The abstract element in our schema is borrowed from the JATS schema, so any markup tags supported by JATS (e.g. <jats:p>) are allowed within the abstracts.
If an abstract should be the text for humans to read of:
<jats:p> is hard!
that is to say, the character <, then letters jats, then character :, then letter p, then character > then space, and then is hard!, how should this appear in JSON?
Should the metadata service be returning the following JSON value?
"<jats:p> is hard!"
If the metadata service returns the JSON value below, it should be a parse error, because the close tag is missing?
"<jats:p>XML is hard!"
or should software gracefully handle the missing closing tag and automatically close the tag resulting in the text for humans to read:
Do you have an example where the closing tag is missing?
In the record for the DOI provided in original post here - 10.7202/1104262ar - and all other abstracts that I’m aware of, the abstract text begins <jats:p> and ends </jats:p>.
I have not seen an example where the closing tag is missing.
My fundamental question is how does a software program know whether to treat the JSON value as XML or plain text. My examples are to make the question concrete.
It sounds like Crossref sometimes returns plain text and sometimes returns the contents of an XML/HTML element and Crossref has not specified how software is supposed to know one way or the other.
If you want a specific real example where this is relevant see 10.1542/peds.2023-062391. The website search dot crossref dot org returns the following text to a human reader:
Sooner Is Better: Early Human Milk Fortification for Hospitalized Preterm Infants <29 Weeks
And the JSON value for the title from Crossref is:
"Sooner Is Better: Early Human Milk Fortification for Hospitalized Preterm Infants &lt;29 Weeks"
So in this case, Crossref software with its own Crossref data seem to think it should be plaintext. But clearly the title is supposed to be the contents of a XML/HTML element.
Is data wrong? Is the Crossref software wrong? When should a title or abstract in Crossref metadata JSON be treated as plaintext vs the contents of an XML/HTML element? Are there any example where the data should be treated as plaintext?
PubMed gets is right (PMID: 37551455). It show for humans the text
Sooner Is Better: Early Human Milk Fortification for Hospitalized Preterm Infants <29 Weeks
The short answer is that the data is wrong in that case.
When Crossref member organizations register DOIs for their content, they submit the metadata assocaited with each item as XML. We make that XML, more-or-less unaltered, available through an older XML-based API. And then we convert the XML to JSON when the records are indexed in the REST API.
Sometimes the XML comes from the registrants with errors, and those errors are passed through to the JSON records as well. We do our best to have good documentation available to our member organizations, and to put checks in place when possible to catch certain kinds of errors at the point of metadata submission, but we can’t catch everything.
In this case, when AAP submitted the metadata for 10.1542/peds.2023-062391 they incorrectly escaped the “<” in that title, and submitted it as
<titles>
<title>Sooner Is Better: Early Human Milk Fortification for Hospitalized Preterm Infants &lt;29 Weeks</title>
</titles>
instead of
<titles>
<title>Sooner Is Better: Early Human Milk Fortification for Hospitalized Preterm Infants <29 Weeks</title>
</titles>
It’s possible that whatever process AAP uses for sending metadata to PubMed made the same error, and PubMed has better systems in place for cleaning it up. Or, it’s possible that AAP didn’t make the same escape/encoding error in the data they sent to PubMed. I’m not sure.
But, for the most part, we’re limited to what the content registrants send us, and that accounts for a lot of the errors and inconsistencies that you might notice in the metadata that’s output downstream.
Sorry to bother you, but I’d like to learn more about MathML tagging issues. Taking the DOI 10.1073/pnas.152290999 as an example, I’ve noticed that within the tex-math tags, there are non-standard LaTeX syntax elements like \documentclass[12pt] and \usepackage. To make these display correctly on web pages or save them to a text file while preserving useful information, I’ve had to use regular expressions to extract the LaTeX content between the begin and end sections. This approach feels somewhat inelegant, and I’m concerned it might be problematic. I was wondering if Crossref provides any tools to help parse this type of content?