Meaningful whitespace removed from JSON REST API

I suspect the Crossref JSON REST API is removing whitespace from abstracts that is problematic and not desirable.

A simple example is 10.1371/journal.pgen.100 where, thx to @Shayn, in topic #14529 I got to see the metadata from PLoS. This case is simple because the issue appears multiple times in text surrounding the same JATS sequence:

<jats:italic>F</jats:italic><jats:sub>ST</jats:sub>

a popular math concept in population genetics.

In the original PLOS deposited metadata, English text before or after the JATS tags has space. The first time in the original it is:

"<jats:p><jats:italic>F</jats:italic><jats:sub>ST</jats:sub> and kinship " ...

the 2nd time it is

... " estimators of <jats:italic>F</jats:italic><jats:sub>ST</jats:sub> and kinship " ...

etc…

In the JSON REST API response, those two cases have the whitespace removed from around the JATS tags:

<jats:p>jats:italic>F</jats:italic><jats:sub>ST</jats:sub>and kinship " ...

and

... " estimators of<jats:italic>F</jats:italic><jats:sub>ST</jats:sub>and kinship " ...

I assume this conversion is not what Crossref intends and agrees it is incorrect?

This removal is problematic because it is removing information and makes proper rendering impossible, in either HTML or plaintext.

The correct rendering of the second case from the JSON REST API is

 estimators ofFSTand kinship

It would be incorrect for it to be rendered otherwise because it would be a bug. Consider JATS

"Eat <jats:italic>hot</jats:italic>dogs!

It is incorrect to add space after the tag to render it as

Eat hot dogs!

The space matters. The original PLOS XML has proper spacing. The Crossref JSON API should not be losing this spacing information in XML mixed content as found within any inline/phrase content of JATS/HTML such as within a paragraph element.

Thanks for reporting this. The REST API uses the UNIXREF XML to generate the JSON. From looking at this discussion it seems like the issue is happening before the REST API gets the data, and at some point shortly after the metadata is deposited. The white spaces are swapped for line breaks. Curiously, though, for 10.1371/journal.pgen.1010373 the XML looks similar but the white space is included in the REST API, which would suggest that the issue is elsewhere. We’ll take a look.

1 Like