UNIXREF XML API inserting incorrect whitespace between HTML-like elements

This topic is related to topic #14533 which is a question specifically about the REST JSON API.

My question here is specifically about a different problem in the UNIXREF XML API, as used by Zotero, via the HTTP endpoint:

curl -LH "Accept: application/vnd.crossref.unixref+xml" https://doi.org/10.1007/s00253-024-13397-8

The UNIXREF XML API currently does not properly provide HTML-like content when two inline HTML-like elements are adjacent with zero whitespace between them.

As a concrete real example, I provide the yummy case of the MATa gene allele of ale beer yeast. What better than a delicious example! The DOI example above is for the libation of PMC11754353. It, and in particular its abstract, writes about a well known cell lines with a gene allele that I write below in three formats.

In HTML (as rendered by Wikipedia):

<i>MAT<b>a</b></i>

In Wikitext:

''MAT'''a'''''                                                                                                

In markdown:

_MAT**a**_                                                                                                    

Scientific contributors to Wikipedia can write Wikitext and have in properly rendered on the Wikipedia, as seen on the page about the [Mating of yeast] (en wikipedia org /wiki/Mating_of_yeast).

For the abstract of PMC11754353 (10.1007/s00253-024-13397-8), the Crossref UNIXREF XML API returns the following around the gene allele name:

                ... We generated stable                                                                       
                <italic>MAT</italic>                                                                          
                <bold>                                                                                        
                  <italic>a</italic>                                                                          
                </bold>                                                                                       
                or                                                                                            
                <italic>MATα </italic>                                                                        
                lines of four different Kveik yeasts, named Odin, Thor, Freya and Vör.                        

I bet Freya tastes the best. The JATS XML in PMC for this part of the abstract is as follows:

We generated stable <italic>MAT</italic><bold><italic>a</italic></bold> or <italic>MATα </italic>lines of four different Kveik yeasts, named Odin, Thor, Freya and Vör.                                                     

I bet Thor tastes really bitter. But what about Odin and Vör?

The problem here is the insertion of whitespace between the </italic> and the <bold>. Any downstream application, like say Zotero, will now render this incorrectly as two words, “MAT” and “a”, rather than one word “MATa”, because Crossref has modified the XML mixed content of the abstract, in an invalid way.

My questions for Crossref are:

What are Crossref’s plans for this?

Are abstracts intended for data-mining only and not re-display for human reading?

Does Crossref plan to fix this or should developers use the REST JSON API if they care? (assuming the invalid removal of whitespace is fixed in the REST JSON API)

Cheers (with a beer ale),
Castedo

Hi Castedo,

For 10.1007/s00253-024-13397-8 this is the abstract from the original metadata submitted by Springer Nature:

<jats:abstract xml:lang="en">
          <jats:sec>
            <jats:title>Abstract</jats:title>
            <jats:p>Improving ale or lager yeasts by conventional breeding is a non-trivial task. Domestication of lager yeasts, which are hybrids between <jats:italic>Saccharomyces cerevisiae</jats:italic> and <jats:italic>Saccharomyces eubayanus</jats:italic>, has led to evolved strains with severely reduced or abolished sexual reproduction capabilities, due to, e.g. postzygotic barriers. On the other hand, <jats:italic>S. cerevisiae</jats:italic> ale yeasts, particularly Kveik ale yeast strains, were shown to produce abundant viable spores (~ 60%; Dippel et al. Microorganisms 10(10):1922, 2022). This led us to investigate the usefulness of Kveik yeasts for conventional yeast breeding. Surprisingly, we could isolate heterothallic colonies from germinated spores of different Kveik strains. These strains presented stable mating types in confrontation assays with pheromone-sensitive tester strains. Heterothallism was due to inactivating mutations in their <jats:italic>HO</jats:italic> genes. These led to amino acid exchanges in the Ho protein, revealing a known G223D mutation and also a novel G217R mutation, both of which abolished mating type switching. We generated stable <jats:italic>MAT</jats:italic>
              <jats:bold>
                <jats:italic>a</jats:italic>
              </jats:bold> or <jats:italic>MATα </jats:italic>lines of four different Kveik yeasts, named Odin, Thor, Freya and Vör. Analyses of bud scar positions in these strains revealed both axial and bipolar budding patterns. However, the ability of Freya and Vör to form viable meiotic offspring with haploid tester strains demonstrated that these strains are haploid. Fermentation analyses indicated that all four yeast strains were able to ferment maltose and maltotriose. Odin was found to share not only mutations in the <jats:italic>HO</jats:italic> gene, but also inactivating mutations in the <jats:italic>PAD1</jats:italic> and <jats:italic>FDC1</jats:italic> genes with lager yeasts, which makes this strain POF-, i.e. not able to generate phenolic off-flavours, a key feature of lager yeasts. These haploid ale yeast-derived strains may open novel avenues also for generating novel lager yeast strains by breeding or mutation and selection utilizing the power of yeast genetics, thus lifting a block that domestication of lager yeasts has brought about.</jats:p>
          </jats:sec>
          <jats:sec>
            <jats:title>Key points</jats:title>
            <jats:p>
              <jats:italic>• Haploid Kveik ale yeasts with stable MAT</jats:italic>
              <jats:bold>
                <jats:italic>a</jats:italic>
              </jats:bold>
              <jats:italic> and MATα mating types were isolated.</jats:italic>
            </jats:p>
            <jats:p>
              <jats:italic>• Heterothallic strains bear mutant HO alleles leading to a novel inactivating G217R amino acid change.</jats:italic>
            </jats:p>
            <jats:p>
              <jats:italic>• One strain was found to be POF- due to inactivating mutations in the PAD1 and FDC1 gene rendering it negative for phenolic off-flavor production.</jats:italic>
            </jats:p>
            <jats:p>
              <jats:italic>• These strains are highly accessible for beer yeast improvements by conventional breeding, employing yeast genetics and mutation and selection regimes.</jats:italic>
            </jats:p>
          </jats:sec>
        </jats:abstract>

They submitted it with a line break, not a space, in between the </italic> and <bold> tags that you pointed out. The unixref xml simply reflects whatever was submitted by the organization registering the content.

To answer your more general question, the abstracts (like all the other metadata we collect) are intended for machine readability first and foremost. But, they should also be useable for display purposes, with the expectation that there’s a lot of variation in how the data is submitted by our members, and knowing that some reformatting or quality assurance work could be necessary if you want everything to be displayed consistently and correctly.

We get a lot of different kinds of metadata quality feedback, and for the most part, we’re happy to pass that feedback along to the member who registered the content and submitted the metadata. So, if you’d like me to send this feedback to our contacts at Springer Nature, I can definitely do that, though I’d also encourage you to reach out to them directly as well.

Same with any other publishers whose metadata you have concerns with. Feel free to send us an email to support@crossref.org, and we can forward any specifics to the relevant publisher contacts.

Best,
Shayn

This is false. The unixref xml does not simply reflect what was submitted by this organization. Run

curl -LH "Accept: application/vnd.crossref.unixref+xml" https://doi.org/10.1007/s00253-024-13397-8

and do a diff. You will see they are not the same. Crossref is inserting newlines.

If you’d like another case look at 10.1371/journal.pgen.1009241 and see the newline that is inserted between the <italic>F</italic> and <sub>ST</sub>.

I’m not giving feedback about metadata quality. I am identifying what appears to be a bug in the Crossref software and infrastructure and trying to get a straight answer about whether y’all are going to fix it or not.

Are you claiming that the UNIXREF XML API is not inserting incorrect whitespace between HTML-like elements?

As a clarification that might helpful, this topic and question is about Crossref software and infrastructure, not about particular cases of publisher metadata. Based on everything I have seen to date, I believe it is literally impossible for any publisher with the highest possible quality of metadata, to deposit an abstract with JATS/HTML formatting of a technical term like MATa in yeast genetics and avoid having Crossref break the single word “MATa” into two words “MAT” and “a”.

This is due to Crossref, not the publisher or the quality level of their data. This is because I believe Crossref is not processing XML mixed content properly. Applications like Zotero literally can not get the right data using the UNIXREF XML API. Due to Crossref. Not due to the publisher or the data.