A way, per DOI, to get original metadata deposit XML?

My motivation for asking is trying to make sense of the different XML that is returned by UNIXREF API results vs JSON API results. Specifically for DOI 10.1371/journal.pgen.1009241 I see the following JSON value for the abstract:

"<jats:p><jats:italic>F</jats:italic><jats:sub>ST</jats:sub>and kinship are ...

whereas for the UNIXREF results (thanks @skarcher) I did:

curl -LH "Accept: application/vnd.crossref.unixref+xml" https://doi.org/10.1371/journal.pgen.1009241

to get:

          <abstract>
            <p>
              <italic>F</italic>
              <sub>ST</sub>
              and kinship are ...

The JSON value has no spaces between all XML elements whereas the UNIXREF has pretty intended spaces between all XML elements.

What does the metadata deposit XML have? No spaces or pretty intended space for all XML elements or a mix?

The correct data contents of the <p> element should have XML mixed_content. That is, there should be whitespace between some XML child elements and not others, as is the case for the JATS XML in PubMed Central for this article. There should no space between </italic> and <sub> whereas there should be whitespace between some other elements (I can provide a separate example of such).

Does the XML data stored at Crossref no longer have this whitespace data preserved?

I didn’t realize that my example above already shows an example where whitespace should preserved but that information is getting lost in the JSON API results. The JSON API value has no whitespace between the </sub> and the word and.

Hello, and thanks for your question.

In this particular case, the metadata that PLoS supplied for 10.1371/journal.pgen.1009241 had the abstract formatted as follows:

        <jats:abstract>
<jats:p><jats:italic>F</jats:italic><jats:sub>ST</jats:sub> and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of <jats:italic>F</jats:italic><jats:sub>ST</jats:sub> and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of <jats:italic>F</jats:italic><jats:sub>ST</jats:sub> to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing <jats:italic>F</jats:italic><jats:sub>ST</jats:sub> and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and <jats:italic>F</jats:italic><jats:sub>ST</jats:sub> when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and <jats:italic>F</jats:italic><jats:sub>ST</jats:sub> estimates.</jats:p>
</jats:abstract>

But, as you’ve noted, when we make that accessible via our XML API, in unixref it looks like this:

          <abstract>
            <p>
              <italic>F</italic>
              <sub>ST</sub>
              and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of
              <italic>F</italic>
              <sub>ST</sub>
              and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of
              <italic>F</italic>
              <sub>ST</sub>
              to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing
              <italic>F</italic>
              <sub>ST</sub>
              and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and
              <italic>F</italic>
              <sub>ST</sub>
              when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and
              <italic>F</italic>
              <sub>ST</sub>
              estimates.
            </p>
          </abstract>

in unixsd, it’s similar, but not exactly the same

<jats:abstract xmlns:jats="http://www.ncbi.nlm.nih.gov/JATS1">
                  <jats:p>
                    <jats:italic>F</jats:italic>
                    <jats:sub>ST</jats:sub>
                    and kinship are key parameters often estimated in modern population genetics studies in order to quantitatively characterize structure and relatedness. Kinship matrices have also become a fundamental quantity used in genome-wide association studies and heritability estimation. The most frequently-used estimators of
                    <jats:italic>F</jats:italic>
                    <jats:sub>ST</jats:sub>
                    and kinship are method-of-moments estimators whose accuracies depend strongly on the existence of simple underlying forms of structure, such as the independent subpopulations model of non-overlapping, independently evolving subpopulations. However, modern data sets have revealed that these simple models of structure likely do not hold in many populations, including humans. In this work, we analyze the behavior of these estimators in the presence of arbitrarily-complex population structures, which results in an improved estimation framework specifically designed for arbitrary population structures. After generalizing the definition of
                    <jats:italic>F</jats:italic>
                    <jats:sub>ST</jats:sub>
                    to arbitrary population structures and establishing a framework for assessing bias and consistency of genome-wide estimators, we calculate the accuracy of existing
                    <jats:italic>F</jats:italic>
                    <jats:sub>ST</jats:sub>
                    and kinship estimators under arbitrary population structures, characterizing biases and estimation challenges unobserved under their originally-assumed models of structure. We then present our new approach, which consistently estimates kinship and
                    <jats:italic>F</jats:italic>
                    <jats:sub>ST</jats:sub>
                    when the minimum kinship value in the dataset is estimated consistently. We illustrate our results using simulated genotypes from an admixture model, constructing a one-dimensional geographic scenario that departs nontrivially from the independent subpopulations model. Our simulations reveal the potential for severe biases in estimates of existing approaches that are overcome by our new framework. This work may significantly improve future analyses that rely on accurate kinship and
                    <jats:italic>F</jats:italic>
                    <jats:sub>ST</jats:sub>
                    estimates.
                  </jats:p>
                </jats:abstract>

So, basically, there were no spaces or indents in the original data that the publisher deposited. (sometimes there are - it varies by publisher). We added those in for the XML output, but the same doesn’t apply to the JSON output.

Does that answer your question?

Thank you! Very informative data that answers the main questions.

I’m inferring that there is no way for me to query this underlying data directly, per DOI.

I do have the related secondary question about spaces getting removed in the JSON value. But I’m thinking it’s better that I post that as a separate topic so it’s easier for other to find searching. I’ll post separately in more the format of a bug report. I think it’s basically a bug in the Crossref REST JSON API.

Oh one last request on this thread, @Shayn can you also pull the original data for 10.1101/2024.09.24.614506 please?

It is very similar to the first one I gave you but has even nastier challenging cases of XML mixed content where whitespace around tags matters and results in bad rendering downstream (in Zotero). This case has a bunch of math using only italics and subscript and unicode characters without any MathML. Altering whitespace around these tags is problematic and I suspect is incorrectly being attributed to bad publisher data when it’s actually a Crossref API issue.

Sure, the abstract in the metadata deposit from CSHL/bioRxiv for 10.1101/2024.09.24.614506 looked like this

<jats:abstract><jats:title>ABSTRACT</jats:title><jats:p>The relative genetic distance between populations is commonly measured using the fixation index (<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>). Traditionally inferred from allele frequency differences, the question arises how<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>can be estimated and interpreted when analysing genomic datasets with low sample sizes. Here, we advocate an elegant solution first put forward by Hudson et al. (1992):<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>= (<jats:italic>D</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>–<jats:italic>π</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>)/<jats:italic>D</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>, where<jats:italic>D</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>and<jats:italic>π</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>denote mean sequence dissimilarity<jats:italic>between</jats:italic>and<jats:italic>within</jats:italic>populations, respectively. This multi-locus<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>-metric can be derived from allele frequency data, but also from sequence alignment data alone, even when sample sizes are low and/or unequal. As with other<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>-metrices, the numerator denotes net divergence (<jats:italic>D</jats:italic><jats:sub><jats:italic>a</jats:italic></jats:sub>), which is equivalent to the<jats:italic>f</jats:italic><jats:sup><jats:italic>2</jats:italic></jats:sup>-statistic and Nei’s<jats:italic>D</jats:italic>(for realistic estimates of<jats:italic>D</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>and<jats:italic>π</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>). In terms of demographic inference, net divergence measures the difference in increase of<jats:italic>D</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>and<jats:italic>π</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>since the population split, owing to a reduction of coalescence times within populations as a result of genetic drift. Because different combinations of<jats:italic>ΔD</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>and<jats:italic>Δπ</jats:italic><jats:sub><jats:italic>xy</jats:italic></jats:sub>can produce identical<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>-estimates, no universal relationship exists between<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>and population split time. Still, in case of recent population splits, when novel mutations are negligible,<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>-estimates can be accurately converted into coalescent units (<jats:italic>τ</jats:italic>. i.e., split time in multiples of 2<jats:italic>N</jats:italic><jats:sub><jats:italic>e</jats:italic></jats:sub>). This then allows to quantify gene tree discordance, without the need for multispecies coalescent based analyses, using the formula:<jats:italic>P</jats:italic><jats:sub><jats:italic>discordance</jats:italic></jats:sub>= ⅔·(1 –<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>). To facilitate the use of the Hudson<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>-metric, we implemented new utilities in the R package SambaR.</jats:p></jats:abstract>

Their xml had no line breaks at all in the entire file.

If this context is helpful, we do have best practice guidance for abstracts and a markup guide for abstracts on our documentation site.

Oh, interesting. In this case it looks like the data got messed up between the publisher and Crossref during deposit. I’m guessing this is the publisher’s fault.

The XML I see on the bioRxiv has correct spacing, for example:

the question arises how <italic toggle="yes">F</italic><sub><italic toggle="yes">ST</italic></sub> can be

but the metadata deposit you quote has

the question arises how<jats:italic>F</jats:italic><jats:sub><jats:italic>ST</jats:italic></jats:sub>can be

So the bioRxiv XML would be rendered correctly as

the question arises how FST can be

but the metadata deposit would be

the question arises howFSTcan be