Ticket of the month - June 2024 - What happens to submitted references

Let’s say you’ve capitulated to all our nudging and have just begun supplying references along with your DOIs’ metadata records. The submission logs that come back confirming your successful metadata deposits now have a bunch of extra ‘stuff’ in them.

Instead of just a <msg>Successfully added</msg> or <msg>Successfully updated</msg> message for each submitted DOI (or, hopefully rarely, an error message), now you see a separate diagnostic for each submitted reference your publications’ reference lists.

These will return one of three status results:

  • status=“error”
  • status=“resolved_reference”
  • status=“stored_query”

Let’s look at each in context. In the example submission log in our documentation, the very first reference submitted returned an error

<citation key="10.5555/example_bb0030" status="error">Either ISSN or Journal title or Proceedings title must be supplied.</citation>

What kind of reference would result in that error? Well, it would have to be a reference where each element is tagged individually (aka, “a structured citation”) because that’s the only situation which requires an ISSN or Journal/Proceedings title. For example:

<citation key="ref1">
<author>Dobbs</author>
<volume>13</volume>
<issue>2</issue>
<first_page>16</first_page>
<cYear>2023</cYear>
<article_title>Cat Herding: A Systematic Review</article_title>
</citation>

The error itself is pretty straightforward in this case. When your publication is citing a journal article or conference paper, the structured reference data has to include some way to identify the journal or conference proceedings that it’s a part of. So, adding the journal title or title abbreviation like this would take care of the problem.

<citation key="ref1">
<journal_title>Journal of Impossible Tasks</journal_title>
<author>Dobbs</author>
<volume>13</volume>
<issue>2</issue>
<first_page>16</first_page>
<cYear>2023</cYear>
<article_title>Cat Herding: A Systematic Review</article_title>
</citation>

The second reference diagnostic in that log returns the status stored_query like this

<citation key="10.5555/example_bb0005" status="stored_query"></citation>

While further down the list, you can see a resolved_reference status like this

<citation key="10.5555/example_bb0015" status="resolved_reference">10.1590/S0006-87051960000100077</citation>

Both of those were the result of references that were formatted in a completely valid way. We know this, because the status was not “error”. So, what’s the difference between them?

In simplest terms, resolved_reference means our reference matching system could successfully match the reference that was supplied in that metadata deposit to the metadata associated with a specific DOI. That is, your publication is citing something, and we’ve figured out what exactly it was citing.

In contrast, stored_query means that we couldn’t find a distinct match. We don’t know what exactly your publication was citing via that reference. When that happens, the reference is “stored” for later re-querying. Periodically, we’ll try to match it again, in case the cited publication has been registered in the meanwhile.

When a citation match has been found, the DOI of the cited item is displayed in the submission log diagnostic. In our example, that’s 10.1590/S0006-87051960000100077

The reference that produced this citation match may have looked like this

<citation key="10.5555/example_bb0015">
<doi>10.1590/S0006-87051960000100077</doi>
</citation>

Or like this

<citation key="10.5555/example_bb0015">
<journal_title>Bragantia</journal_title>
<author>Bacchi</author>
<volume>19</volume>
<first_page>XLI</first_page>
<cYear>1960</cYear>
</citation>

Or like this

<citation key="10.5555/example_bb0015">
<unstructured_citation>Bacchi, O. (1960). Estudos sôbre a conservação de sementes. V - alface. Bragantia, 19(unico), XLI–XLV.</unstructured_citation>
</citation>

Any of those, as well as many variations of the later two could produce a successful citation match to 10.1590/S0006-87051960000100077 based on the metadata supplied to Crossref by its publisher.

A stored_query result, where a citation match has not been found, typically means that the referenced publication has not been registered with Crossref. While the majority of DOIs for scholarly publications are registered with Crossref, not all scholarly publications have DOIs (this is especially true for content that was published prior to the advent of the DOI system) and not all DOIs are Crossref DOIs. If a reference is citing something that isn’t registered with Crossref, then we won’t be able to match your reference to an identifier.

In some cases, the lack of a citation match is due to an inaccuracy in the way the citation has been submitted or formatted.

One common example tends to happen when an author is citing a paper directly from a prepublication manuscript, and therefore puts the first page number “1” in their reference and the publisher passes this false, placeholder page number along in the reference they submit to Crossref. Ultimately, once that cited paper goes on to be published as an article in a journal, it’s given some other page range that doesn’t begin with “1”. So, the page number reference doesn’t end up matching the page number in the cited work’s metadata record, and no citation match can be made.

For example, if an item that you’re registering cites the article “Damage Tolerance Related to the Damage Area of Impacted Carbon/Epoxy Composite Laminates” in volume 57 issue 19 of Journal of Composite Materials, but you supply the reference like this:

<citation key="5555.1">
<unstructured_citation>Targino, T. G., et al. (2023). Damage tolerance related to the damage area of impacted carbon/epoxy composite laminates. Journal of Composite Materials, 57(19), 1-9</unstructured_citation>
</citation>

That won’t be effective in producing a citation match to its DOI 10.1177/00219983231181942 because the page range in the metadata for that DOI is 2985-2993, not 1-9. However, if the first page number, or page range, was entirely omitted from the reference, that would match successfully. Page numbers can help disambiguate one item from another, but they’re not required - an inaccurate page number hurts more than an accurate one helps.

In other instances, a missing citation match may be due to an overall sparsity of information in the reference. This is especially a problem with structured references where each element has its own tags. Unstructured references, where a whole formatted citation is submitted as one block of text, tend to be a bit more flexible.

So, to take another example, if an item that you’re registering cites the article “ Cosmological consequences of Brans–Dicke theory in 4D from 5D scalar-vacuum” in volume 139, issue 2 of The European Physical Journal Plus, but you submit a reference like this:

<citation key="5555.2">
<journal_title>Eur Phys J Plus</journal_title>
<author>Lambiase</author>
<cYear>2024</cYear>
</citation>

That’s unlikely to produce a successful match to that cited work’s DOI - 10.1140/epjp/s13360-024-04905-w - simply because there’s not enough data included. The publication year, journal abbreviation, and first author’s surname are accurate, but including the volume and issue numbers and/or the article title would be more effective.

And, of course, the simplest and most foolproof method to submit a reference is always to just use the DOI, if it exists, e.g.

<citation key="5555.3">
<doi>10.5555/12345678</doi>
</citation>

As long as that DOI exists in Crossref’s system that is a 100% guarantee that you’ll end up with a successful citation relationship between your publication and the item it cites.

2 Likes

We decided to jump headlong into reporting bibliographic references, but it’s easy for us because we collect structured information (bibtex) from authors and we check references in our copy editing process to make sure that DOIs are included wherever possible. Unfortunately when I tested our newest issue, I saw a bunch of errors of the form.

<citation key="ref37:AC:CasLagTuc18" status="error">Reference DOI 10.1007/978-3-030-03329-3_25 not found in Crossref doi: 10.1007/978-3-030-03329-3_25</citation>

This is very peculiar since https://api.crossref.org/works/10.1007/978-3-030-03329-3_25 returns metadata for that DOI (and it’s from Springer Nature). This is not an isolated example; we had 315 such reports for DOIs that are registered. The DOI should be the most valuable key in identifying bibliographic references, so it makes me wonder what is going wrong.

Is this a test deposit that was submitted to our test system (test.crossref.org) endpoint? If so, that almost certainly explains the “not found in Crossref” response.

The test system is good for testing the process of submitting files, but that’s about where its utility ends. (I’m exaggerating a bit - it will also tell you if your xml is misformatted or invalid against the schema, but there are other ways to do that)

Responses that relate to the existence of individual DOIs or the details of titles (title ownership, journal title text, ISSNs, for example) aren’t always going to sync up with the real data in the production system.

So, for practical purposes, since you know that DOI does actually exist, you can ignore that error from the test system. It won’t happen in production.

1 Like

That explains a lot - I was in fact using test.crossref.org.

This mostly seems to work for standard references like journal articles or books or articles in conference proceedings. As you mentioned, there are lots of things that get many citations and lack a DOI or ISBN. For example this paper has over 2200 citations and has had a stable URL for almost 20 years. Some publishers like the Internet Society and Usenix don’t use DOIs (this one has almost 6000 citations). It feels like a great oversight to omit a field for a stable URL. Should we be using elocation_id for those URLs? A lot of people regard URLs as unstable identifiers, but web has changed a lot due to SEO and in some cases this is the best identifier we have. A DOI is always superior of course.

The way references are handled in our schema was geared towards matching references to Crossref DOIs, when DOIs for those cited items exist. That’s why we haven’t allowed for URLs in the citation markup.

elocation_id is intended for article IDs or page locators. So, that’s not suitable for stable URLs.

The best option, if you have a reference for something that you already know doesn’t have a DOI, is actually just to use <unstructured_citation> rather than marking up the individual elements. You can put the URL in there the same way that you would have it at the end of a formatted reference.

It makes sense that you’re targeting references with DOIs (after all that is your business). Unfortunately we would have to do quite a bit of work to generate <unstructured_citation> from the BibTeX format, because <unstructured_citation> is misnamed - it’s really just a different structure with its own tag set. For example, many of our references contain mathematics in the title, and while <article-title> supports inline mathematics as <tex-math>, <unstructured_citation> only seems to support <mml:math>. The conversion is non-trivial to handle so we’ll probably just send the structured version and have them ignored.

Since some publishers use only a stable URL as their identifier, that’s what we will be sending in <elocation_id> since there is no other field for it. If crossref chooses to drop it, then whatever but if a client of the data wants to match references, then sending more data gives them a better chance.