Not sure where to initiate this discussion, but I’m involved in designing a workflow for an open access journal platform. The content is primarily PDF produced from LaTeX, but there may also be other output formats such as JATS or HTML. My feeling is that metadata is something that should if at all possible be embedded into the PDF, since these are often the things that are passed around and reposted. PDF is a pretty ugly format, but it at least has ways to embed metadata into them (notably XMP). Given the importance of citation metadata for bibliometrics, I’m trying to embed citation metadata in a structured way as XMP metadata. While XMP is an extensible format using RDF, it’s not at all clear what standards have evolved for embedding references into XMP. The lack of standardized citation metadata has led to a mess in near-duplicate detection and complicated machine learning algorithms (arxiv: abs/2006.05563) to extract the metadata. This relegates citation analysis to only be possible for large commercial entities, who typically do not release this data freely. This is holding back open access publishing.
There is an obvious desire to include DOIs wherever possible, but this isn’t always possible. There are at least three formats that seem plausible for embedding citation metadata:
- The JATS format has a very detailed schema for element-citationand seems like the best choice. There might be some complication in the requirement for separating surname and given name, but is otherwise pretty feasible.
- the crossref format lacks a few things. In particular it doesn’t appear to support encoding more than one author name unless I use the unstructured_citation form.
- invent my own, based on the original citation metadata from BibTeX. This will at least preserve as much as possible from the author, though BibTeX is an aging format that is overdue to be replaced.
My question is two-fold:
- has anyone observed PDFs that contain citation metadata in XMP?
- is there any progress on standardizing reference metadata?
Hi @mccurley ,
Thanks for your message. I’ll leave the majority of your question(s) for the community at large to address, since there may be others outside of Crossref staff better positioned for it.
As for reference metadata, we do have a standardized method for reference metadata to be registered with us - using the DOI. I know you said that including the DOI in the metadata is not always possible, but I do feel obligated to mention that submitting a DOI in your reference metadata will ensure that our system establishes a link between the DOI being registered and the DOI in your citation list, since that is the most definitive metadata element that we use in matching.
Thank you for your question my dear @mccurley, I agree with @ifarley Isaac regarding reference metadata, there is a standardized procedure for registering reference metadata using the DOI, I always think it possible to include the DOI in the metadata, but a DOI in your reference metadata will guarantee the system creates a connection between the DOI being registered and the DOI in your citation list because that is the most specific metadata element we being used in similarity.
People don’t need to keep repeating that DOIs are important - that is obvious. It’s worth noting that some famous papers do not have DOIs. For one thing, DOIs cost money, so DOIs tend to have been assigned only to things that are owned by publishers and can be sold back to the reading public. I can think of several things that don’t have DOIs: Tate’s thesis in number theory, for example, or Riemann’s paper Uber die Anzahl der Primzahlen unter einer Gegebenen Gr6fle, Montasb. der Berliner Akad. (1858/60) 671-680; that proposed the most famous unsolved problem in mathematics. In some fields like Economics it is common to cite working papers like " WP 2008-1 Debunking the Myths of Computable General Equilibrium Models" that has over a hundred citations. That’s why we need a schema to describe publications. People have been citing academic literature for hundreds of years, but we still don’t have an appropriate XML schema?
There is also the CSL JSON format that’s supported by most “citeproc” processors: it could either be included directly, or converted to XML via XSLT, e.g. via the
json-to-xml function in XSLT-3.0.
Albert Krewinkle’s response seems to have disappeared, but he said:
There is also the CSL JSON format: it could either be included directly, or converted to XML via XSLT, e.g. via the
json-to-xml function in XSLT-3.0.
It’s an interesting approach, but it appears that cloudcite has shut down. I noticed a couple of flaws in the JSON schema:
- The “type” field is lacking a “manual” type, and there is no distinction between a PhD Thesis, Habilitation, Master’s thesis or undergraduate thesis.
- it’s not clear how to reference a letter or correspondence.
- authors do not have ORCID or any other identifier.
- author names do not include multiple given names or middle names.
I’m also not sure if this is related to the Citation Style Language that is XML-based and apparently supported by Zotero and Mendeley. They seem to have quite a bit of information on citationstyles dot org. They appear to have boiled the ocean with thousands of styles, but I had a hard time understanding their actual schema.
Crafting a good schema is of course a challenge, and should be done through a community activity if it is ever going to become widely used. I know of several efforts in the past, such as BibTeX, BibLaTeX, the RIS format, schema dot org’s ScholarlyArticle, and maybe MODS. There are also proprietary formats such as the endnote XML schema. Zotero has a list of supported bibliographic data formats.
Apparently the spam filter didn’t like that I edited my message to link to the CSL JSON schema definition, so it removed my message.
Very good points about the shortcomings; I actually wasn’t aware of some of those (e.g., I never noticed that there is no type for “manual”).