Ticket of the month - July 2025 - Parsing with Flying Colours (and DOIs)

Crossref metadata is a treasure trove of information for researchers, publishers, and data enthusiasts alike. It provides a standardised way to describe scholarly content, for multiple content types supported in Crossref.

At the heart of this system lies XML, a markup language that allows for structured and machine-readable data. Understanding how to parse and validate this XML is crucial for anyone looking to efficiently work with Crossref data; as this is the only method we currently support for accepting DOI registrations or updates.

XML in Crossref

XML provides a flexible and hierarchical structure for organising diverse information about different pieces of work. Each piece of information, whether it’s an author’s name, a publication date, or a DOI, is enclosed within clearly defined tags, making it easy for machines to read and process.

Why XML?

  • Structure: XML’s hierarchical nature allows for complex relationships between different metadata elements in a predefined format.

  • Machine-readability: The clear tagging makes it straightforward for software applications to parse and extract specific data points.

  • Extensibility: XML can be extended to include new elements and schema updates as the needs of the scholarly publishing community evolve.

Understanding the Crossref XML Schema

While XML provides the structure, the XML Schema defines the rules for that structure. The Crossref Schema (XSD) specifies which elements and attributes are allowed, their data types, and their relationships. This is crucial for ensuring the consistency and validity of the metadata.

What is an XML Schema (XSD)?

An XML Schema Definition (XSD) is a language for defining the structure and content of XML documents. It acts as a blueprint, specifying:

  • Elements: The tags that can appear in the XML document.

  • Attributes: Additional information associated with elements.

  • Data Types: The type of data allowed for elements and attributes (e.g., string, integer, date).

  • Quantity: How many times an element can appear (e.g., optional, required, multiple).

Crossref has many past and current schemas all which are accepted but the later versions would have the fuller metadata picture available for deposit. The most current schema is 5.4.0.

Why Schema Validation is Essential

Validating Crossref XML against its schema ensures that the metadata adheres to the defined standards. This is vital for:

  • Data Integrity: Guarantees that the data is well-formed and follows the expected structure.

  • Interoperability: Allows different systems to correctly interpret and process the metadata.

  • Error Detection: Helps identify and correct issues in the XML before it’s used.

Working with XML files and tools to create them

At Crossref we have a nice collection of example XML files which our members can use as templates to create their own XML files, ready for deposit, or use as a guide when building platforms which can create their own Crossref schema formatted XML files.

Note: If you aren’t ready to create your own XML, we have a couple of helper tools available which use forms where metadata can be entered into those form fields and then the tool itself creates XML files and sends it to the system for deposit. These two tools are the Web Deposit Form and the Record Registration Form.

Parsing the XML

For those creating your own XML, once you have created your XML file, which aligns with our schema 5.4.0, then you can check that you have everything in order by running it through our parser. This parser will flag any errors in the XML file which are related to the format of the XML, the construction of the XML, the ordering of the elements or the values that are included against the elements.

Common parsing errors

[Error]: cvc-elt.1.a: Cannot find the declaration of element ‘doi_batch’.

This is usually caused by an incorrect or malformed declaration in the XML file.

You would need to make sure that the declaration is correct and present at the top of the file e.g.

<?xml version="1.0" encoding="UTF-8"?><doi_batch version="5.3.1" xmlns="``http://www.crossref.org/schema/5.3.1``" xmlns:xsi="``http://www.w3.org/2001/XMLSchema-instance``" xsi:schemaLocation="``http://www.crossref.org/schema/5.3.1`` ``http://data.crossref.org/schemas/crossref5.3.1.xsd``">

Once a valid declaration is in place then it should parse and validate successfully.


An invalid XML character (Unicode: 0x2) was found in the element content of the document.

This is an error where a special character is found in the XML file and this will fail the deposit as those characters are not allowed in the file.

To correct this, you would need to find that special character and then remove it from the file and redeposit it to us. Some XML editors will allow you to search for the unicode in Find and Replace to remove it; otherwise, you might need to see what line the error is showing and remove it. If you have used a rich-text editor to paste into your XML file, you may need to remove the text within that element and paste back into the XML file using plain text.

This error normally happens when the text (usually against an abstract) has been copied from a rich-text editor and the special formatting from that rich text is pulled into the XML (which cannot be supported by the XML).


Invalid content was found starting with element ‘{“http://www.crossref.org/schema/5.3.1”:pages}’. One of ‘{“http://www.ncbi.nlm.nih.gov/JATS1”:abstract, “http://www.crossref.org/schema/5.3.1”:publication_date}’ is expected

This error, or one similar with a different selection of elements listed, is normally down to the ordering of elements in the XML file. That is, the order of the elements in the XML file being submitted does not match our schema. You would need to double check the required order of the elements in the schema to make sure that they are perfectly aligned.


The value ‘110.64261/ijaarai.v1n1.0011’ of element ‘doi’ is not valid.

This error states that a value within the element is not valid. This would be down to the schema and the strict parsing that can be included for elements within the schema. If we look at the element in the schema then we can look at the Facets section and it shows some validation elements for the value against the element. The DOI prefix should begin with 10., not 110.

Screenshot 2025-08-06 at 16.11.11

This is telling us that the value against this element has a minimum and a maximum length it can be and it needs to follow the regex pattern shown to be a valid input.

If we continue on this error and look at say the element where we add the URL for the DOI then we can see the facets against that too.

Screenshot 2025-08-06 at 16.12.47

Again this value would need to conform to the facets against the element like min and max length and also the regex pattern as well.

Conclusion

Working with Crossref can require a solid understanding of XML parsing and schema validation, if you are creating and depositing XML files to us directly.

Of course if you are using our helper tools then you would not need to worry as much about the XML itself. That said, I would say that it is a huge benefit to have some understanding of XML and errors that can and do occur during registration.

For further reading and resources on Crossref’s processes and XML markup, visit the Crossref documentation pages.

Further information

Libraries and Tools for XML Parsing

Many programming languages offer robust libraries for working with XML. Here’s a few of the most widely used tools, by language:

Python

  • xml.etree.ElementTree – Built-in and great for simple XML.

  • lxml – High-performance and feature-rich; supports XPath and schema validation.

  • BeautifulSoup – Ideal for parsing XML or HTML that isn’t perfectly structured.

Java

  • JAXP – Java’s core XML API, includes support for both DOM and SAX parsing.

  • DOM/SAX Parsers – DOM for tree-based parsing, SAX for streaming large files.

  • JAXB – Allows binding XML directly to Java objects.

JavaScript

  • DOMParser – Built-in browser API for XML parsing.

  • xml2js – Popular Node.js library for converting XML to JSON.

  • fast-xml-parser – Lightweight, high-performance parser in Node.js.

Rust

  • quick-xml – Fast and memory-efficient XML parser, great for performance-critical applications.

  • xml-rs – Simple and effective XML library for standard use cases.

PHP

  • SimpleXML – Easy-to-use and ideal for basic XML manipulation.

  • DOMDocument – More powerful and flexible than SimpleXML.

  • XMLReader – Efficient stream-based parser for large XML files.

Command Line / Cross-language Tools

  • xmlstarlet – Command-line toolkit for XML transformation, querying, and formatting.

  • xmllint – Useful for validating and pretty-printing XML; comes with libxml2.

These tools are especially helpful when working with complex schemas like ours, where structured parsing, validation, and transformation are needed for successful submissions.

4 Likes