Cleaning Tool for DOI Reference Backfilling

Hi there, everyone,

I’ve been vibe coding a small Flask tool that audits Crossref <doi_batch> deposit XML before submission and provides a card-based cleanup UI for the kinds of issues that pass XSD validation but show up badly in the deposited record. I’d like to share here in case it’s useful to other publisher members, especially smaller presses doing reference-list backfills against existing DOIs. I use these backfills to fill out my own bibliometric tool called Pinakes (http://pinakes.xyz).

What it does:
• Validates against the official Crossref XSD (so you catch schema issues before submission, not after)
• Runs sixteen heuristic rules over the deposit, catching things like glued citations (two refs concatenated into one <unstructured_citation>), repeat-author “---------.” markers that didn’t get expanded, paragraph-shaped body text mistakenly picked up as a citation, fields carrying the doi prefix, duplicate years from OCR scrambling, future-dated years, stuck whitespace, and a handful of other ingestion-layer rejection patterns
• Card-based cleanup UI: each flagged citation gets a Keep / Delete / Split decision, with bulk auto-decide for the obvious cases
• Batch workflow: upload all volumes of a journal at once, clean each, merge into one consolidated deposit
• Crossref REST API integration for inline DOI matching on borderline cases

I built it to support a reference-extraction pipeline I run across multiple humanities journals where GROBID and AnyStyle produce reference lists that are 95% correct but have systematic quirks (citation gluing, OCR garbage in scanned older content, footnotes/endnotes masqeurading as citations, etc.) that a heuristic pass catches efficiently.

GitHub Repo: GitHub - justalewis/crossref-references-deposit-auditor: CrossRef References Deposit Auditor — pre-submission audit and cleanup tool for Crossref deposit XML. Validates against the official Crossref XSD, flags scrape-induced issues (glued citations, paragraph-shaped body text, repeat-author markers, DOI format, etc.), and provides a Flask-based cleanup workflow with Crossref REST matching. · GitHub

Here’s a Substack writeup on the tool: New Tool: CrossRef References Deposit Auditor

There’s a sample-deposits/ directory with two small example XMLs (one clean, one with seven planted issues) so you can docker compose up and see what the rules catch in about thirty seconds without needing your own deposit on hand.

It’s free, GPL-3 licensed, and a hobby project — but I check the issue tracker reliably and would love to hear from other publishers about deposit patterns the existing rules don’t yet catch. If your scraper or JATS pipeline produces a particular flavor of deposit garbage, a feature-request issue with an example XML snippet is the easiest way to land a new rule.

Happy to answer questions. Also working on a complementary tool (/mint) for back-catalog DOI minting from whole-issue PDFs (visual page tagging, ToC OCR, content-registration XML generation) — same repo. Less polished but functional and helpful for really old content in non-accessible OCRed PDFs.

Looking forward to connecting with the community more!

justin

1 Like

Hi Justin,

Thanks for sharing this!

I’ve modified your forum permissions, so you can share links now.

-Shayn

2 Likes