The blog post was quite good. I think part of the problem is inadequate schemas. If you take as an example, part of the problem may be due to the fact that the schema says a string of length up to 32 characters. Of course some page numbers could be roman numerals, so it’s natural that you might accept a string, but it also means that something like 121-123 is not caught as an error. There is a natural tension between trying to be as flexible as possible in allowing weird page numbers and then assuming that most people would use it in a given way. Sometimes it’s better to err on the side of a restrictive schema if you are really counting on that data for matching. In this case I suspect page numbers are diminishing in importance.
We have struggled in collecting author names correctly, because coauthors tend to be sloppy in how they write their coauthor’s name (authors themselves are sometimes sloppy). People also change their names, and some don’t have surnames. I recommend that everyone read “Falsehoods Programmers Believe About Names” to understand how complex it is.
Titles in computer science and mathematics are also problematic, because it is common to use mathematical terminology in a title, and essentially nobody uses MATHML correctly (authors use TeX, and there are no reliable translators). That makes matching on titles problematic. The JATS format accepts inline mathematics in either MATHML or TeX.
I am particularly concerned about the schema for , which is pretty loose (e.g., it only supports a single author name). I suspect that in the case where a DOI is included for a citation, then that is all you need. In the case where there is no DOI, it’s important to have a schema that accurately reflects what a citation would look like. Because of this, we have decided to collect citation information in the JATS format instead of the crossref format.