I volunteer a lot on Wikipedia’s database “wikidata” and have seen that a lot of DOI’s have been imported which is great but they have done it identifying the author by name and in a text string rather than as unique id.
I was wondering what ideas people have of how to fix this perhaps using public api so as to do it in bulk?
For example I wonder if there were any open doi databases that would have the authors expressed as orcid id’s (instead of as plain text)
Thanks for your question.
There are some limiting factors that would make that difficult: 1) not all authors have opted to get ORCIDs for themselves, 2) not all publishers collect ORCIDs along with submitted manuscripts, and 3) even among those that do, not all of them submit the ORCIDs in the metadata they send to Crossref.
We strongly encourage the collection and submission of ORCIDs, and those that are submitted are made publicly available via our APIs. They’re unfortunately just not very comprehensive.
About 8% of all Crossref DOIs have ORCIDs included in their metadata. That goes up to 11% when you’re just looking at journal articles.
Though if you search through some of our participation reports, there are certainly some publishers that are doing a great job of supplying ORCIDs, for example eLife and f1000 are both upwards of 80%. You’ll also notice that more recently published and registered content is more likely to have ORCIDs included than older content.
The problem of identifying an author by name is very difficult. Just within the field of computer science there are 217 people named exactly “Wei Zhang” listed as authors in DBLP. dblp: Wei Zhang (disambiguation) Author disambiguation is an active area of research using various signals and machine learning methods, but it isn’t foolproof by any means. See Graph-based methods for Author Name Disambiguation: a survey [PeerJ]
This is one reason why ORCID is so important to identify authors, but publishers have been slow to require them and authors have been slow to declare them. I think one reason is because ORCID has tried to coerce authors to login to their website to prove that their ORCID is authentic. We have too many places to login, and it’s just easier for authors to say they don’t have an ORCID. When a paper has six authors with one corresponding author, it is pure folly to think that all six authors will login just to prove they own an ORCID. Some journals using OJS have been disabling the ORCID plugin because of this. I suspect the ORCID organization wants to drive traffic to their site, but it actually has a detrimental effect on ORCID usage. Luckily the crossref metadata schema allows submission of unauthenticated ORCID.
@back_ache I too do a lot of work on Wikidata and have tools to add articles based on DOIs from CrossRef (and other DOI agencies). If the CrossRef metadata includes an ORCID and that ORCID is in Wikidata then I link the author to that Wikidata item. But as @Shayn and @mccurley note many authors either lack an ORCID, or even if they do they are recorded in the manuscript. My sense is that in many cases only the lead or submitting author has an ORCID.
Another problem is that many authors don’t populate or update their ORCID profiles, so that even if they have an ORCID and have published there is no link between the DOI and the ORCID.
I while back I wrote a crude tool to use the ORCID API to fetch ORCIDs for a given DOI, see https://enchanting-bongo.glitch.me
As I’m sure you’re aware, there are active efforts to convert author names strings to Wikidata items, but the task is obviously large, and the incentives are few.