New version of the dataset of relationships between preprints and journal articles

We’ve published a new version of the dataset of relationships between preprints and journal articles :tada:

The dataset is a result of applying the preprint matching strategy at scale to the metadata deposited by Crossref members. The dataset contains both relationships asserted by Crossref members and matched by the strategy.

The dataset is a single CSV file with the following fields:

  • preprint DOI (string)
  • journal article DOI (string)
  • whether the publisher of the journal article deposited this relationship (boolean)
  • whether the publisher of the preprint deposited this relationship (boolean)
  • the confidence score returned by the matching strategy (float, empty if the strategy did not discover this relationship)

The dataset contains:

  • 1,060,572 relationships in total, including 954,782 preprints and 953,453 journal articles,
  • 24,333 of the relationships were deposited by the Crossref members, but not discovered by the strategy,
  • 598,480 of the relationships were discovered by the strategy, but not provided by any Crossref member,
  • 437,759 of the relationships were both deposited by a Crossref member and discovered by the strategy.

More information about the matching approach and strategy can be found here.

5 Likes