Increasing Crossref Data Reusability With Format Experiments - Crossref

Crossref · 22 January 2024 17:59

Every year, Crossref releases a full public data file of all of our metadata. This is partly a commitment to POSI and partly just what we do. We want the community to re-use our metadata and to find interesting ends to which they can be put!

This is a companion discussion topic for the original entry at https://www.crossref.org/blog/increasing-crossref-data-reusability-with-format-experiments

abeechin · 23 January 2024 20:08

Hi, thanks for enabling the discussion!

JSONL would already be a great improvement for us in ingesting the file. Currently to load this in Hive we need to unzip and flatten all of the record effectively into JSONL, concat and rezip files to a reasonable number to reduce PUT ops to HDFS, mount the files in Hive and apply a handwritten schema.

To all of that whilst still being future-proof towards schema changes we would propose Avro as an alternative data format.

It:

Can be split into smaller files for distribution
Supports native compression (snappy)
Embeds the schema inside the file, allowing for full hydration natively by clients (e.g. Hadoop, BigQuery, Databricks, Snowflake) - also good for data integrity
Facilitates schema evolution

castedo · 24 January 2024 02:36

To be even more POSI, has Crossref considered making your metadata available as a Dolt [1] database? and keep such a Dolt database up to date regularly rather than annually?

My current impression is that:

the risk of vendor lock-in is extremely small,
Crossref has the option to host it themselves or have a database on dolthub [2],
hosting the data on dolthub dot com will be free,
a Dolt database can handle this size of data,
community members can easily make clone databases (with enough available disk space),
clone databases can efficiently stay up-to-date with only differences being copied,
cloned databases automatically get the schema and SQL structure without any extra work.

[1] github dot com /dolthub/dolt
[2] dolthub dot com

meve · 24 January 2024 13:27

Just to add my thanks to the current commenters for these suggestions - I can’t promise anything, but I will investigate these formats and see what we can do. It may be that we will release code that will allow for the dump to be converted into these formats, rather than releasing the formats ourselves.

abeechin · 24 January 2024 13:44

For Avro what would be useful is to release an official AvroSchema for the dataset. That way anyone can use it in conjunction with JSONL format to generate their avro files.

https://avro.apache.org/docs/1.11.1/getting-started-python/#defining-a-schema

Given the works collection has millions of records getting the schema correct is important, as records are validated against it when written (so one malformed record can jeopardise the avro file generation).

A guide a la OpenAlex for mounting in BigQuery or similar would also be useful (whether with Avro or not)

https://docs.openalex.org/download-all-data/upload-to-your-database/load-to-a-data-warehouse

Topic		Replies	Views
2024 public data file now available, featuring new experimental formats - Crossref Metadata Retrieval blog , metadata-retrieval , public-data-file	1	358	18 June 2024
2025 public data file now available - Crossref Metadata Retrieval rest-api , metadata , community , blog , metadata-retrieval	0	97	12 March 2025
2026 public data file now available - Crossref Metadata Retrieval community , blog , metadata-retrieval , public-data-file , api	6	120	31 March 2026
New public data file: 120+ million metadata records - Crossref Interfaces for Machines posi , blog , open-data	8	2179	17 December 2022
2023 public data file now available with new and improved retrieval options - Crossref News and current events rest-api , community , blog , metadata-retrieval , public-data-file	4	878	28 July 2023

Increasing Crossref Data Reusability With Format Experiments - Crossref

Related topics