Visit the main Crossref website

Date range search of index changes seems to retrieve too many records

Some months ago I retrieved the covid-19 dataset. Now I want to retrieve any records that have been added or changed since then.

I run this command to retrieve the the first page of the set of records starting April 1:

https://api.crossref.org/works?filter=from-index-date:2020-04-01,until-index-date:2020-11-11

I get total-records = 88403122.

88 million records seems like a lot of records for an incremental update.

So, out of curiosity, I run this command to see how many records have had an index update yesterday.

https://api.crossref.org/works?filter=from-index-date:2020-11-11,until-index-date:2020-11-11

I get total-records = 385706.

That’s a lot of records to be updated in one day!

What am I missing?

Hello @slnm. Thanks for your message. Welcome to the community forum!

One alternative here is that you could use the from-update-date filter instead of the from-index-date filter. The major difference between the two is that the from-index-date is going to include updated citation counts (and the changes that are also included in the from-update-date filter). If that information isn’t of concern for you, then you can use the from-update-date filter which will result in much fewer results. Index includes changes that we also make to the record - so very occasionally that will include work we’ve done on bugs and those citation count updates I mentioned. The from-update-date filter will include all metadata changes made by our members to their records.

You’re right, 385,706 records is a lot to update in one day, but we’re always updating those citation counts by matching references with existing, cited DOIs, so that from-index-date filter is going to seem high.

My best,
Isaac

Thanks, Isaac, for your help and engagement.

There are over 112,000 updates for Nov 11 which is better than nearly 386,000 records to fetch but still a large number.

https://api.crossref.org/works?filter=from-update-date:2020-11-11,until-update-date:2020-11-11

And, nearly 20,000,000 records to fetch to update the covid-19 dataset to be current.

https://api.crossref.org/works?filter=from-update-date:2020-05-01,until-update-date:2020-11-11

My aim is to maintain a relatively current dataset. Should I create a daily job to fetch new records, using your deep cursor? My concern with that approach is that a query takes roughly 30 seconds to return. At the rate of 2 queries per minute, 100,000 updates per day, and 1,000 results per page, it will take 50 minutes per day to fetch the incremental changes. I have tried using the mailto parameter (and https) to get into the preferred query pool but that doesn’t seem to speed up queries.

Thanks.

2 Likes

Hello again @slnm,

I think you’ll find that the 112,000 updates per day number is a little higher than the average, which should help with the overall time estimate for fetching these incremental changes. And, there’s no reason you can’t send us more than two queries per minute. You should be able to perform up to 50 per second and still be below our rate limits, as discussed here: https://github.com/CrossRef/rest-api-doc#rate-limits

I’d suggest using the Polite pool to the Public pool, as the Polite pool is the more performant of the two over the longer-term.

If you need a higher rate limit or a more performant pool, our Plus pool, with its SLAs, is an option as well. You can learn more here: https://www.crossref.org/services/metadata-retrieval/metadata-plus/. If you’re interested in learning more about the Plus service, I’d be happy to answer your questions or connect you with Jennifer Kemp, our Head of Partnerships.

Kind regards,
Isaac

1 Like

Thanks again, @ifarley, for your help. I’m still not clear.

According to the REST API doc I should use cursor if I’m fetching a large number of rows and offset can’t be used with cursors. So, I do an initial query with parameter cursor=* to get the first cursor and then I get next-cursor from the first set of results and use that cursor for the next query and so on. Given that the cursor changes for every subsequent query, I can’t parallelize those queries but need to get the next cursor before doing the next query. So, to get 112,000 updates with a max of 1,000 rows per query I’ll need to do 112 queries and I don’t see how I can do anything but wait for one query to complete before doing the next one.

Back to the original question of how to efficiently retrieve all updates since April 1.

https://api.crossref.org/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16 shows that there are nearly 24M records to fetch to get my covid-19 set current. That’s 24,000 queries which, unless I’m missing something, I can’t parallelize.

Let’s say I did want to parallelize them by fetching records for April through July in one set of queries and from August on in another set of queries.

https://api.crossref.org/works?filter=from-update-date:2020-04-01,until-update-date:2020-07-31 shows 13,286,127 records.

https://api.crossref.org/works?filter=from-update-date:2020-08-01,until-update-date:2020-11-16 shows 10,660,131 records.

So, I can do those two date range query sets in parallel and cut the time roughly in half for fetching the records since April. And, I can do more granular date range searches and parallelize them but I’ll hit duplicate records (i.e. records that were updated in more than one date range.)

But, I think I’m still missing something because you say that I can do up to 50 queries per second to fetch those 112,000 updates for one day.

Thanks, again.

Hi @slnm,

You’re right, my suggestion wasn’t well thought out. Sorry about that. You do need to wait for the cursor for each of your queries.

I’m not sure a way around your dilemma, outside of becoming a Plus subscriber and being able to regularly pull the monthly Snapshots. That said, I’ve asked our technical team for any suggestions they may have. I’ll follow up as soon as I know more.

My best,
Isaac

My colleagues on the technical team have some suggestions:

You could divide the set you need to download by the date of creation, and download various creation date ranges in parallel. Creation date should safe because it does not change, and every DOI has only one creation date. So a DOI should belong to exactly one creation date range, assuming all possible ranges are downloaded. The full range to cover is from 2002-07-25 (inclusive, this the older creation date in our data) to the current date.

For example, I can download DOIs updated since April and created in 2020, in parallel download DOIs updated since April and created in 2019, … , and in parallel download DOIs updated from April and created in 2002, using parallel requests https://api.crossref.org/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2020,until-created-date:2020&cursor=… and https://api.crossref.org/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2019,until-created-date:2019&cursor=… and so on.

Or, I could use smaller ranges and download separately DOIs updated from April and created in 2020-11, DOIs updated from April and created in 2020-10, and so one down to 2002-07. Or use just a few days as the range. The smallest range is 1 day long, as this is the creation date filter “resolution”.

Those subsets may not be well balanced in terms of the numbers of DOIs, but it should allow you to speed the whole thing up a bit.

Does that make sense?

@ifarley Yes, this all makes sense. Thank you! I’ll do some queries to get some counts to estimate the volume of searches needed and the time needed then parallelize the whole process. Again, I appreciate your willingness to dig into this issue.

3 Likes

I’m always happy to help, @slnm. Thanks for posting this message here for all to benefit from the exchange.

2 Likes

A perfect and very useful question for many of us. Thanks for the answers and suggestions!

3 Likes