Thanks again, @ifarley, for your help. I’m still not clear.
According to the REST API doc I should use cursor if I’m fetching a large number of rows and offset can’t be used with cursors. So, I do an initial query with parameter cursor=* to get the first cursor and then I get next-cursor from the first set of results and use that cursor for the next query and so on. Given that the cursor changes for every subsequent query, I can’t parallelize those queries but need to get the next cursor before doing the next query. So, to get 112,000 updates with a max of 1,000 rows per query I’ll need to do 112 queries and I don’t see how I can do anything but wait for one query to complete before doing the next one.
Back to the original question of how to efficiently retrieve all updates since April 1.
https://api.crossref.org/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16 shows that there are nearly 24M records to fetch to get my covid-19 set current. That’s 24,000 queries which, unless I’m missing something, I can’t parallelize.
Let’s say I did want to parallelize them by fetching records for April through July in one set of queries and from August on in another set of queries.
https://api.crossref.org/works?filter=from-update-date:2020-04-01,until-update-date:2020-07-31 shows 13,286,127 records.
https://api.crossref.org/works?filter=from-update-date:2020-08-01,until-update-date:2020-11-16 shows 10,660,131 records.
So, I can do those two date range query sets in parallel and cut the time roughly in half for fetching the records since April. And, I can do more granular date range searches and parallelize them but I’ll hit duplicate records (i.e. records that were updated in more than one date range.)
But, I think I’m still missing something because you say that I can do up to 50 queries per second to fetch those 112,000 updates for one day.