Query.affiliation

psmku · 24 August 2021 09:45

When I retrieve through a syntax like

https://api.crossref.org/works?query.affiliation=Kalyani+University&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title&rows=300&offset=0

… it shows abnormal number of documents possibly due to treating the query like Kalyani OR University.

The retrieval set for -

https://api.crossref.org/works?query.affiliation=Kalyani&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title&rows=300&offset=0

… is more reasonable document wise but taking all institutes with the place name Kalyani not only Kalyani University.

What is way out to retrieve documents with affiliation as Kalyani University?

Regards

ifarley · 24 August 2021 19:03

Hi @psmku. Thanks for your question and welcome to the community forum.

Our REST API does not support Boolean operators (i.e., OR, AND). Instead, we score and sort the relevance of our results. So, the highest results in the API for your query https://api.crossref.org/works?query.affiliation=Kalyani+University&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title&rows=300&offset=0 include matches for (ALL) affiliations that include the words Kalyani or University. Comparatively, if you were to page through more of the results, you might find that you’d eventually only find matches that included (ONLY) the words Kalyani or University in the affiliation element.

It might be helpful to include the score in your results as well: https://api.crossref.org/works?query.affiliation=Kalyani+University&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title,score&rows=300&offset=0 and eliminate results with a score below a certain threshold.

For instance, this DOI 10.1386/dtr_00025_1 has a relevance score of 0.81349975 and the affiliation metadata registered for all of the contributors of that DOI is: 0000000092155771Lesley University. Thus, I would suggest eliminating this as a viable result (for this specific query), and I would ignore anything with a score below that as well. Note: this is simply an arbitrary example; I do not mean to suggest that this score is the threshold for all queries (or, even this one - I assume a higher relevance score might be a better fit for this query, but I’ll defer to you).

Please let me know if you have any additional questions.

Kind regards,
Isaac

psmku · 11 December 2021 12:06

Yes, I do understand this now. And a long overdue thanks to Isaac @ifarley.
But, sometimes,selecting an appropriate threshold value (the same query on a given country, for example, more so with country names having space in between) are essentially research works.
Regards

ifarley · 14 December 2021 21:28

Hi @psmku,

Thanks for following up. I have some additional information about these relevance scores that might be helpful, so I am including it here:

We use the default scoring in ElasticSearch for our scoring: Practical BM25 - Part 2: The BM25 Algorithm and its Variables | Elastic Blog. The query is scored against a field which is a concatenation of several metadata fields.

Those fields are:

For “query”, the following fields are searched:

publication and print publication year
issue
volume
first and last page
ISSN
ISBN
title
container (journal, conference, etc.) title
description
supplementary ids
contributors’ first and last names, or name of a contributing organization
grant numbers
funder names

For “query.bibliographic”, the following fields are searched:

publication and print publication year
issue
volume
first and last page
ISSN
ISBN
title
container (journal, conference, etc.) title
contributors’ last names and initials, or name of a contributing organization

Unfortunately, we doubt this will help with the threshold, and the most important reason is: in search engines, such as ElasticSearch, scoring is not designed to be meaningful across different queries, i.e. the score is not some sort of objective global measure of similarity. The number is not scaled to any known range, and it will depend a lot on the query itself. Scores are only supposed to allow us to compare the similarity of different indexed documents with the same query , and so it only enables us to sort the results for a given query. Our best advice for finding such a threshold is: 1) try normalizing the score by the query length (i.e., just divide the score by the number of words in the query (possibly excluding stopwords), to get a score that is a bit more comparable between queries), and 2) find the best threshold from experimenting on a real representative dataset.

To see how query length affects the scoring compare the scores:
http://api.crossref.org/works?query=dominika+tkaczyk&select=score,DOI

and

http://api.crossref.org/works?query=dominika+tkaczyk+dominika+tkaczyk&select=score,DOI

The second one has double scores. This is what I meant: the scores make sense when you compare them in the context of a given query, much less sense when you compare the scores between queries.

My best,
Isaac

Donald85 · 16 December 2021 10:34

Thanks for sharing this info this is useful keep it up.

Topic		Replies	Views
Ticket of the month - March 2022 - Getting started with REST API queries Metadata Retrieval rest-api , metadata-retrieval , ticket_of_month , for_community	26	4100	8 September 2023
Metadata retrieval question Metadata Retrieval rest-api , metadata-retrieval , xml_api , affiliation	1	176	2 May 2024
Crossref API questions - test queries (example) Technical Support rest-api , metadata-retrieval	0	2799	25 February 2019
Search works by affiliation Interfaces for Machines rest-api	3	1729	6 May 2021
Some changes to Crossref Metadata Search (search.crossref.org) Metadata Retrieval orcid , metadata-search , crmds	9	2389	26 April 2023

Query.affiliation

Related topics