Query.affiliation

When I retrieve through a syntax like

https://api.crossref.org/works?query.affiliation=Kalyani+University&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title&rows=300&offset=0

… it shows abnormal number of documents possibly due to treating the query like Kalyani OR University.

The retrieval set for -

https://api.crossref.org/works?query.affiliation=Kalyani&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title&rows=300&offset=0

… is more reasonable document wise but taking all institutes with the place name Kalyani not only Kalyani University.

What is way out to retrieve documents with affiliation as Kalyani University?

Regards

1 Like

Hi @psmku. Thanks for your question and welcome to the community forum.

Our REST API does not support Boolean operators (i.e., OR, AND). Instead, we score and sort the relevance of our results. So, the highest results in the API for your query https://api.crossref.org/works?query.affiliation=Kalyani+University&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title&rows=300&offset=0 include matches for (ALL) affiliations that include the words Kalyani or University. Comparatively, if you were to page through more of the results, you might find that you’d eventually only find matches that included (ONLY) the words Kalyani or University in the affiliation element.

It might be helpful to include the score in your results as well: https://api.crossref.org/works?query.affiliation=Kalyani+University&filter=from-pub-date:2020-01-01,until-pub-date:2020-12-31,type:journal-article&select=DOI,title,container-title,score&rows=300&offset=0 and eliminate results with a score below a certain threshold.

For instance, this DOI 10.1386/dtr_00025_1 has a relevance score of 0.81349975 and the affiliation metadata registered for all of the contributors of that DOI is: 0000000092155771Lesley University. Thus, I would suggest eliminating this as a viable result (for this specific query), and I would ignore anything with a score below that as well. Note: this is simply an arbitrary example; I do not mean to suggest that this score is the threshold for all queries (or, even this one - I assume a higher relevance score might be a better fit for this query, but I’ll defer to you).

Please let me know if you have any additional questions.

Kind regards,
Isaac

Yes, I do understand this now. And a long overdue thanks to Isaac @ifarley.
But, sometimes,selecting an appropriate threshold value (the same query on a given country, for example, more so with country names having space in between) are essentially research works.
Regards

Hi @psmku,

Thanks for following up. I have some additional information about these relevance scores that might be helpful, so I am including it here:

We use the default scoring in ElasticSearch for our scoring: Practical BM25 - Part 2: The BM25 Algorithm and its Variables | Elastic Blog. The query is scored against a field which is a concatenation of several metadata fields.

Those fields are:

For “query”, the following fields are searched:

  • publication and print publication year
  • issue
  • volume
  • first and last page
  • ISSN
  • ISBN
  • title
  • container (journal, conference, etc.) title
  • description
  • supplementary ids
  • contributors’ first and last names, or name of a contributing organization
  • grant numbers
  • funder names

For “query.bibliographic”, the following fields are searched:

  • publication and print publication year
  • issue
  • volume
  • first and last page
  • ISSN
  • ISBN
  • title
  • container (journal, conference, etc.) title
  • contributors’ last names and initials, or name of a contributing organization

Unfortunately, we doubt this will help with the threshold, and the most important reason is: in search engines, such as ElasticSearch, scoring is not designed to be meaningful across different queries, i.e. the score is not some sort of objective global measure of similarity. The number is not scaled to any known range, and it will depend a lot on the query itself. Scores are only supposed to allow us to compare the similarity of different indexed documents with the same query , and so it only enables us to sort the results for a given query. Our best advice for finding such a threshold is: 1) try normalizing the score by the query length (i.e., just divide the score by the number of words in the query (possibly excluding stopwords), to get a score that is a bit more comparable between queries), and 2) find the best threshold from experimenting on a real representative dataset.

To see how query length affects the scoring compare the scores:
http://api.crossref.org/works?query=dominika+tkaczyk&select=score,DOI

and

http://api.crossref.org/works?query=dominika+tkaczyk+dominika+tkaczyk&select=score,DOI

The second one has double scores. This is what I meant: the scores make sense when you compare them in the context of a given query, much less sense when you compare the scores between queries.

My best,
Isaac

Thanks for sharing this info this is useful keep it up.

1 Like