Query Syntax
The DSL aims to capture the type of interaction with Dimensions data that users are accustomed to performing graphically via the web application, and enable web app developers, power users, and others to carry out such interactions by writing query statements in a syntax loosely inspired by SQL but particularly suited to our specific domain and data organization.
Basic query structure
DSL queries consist of two required components:
a search
phrase that indicates the scientific records
to be searched,
and one or more return
phrases which specify the contents and structure
of the desired results.
The simplest valid DSL query is of the form:
search grants return grants
---------------|---------------
search <source>|return <result>
A more useful query might also make use of the optional for
and where
phrases to limit the set of records returned.
search grants for "lung cancer" where active_year=2000 return grants
--------------------------------------------------------|---------------
search <source>|for <terms> |where <filters> |return <result>
A query requesting more complex results might have the form:
search grants for "laryngectomy" where start_year=2000 return grants[id + title] sort by title return funders return funder_countries return research_orgs as "universities" aggregate funding sort by funding
--------------------------------------------------------|------------------------------------------|--------------------------------------|--------------------------------------------------------------------------
| | |
search <source>|for <terms> |where <filters> |return <source>[<fields>]|sort by <field>|return <facet> return <facet> |return <facet> as <alias> |aggregate <indicator>|sort by <indicator>
search source
A query must begin with the word search
followed by a source name, i.e. the name
of a type of scientific record, such as grants
or publications
.
search grants
---------------
search <source>
The source name may be followed by an optional for phrase that provides search terms to rank records against,
and/or an optional where phrase that limits the set of records that will be searched. The for
and where
phrases may be in either order.
search grants for "laryngectomy" where start_year=2000
--------------------------------------------------------
search <source>|for <terms> |where <filters>
search grants where start_year=2000 for "laryngectomy"
-------------------------------------------------------
search <source>|where <filters> |for <terms>
return source
The most basic return
phrase consists of the keyword return
followed by
the name of a record or facet to be returned.
This must be the name of the source
used in the search
phrase, or the name of a facet of that source.
return grants
---------------
return <source>
return funders
--------------
return <facet>
Full-text Searching
Full-text search or keyword search finds all instances of a term (keyword) in a document, or group of documents. Full text search works by using search indexes, which can be targeting specific sections of a document e.g. its abstract, authors, full text etc…
search publications in full_data for "space travel" return publications
----------------------------------------------------------------------
search <source> | in <search index> | for <keywords> | return <source>
in [search index]
This optional phrase consists of the particle in
followed by a term indicating a search index, specifying for example whether the search is limited to full text, title and abstract only, or title only. Please check
supported sources to see what sources support which exact search indexes.
search grants in full_data for "something" return grants
---------------------------------------------------------------------
search grants in title_abstract_only for "something" return grants
Special search indexes for persons names permit to perform full text searches on publications authors
or grants inventors
. Please see the authors search section for more information on how searches work in this case.
search publications in authors for "\"Jennifer A Doudna\"" return publications
for "search term"
This optional phrase consists of the keyword for
followed by a search term
string, enclosed in double quotes ("
).
for "motor neuron disease"
--------------------------
for <terms (string)>
Strings in double quotes can contain nested quotes escaped by a backslash \
.
This will ensure that the string in nested double quotes is searched for as
if it was a single phrase, not multiple words (note: this applies both to full-text searching and field searching).
An example of a phrase: "\"Machine Learning\""
: results must contain Machine Learning
as a phrase.
Example of multiple keywords: "Machine Learning"
: this searches for keywords independently.
Note
Special characters, such as any of ^ " : ~ \ [ ] { } ( ) ! | & +
must be escaped by a backslash \
.
Examples
Searching for phrase “How is mechanobiology involved in mesenchymal stem cell differentiation toward the osteoblastic or adipogenic fate?”
Special characters, such as question marks, must be escaped by a backslash
\
search publications for "How is mechanobiology involved in mesenchymal stem cell differentiation toward the osteoblastic or adipogenic fate\?" return publications
Searching for “dose” or “concentration”
Special characters, such as parenthesis must not be escaped, because they are used to construct a boolean query (see next section).
search publications for "(dose OR concentration)" return publications
Searching for “haskell unit ()”
Special characters, such as parenthesis must be escaped, because we are searching literally for them.
search publications for "haskell unit \(\)" return publications
Attention
Please note escaping rules in Python. For example, when writing a query with escaped quotes, such as:
search publications for "\"phrase 1\" AND \"phrase 2\""
in Python, it is necessary to escape the backslashes as well, so it would look like:
resp = requests.post(
'https://<your-url.dimensions.ai>/api/dsl.json',
data='search publications for "\\"phrase 1\\" AND \\"phrase 2\\""',
headers=headers)
In some circumstances, it can be useful to employ Python raw literals so that quotes can be escaped with a backslash, but the backslash remains in the result:
query = r'search publications for "\"phrase 1\" AND \"phrase 2\"" return publications'
See also this SO answer for more context. Please note that similar escaping rules might apply to other programming languages than Python as well.
Using triple quotes
Another possible syntax, starting with the DSL 2.0, is to use triple quotes, which can contain simple quotes inside, without any escaping. Examples:
search publications
for """
"malaria africa" AND (blood OR "blood donors")
"""
return publications
Note that the triple quote syntax makes it escaping of quotes not necessary, but other special characters still need to be escaped.
search publications
for """
How is mechanobiology involved in mesenchymal
stem cell differentiation toward the osteoblastic
or adipogenic fate\?
"""
return publications
Boolean Operators
Search term can consist of multiple keywords or phrases connected using boolean logic operators, e.g.
AND
, OR
and NOT
.
search publications for "(dose OR concentration)" return publications
The full specification is shown in the table below. This table is specifying the “standardQuery” grammar .
Boolean Operator |
Alternative Symbol |
Description |
---|---|---|
|
|
Requires both terms on either side of the Boolean operator to be present for a match. |
|
|
Requires that the following term not be present. |
|
|
Requires that either term (or both terms) be present for a match. |
|
Requires that the following term be present. |
|
|
Prohibits the following term (that is, matches on fields or documents that do not include that term).
The |
Note
When specifying Boolean operators with keywords such as AND
, OR
and NOT
, the keywords must appear in all uppercase.
Warning
Negative search may follow only a positive search. Valid example:
search publications in title_only for "miRISC NOT RNA"
return publications
Following example is not allowed:
search publications in title_only for "NOT miRISC AND RNA"
return publications
Searching in multiple search indexes in a single query
As of DSL 1.27, it is possible to search in multiple indexes, if they support this option(check data sources, sections specific to search indexes). Operators such as and/or and parentheses may be used, same as with where.
As an example, the following query is valid:
search publications in title_only for "graphene AND catalyst" and in concepts for "\"semiconductor materials\"" return publications
This query finds publications that contain both “graphene AND catalyst” in the title, and at the same time “"semiconductor materials"” as a concept.
Wildcard Searches
The DSL supports single and multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases.
search publications in title_only for "ital*" return publications
Wildcard Search Type |
Special Character |
Example |
---|---|---|
Single character - matches a single character |
|
The search string |
Multiple characters - matches zero or more sequential characters |
|
The wildcard search: |
Warning
Wildcard matches may only be used for word suffixes, i.e. ital* is a correct wildcard search, whereas *ital is not accepted. The wildcard match in this case is stripped out and a warning is issued.
Proximity Searches
A proximity search looks for terms that are within a specific distance from one another.
To perform a proximity search, add the tilde character ~
and a numeric value to the end of a search phrase.
For example, to search for a formal
and model
within 10 words of each other in a document, use the search:
search publications for "\"formal model\"~10" return publications
Note
The distance here is defined by the Apache Lucene slop feature.
Quote from the documentation:
Sets the number of other words permitted between words in query phrase. If zero, then this is an exact phrase search. For larger values this works like a WITHIN or NEAR operator. The slop is in fact an edit-distance, where the units correspond to moves of terms in the query phrase out of position. For example, to switch the order of two words requires two moves (the first move places the words atop one another), so to permit re-orderings of phrases, the slop must be at least two.
More exact matches are scored higher than sloppier matches, thus search results are sorted by exactness.
The slop is zero by default, requiring exact matches.
- For example, query like
"jakarta apache lucene"~3
would match: jakarta lucene apache
(distance is 2)jakarta another two words apache lucene
(distance is 3)jakarta another word apache then lucene
(distance is 3)
- It will not match:
lucene jakarta apache
(distance is 4)jakarta way too many extra words apache lucene
(distance is 5)jakarta more words apache more separated lucene
(distance is 4)
Field Searching
Field searching allows to use a specific field of a source as a query filter. For example, this can be a Literal field such as the type of a publication, its date, mesh terms, etc.. Or it can be an entity field, such as the journal title for a publication, the country name of its author affiliations, etc..
A complete list of fields available as filters for each source can be found in the supported data section. See also the entity fields section for more information on how entity fields differ from simple literal fields.
search publications where type = "book" and return publications
----------------------------------------------------------------
search <source> | where <filter> | return <source>
where
This optional phrase consists of the keyword where
followed by a
filters
phrase consisting of DSL
filter expressions, as described below.
where research_org_name="Saarland University"
---------------------------------------------
where <filters>
Note
If a for phrase is also used in a filtered query, the system will first apply the filters, and then search the resulting restricted set of documents for the search term.
in
For convenience, the DSL also supports shorthand notation for filters where a particular field should be restricted to a specified range or list of values (although the same logic may be expressed using complex filters as shown below).
A range filter consists of the field
name, the keyword in
, and a range
of values enclosed in square brackets ([]
), where the range consists of
a low
value, colon :
, and a high
value.
start_year in [ 2010 : 2015 ]
----------|--|---------------
<field> in <range>
-|----|-|----|-
[ <low>:<high>]
The results are inclusive of both the low and high values, such that the following two restriction phrases give the same results:
where start_year in [2010:2015]
where (start_year>=2010 and start_year<=2015)
A list filter consists of the field
name, the keyword in
, and a list
of one or more value
s enclosed in square brackets ([]
),
where values are separated by commas (,
):
research_org_name in [ "UC Berkeley", "UC Davis", "UCLA" ]
-----------------|--|--------------------------------------
<field> in <list>
-|-------------|-------|------------|-
[ <value> , <value> , <value> ]
The following two restriction phrases give the same results:
where start_year in [2000, 2005, 2010]
where (start_year=2000 or start_year=2005 or start_year=2010)
Note
The in
condition may be negated using the not
operator (Chaining Filters). In that case, it must follow at least one positive filter,
for example:
search publications where type = "article" not id in ["pub.1124196727", "pub.1124099280"] return publications
This query will identify publications of type “article”, except of those with IDs specified in the filter.
count
Functions can be applied on fields in filters. Currently, only filter function count
is supported on some fields in publications (e.g. researchers
and research_orgs
).
Use of this filter is shown on the example below:
count(research_orgs) = 1
-------------------------------------------
<filter function> <simple filter>
Literal Fields Vs Entity Fields
In addition to restricting the values of a particular literal field of the source being searched, filter expressions may also restrict results using an entity field, i.e. restricting the values of a particular field of an entity related to this record.
Using entity fields may lead to warnings and incomplete results (see the box below). So, in general, it is better to use them only with fields that refer to unique identifiers/attributes of the entity object (e.g., researchers.orcid_id
, organizations.ror_ids
, category_for.name
, etc..).
For example, when searching for grants, we may wish to restrict results to those whose funder has a certain acronym; or when searching for publications, we may wish to restrict results to those whose author has a certain ORCID identifier. For example:
search grants where funders.acronym="DFG" return grants
search publications where researchers.orcid_id = "0000-0002-1838-9363" return publications
An entity metadata filter consists of the entity
name, followed by
a dot (.
) and the name of the field
on the entity record
which we would like to restrict, followed by the same content as for
Filter Operators or in: namely, an op
operator
followed by a value
, or the keyword in
followed by a range or list
of values in brackets ([]
).
funders.acronym = "NHLBI"
---------|-------|----|-------
<entity>.<field> <op> <value>
funders.acronym in ["EC", "DFG"]
---------|-------|--|-------------
<entity>.<field> in <list>
Warning
Entity fields should be used with caution. Entity metadata filters impose some internal limitations and the user should be aware how they work to understand the results they produce.
For example, the following query will produce a warning message and should be avoided:
search publications where research_orgs.country_name = "South Korea" return publications
This is because the entity field filter research_orgs.country_name = "South Korea"
is first transformed by the DSL into a query to retrieve
up to 450 organization IDs from South Korea, then these IDs are used in the main query.
The 450 organization limit, is in total for all entity queries, and is split between
the number of entity filters. This means that filter
research_orgs.country_name = "South Korea" or research_orgs.country_name = "Japan"
will retrieve up to
225 organization IDs from South Korea and up to 225
organization IDs from Japan.
As a result, it may come as a surprise that this filter does not return all research organizations from South Korea
or Japan, but only a limited subset.
A better way to express this filter is as research_org_country_names in ["South Korea", "Japan"]
,
which uses the field research_org_country_names
and does not trigger an extra query.
Therefore no limit of entities are imposed on it.
When using entity filters you should always check whether the query returns any warning messages.
JSON fields
Filters that return raw JSON data (e.g. authors or concepts) can be used to filter, but are only searchable as strings. The inner JSON structure is not going to have any effect on the search results.
Filter Operators
A simple filter expression consists of
a field
name, an in-/equality operator op
,
and the desired field value
.
The value
must be a string enclosed in double quotes ("
) or an integer (e.g. 1234
).
The available op
operators are:
operation |
meaning |
---|---|
|
is (or contains if the given |
|
is not |
|
is greater than |
|
is less than |
|
is greater than or equal to |
|
is less than or equal to |
|
partially matches (see Partial string matching with ~ below) |
|
matches the search expression (see Lucene field searches with @ below) |
|
is empty (see Emptiness filters is empty below) |
|
is not empty (see Emptiness filters is empty below) |
start_year >= 2010
----------|----|-------
<field> <op> <value>
Partial string matching with ~
The ~
operator indicates that the given field
need only partially,
instead of exactly, match the given string
(the value
used with this operator must be a string, not an integer).
For example, the filter where research_orgs.name~"Saarland Uni"
would match both the organization named “Saarland University” and the one named “Universitätsklinikum des Saarlandes”, and any other organization whose name includes the terms “Saarland” and “Uni” (the order is unimportant). However, the filter where research_orgs.name="Saarland Uni"
would not match either of these two organizations, as the =
operator requires an exact match.
Lucene field searches with @
Since version 2.10 it is possible to use the @
operator to search fields using similar approaches to those available when doing full-text or index-based searches (see Full-text Searching section).
Unlike searching in full text or indexes, triple quotes are not available for this type of filter, so make sure that quotation marks are appropriately escaped.
You can use boolean operators
search publications where mesh_terms @ "(human OR animal)" return publications
and wildcard searches
search publications where category_for.name @ "*chem*" return publications
and proximity searches
search publications where research_org_names @ "\"oxford university\"~3" return publications
Like the ~
operator, this form of search is currently only available for fields that take string (text) values. You can even recreate the effect of the ~
operator:
search publications where journal.title ~ "Nature" return journal
is equivalent to
search publications where journal.title @ "*Nature*" return journal
When using the @
operator to filter fields, all the same caveats and notes apply for these techniques as described in the Full-text Searching section.
Emptiness filters is empty
To filter records which contain specific field or to filter those which contain an empty field, it is possible to use something like where research_orgs is not empty
or where issn is empty
.
Note
The is not empty
and is empty
filters can be used on any field, irrespectively of whether filtering / faceting is enabled on them.
Chaining Filters
More complex filter expressions may be created by combining multiple
simple filters using the following boolean
operators,
possibly grouped using parentheses (()
):
|
meaning |
---|---|
A |
include results which match both filters A and B |
A |
include results which match either filter A, or filter B, or both |
A |
include results which match filter A and do not match filter B |
The following are all examples of valid filter expressions:
research_org_name="Saarland University" or research_org_name="Universität des Saarlandes"
---------------------------------------|---------|----------------------------------------------
<simple filter> <boolean> <simple filter>
start_year<=2010 and ( active_year=2015 or active_year=2016 )
----------------|-----|-----------------------------------------
<simple filter> <bl.> <complex filter>
-|-----------------|--|----------------|-
( <simple filter> <b.> <simple filter> )
( start_year>=2008 and start_year<=2010 ) and ( active_year=2015 or active_year=2016 )
-----------------------------------------|-----|----------------------------------------
<complex filter> <bl.> <complex filter>
-|----------------|---|----------------|- -|-----------------|--|-----------------|-
( <simple filter> <bl.> <simple filter> ) ( <simple filter> <bl> <simple filter> )
Note
Spaces around operators and parentheses are optional. They may be used to make queries easier to read, but are ignored by the query parser.
Warning
Boolean operators in a Dimensions search query are not guaranteed to follow familiar rules of precedence. For best results, use brackets to specify order of precedence and include brackets around every NOT phrase. e.g. lions AND tigers OR NOT bears would conventionally be parsed as (lions AND tigers) OR (NOT bears) but must include the brackets to guarantee that parsing by Dimensions. See the Frequently Asked Questions for more details.
Note
Outermost parentheses around the filter expression(s) are optional and have no effect, such that the following pairs are equivalent:
where start_year=2010
where (start_year=2010)
where start_year<=2010 and (active_year=2015 or active_year=2016)
where (start_year<=2010 and (active_year=2015 or active_year=2016))
Searching for Researchers
The DSL offers different mechanisms for searching for researchers (e.g. publication authors, grant investigators), each of them presenting specific advantages.
Exact searches
Special full-text indices allows to look up a researcher’s name and surname exactly as they appear in the source documents they derive from.
This approach has a broad scope, as it allows to search the full collection of Dimensions documents irrespectively of whether a researcher was succesfully disambiguated (and hence given a Dimensions ID). On the other hand, this approach will only match names as they appear in the source document, so different spellings or initials are not necessarily returned via a single query. In order to address this limitation, disambiguated researchers search can be used instead.
search in [investigators|inventors]
Investigators search is similar to authors search, only it allows to search on grants
and clinical trials
using a separate search index investigators
, and on patents
using the index inventors
.
search clinical_trials in investigators for "\"John Smith\"" return clinical_trials
search grants in investigators for "\"Satoko Shimazaki\"" return grants
search patents in inventors for "\"John Smith\"" return patents
Fuzzy Searches
As opposed to exact names search, fuzzy search can match only part of a person name, e.g. only the ‘first name’ or the ‘last name’ of a person.
search where [investigators|inventors]
Fuzzy search for researcher names is available also in Grants (field investigators
) , Clinical Trials (field investigators
) and Patents (field inventors
).
search grants where investigators ~ "Downney" return grants[id+title+investigators]
search patents where inventors = "Jobs" return patents[id+title+inventors]
Note
Generally speaking, using a where
clause to search authors|investigators|inventors is less precise that using the relevant full-text index. On the other hand, using a where
clause can be handy if one wants to combine a fuzzy search with another full-text search index.
Disambiguated Researchers
By using this method one can search within a catalogue containing only researchers that have been disambiguated.
The Dimensions Researchers source is a database of researchers information algorithmically extracted and disambiguated from all of the other content sources (publications, grants, clinical trials etc..).
Hence by using the researchers
source it is possible to match an ‘aggregated’ person object linking together multiple publication authors, grant investigators etc.. irrespectively of the form their names can take in the original source documents.
Examples:
search researchers for "\"Satoko Shimazaki\"" return researchers
search researchers where last_name="Shimazaki" return researchers
Note
The researchers
source is the result of a vetted computational process and so it does not contain all authors and investigators information available in Dimensions. E.g. think of authors from older publications, or authors with very common names that are difficult to disambiguate, or very new authors, who have only one or few publications. In such cases, using full-text authors search might be more appropriate.
Warning
If using the full-text for
syntax, please remember that Python and other programming languages have special escaping rules when writing a query with escaped quotes.
Obsolete researchers
Dimensions data contains researchers objects who are no longer valid. In order to determine a validity of a researcher object, one can perform following query:
search researchers where id in ["ur.011301404166.06", "ur.07433432213.73"] return researchers[id+obsolete+redirect]
Returned fields who the example have following meaning:
id - is the input researcher ID
obsolete - 0 means that the researcher ID is still active. In this case no additional information is provided. 1 means that the researcher ID is no longer valid, in this case see redirect field for further information.
redirect - if empty, it means that the researcher with this ID was deleted. If it contains a single value, it means that that value is a new researcher ID into which the current one was redirected. If multiple values are available, it means that the original researcher ID was split to these multiple IDs.
Searching using concepts
The Publications and Grants data sources offer the ability to search using concepts
.
Concepts are noun-phrases automatically extracted from a document’s abstract as well as the rest of the Dimensions database, which is used to weight their importance and relevance within the document’s field of study.
For instance, the phrases machine learning and neural network will be considered very relevant in a computer science paper, while project and study will have their relevance scores low as they are generic phrases.
Note
Concepts regularly get updated both as a result of constantly growing data in Dimensions, and because our concepts extraction AI tools improve.
Concepts relevance/scores
Concepts are normally ordered by relevance, where the first concept returned is the most relevant.
It is also possible to retrieve the relevance score associated to a concept (concepts_score
field). For example, for the publications with ID pub.1122072646
we would have the following JSON:
{'id': 'pub.1122072646',
'concepts_scores': [{'concept': 'acid', 'relevance': 0.07450046286579201},
{'concept': 'conversion', 'relevance': 0.055053872555463006},
{'concept': 'formic acid', 'relevance': 0.048144671935356},
{'concept': 'CO2', 'relevance': 0.032150964737607}
[........]
],
}
About scores:
Scores are represented by a number from 0 to 1.
Values approaching 1 signal concepts that are relevant to a subject of a paper they are extracted from. They may be frequent (in the entire Dimensions collection) or infrequent (but always in at least 5 documents).
Values approaching 0 signal concepts that are irrelevant to a subject of a paper they are extracted from. The same frequency facts apply.
A value of 0 is for concepts that occur fewer than 5 times in our collection and in general should be discarded (in future versions of the API these values may be filtered out automatically).
Note
Since concepts get regularly updated, also concepts scores change over time. Hence scores only represent the localized ranking of a concept within a document, at a specific point in time. They are not an ‘absolute’ ranking.
Concepts queries examples
Retrieving concepts from publications and grants:
search return publications[id+title+concepts]
search return grants[id+title+concepts]
Retrieving concepts from publications, as well as their scores (note: as of version 1.25 of the DSL, concepts_scores are available only in Publications).
search publications for "graphene" return publications[id+year+concepts_scores]
Full text search using specific concepts (via the concepts search index):
search publications in concepts for "\"polymer matrix\" AND graphene" return publications
search publications in concepts for "\"polymer matrix\" OR graphene" return publications
# no connector defaults to an AND
search publications in concepts for "\"polymer matrix\" graphene" return publications
Basic filtering using concepts:
search publications where concepts = "polymer matrix" return publications
Note
See also the DSL function extract_concepts, which can be used to extract concepts from any text.
Todo
The API Lab notebook Working with concepts in the Dimensions API contains many more examples of how to process concepts data.
Searching using abstracts
Search function search publications for similar_documents(abstract)
is the DSL’s implementation of the “Search for Abstracts” type of search in the Dimensions webapp.
It uses text of abstract to extract concepts which are then used to filter publications, grants or reports. It can be used like any other query, with filters, facets, paging, etc.
The function is specified after the search publications for
by appending similar_documents(abstract)
. Alternatively, it can be used with Grants source as well.
For example:
search publications for similar_documents("After spinal cord injury (SCI), a fibroblast- and microglia-mediated fibrotic scar is formed in the lesion core...") where year > 2000 return publications limit 5
Note
The input parameter is a string, and it can be either a single or triple quoted string. If single quotes are used, quotes within the text of abstract must be escaped with a backslash.
Returning results
After the search
phrase, a query may contain one or more return
phrases,
specifying the content and format of the information that should be returned.
Note
While a query can have only one search
phrase, multiple result phrases
are allowed, one directly after another.
Note
When there is no return
phrase specified, by default the basics
fieldset of the
searched source is returned. For example search publications
is equivalent to search publications return publications[basics]
.
Examples:
return grants [extras]
------------------------
return <src>[<fieldset>]
return funders return funder_countries return research_orgs as "universities" sort by count
--------------------------------------|-------------------------------------------------------
return <facet> return <facet> |return <facet> as <alias> sort by <indicator>
return in "docs" grants[title] as "projects" sort by title limit 10 skip 20
-----------------------------------------------------------------------------------
return in <group> <src>[<fields>] as <alias> sort by <field> limit <#> skip <#>
return in "facets" funders return research_orgs as "orgs" aggregate rcr_avg, altmetric_median sort by rcr_avg limit 5
----------------------------------------------------------------------------------------------------------------------------------
return in <group> <facet> return <facet> as <alias> aggregate <indicator>, <indicator> sort by <indicator> limit <#>
Note
It is possible to specify set return_all_keys
as a prefix to the query, in order for the DSL to return all fields, even those which have no value. They will be returned in the JSON with null as a value.
This is in contrast with the default behavior, where fields with no value are in the output omitted.
Example: set return_all_keys search publications return publications
.
Returning Multiple Sources
Multiple results may not be returned in a single return
phrase.
return funders return research_orgs return year
--------------------------------------------------
return <facet> return <facet> return <facet>
Note
This feature is only available for the Analytics DSL, and is not present in the Runtime DSL.
Returning Specific Fields
For control over which information from each given record will be returned,
a source or entity name in the results
phrase can be optionally
followed by a specification of fields
and fieldsets
to be included in the
JSON results for each retrieved record.
The possible types of fields
specifications are described below.
fields
The fields specification may be an arbitrary list of field names
enclosed in brackets ([
, ]
), with field names separated by a plus sign (+
).
Minus sign (-
) can be used to exclude field
or a fieldset
from the result.
Field names thus listed within brackets must be “known” to the DSL,
and therefore only a subset of fields may be used in this syntax (see note below).
grants:[project_num + title_original - language]
------------------------------------------------
<source>:[ <field> + <field> - <field> ]
funders:[country_name + acronym + name ]
---------------------------------------
<entity>:[ <field> + <field> + <field>]
fieldsets
The fields specification may be the name of a pre-defined fieldset
(e.g. extras
, basics
). The fields corresponding to that fieldset
will be included in the result.
publications[extras] ------------------------ <source> [<fieldset>]funders[basics] ------------------- <entity>[<fieldset>]
Note
The fields and fieldsets available for each source/entity are specified in
the data sources section.
Only fields/fieldsets listed in the configuration
may be used in fields
specifications of the two types listed above.
Returning facets
In addition to returning source records matching a query, it is possible to facet on the entity fields related to a particular source and return only those entity values as an aggregrated view of the related source data. This operation is similar to a group by or pivot table.
Not all entity fields can be used as facets; the full list is available in the sources data section.
Warning
Faceting can return up to a maximum of 1000 results. This is to ensure adequate performance with all queries. Furthemore, although the limit
operator is allowed, the skip
operator cannot be used.
return in "facets"
For control over the organization and headers of the JSON query results,
the return
keyword in a return phrase may be followed by the keyword in
and then a group
name for this group of results, where the group name
is enclosed in double quotes("
).
return in "facets" funders return in "facets" research_orgs
-----------------------------------------
return in <group> <facet> return in <group> <facet>
Each result returned in this return
phrase will then be placed under the
header of this group’s name in the final results
(for an example, see Groups on the Example Queries and Results page).
Note
If the given group name has already been used in a previous return
phrase,
the result(s) from this return
phrase will be added to that group.
A result may not be added to an existing group which already contains
a result of the same name.
aliases
The name of a source or facet to be returned in a return
phrase
may optionally be followed by the keyword
as
followed by an alias
for this result in double quotes ("
).
The alias will then be used instead of the original source/facet name in the returned JSON results (for an example, see Aliases on the Example Queries and Results page).
return publications as "articles"
---------------------------------
return <source> as <alias>
Paginating Results
At the end of a return
phrase, the user can specify
the maximum number of results to be returned
and the number of top records to skip over before returning
the first result record, for e.g. returning large result sets page-by-page
(i.e. “paging” results) as described below.
This is done using the keyword limit
followed by the maximum number
of results to return, optionally followed by the keyword skip
and the number of results to skip (the offset).
return publications limit 15 skip 30
---------------------------------------
return <result> limit <#> skip <#>
If paging information is not provided, the default values
limit 20 skip 0
are used, so the two following queries are equivalent:
search grants for "malaria" return grants
search grants for "malaria" return grants limit 20 skip 0
Note
While a limit
value may be specified without also specifying
a skip
value, skip
may not be used on its own without limit
;
e.g. the query search grants return grants skip 20
is invalid and
will result in an error. The valid alternative to this query would be
search grants return grants limit 20 skip 20
.
Combining limit
and skip
across multiple queries
enables paging or batching of results;
e.g. to retrieve 30 grant records divided into 3 pages of 10 records each,
the following three queries could be used:
return grants limit 10 => get 1st 10 records for page 1 (skip 0, by default)
return grants limit 10 skip 10 => get next 10 for page 2; skip the 10 we already have
return grants limit 10 skip 20 => get another 10 for page 3, for a total of 30
Sorting Results
A sort order for the results in a given return
phrase can be specified
with the keyword sort by
followed by the name of a field
(in the case that a source is being requested) or a indicator
(in the case that one or more facets are being requested).
Multi-value fields cannot be used in the sort clause. By default,
the result set of full text queries (search … for “full text query”) is sorted by the default sort field, as specified for each data source.
Additionally, it is possible to specify the sort order, using asc
or desc
keywords.
These keyword specify ascending resp. descending ordering of results.
By default, descending order is selected.
return grants sort by title desc
----------------------------------------
return <source> sort by <field> <order>
return grants sort by relevance desc
--------------------------------------------
return <source> sort by relevance <order>
return <source> sort by score <order>
return research_orgs aggregate altmetric_median, rcr_avg sort by rcr_avg
-------------------------------------------------------------------------
return <facet(s)> aggregate <indicator(s)> sort by <indicator>
Note
If a facets is being returned, the indicator
used in the sort
phrase must either be count
(the default, such that sort by count
is unnecessary),
or one of the indicators specified in the aggregate
phrase,
i.e. one whose values are being computed in the faceting operation.
Attempting to sort by a indicator other than count
that does not appear
in the aggregate
phrase will result in an error, as in the invalid query
search publications return funders aggregate altmetric_median sort by rcr_avg
.
Please note that due to internal implementation of the fcr_gavg
, it is not possible
to use this field in the sort
part of the query.
Note
relevance
and score
are synonyms when it comes to sorting the result set by the relevance of a particular document
to the specified search phrase and filter options. This score or relevance has only a relative value specific to that
particular result set and is not comparable, even when using limt
/skip
options with the same filters and search phrase.
Therefore, the actual value of the returned score
may be obtained, by including it in the result set, e.g. return publications[basics+score]
,
but the actual numeric value cannot be used to compare results from multiple query invocations.
Warning
Sorting by title is considered slow and might not function as expected, and should be avoided. This feature will be removed from future releases.
Aggregations
In a return
phrase requesting one or more facet results,
aggregation operations to perform during faceting can be specified after
the facet name(s) by using the keyword aggregate
followed by
a comma-separated list of one or more indicator names corresponding to
the source being searched.
return research_orgs aggregate rcr_avg, altmetric_median
-----------------------------------------------------------------
return <facet> aggregate <indicator>,<indicator>
Note
Every indicator appearing in the aggregations
phrase must be
either count
or a pre-defined indicator for the given source
as specified in the data configuration.
In addition to any specified aggregations, count
is always computed
and reported when facet results are requested.
If no aggregations
phrase is present, only count
is computed.
The following pairs of queries are therefore equivalent:
search publications return funders aggregate rcr_avg, count, altmetric_median
search publications return funders aggregate rcr_avg, altmetric_median
search grants return funders aggregate count
search grants return funders
Returning complex calculations using result function expressions
The return phrase may be followed by a function expression, to return additional calculations, such as per year funding or citations statistics. These functions may take their own arguments, and are calculated using the source data as specified in the search part of the query.
search grants return funding_per_year(2010, 2020, "USD")
--------------------------------------------------------------------------------
search <source> return <function name> (<positional arguments, named arguments>)
Function expressions support two types of arguments:
- Positional Arguments
These are placed without any name. They are just values. These are completely optional and can be replaced by named arguments. Their purpose is mainly to simplify calling functions with just one or two arguments, by omitting the argument name.
- Named Arguments
Named arguments are always put after positional arguments and their order is not important.
Arguments can be of various type, for example string or integer. See below for the full list of supported functions.
Function: citations_per_year
Publication citations is the number of times that publications have been cited by other publications in the database. This function returns the number of citations received in each year.
Arguments
Argument Name |
Argument Type |
Optional? |
---|---|---|
|
|
|
|
|
Function: funding_per_year
Returns grant funding per year in the given currency, starting from specified year, ending in specified year (including).
Supported currencies are: USD,JPY,NZD,CAD,CHF,AUD,CNY,EUR,GBP
Arguments
Argument Name |
Argument Type |
Optional? |
---|---|---|
|
|
|
|
|
|
|
|
Unnesting multi-value entity fields
Multi-value entity and JSON fields, such as researchers, ‘authors’ or research_orgs or any of category_* fields may be unnested into top level objects using the following syntax:
search publications return publications[basics+unnest(researchers)+unnest(category_for)]
----------------------------------------------------------------------------------------
search <source> return <source>[<fields, fieldsets and field functions(unnest)>]
This query will transform all of the returned researchers and FOR categories and turn them into top level keys, such as researcers.id, researchers.first_name, researchers.last_name name, while copying other, non-unnested fields, such as id or title of publication for each of them. Returned results are therefore multiplied by as many researchers and FOR categories each original publication has, so they will likely be more than the overall query limit, as the limit applies on the source objects, not the unnested one. If multiple fields are being unnested, then a cartesian product of all unnested fields is being returned.
Note
As of release 2.1.0, the total number of records returned when using the unnest is capped at 10000 records, as a performance and stability improvement.
Formal language specification
The general structure of the query visualized using railroad diagrams. Rectangles represent non-terminals and elliptic shapes visualize terminal symbols.
Simply put, terminal symbols are actual tokens or “words” in the DSL, while non-terminals can be expanded further. This section shows how the language is structured. For example, each query starts with either search, followed by something that is a NAME - basically string such as publications. Alternatively, query can be a Basic functions structure. Then there is an optional part with filters that is initiated using where terminal / keyword. The query is allowed to have one or more result non-terminal sections.
search:
searchFor:
restriction:
filters:
entityFilter:
simpleFilterEntity:
value:
literal:
span:
array:
stringArray:
integerArray:
op:
result:
group:
record:
fields:
aggregate:
sort:
paging:
funExpression:
funArgs:
funArgNamed:
funArgValue:
identifyExperts:
identifyResult:
identifyAnnotate:
describe:
metaExpr:
Note
Some non-terminals, such as NAME, STRING, INTEGER aren’t defined using railroad diagrams, their syntax is defined using following regular expressions:
NAME:
[a-zA-Z_]+
STRING:
'"' ('\\"'|.)*? '"'
INTEGER:
'-'?[0-9]+
Other than that, tokens can be separated using as much white space as desired and the DSL considers all [ \t\r\n]+
to be a white space.
Full-text query formal language specification
limitedQuery:
limitedClause:
booleanPrefix:
standardQuery:
standardClause:
booleanQuery:
Note
STRING non-terminal isn’t defined using railroad diagrams, the syntax is defined using following regular expression:
STRING:
((UNQUOTED_STRING | QUOTED_STRING) (('~' | '^') NUMBER)?)+
ESCAPES:
\^|\"|\*|\?|\:|\~|\\|\[|\]|\{|\}|\(|\)|\!|\||\&|\+|\-
UNQUOTED_STRING:
([\p{Alnum}.*?-@] | ESCAPES)+
QUOTED_STRING:
"(STRING_CHAR | WHITESPACE)*"
NUMBER:
'-'?[0-9]+ ('.' [0-9]+)?
Unicode classes (Letter, Number and Symbol) are Unicode General Category values.
Other than that, tokens can be separated using as much white space as desired and the DSL considers all [ \t\r\n]+
to be a white space.