Query syntax

The DSL aims to capture the type of interaction with Dimensions data that users are accustomed to performing graphically via the web application, and enable web app developers, power users, and others to carry out such interactions by writing query statements in a syntax loosely inspired by SQL but particularly suited to our specific domain and data organization.

Language specification

The general structure of the query visualized using railroad diagrams. Rectangles represent non-terminals and elliptic shapes visualize terminal symbols.

Simply put, terminal symbols are actual tokens or “words” in the DSL, while non-terminals can be expanded further. This section shows how the language is structured. For example, each query starts with either search, followed by something that is a NAME - basically string such as publications. Alternatively, query can be a Function Syntax. Then there is an optional part with filters that is initiated using where terminal / keyword. The query is allowed to have one or more result non-terminal sections.

searchFor:
restriction:
filters:
entityFilter:
simpleFilterEntity:
value:
literal:
span:
array:
stringArray:
integerArray:
op:
result:
group:
record:
fields:
aggregate:
sort:
paging:
funExpression:
funArgs:
funArgNamed:
funArgValue:

Note

Some non-terminals, such as NAME, STRING, INTEGER aren’t defined using railroad diagrams, their syntax is defined using following regular expressions:

  • NAME: [a-zA-Z_]+
  • STRING: '"' ('\\"'|.)*? '"'
  • INTEGER: '-'?[0-9]+

Other than that, tokens can be separated using as much white space as desired and the DSL considers all [ \t\r\n]+ to be a white space.

Basic query structure

DSL queries consist of two required components: a search phrase that indicates the scientific works to be searched, and one or more return phrases which specify the contents and structure of the desired results.

The simplest valid DSL query is of the form:

search  grants  return  grants
---------------|---------------
search <source>|return <result>

A more useful query might also make use of the optional for and where phrases to limit the set of works returned.

search  grants  for "lung cancer" where active_year=2000 return  grants
--------------------------------------------------------|---------------
search <source>|for     <terms>  |where    <filters>    |return <result>

A query requesting more complex results might have the form:

search  grants  for "laryngectomy" where start_year=2000 return  grants[id + title]  sort by  title  return funders return funder_countries return research_orgs as "universities" aggregate funding  sort by funding
--------------------------------------------------------|------------------------------------------|--------------------------------------|--------------------------------------------------------------------------
                                                        |                                          |                                      |
search <source>|for     <terms>   |where    <filters>   |return <source>[<fields>]|sort by <field>|return <facet> return <facet>         |return    <facet>    as    <alias>    |aggregate <indicator>|sort by <indicator>

Searching works

search publications

A query must begin with the keyword (KW) search followed by a source name, i.e. the name of a type of scientific work, such as grants or publications.

search  grants
---------------
search <source>

The source name may be followed by an optional for phrase that provides search terms to rank works against, and/or an optional where phrase that limits the set of works that will be searched. The for and where phrases may be in either order.

search  grants  for "laryngectomy" where start_year=2000
--------------------------------------------------------
search <source>|for     <terms>   |where    <filters>
search  grants  where start_year=2000 for "laryngectomy"
-------------------------------------------------------
search <source>|where    <filters>   |for     <terms>

in [full_data|title_abstract_only|title_only|researchers|terms|...]

This optional phrase consists of the keyword in followed by a full data or title abstract only or title only, whether the search is limited to full text or title and abstract only or title only. Please check supported sources to see what sources support which exact search fields.

    search grants in full_data for "something" return grants
    ---------------------------------------------------------------------
    search grants in title_abstract_only for "something" return grants

Another example of searching for researchers in publications, shows how to search for a phrase, a full name in this case. This feature is explained in the next section, search term.

    search publications in researchers for "\"Jennifer A Doudna\"" return publications

search term

This optional phrase consists of the keyword for followed by a search terms string, enclosed in double quotes (").

for "motor neuron disease"
--------------------------
for    <terms (string)>

Strings in double quotes can contain nested quotes escaped by a backslash \. This will ensure that the string in nested double quotes is searched for as if it was a single phrase, not multiple words. This applies both to the for part of the query and filters.

An example of a phrase: "\"Machine Learning\"". Result must contain Machine Learning as a phrase.

Example of multiple keywords: "Machine Learning". Searches for keywords independently.

Note

Special characters, such as any of ^ " * ? : ~ \ [ ] { } ( ) ! | & + - must be escaped by a backaslash \.

Examples

  1. Searching for phrase “How is mechanobiology involved in mesenchymal stem cell differentiation toward the osteoblastic or adipogenic fate?”

    Special characters, such as question marks, must be escaped by a backslash \

    search publications for "How is mechanobiology involved in mesenchymal
                 stem cell differentiation toward the osteoblastic or adipogenic fate\?"
    return publications
    
  2. Searching for “dose” or “concentration”

    Special characters, such as parenthesis must not be escaped, because they are used to construct a boolean query.

    search publications for "(dose OR concentration)"
    return publications
    
  3. Searching for “haskell unit ()”

    Special characters, such as parenthesis must be escaped, because we are searching literally for them.

    search publications for "haskell unit \(\)"
    return publications
    

Full query syntax specification:

limitedQuery:
limitedClause:
booleanPrefix:
standardQuery:
standardClause:
booleanQuery:

Note

STRING non-terminal isn’t defined using railroad diagrams, the syntax is defined using following regular expression:

  • STRING: UNQUOTED_STRING | QUOTED_STRING
  • ESCAPES: \^|\"|\*|\?|\:|\~|\\|\[|\]|\{|\}|\(|\)|\!|\||\&|\+|\-
  • UNQUOTED_STRING: ([\p{Letter}\p{Number}\p{Symbol}] | ESCAPES)+
  • QUOTED_STRING: "(~")*"

Unicode classes (Letter, Number and Symbol) are Unicode General Category values. Other than that, tokens can be separated using as much white space as desired and the DSL considers all [ \t\r\n]+ to be a white space.

Search term can consist of multiple keywords or phrases connected using boolean logic operators, e.g. AND, OR and NOT. The full specification is shown in the table below. This table is specifying the “standardQuery” grammar.

Supported Boolean Operators
Boolean Operator Alternative Symbol Description
AND && Requires both terms on either side of the Boolean operator to be present for a match.
NOT ! Requires that the following term not be present.
OR || Requires that either term (or both terms) be present for a match.
  + Requires that the following term be present.
  - Prohibits the following term (that is, matches on fields or documents that do not include that term). The - operator is functionally similar to the Boolean operator !.

Note

When specifying Boolean operators with keywords such as AND, OR and NOT, the keywords must appear in all uppercase.

Note

Researcher search (search … in researchers) only supports only + and - operators. It is using the “limitedQuery” grammar.

Researcher search expects case insensitive phrases, in format “<first name> <last name>” or reverse order. Instead of first name, initials can also be used. These are examples of valid research search phrases:

search … in researchers for “<use one of the following>”

  • \"Hook, Daniel W.\"
  • \"Daniel W. Hook\"
  • \"DW Hook\"
  • \"Hook DW\"
  • \"D W Hook\"
  • \"Hook D W\"
  • \"D Hook\"
  • \"Hook D\"
  • \"Daniel Hook\"
  • \"Hook Daniel\"

Commas and dots are ignored, they may or may not be used.

where

This optional phrase consists of the keyword where followed by a filters phrase consisting of DSL filter expressions, as described below.

where research_org_name="Saarland University"
---------------------------------------------
where          <filters>

Note

The filters specified in this phrase are used to restrict the set of documents to be searched; if a for phrase is also used, the system will first apply the filters, and then search the resulting restricted set of documents for the search terms.

simple filters

A simple filter expression consists of a field name, an in-/equality operator op, and the desired field value. The value must be a string enclosed in double quotes (") or an integer (e.g. 1234).

The available op operators are:

op meaning
= is (or contains if the given field is multi-value)
!= is not
> is greater than
< is less than
>= is greater than or equal to
<= is less than or equal to
~ partially matches (see Partial string matching with ~ below)
is empty is empty (see Emptiness filters below)
is not empty is not empty (see Emptiness filters below)
start_year  >=  2010
----------|----|-------
 <field>   <op> <value>
Partial string matching with ~

The ~ operator indicates that the given field need only partially, instead of exactly, match the given string (the value used with this operator must be a string, not an integer).

For example, the filter where research_orgs.name~"Saarland Uni" would match both the organization named “Saarland University” and the one named “Universitätsklinikum des Saarlandes”, and any other organization whose name includes the terms “Saarland” and “Uni” (the order is unimportant). However, the filter where research_orgs.name="Saarland Uni" would not match either of these two organizations, as the = operator requires an exact match.

Emptiness filters

To filter records which contain specific field or to filter those which contain an empty field, it is possible to use something like where research_orgs is not empty or where issn is empty.

Filter functions

Functions can be applied on fields in filters. Currently, only filter function count is supported on some fields in publications.

Use of this filter is shown on the example below:

count(research_orgs) =        1
-------------------------------------------
   <filter function>     <simple filter>

Please see Supported data for the exact specifications on supported filter functions.

complex filters

More complex filter expressions may be created by combining multiple simple filters using the following boolean operators, possibly grouped using parentheses (()):

boolean meaning
A and B include results which match both filters A and B
A or B include results which match either filter A, or filter B, or both
A not B include results which match filter A and do not match filter B

The following are all examples of valid filter expressions:

research_org_name="Saarland University"    or     research_org_name="Universität des Saarlandes"
---------------------------------------|---------|----------------------------------------------
        <simple filter>                 <boolean>              <simple filter>
start_year<=2010  and  ( active_year=2015  or active_year=2016 )
----------------|-----|-----------------------------------------
 <simple filter> <bl.>           <complex filter>
                       -|-----------------|--|----------------|-
                       ( <simple filter>  <b.> <simple filter> )
( start_year>=2008 and start_year<=2010 )  and  ( active_year=2015  or  active_year=2016 )
-----------------------------------------|-----|----------------------------------------
          <complex filter>                <bl.>           <complex filter>
-|----------------|---|----------------|-       -|-----------------|--|-----------------|-
( <simple filter> <bl.> <simple filter> )       (  <simple filter> <bl>  <simple filter> )

If multiple boolean operators are used in a complex filter expression without parentheses, the system will apply the operators using the following order of operator precedence:

  • not applies first
  • and applies next
  • or applies last

To deviate from these precedence rules, parentheses (()) may be used to explicitly specify the order in which operators should be applied.

To illustrate, the following filter expressions are equivalent:

start_year<=2010 not start_year=2005 and active_year=2015 or active_year=2020
((start_year<=2010 not start_year=2005) and active_year=2015) or active_year=2010

The above filters will match documents which either:

  • started in 2010 or earlier but did not start in 2005, and were active in 2015; or
  • were active in 2010

The following expression uses the same operators, but specifies a different order of operations using parentheses:

(start_year<=2010 not start_year=2005) and (active_year=2015 or active_ye
              ar=2010)

In contrast to the previous two filters, this filter will only match documents which:

  • started in 2010 or earlier but did not start in 2005, and were active in 2015 or 2010

Note

Spaces around operators and parentheses are optional. They may be used to make queries easier to read, but are ignored by the query parser.

Note

Outermost parentheses around the filter expression(s) are optional and have no effect, such that the following pairs are equivalent:

where start_year=2010
where (start_year=2010)
where start_year<=2010 and (active_year=2015 or active_year=2016)
where (start_year<=2010 and (active_year=2015 or active_year=2016))

range filters

For convenience, the DSL also supports shorthand notation for filters where a particular field should be restricted to a specified range or list of values (although the same logic may be expressed using complex filters as shown below).

A range filter consists of the field name, the keyword in, and a range of values enclosed in square brackets ([]), where the range consists of a low value, colon :, and a high value.

start_year in [ 2010 : 2015 ]
----------|--|---------------
 <field>   in    <range>
              -|----|-|----|-
              [ <low>:<high>]

The results are inclusive of both the low and high values, such that the following two restriction phrases give the same results:

where start_year in [2010:2015]
where (start_year>=2010 and start_year<=2015)

A list filter consists of the field name, the keyword in, and a list of one or more value s enclosed in square brackets ([]), where values are separated by commas (,):

research_org_name in [ "UC Berkeley", "UC Davis", "UCLA"  ]
-----------------|--|--------------------------------------
<field>           in              <list>
            -|-------------|-------|------------|-
            [    <value>   ,  <value>  , <value> ]

The following two restriction phrases give the same results:

where start_year in [2000, 2005, 2010]
where (start_year=2000 or start_year=2005 or start_year=2010)

entity metadata filter

In addition to restricting the values of a particular field of the work type being searched, filter expressions may also restrict results using entity metadata, i.e. restricting the values of a particular field of an entity related to this work.

For example, when searching for grants, we may wish to restrict results to those whose funder has a certain acronym or whose research organization has a certain name, as in the following queries:

search grants where funders.acronym="DFG" return grants
search grants
    where research_orgs.name~"Harvard"
return grants

Such an entity metadata filter consists of the entity name, followed by a dot (.) and the name of the field on the entity record which we would like to restrict, followed by the same content as for simple filters or range filters: namely, an op operator followed by a value, or the keyword in followed by a range or list of values in brackets ([]).

  funders.acronym  !=  "NHLBI"
---------|-------|----|-------
<entity>.<field> <op> <value>
  funders.acronym in ["EC", "DFG"]
---------|-------|--|-------------
<entity>.<field> in    <list>

Note

Entity metadata filters impose some internal limitations and the user should be aware how they work to understand the results they produce.

If the user queries research_orgs.country_name = "South Korea", then the DSL will execute a query to retrieve up to 512 organization IDs from the South Korea, then these IDs are used in the main query.

The 512 organization limit, is in total for all entity queries, and is split between the number of entity filters. This means that filter research_orgs.country_name = "South Korea" or research_orgs.country_name = "Japan" will retrieve up to 256 organization IDs from South Korea and up to 256 organization IDs from Japan.

As a result, it may come as a surprise that this filter does not return all research organizations from South Korea or Japan, but only a limited subset. A better way to express this filter is as research_org_country_names in ["South Korea", "Japan"], which uses the field research_org_country_names and does not trigger an extra query. Therefore no limit of entities are imposed on it.

Please see the Supported data for the list of supported fields and their meaning.

Returning results

After the search phrase, a query must contain one or more return phrases, specifying the content and format of the information that should be returned.

Note

While a query can have only one search phrase, multiple result phrases are allowed, one directly after another.

Examples:

return grants [extras]
------------------------
return <src>[<fieldset>]
return funders return funder_countries return research_orgs as "universities" sort by count
--------------------------------------|-------------------------------------------------------
return <facet> return <facet>         |return    <facet>    as    <alias>     sort by <indicator>
return in "docs"  grants[title]   as "projects" sort by  title  limit 10  skip 20
-----------------------------------------------------------------------------------
return in <group> <src>[<fields>] as  <alias>   sort by <field> limit <#> skip <#>
return in "facets" funders return research_orgs as "organizations" aggregate  rcr_avg, altmetric_median sort by rcr_avg  limit  5
----------------------------------------------------------------------------------------------------------------------------------
return in <group>  <facet> return  <facet>      as    <alias>      aggregate <indicator>,   <indicator>   sort by <indicator> limit <#>

return source

The most basic return phrase consists of the keyword return followed by the name of a work or facet to be returned. This must be the name of the source used in the search phrase, or the name of a facet of that source.

return  grants
---------------
return <source>
return funders
--------------
return <facet>

return multiple sources

Multiple results may not be returned in a single return phrase.

return funders return research_orgs return year
--------------------------------------------------
return <facet> return <facet>       return <facet>

Note

This feature is only available for the Analytics DSL, and is not present in the Runtime DSL.

return publications:

For control over which information from each given record will be returned, a source or entity name in the results phrase can be optionally followed by a specification of fields and fieldsets to be included in the JSON results for each retrieved record.

The possible types of fields specifications are described below.

fieldsets

The fields specification may be the name of a pre-defined fieldset (e.g. extras, basics). The fields corresponding to that fieldset will be included in the result.

publications[extras]
------------------------
<source>    [<fieldset>]
 funders[basics]
-------------------
<entity>[<fieldset>]

fields

The fields specification may be an arbitrary list of field names enclosed in brackets ([, ]), with field names separated by a plus sign (+). Minus sign (-) can be used to exclude field or a fieldset from the result. Field names thus listed within brackets must be “known” to the DSL, and therefore only a subset of fields may be used in this syntax (see note below).

  grants:[project_num + title_original - language]
------------------------------------------------
<source>:[  <field>  +    <field>    - <field> ]
funders:[country_name + acronym +  name ]
---------------------------------------
<entity>:[  <field>  + <field> + <field>]

Note

The fields and fieldsets available for each source/entity are specified in the data configuration. Only fields/fieldsets listed in the configuration may be used in fields specifications of the two types listed above.

all

The fields specification may be an (all), to indicate that all fields available for the given source should be returned.

publications[all]
--------------
   <source> [all]

return in "facets"

For control over the organization and headers of the JSON query results, the return keyword in a return phrase may be followed by the keyword in and then a group name for this group of results, where the group name is enclosed in double quotes(").

return in "facets" funders return in "facets" research_orgs
-----------------------------------------
return in <group>  <facet> return in <group> <facet>

Each result returned in this return phrase will then be placed under the header of this group’s name in the final results (for an example, see Groups on the Example queries and results page).

Note

If the given group name has already been used in a previous return phrase, the result(s) from this return phrase will be added to that group. A result may not be added to an existing group which already contains a result of the same name.

aliases

The name of a source or facet to be returned in a return phrase may optionally be followed by the keyword as followed by an alias for this result in double quotes (").

The alias will then be used instead of the original source/facet name in the returned JSON results (for an example, see Aliases on the Example queries and results page).

return publications as "articles"
---------------------------------
return   <source>   as  <alias>

aggregate

In a return phrase requesting one or more facet results, aggregation operations to perform during faceting can be specified after the facet name(s) by using the keyword aggregate followed by a comma-separated list of one or more indicator names corresponding to the source being searched.

return research_orgs aggregate rcr_avg, altmetric_median
-----------------------------------------------------------------
return <facet>       aggregate <indicator>,<indicator>

Note

Every indicator appearing in the aggregations phrase must be either count or a pre-defined indicator for the given source as specified in the data configuration.

In addition to any specified aggregations, count is always computed and reported when facet results are requested. If no aggregations phrase is present, only count is computed. The following pairs of queries are therefore equivalent:

search publications return funders aggregate rcr_avg, count, altmetric_median
search publications return funders aggregate rcr_avg, altmetric_median
search grants return funders aggregate count
search grants return funders

sort

A sort order for the results in a given return phrase can be specified with the keyword sort by followed by the name of a field (in the case that a source is being requested) or a indicator (in the case that one or more facets are being requested). It is also possible to sort the result set by “relevance” to the full text query (search … for “full text query”). Additionally, it is possible to specify the sort order, using asc or desc keywords. These keyword specify ascending resp. descending ordering of results. By default, descending order is selected.

return  grants  sort by  title    desc
----------------------------------------
return <source> sort by <field>  <order>
return  grants  sort by  relevance    desc
--------------------------------------------
return <source> sort by  relevance   <order>
return research_orgs aggregate altmetric_median, rcr_avg sort by rcr_avg
-------------------------------------------------------------------------
return   <facet(s)>  aggregate         <indicator(s)>    sort by <indicator>

Note

If a facets is being returned, the indicator used in the sort phrase must either be count (the default, such that sort by count is unnecessary), or one of the indicators specified in the aggregate phrase, i.e. one whose values are being computed in the faceting operation. Attempting to sort by a indicator other than count that does not appear in the aggregate phrase will result in an error, as in the invalid query search publications return funders aggregate altmetric_median sort by rcr_avg.

limit and skip

At the end of a return phrase, the user can specify the maximum number of results to be returned and the number of top records to skip over before returning the first result record, for e.g. returning large result sets page-by-page (i.e. “paging” results) as described below.

This is done using the keyword limit followed by the maximum number of results to return, optionally followed by the keyword skip and the number of results to skip (the offset).

return  publications limit 15  skip 30
---------------------------------------
return  <result>     limit <#> skip <#>

If paging information is not provided, the default values limit 20 skip 0 are used, so the two following queries are equivalent:

search grants for "malaria" return grants
search grants for "malaria" return grants limit 20 skip 0

Note

While a limit value may be specified without also specifying a skip value, skip may not be used on its own without limit; e.g. the query search grants return grants skip 20 is invalid and will result in an error. The valid alternative to this query would be search grants return grants limit 20 skip 20.

Combining limit and skip across multiple queries enables paging or batching of results; e.g. to retrieve 30 grant records divided into 3 pages of 10 records each, the following three queries could be used:

return grants limit 10           => get 1st 10 records for page 1 (skip 0, by default)
return grants limit 10 skip 10   => get next 10 for page 2; skip the 10 we already have
return grants limit 10 skip 20   => get another 10 for page 3, for a total of 30

Function Syntax

DSL allows executing function expression to evaluate non-search functionality.

Calling a function extract_grants looks like this:

extract_grants  ("R01HL117329", fundref="100000050")
-------------------------------------------------------
<function name>:(positional arguments, named arguments)

Each function call starts with a function name followed by parenthesis, in between which arguments are placed, separated by commas. The DSL supports two types of arguments:

Positional Arguments
These are placed without any name. They are just values. These are completely optional and can be replaced by named arguments. Their purpose is mainly to simplify calling functions with just one or two arguments, by omitting the argument name.
Named Arguments
Named arguments are always put after positional arguments and their order is not important. In the example above, one could also write extract_grants(fundref=”100000050”, grant_number=”R01HL117329”).

Arguments can be of various type, for example string or integer. Full list of supported functions with the description is available at DSL Functions.