Query Syntax¶
The DSL aims to capture the type of interaction with Dimensions data that users are accustomed to performing graphically via the web application, and enable web app developers, power users, and others to carry out such interactions by writing query statements in a syntax loosely inspired by SQL but particularly suited to our specific domain and data organization.
Basic query structure¶
DSL queries consist of two required components:
a search
phrase that indicates the scientific records
to be searched,
and one or more return
phrases which specify the contents and structure
of the desired results.
The simplest valid DSL query is of the form:
search grants return grants
---------------|---------------
search <source>|return <result>
A more useful query might also make use of the optional for
and where
phrases to limit the set of records returned.
search grants for "lung cancer" where active_year=2000 return grants
--------------------------------------------------------|---------------
search <source>|for <terms> |where <filters> |return <result>
search source
¶
A query must begin with the word search
followed by a source name, i.e. the name
of a type of scientific record, such as grants
or publications
.
search grants
---------------
search <source>
The source name may be followed by an optional for phrase that provides search terms to rank records against,
and/or an optional where phrase that limits the set of records that will be searched. The for
and where
phrases may be in either order.
search grants for "laryngectomy" where start_year=2000
--------------------------------------------------------
search <source>|for <terms> |where <filters>
search grants where start_year=2000 for "laryngectomy"
-------------------------------------------------------
search <source>|where <filters> |for <terms>
Full-text Searching¶
Full-text search or keyword search finds all instances of a term (keyword) in a document, or group of documents. Full text search works by using search indexes, which can be targeting specific sections of a document e.g. its abstract, authors, full text etc…
search publications in full_data for "Apollo 11" return publications
----------------------------------------------------------------------
search <source> | in <search index> | for <keywords> | return <source>
in [search index]
¶
This optional phrase consists of the particle in
followed by a term indicating a search index, specifying for example whether the search is limited to full text, title and abstract only, or title only. Please check
supported sources to see what sources support which exact search indexes.
search grants in full_data for "something" return grants
---------------------------------------------------------------------
search grants in title_abstract_only for "something" return grants
Special search indexes for persons names permit to perform full text searches on publications authors
or grants inventors
. Please see the authors search section for more information on how searches work in this case.
search publications in authors for "\"Jennifer A Doudna\"" return publications
for "search term"
¶
This optional phrase consists of the keyword for
followed by a search term
string, enclosed in double quotes ("
).
for "motor neuron disease"
--------------------------
for <terms (string)>
Strings in double quotes can contain nested quotes escaped by a backslash \
.
This will ensure that the string in nested double quotes is searched for as
if it was a single phrase, not multiple words (note: this applies both to full-text searching and field searching).
An example of a phrase: "\"Machine Learning\""
: results must contain Machine Learning
as a phrase.
Example of multiple keywords: "Machine Learning"
: this searches for keywords independently.
Note
Special characters, such as any of ^ " : ~ \ [ ] { } ( ) ! | & +
must be escaped by a backaslash \
.
Examples
Searching for phrase “How is mechanobiology involved in mesenchymal stem cell differentiation toward the osteoblastic or adipogenic fate?”
Special characters, such as question marks, must be escaped by a backslash
\
search publications for "How is mechanobiology involved in mesenchymal stem cell differentiation toward the osteoblastic or adipogenic fate\?" return publications
Searching for “dose” or “concentration”
Special characters, such as parenthesis must not be escaped, because they are used to construct a boolean query (see next section).
search publications for "(dose OR concentration)" return publications
Searching for “haskell unit ()”
Special characters, such as parenthesis must be escaped, because we are searching literally for them.
search publications for "haskell unit \(\)" return publications
Attention
Please note escaping rules in Python. For example, when writing a query with escaped quotes, such as:
search publications for "\"phrase 1\" AND \"phrase 2\""
in Python, it is necessary to escape the backslashes as well, so it would look like:
resp = requests.post(
'https://<your-url.dimensions.ai>/api/dsl.json',
data='search publications for "\\"phrase 1\\" AND \\"phrase 2\\""',
headers=headers)
Similar escaping rules might apply to other programming languages than Python as well.
Boolean Operators¶
Search term can consist of multiple keywords or phrases connected using boolean logic operators, e.g.
AND
, OR
and NOT
.
search publications for "(dose OR concentration)" return publications
The full specification is shown in the table below. This table is specifying the “standardQuery” grammar .
Boolean Operator |
Alternative Symbol |
Description |
---|---|---|
|
|
Requires both terms on either side of the Boolean operator to be present for a match. |
|
|
Requires that the following term not be present. |
|
|
Requires that either term (or both terms) be present for a match. |
|
Requires that the following term be present. |
|
|
Prohibits the following term (that is, matches on fields or documents that do not include that term).
The |
Note
When specifying Boolean operators with keywords such as AND
, OR
and NOT
, the keywords must appear in all uppercase.
Wildcard Searches¶
The DSL supports single and multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases.
search publications in title_only for "ital*" return publications
Wildcard Search Type |
Special Character |
Example |
---|---|---|
Single character - matches a single character |
|
The search string |
Multiple characters - matches zero or more sequential characters |
|
The wildcard search: |
Note
Wildcard matches may only be used for word suffixes, i.e. ital* is a correct wildcard search, whereas *ital is not accepted. The wildcard match in this case is stripped out and a warning is issued.
Proximity Searches¶
A proximity search looks for terms that are within a specific distance from one another.
To perform a proximity search, add the tilde character ~
and a numeric value to the end of a search phrase. For example, to search for a apache
and jakarta
within 10 words of each other in a document, use the search:
search publications for "\"jakarta apache\"~10" return publications
The distance referred to here is the number of term movements needed to match the specified phrase. In the example above, if apache
and jakarta
were 10 spaces apart in a field, but apache
appeared before jakarta
, more than 10 term movements would be required to move the terms together and position apache
to the right of jakarta
with a space in between.
Field Searching¶
Field searching allows to use a specific field of a source as a query filter. For example, this can be a Literal field such as the type of a publication, its date, mesh terms, etc.. Or it can be an entity field, such as the journal title for a publication, the country name of its author affiliations, etc..
A complete list of fields available as filters for each source can be found in the supported data section. See also the entity fields section for more information on how entity fields differ from simple literal fields.
search publications where type = "book" and return publications
----------------------------------------------------------------
search <source> | where <filter> | return <source>
where
¶
This optional phrase consists of the keyword where
followed by a
filters
phrase consisting of DSL
filter expressions, as described below.
where research_org_countries="Germany"
---------------------------------------------
where <filters>
Note
If a for phrase is also used in a filtered query, the system will first apply the filters, and then search the resulting restricted set of documents for the search term.
in
¶
For convenience, the DSL also supports shorthand notation for filters where a particular field should be restricted to a specified range or list of values (although the same logic may be expressed using complex filters as shown below).
A range filter consists of the field
name, the keyword in
, and a range
of values enclosed in square brackets ([]
), where the range consists of
a low
value, colon :
, and a high
value.
start_year in [ 2010 : 2015 ]
----------|--|---------------
<field> in <range>
-|----|-|----|-
[ <low>:<high>]
The results are inclusive of both the low and high values, such that the following two restriction phrases give the same results:
where start_year in [2010:2015]
where (start_year>=2010 and start_year<=2015)
A list filter consists of the field
name, the keyword in
, and a list
of one or more value
s enclosed in square brackets ([]
),
where values are separated by commas (,
):
research_org_countries in [ "Germany", "Italy", "France" ]
-----------------|--|--------------------------------------
<field> in <list>
-|-------------|-------|------------|-
[ <value> , <value> , <value> ]
The following two restriction phrases give the same results:
where start_year in [2000, 2005, 2010]
where (start_year=2000 or start_year=2005 or start_year=2010)
count
¶
Functions can be applied on fields in filters. Currently, only filter function count
is supported on some fields in publications (e.g. researchers
and research_orgs
).
Use of this filter is shown on the example below:
count(research_orgs) = 1
-------------------------------------------
<filter function> <simple filter>
Literal Fields Vs Entity Fields¶
In addition to restricting the values of a particular literal field of the source being searched, filter expressions may also restrict results using an entity field, i.e. restricting the values of a particular field of an entity related to this record.
For example, when searching for grants, we may wish to restrict results to those whose funder has a certain acronym or whose research organization has a certain name, as in the following queries:
search grants where funders.acronym="DFG" return grants
search grants where research_orgs.name~"Harvard" return grants
Such an entity metadata filter consists of the entity
name, followed by
a dot (.
) and the name of the field
on the entity record
which we would like to restrict, followed by the same content as for
Filter Operators or in: namely, an op
operator
followed by a value
, or the keyword in
followed by a range or list
of values in brackets ([]
).
funders.acronym != "NHLBI"
---------|-------|----|-------
<entity>.<field> <op> <value>
funders.acronym in ["EC", "DFG"]
---------|-------|--|-------------
<entity>.<field> in <list>
Warning
Entity fields should be used with caution. Entity metadata filters impose some internal limitations and the user should be aware how they work to understand the results they produce.
If the user queries research_orgs.country_name = "South Korea"
, then the DSL will execute a query to retrieve
up to 450 organization IDs from the South Korea, then these IDs are used in the main query.
The 450 organization limit, is in total for all entity queries, and is split between
the number of entity filters. This means that filter
research_orgs.country_name = "South Korea" or research_orgs.country_name = "Japan"
will retrieve up to
225 organization IDs from South Korea and up to 225
organization IDs from Japan.
As a result, it may come as a surprise that this filter does not return all research organizations from South Korea
or Japan, but only a limited subset.
A better way to express this filter is as research_org_country_names in ["South Korea", "Japan"]
,
which uses the field research_org_country_names
and does not trigger an extra query.
Therefore no limit of entities are imposed on it.
Please see the Data Model for the list of supported fields and their meaning.
Filter Operators¶
A simple filter expression consists of
a field
name, an in-/equality operator op
,
and the desired field value
.
The value
must be a string enclosed in double quotes ("
) or an integer (e.g. 1234
).
The available op
operators are:
|
meaning |
---|---|
|
is (or contains if the given |
|
is not |
|
is greater than |
|
is less than |
|
is greater than or equal to |
|
is less than or equal to |
|
partially matches (see Partial string matching with ~ below) |
|
is empty (see Emptiness filters is empty below) |
|
is not empty (see Emptiness filters is empty below) |
start_year >= 2010
----------|----|-------
<field> <op> <value>
Partial string matching with ~
¶
The ~
operator indicates that the given field
need only partially,
instead of exactly, match the given string
(the value
used with this operator must be a string, not an integer).
For example, the filter where research_orgs.name~"Saarland Uni"
would match both the organization named “Saarland University” and the one named “Universitätsklinikum des Saarlandes”, and any other organization whose name includes the terms “Saarland” and “Uni” (the order is unimportant). However, the filter where research_orgs.name="Saarland Uni"
would not match either of these two organizations, as the =
operator requires an exact match.
Emptiness filters is empty
¶
To filter records which contain specific field or to filter those which contain an empty field, it is possible to use something like where research_orgs is not empty
or where issn is empty
.
Chaining Filters¶
More complex filter expressions may be created by combining multiple
simple filters using the following boolean
operators,
possibly grouped using parentheses (()
):
|
meaning |
---|---|
A |
include results which match both filters A and B |
A |
include results which match either filter A, or filter B, or both |
A |
include results which match filter A and do not match filter B |
The following are all examples of valid filter expressions:
research_org_countries="Germany" or research_org_countries="Czech Republic"
---------------------------------------|---------|----------------------------------------------
<simple filter> <boolean> <simple filter>
start_year<=2010 and ( active_year=2015 or active_year=2016 )
----------------|-----|-----------------------------------------
<simple filter> <bl.> <complex filter>
-|-----------------|--|----------------|-
( <simple filter> <b.> <simple filter> )
( start_year>=2008 and start_year<=2010 ) and ( active_year=2015 or active_year=2016 )
-----------------------------------------|-----|----------------------------------------
<complex filter> <bl.> <complex filter>
-|----------------|---|----------------|- -|-----------------|--|-----------------|-
( <simple filter> <bl.> <simple filter> ) ( <simple filter> <bl> <simple filter> )
If multiple boolean
operators are used in a complex filter expression without parentheses,
the system will apply the operators using the following order of operator precedence:
not
applies firstand
applies nextor
applies last
To deviate from these precedence rules, parentheses (()
) may be used to
explicitly specify the order in which operators should be applied.
To illustrate, the following filter expressions are equivalent:
start_year<=2010 not start_year=2005 and active_year=2015 or active_year=2020
((start_year<=2010 not start_year=2005) and active_year=2015) or active_year=2010
The above filters will match documents which either:
started in 2010 or earlier but did not start in 2005, and were active in 2015; or
were active in 2010
The following expression uses the same operators, but specifies a different order of operations using parentheses:
(start_year<=2010 not start_year=2005) and (active_year=2015 or active_ye
ar=2010)
In contrast to the previous two filters, this filter will only match documents which:
started in 2010 or earlier but did not start in 2005, and were active in 2015 or 2010
Note
Spaces around operators and parentheses are optional. They may be used to make queries easier to read, but are ignored by the query parser.
Note
Outermost parentheses around the filter expression(s) are optional and have no effect, such that the following pairs are equivalent:
where start_year=2010
where (start_year=2010)
where start_year<=2010 and (active_year=2015 or active_year=2016)
where (start_year<=2010 and (active_year=2015 or active_year=2016))
Searching for Researchers/Authors¶
The DSL offers different mechanisms for searching for researchers (e.g. publication authors, grant investigators), each of them presenting specific advantages.
Using Full Text Indexes¶
Using this method one can look up a researcher name and surname exactly as they appear in source documents (that is, when that document was published).
This approach allows to search the full collection of Dimensions documents, irrespectively of whether a researcher has been disambiguated and hence given a Dimensions ID.
On the other hand, this approach will only match names as they appear in a document, so different spellings or initials are not necessarily returned via a single query. In order to address this problem, researchers search can be used.
search in [authors]
¶
Authors search can be used to look up publications authors using a specific search index called authors
.
Authors search expects case insensitive phrases, in format “<first name> <last name>” or reverse order.
search publications in authors for "\"Daniel Hook\"" return publications
--------------------------------------------------------------------------------------
search <source> | in <search index> | for <first name> <last name> | return <source>
Instead of first name, initials can also be used. These are examples of valid research search phrases:
\"Hook, Daniel W.\"
\"Daniel W. Hook\"
\"DW Hook\"
\"Hook DW\"
\"D W Hook\"
\"Hook D W\"
\"D Hook\"
\"Hook D\"
\"Daniel Hook\"
\"Hook Daniel\"
Commas and dots are ignored, they may or may not be used. Authors search only supports only + and - operators.
search in [investigators|inventors]
¶
Investigators search is similar to authors search, only it allows to search on grants
and clinical trials
using a separate search index investigators
, and on patents
using the index investors
.
search clinical_trials in investigators for "smith" return clinical_trials
search grants in investigators for "\"Satoko Shimazaki\"" return grants
search patents in inventors for "\"John Smith\"" return patents
Note
In order to produce valid results an author or an investigator search query must contain at least two components or more (e.g., name and surname, either in full or initials).
Using Field Filters¶
This type of search is similar to full-text search, with the difference that it allows searching by only a part of a name, such as only a last name for example.
search where [authors]
¶
This syntax allows to search for only part of an author’s name.
For example:
search publications where authors = "Hawking" return publications
--------------------------------------------------------------------------------------
search <source> | where authors <part of name> | return <source>
Generally speaking, using a where
clause to search authors is less precise that using the relevant full-text index. On the other hand, using a where
clause can be handy if one wants to combine an author search with another full-text search index.
For example:
search publications in title_abstract_only for "dna replication" where authors = "smith" return publications
-----------------------------------------------------------------------------------------------------------------------------
search <source> | in <search-index> | for <keywords> | where authors <part of name> | return <source>
Note
At this moment, this type of search is only available for publications
. Other sources will add this option in the future.
Using Researchers Source¶
The Dimensions researchers source is a catalogue of researchers information algorithmically extracted and disambiguated from all of the other content sources (publications, grants, clinical trials etc..).
The researchers
database allows to match a person ‘object’ that links together multiple publication authors or grant investigators, irrespectively of the form their names can take in the original source documents.
Examples:
search researchers for "\"Satoko Shimazaki\"" return researchers
search researchers where last_name="Shimazaki" return researchers
Note
The researchers
database is the result of a vetted computational process and so it does not contain all authors and investigators information available in Dimensions. E.g. think of authors from older publications, or authors with very common names that are difficult to disambiguate, or very new authors, who have only one or few publications. In such cases, using full-text authors search might be more appropriate.
Warning
Please note that Python and other programming languages have special escaping rules when writing a query with escaped quotes.
Returning results¶
After the search
phrase, a query must contain one return
phrase,
specifying the content and format of the information that should be returned.
Examples:
return grants [extras]
------------------------
return <src>[<fieldset>]
return in "docs" grants[title] as "projects" sort by title limit 10 skip 20
-----------------------------------------------------------------------------------
return in <group> <src>[<fields>] as <alias> sort by <field> limit <#> skip <#>
Note
It is possible to specify set return_all_keys
as a prefix to the query, in order for the DSL to return all fields, even those which have no value. They will be returned in the JSON with null as a value.
This is in contrast with the default behavior, where fields with no value are in the output omitted.
Example: set return_all_keys search publications return publications
.
Returning Specific Fields¶
For control over which information from each given record will be returned,
a source name in the results
phrase can be optionally
followed by a specification of fields
and fieldsets
to be included in the
JSON results for each retrieved record.
The possible types of fields
specifications are described below.
fields
¶
The fields specification may be an arbitrary list of field names
enclosed in brackets ([
, ]
), with field names separated by a plus sign (+
).
Minus sign (-
) can be used to exclude field
or a fieldset
from the result.
Field names thus listed within brackets must be “known” to the DSL,
and therefore only a subset of fields may be used in this syntax (see note below).
grants:[project_num + title_original - language]
------------------------------------------------
<source>:[ <field> + <field> - <field> ]
fieldsets
¶
The fields specification may be the name of a pre-defined fieldset
(e.g. extras
, basics
). The fields corresponding to that fieldset
will be included in the result.
publications[extras] ------------------------ <source> [<fieldset>]
Note
The fields and fieldsets available for each source/entity are specified in
the data sources section.
Only fields/fieldsets listed in the configuration
may be used in fields
specifications of the two types listed above.
Paginating Results¶
At the end of a return
phrase, the user can specify
the maximum number of results to be returned
and the number of top records to skip over before returning
the first result record, for e.g. returning large result sets page-by-page
(i.e. “paging” results) as described below.
This is done using the keyword limit
followed by the maximum number
of results to return, optionally followed by the keyword skip
and the number of results to skip (the offset).
return publications limit 15 skip 30
---------------------------------------
return <result> limit <#> skip <#>
If paging information is not provided, the default values
limit 20 skip 0
are used, so the two following queries are equivalent:
search grants for "malaria" return grants
search grants for "malaria" return grants limit 20 skip 0
Note
While a limit
value may be specified without also specifying
a skip
value, skip
may not be used on its own without limit
;
e.g. the query search grants return grants skip 20
is invalid and
will result in an error. The valid alternative to this query would be
search grants return grants limit 20 skip 20
.
Combining limit
and skip
across multiple queries
enables paging or batching of results;
e.g. to retrieve 30 grant records divided into 3 pages of 10 records each,
the following three queries could be used:
return grants limit 10 => get 1st 10 records for page 1 (skip 0, by default)
return grants limit 10 skip 10 => get next 10 for page 2; skip the 10 we already have
return grants limit 10 skip 20 => get another 10 for page 3, for a total of 30
Sorting Results¶
A sort order for the results in a given return
phrase can be specified
with the keyword sort by
followed by the name of a field. It is also possible
to sort the result set by “relevance” to the full text query (search … for “full text query”).
Additionally, it is possible to specify the sort order, using asc
or desc
keywords.
These keyword specify ascending resp. descending ordering of results.
By default, descending order is selected.
return grants sort by title desc
----------------------------------------
return <source> sort by <field> <order>
return grants sort by relevance desc
--------------------------------------------
return <source> sort by relevance <order>
Formal language specification¶
The general structure of the query visualized using railroad diagrams. Rectangles represent non-terminals and elliptic shapes visualize terminal symbols.
Simply put, terminal symbols are actual tokens or “words” in the DSL, while non-terminals can be expanded further. This section shows how the language is structured. For example, each query starts with either search, followed by something that is a NAME - basically string such as publications. Alternatively, query can be a Basic functions structure. Then there is an optional part with filters that is initiated using where terminal / keyword. The query is allowed to have one or more result non-terminal sections.
search:
searchFor:
restriction:
filters:
entityFilter:
simpleFilterEntity:
value:
literal:
span:
array:
stringArray:
integerArray:
op:
result:
group:
record:
fields:
aggregate:
sort:
paging:
funExpression:
funArgs:
funArgNamed:
funArgValue:
identifyReviewers:
describe:
metaExpr:
Note
Some non-terminals, such as NAME, STRING, INTEGER aren’t defined using railroad diagrams, their syntax is defined using following regular expressions:
NAME:
[a-zA-Z_]+
STRING:
'"' ('\\"'|.)*? '"'
INTEGER:
'-'?[0-9]+
Other than that, tokens can be separated using as much white space as desired and the DSL considers all [ \t\r\n]+
to be a white space.
Full-text query formal language specification¶
limitedQuery:
limitedClause:
booleanPrefix:
standardQuery:
standardClause:
booleanQuery:
Note
STRING non-terminal isn’t defined using railroad diagrams, the syntax is defined using following regular expression:
STRING:
((UNQUOTED_STRING | QUOTED_STRING) (~ INTEGER)?)+
ESCAPES:
\^|\"|\*|\?|\:|\~|\\|\[|\]|\{|\}|\(|\)|\!|\||\&|\+|\-
UNQUOTED_STRING:
([\p{Letter}\p{Number}\p{Symbol}.*?] | ESCAPES)+
QUOTED_STRING:
"(~")*"
INTEGER:
'-'?[0-9]+
Unicode classes (Letter, Number and Symbol) are Unicode General Category values.
Other than that, tokens can be separated using as much white space as desired and the DSL considers all [ \t\r\n]+
to be a white space.