A Tour of the DSL
The Dimensions Search Language (DSL) allows you to write concise expressions, called queries, to search and retrieve information from the Dimensions dataset. This page provides an interactive walkthrough of DSL query basics, to get you familiar with the language.
Todo
In this tutorial, suggested interactive exercises to build up your DSL knowledge will appear in boxes like this. This is the first one!
A simple query
Let’s start with a simple query example. One of the most basic and commonly-used types of questions we could ask about Dimensions data is something like, “Which are the most recent scholarly works related to a specific research topic?”.
We can answer such a question with a query like this, which asks: “Which are the most recent scientific publications related to malaria?”
search publications for "malaria" return publications
Whitespace is ignored in DSL queries, so feel free to use line breaks and indentation as you wish. The following query is exactly the same as the query above:
search publications
for "malaria"
return publications
The anatomy of a query
Each DSL query is composed of two main parts:
A
search
phrase, e.g.search publications for "malaria"
, andAt least one
return
phrase, e.g.return publications
The first part of the query, the search
phrase, specifies the set of documents that we want to know something about.
In the example search publications for "malaria"
, we are saying “I want to know something about publications related to malaria”.
search
part can optionally specify whether to search using full text search or search only in title and abstract.
This is accomplished by specifier in [full_data | title_abstract_only]
that follows the source to search in, such as for example search grants in title_abstract_only
.
The second part of the query, the return
phrase, specifies what we want to know about the documents specified in the search
phrase. In the example return publications
, we are saying “I want to see basic information about the publications matched by the search
phrase”.
A DSL query is not complete without both a search
phrase and at least one return
phrase; we must tell the DSL both which data we are interested in and what we want to see from that data. Trying to run a query that has only one phrase or the other will result in an error, as you can see for yourself by trying to run these incomplete queries:
search publications for "malaria"
return publications
In the following sections we’ll explore what we can do with each of these two phrases.
search
for source documents
As we’ve just seen, the search
phrase specifies a set of documents that will provide the data for the results we’ll ask for in the return
phrase.
As all Dimensions data must be taken from one of the supported sources, this phrase always starts with the word search
followed by the name of the desired source, e.g. search publications
. This is the only part of the search
phrase that is not optional; the following is a valid (though perhaps not very useful) query where the search
phrase specifies the set of all publications:
search publications return publications
Choose a document source
In the previous queries we’ve searched for publications
, which is an example of a Dimensions data source, i.e. a type of scientific document/work that may be of interest. Others include grants
, patents
, clinical_trials
, researchers
and policy_documents
.
We can query another source by replacing publications
with the name of another source, like grants
. For example:
search grants for "malaria" return grants
Todo
Try changing the query above to search for another type of work, e.g. patents
. (Hint: make sure you update both the search
and return
phrases; what happens when you only change one?)
For a list of all currently supported sources, see the Data Sources page.
Search for research topics
As we’ve seen in the preceding examples, we’re often interested in scientific work related to a particular topic, e.g. malaria. In the for
phrase, we can specify such a topic as one or more search term that we want the retrieved documents to match. We can specify a multi-word phrase as well, for example:
search grants for "attention deficit disorder" return grants
Todo
Try modifiying the query above to search for different terms/subjects. Take a look at the titles of the returned documents to see how they may be related to the subject(s) you searched for.
How does this type of search work behind the scenes? First, the complete text of all documents is analyzed, and each document is assigned a score that represents how well it matches the given terms(s). Documents are then ranked by their score, with the best-matching documents appearing first in the results.
Restrict the search with filters
We can restrict the set of documents matched by the search
phrase by specifying certain filters that must apply to the retrieved documents, in other words certain properties these documents must have. We specify these as filter expressions in the where
phrase, as follows:
search grants where start_year>=2010 and funders.acronym="DFG" for "attention deficit disorder" return grants
The query above only matches documents that were published in the year 2010 or later and where the supporting funder had the acronym “DFG”.
Todo
Try to modify the query above to match documents published only in 2010 (not in later years), then to match documents published in 2010 or earlier. (You can probably guess what operators you’d need to use instead of >=
, but consult the list of filter operators if you get stuck.)
Filters in the where
phrase may also be used to retrieve one or more specific documents using an identifier such as a publication DOI, like so:
search publications
where doi in ["10.1186/s12888-017-1463-3", "10.1186/s40479-017-0057-5", "10.1186/s12888-017-1222-5"]
return publications
The fields, entities, etc. that may appear in filter expressions depend on the source being searched; in the queries above, we see the year
field as well as the funder
entity’s acronym
field being used to search grants
, and the doi
field being used to search publications
.
If an unsupported field name is used, the DSL will report an error message to that effect, and provide a list of the supported fields to help you find the correct field name. For example, if you want to find grants that started in the year 2010 and create a filter for this using a field called startyear
, this will result in an error, because the field name is not quite correct. Try it for yourself - this query will raise an error:
search grants where start_year=2010 return grants
Todo
Fix the query above to use the correct field name (hint: it’s in the error message).
Consult the documentation for the where phrase (including the types of expressions/operators that may be used) to find out more about the types of filters and operators you can use, and the lists of supported fields for each supported source to see which field and entity names can be used in filter expressions.
Todo
Play with the queries above to use other filter expression(s) that narrow the set of documents in different ways.
Note
As you can see in the various queries above, the where
phrase may be used in combination with a for
phrase, or by itself. If both phrases are used, they can appear in either order - it does not matter which comes first. Behind the scenes, first the where
filters will be applied to narrow down the set of documents, and then this restricted set of documents will be scored & ranked against the for
search terms as described above.
return
information about documents
As we said earlier, the search
phrase allows us to specify which documents we’re interested in, but it doesn’t allow us to specify exactly what we want to know about those documents; for that we use the return
phrase(s).
A return phrase comes after the search
phrase, and always starts with the word return
followed by a specification of the result we wish to see. In the queries we’ve seen so far, this has been only the most basic type of result: the source documents themselves, i.e. return publications
or return grants
.
However, the return
phrase gives us the flexibility to be much more specific about what we wish to see; in the following sections we’ll explore some of the possibilities.
Return source documents
If we are interested in knowing something about the source documents themselves, we can ask the DSL to show us only certain information (metadata) about each document by indicating the specific field(s) we wish to see, like so:
search publications for "malaria" return publications [doi + title + year]
As in the search
phrase, the field(s) that may appear in such a return
phrase depend on the source being searched/returned.
Todo
Modify the query above to return different metadata by adding/removing/changing fields in the return
phrase. (Consult the list of publications fields for ideas.)
As a shorthand for commonly requested data, the DSL also supports a few fieldsets for each source, so that the name of a fieldset can be used instead of a long list of fields. For example, the extras
fieldset includes additional fields we may not have seen before:
search publications for "malaria" return publications [extras]
Compare the results of the query above to those of the query below, which uses the smaller basics
fieldset:
search publications for "malaria" return publications[basics]
The basics
fieldset is the default used when no fields or fieldsets are specified in the results phrase, such that return publications
returns the exact same results as return publications[basics]
. Try deleting :basics
from the query above to see for yourself.
For a complete list of supported fields and fieldsets for each source, see the supported data page of the documentation.
Calculate aggregations
Sometimes we are not interested in data about each individual document, but rather in aggregate data about the set of documents matched by the return phrase. For example, perhaps you’re interested in publications about malaria, but don’t want to see the publications themselves; instead, you want to see the number of publications that appeared in each year. You can ask the DSL for that information with a return
phrase like this:
search publications for "malaria" return year
When given this query, the DSL will find all publications matching the search
phrase, then look at the year
of each of those publications to tally up the count
of matching publications for each year; the results will then present the years
(not the publications
themselves) with the highest publication counts.
We may also be interested in the number of malaria publications supported by different funding organizations, which we could determine with a query like:
search publications for "malaria" return year return funders
We can calculate such aggregations via an operation called “faceting”, which is supported for certain fields of the given source; such fields are called facets. Consult the tables of supported source fields to learn which fields are facets.
Todo
Change the query above to aggregate over another publications
facet. (Consult the list of publications fields for ideas.)
Specify entity fields
Since funders
are an entity, the Dimensions database stores some metadata about each of these funders. Just as we could specify which fields we wanted to see for each of the publications
we returned earlier, we can specify which fields (or fieldset) we’d like to see for each of the funders
in the same way. Perhaps we’re only interested in each funder’s name:
search publications for "malaria" return funders[name]
Compare this to the results of the previous query, where we saw additional data like the acronym
and country_name
of each funder.
Todo
Play with the query above to add fields to the funders
results (consult the list of funding organization fields).
Note
Specifying fields is only possible for entity fields of a given source, such as funders
of publications
. With a non-entity field, such as years
, we cannot specify fields; a year has no name or other metadata. Which fields of a given source are entities/non-entities can be seen in the field tables for the supported Data Sources, and for each of the States there is also a list of available fields/fieldsets.
Use different aggregation indicators
So far we’ve been interested in the number of publications each funder supported, but perhaps we’re interested in knowing which funders supported the most-cited publications. In that case, instead of just tallying the count of publications per funder, we could use a different aggregation indicator, e.g. take the average of the Relative Citation Ratio (RCR) score for each publication supported by a given funder. We can tell the DSL to do this by specifying rcr_avg
as the aggregation indicator:
search publications for "malaria" return funders aggregate rcr_avg
As you can see in the results of the query above, now we see an rcr_avg
score for each funder, but we still see the count
in the results (this is always included in aggregation results), and the funders are still ranked by count
. This is because count
is the default sort indicator for any aggregations; if we want to sort by a different indicator, we just need to add a sort by
phrase:
search publications for "malaria" return funders aggregate rcr_avg sort by rcr_avg
Compare the results of the query above to those of the previous query; as you can see, now the funders appear ranked by their average RCR score, not publication count.
Consult the list of indicators for each supported source to find out which indicators(s) can be used.
Todo
Try updating the query above to aggregate & sort the funders by the median Altmetric Attention Score of the publications they supported.
Limit the number of results
In all of the results we’ve seen so far, whether we’re returning publications
documents or an aggregation over years
or funders
, we’ve seen the top 20 entries for each. That is because the DSL is configured to return 20 results by default, and for those to be the first 20 results.
You can customize how many results are returned by setting a limit
, as follows:
search publications for "malaria" return funders limit 5
Todo
Try it yourself by changing the limit
in the query above. How high can it go?
Return multiple results
Instead of writing two separate queries like these that use the same search
phrase and are asking for an aggregation over the results, like search publications for "malaria" return years
and search publications for "malaria" return funders
, we can combine them using multiple return
phrases, like so:
search publications for "malaria"
return years
return funders
If you wish to also get data about the publications themselves, you can ask for that in a separate return
phrase, like this:
search publications for "malaria"
return year
return funders
return publications
And if you want to use a different aggregation indicator or limit for funders, but not years, you’ll need to use separate return phrases for each, e.g.:
search publications for "malaria"
return year
return funders aggregate rcr_avg limit 5
return publications
One way of looking at this is that we are using multiple return
phrases to get multiple different “views” over a single set of data (the documents matching the search
phrase).
Warning
Returning field both as a facet and a field might result in incomplete results. For example, this query: search publications return publications[researchers] return researchers[all]
would fail to retrieve all
fieldset for the researchers
facet, and instead only the basic
fieldset would be returned. In this case, it might be better to split the query into two
and return facet separately.