A Tour of the DSL¶
The Dimensions Search Language (DSL) allows you to write concise expressions, called queries, to search and retrieve information from the Dimensions dataset. This page provides an interactive walkthrough of DSL query basics, to get you familiar with the language.
In this tutorial, suggested interactive exercises to build up your DSL knowledge will appear in boxes like this. This is the first one!
A simple query¶
Let’s start with a simple query example. One of the most basic and commonly-used types of questions we could ask about Dimensions data is something like, “Which are the most recent scholarly works related to a specific research topic?”.
We can answer such a question with a query like this, which asks: “Which are the most recent scientific publications related to malaria?”
search publications for "malaria" return publications
Whitespace is ignored in DSL queries, so feel free to use line breaks and indentation as you wish. The following query is exactly the same as the query above:
search publications for "malaria" return publications
The anatomy of a query¶
Each DSL query is composed of two main parts:
search publications for "malaria", and
At least one
The first part of the query, the
search phrase, specifies the set of documents that we want to know something about.
In the example
search publications for "malaria", we are saying “I want to know something about publications related to malaria”.
search part can optionally specify whether to search using full text search or search only in title and abstract.
This is accomplished by specifier
in [full_data | title_abstract_only] that follows the source to search in, such as for example
search grants in title_abstract_only.
The second part of the query, the
return phrase, specifies what we want to know about the documents specified in the
search phrase. In the example
return publications, we are saying “I want to see basic information about the publications matched by the
A DSL query is not complete without both a
search phrase and at least one
return phrase; we must tell the DSL both which data we are interested in and what we want to see from that data. Trying to run a query that has only one phrase or the other will result in an error, as you can see for yourself by trying to run these incomplete queries:
search publications for "malaria"
In the following sections we’ll explore what we can do with each of these two phrases.
search for source documents¶
As we’ve just seen, the
search phrase specifies a set of documents that will provide the data for the results we’ll ask for in the
As all Dimensions data must be taken from one of the supported sources, this phrase always starts with the word
search followed by the name of the desired source, e.g.
search publications. This is the only part of the
search phrase that is not optional; the following is a valid (though perhaps not very useful) query where the
search phrase specifies the set of all publications:
search publications return publications
Choose a document source¶
In the previous queries we’ve searched for
publications, which is an example of a Dimensions data source, i.e. a type of scientific document/work that may be of interest. Others include
We can query another source by replacing
publications with the name of another source, like
grants. For example:
search grants for "malaria" return grants
Try changing the query above to search for another type of work, e.g.
patents. (Hint: make sure you update both the
return phrases; what happens when you only change one?)
For a list of all currently supported sources, see the Data Model page.
Search for research topics¶
As we’ve seen in the preceding examples, we’re often interested in scientific work related to a particular topic, e.g. malaria. In the
for phrase, we can specify such a topic as one or more search term that we want the retrieved documents to match. We can specify a multi-word phrase as well, for example:
search grants for "attention deficit disorder" return grants
Try modifiying the query above to search for different terms/subjects. Take a look at the titles of the returned documents to see how they may be related to the subject(s) you searched for.
How does this type of search work behind the scenes? First, the complete text of all documents is analyzed, and each document is assigned a score that represents how well it matches the given terms(s). Documents are then ranked by their score, with the best-matching documents appearing first in the results.
Restrict the search with filters¶
We can restrict the set of documents matched by the
search phrase by specifying certain filters that must apply to the retrieved documents, in other words certain properties these documents must have. We specify these as filter expressions in the
where phrase, as follows:
search grants where start_year>=2010 and funders.acronym="DFG" for "attention deficit disorder" return grants
The query above only matches documents that were published in the year 2010 or later and where the supporting funder had the acronym “DFG”.
Try to modify the query above to match documents published only in 2010 (not in later years), then to match documents published in 2010 or earlier. (You can probably guess what operators you’d need to use instead of
>=, but consult the list of filter operators if you get stuck.)
Filters in the
where phrase may also be used to retrieve one or more specific documents using an identifier such as a publication DOI, like so:
search publications where doi in ["10.1186/s12888-017-1463-3", "10.1186/s40479-017-0057-5", "10.1186/s12888-017-1222-5"] return publications
The fields, entities, etc. that may appear in filter expressions depend on the source being searched; in the queries above, we see the
year field as well as the
acronym field being used to search
grants, and the
doi field being used to search
If an unsupported field name is used, the DSL will report an error message to that effect, and provide a list of the supported fields to help you find the correct field name. For example, if you want to find grants that started in the year 2010 and create a filter for this using a field called
startyear, this will result in an error, because the field name is not quite correct. Try it for yourself - this query will raise an error:
search grants where start_year=2010 return grants
Fix the query above to use the correct field name (hint: it’s in the error message).
Consult the documentation for the where phrase (including the types of expressions/operators that may be used) to find out more about the types of filters and operators you can use, and the lists of supported fields for each supported source to see which field and entity names can be used in filter expressions.
Play with the queries above to use other filter expression(s) that narrow the set of documents in different ways.
As you can see in the various queries above, the
where phrase may be used in combination with a
for phrase, or by itself. If both phrases are used, they can appear in either order - it does not matter which comes first. Behind the scenes, first the
where filters will be applied to narrow down the set of documents, and then this restricted set of documents will be scored & ranked against the
for search terms as described above.
return information about documents¶
As we said earlier, the
search phrase allows us to specify which documents we’re interested in, but it doesn’t allow us to specify exactly what we want to know about those documents; for that we use the
A return phrase comes after the
search phrase, and always starts with the word
return followed by a specification of the result we wish to see. In the queries we’ve seen so far, this has been only the most basic type of result: the source documents themselves, i.e.
return publications or
return phrase gives us the flexibility to be much more specific about what we wish to see; in the following sections we’ll explore some of the possibilities.
Return source documents¶
If we are interested in knowing something about the source documents themselves, we can ask the DSL to show us only certain information (metadata) about each document by indicating the specific field(s) we wish to see, like so:
search publications for "malaria" return publications [doi + title + year]
Modify the query above to return different metadata by adding/removing/changing fields in the
return phrase. (Consult the list of publications fields for ideas.)
As a shorthand for commonly requested data, the DSL also supports a few fieldsets for each source, so that the name of a fieldset can be used instead of a long list of fields. For example, the
extras fieldset includes additional fields we may not have seen before:
search publications for "malaria" return publications [extras]
Compare the results of the query above to those of the query below, which uses the smaller
search publications for "malaria" return publications[basics]
basics fieldset is the default used when no fields or fieldsets are specified in the results phrase, such that
return publications returns the exact same results as
return publications[basics]. Try deleting
:basics from the query above to see for yourself.
For a complete list of supported fields and fieldsets for each source, see the supported data page of the documentation.
Sometimes we are not interested in data about each individual document, but rather in aggregate data about the set of documents matched by the return phrase. For example, perhaps you’re interested in publications about malaria, but don’t want to see the publications themselves; instead, you want to see the number of publications that appeared in each year. You can ask the DSL for that information with a
return phrase like this:
search publications for "malaria" return year
When given this query, the DSL will find all publications matching the
search phrase, then look at the
year of each of those publications to tally up the
count of matching publications for each year; the results will then present the
years (not the
publications themselves) with the highest publication counts.
We may also be interested in the number of malaria publications supported by different funding organizations, which we could determine with a query like:
search publications for "malaria" return year return funders
We can calculate such aggregations via an operation called “faceting”, which is supported for certain fields of the given source; such fields are called facets. Consult the tables of supported source fields to learn which fields are facets.
Change the query above to aggregate over another
publications facet. (Consult the list of publications fields for ideas.)
Specify entity fields¶
funders are an entity, the Dimensions database stores some metadata about each of these funders. Just as we could specify which fields we wanted to see for each of the
publications we returned earlier, we can specify which fields (or fieldset) we’d like to see for each of the
funders in the same way. Perhaps we’re only interested in each funder’s name:
search publications for "malaria" return funders[name]
Compare this to the results of the previous query, where we saw additional data like the
country_name of each funder.
Play with the query above to add fields to the
funders results (consult the list of funding organization fields).
Specifying fields is only possible for entity fields of a given source, such as
publications. With a non-entity field, such as
years, we cannot specify fields; a year has no name or other metadata. Which fields of a given source are entities/non-entities can be seen in the field tables for the supported Publications, and for each of the States there is also a list of available fields/fieldsets.
Use different aggregation indicators¶
So far we’ve been interested in the number of publications each funder supported, but perhaps we’re interested in knowing which funders supported the most-cited publications. In that case, instead of just tallying the count of publications per funder, we could use a different aggregation indicator, e.g. take the average of the Relative Citation Ratio (RCR) score for each publication supported by a given funder. We can tell the DSL to do this by specifying
rcr_avg as the aggregation indicator:
search publications for "malaria" return funders aggregate rcr_avg
As you can see in the results of the query above, now we see an
rcr_avg score for each funder, but we still see the
count in the results (this is always included in aggregation results), and the funders are still ranked by
count. This is because
count is the default sort indicator for any aggregations; if we want to sort by a different indicator, we just need to add a
sort by phrase:
search publications for "malaria" return funders aggregate rcr_avg sort by rcr_avg
Compare the results of the query above to those of the previous query; as you can see, now the funders appear ranked by their average RCR score, not publication count.
Consult the list of indicators for each supported source to find out which indicators(s) can be used.
Try updating the query above to aggregate & sort the funders by the median altmetric score of the publications they supported.
Limit the number of results¶
In all of the results we’ve seen so far, whether we’re returning
publications documents or an aggregation over
funders, we’ve seen the top 20 entries for each. That is because the DSL is configured to return 20 results by default, and for those to be the first 20 results.
You can customize how many results are returned by setting a
limit, as follows:
search publications for "malaria" return funders limit 5
Try it yourself by changing the
limit in the query above. How high can it go?
Return multiple results¶
Instead of writing two separate queries like these that use the same
search phrase and are asking for an aggregation over the results, like
search publications for "malaria" return years and
search publications for "malaria" return funders, we can combine them using multiple
return phrases, like so:
search publications for "malaria" return years return funders
If you wish to also get data about the publications themselves, you can ask for that in a separate
return phrase, like this:
search publications for "malaria" return year return funders return publications
And if you want to use a different aggregation indicator or limit for funders, but not years, you’ll need to use separate return phrases for each, e.g.:
search publications for "malaria" return year return funders aggregate rcr_avg limit 5 return publications
One way of looking at this is that we are using multiple
return phrases to get multiple different “views” over a single set of data (the documents matching the
return phrases in the query above to view different results from the same