A Tour of the DSL

The Dimensions Search Language (DSL) allows you to write concise expressions, called queries, to search and retrieve information from the Dimensions dataset. This page provides an interactive walkthrough of DSL query basics, to get you familiar with the language.

Todo

In this tutorial, suggested interactive exercises to build up your DSL knowledge will appear in boxes like this. This is the first one!

A simple query

Let’s start with a simple query example. One of the most basic and commonly-used types of questions we could ask about Dimensions data is something like, “Which are the most recent scholarly works related to a specific research topic?”.

We can answer such a question with a query like this, which asks: “Which are the most recent scientific publications related to malaria?”

search publications for "malaria" return publications

Whitespace is ignored in DSL queries, so feel free to use line breaks and indentation as you wish. The following query is exactly the same as the query above:

search publications
  for "malaria"
return publications

The anatomy of a query

Each DSL query is composed of two main parts:

  • A search phrase, e.g. search publications for "malaria", and

  • At least one return phrase, e.g. return publications

The first part of the query, the search phrase, specifies the set of documents that we want to know something about. In the example search publications for "malaria", we are saying “I want to know something about publications related to malaria”.

search part can optionally specify whether to search using full text search or search only in title and abstract. This is accomplished by specifier in [full_data | title_abstract_only] that follows the source to search in, such as for example search grants in title_abstract_only.

The second part of the query, the return phrase, specifies what we want to know about the documents specified in the search phrase. In the example return publications, we are saying “I want to see basic information about the publications matched by the search phrase”.

A DSL query is not complete without both a search phrase and at least one return phrase; we must tell the DSL both which data we are interested in and what we want to see from that data. Trying to run a query that has only one phrase or the other will result in an error, as you can see for yourself by trying to run these incomplete queries:

search publications for "malaria"
return publications

In the following sections we’ll explore what we can do with each of these two phrases.

search for source documents

As we’ve just seen, the search phrase specifies a set of documents that will provide the data for the results we’ll ask for in the return phrase.

As all Dimensions data must be taken from one of the supported sources, this phrase always starts with the word search followed by the name of the desired source, e.g. search publications. This is the only part of the search phrase that is not optional; the following is a valid (though perhaps not very useful) query where the search phrase specifies the set of all publications:

search publications return publications

Choose a document source

In the previous queries we’ve searched for publications, which is an example of a Dimensions data source, i.e. a type of scientific document/work that may be of interest. Others include grants, patents, clinical_trials, researchers and policy_documents.

We can query another source by replacing publications with the name of another source, like grants. For example:

search grants for "malaria" return grants

Todo

Try changing the query above to search for another type of work, e.g. patents. (Hint: make sure you update both the search and return phrases; what happens when you only change one?)

For a list of all currently supported sources, see the Data Sources page.

Search for research topics

As we’ve seen in the preceding examples, we’re often interested in scientific work related to a particular topic, e.g. malaria. In the for phrase, we can specify such a topic as one or more search term that we want the retrieved documents to match. We can specify a multi-word phrase as well, for example:

search grants for "attention deficit disorder" return grants

Todo

Try modifiying the query above to search for different terms/subjects. Take a look at the titles of the returned documents to see how they may be related to the subject(s) you searched for.

How does this type of search work behind the scenes? First, the complete text of all documents is analyzed, and each document is assigned a score that represents how well it matches the given terms(s). Documents are then ranked by their score, with the best-matching documents appearing first in the results.

Restrict the search with filters

We can restrict the set of documents matched by the search phrase by specifying certain filters that must apply to the retrieved documents, in other words certain properties these documents must have. We specify these as filter expressions in the where phrase, as follows:

search grants where start_year>=2010 and funders.acronym="DFG" for "attention deficit disorder" return grants

The query above only matches documents that were published in the year 2010 or later and where the supporting funder had the acronym “DFG”.

Todo

Try to modify the query above to match documents published only in 2010 (not in later years), then to match documents published in 2010 or earlier. (You can probably guess what operators you’d need to use instead of >=, but consult the list of filter operators if you get stuck.)

Filters in the where phrase may also be used to retrieve one or more specific documents using an identifier such as a publication DOI, like so:

search publications
where doi in ["10.1186/s12888-017-1463-3", "10.1186/s40479-017-0057-5", "10.1186/s12888-017-1222-5"]
return publications

The fields, entities, etc. that may appear in filter expressions depend on the source being searched; in the queries above, we see the year field as well as the funder entity’s acronym field being used to search grants, and the doi field being used to search publications.

If an unsupported field name is used, the DSL will report an error message to that effect, and provide a list of the supported fields to help you find the correct field name. For example, if you want to find grants that started in the year 2010 and create a filter for this using a field called startyear, this will result in an error, because the field name is not quite correct. Try it for yourself - this query will raise an error:

search grants where start_year=2010 return grants

Todo

Fix the query above to use the correct field name (hint: it’s in the error message).

Consult the documentation for the where phrase (including the types of expressions/operators that may be used) to find out more about the types of filters and operators you can use, and the lists of supported fields for each supported source to see which field and entity names can be used in filter expressions.

Todo

Play with the queries above to use other filter expression(s) that narrow the set of documents in different ways.

Note

As you can see in the various queries above, the where phrase may be used in combination with a for phrase, or by itself. If both phrases are used, they can appear in either order - it does not matter which comes first. Behind the scenes, first the where filters will be applied to narrow down the set of documents, and then this restricted set of documents will be scored & ranked against the for search terms as described above.

return information about documents

As we said earlier, the search phrase allows us to specify which documents we’re interested in, but it doesn’t allow us to specify exactly what we want to know about those documents; for that we use the return phrase(s).

A return phrase comes after the search phrase, and always starts with the word return followed by a specification of the result we wish to see. In the queries we’ve seen so far, this has been only the most basic type of result: the source documents themselves, i.e. return publications or return grants.

However, the return phrase gives us the flexibility to be much more specific about what we wish to see; in the following sections we’ll explore some of the possibilities.

Return source documents

If we are interested in knowing something about the source documents themselves, we can ask the DSL to show us only certain information (metadata) about each document by indicating the specific field(s) we wish to see, like so:

search publications for "malaria" return publications [doi + title + year]

As in the search phrase, the field(s) that may appear in such a return phrase depend on the source being searched/returned.

Todo

Modify the query above to return different metadata by adding/removing/changing fields in the return phrase. (Consult the list of publications fields for ideas.)

As a shorthand for commonly requested data, the DSL also supports a few fieldsets for each source, so that the name of a fieldset can be used instead of a long list of fields. For example, the extras fieldset includes additional fields we may not have seen before:

search publications for "malaria" return publications [extras]

Compare the results of the query above to those of the query below, which uses the smaller basics fieldset:

search publications for "malaria" return publications[basics]

The basics fieldset is the default used when no fields or fieldsets are specified in the results phrase, such that return publications returns the exact same results as return publications[basics]. Try deleting :basics from the query above to see for yourself.

For a complete list of supported fields and fieldsets for each source, see the supported data page of the documentation.

Calculate aggregations

Sometimes we are not interested in data about each individual document, but rather in aggregate data about the set of documents matched by the return phrase. For example, perhaps you’re interested in publications about malaria, but don’t want to see the publications themselves; instead, you want to see the number of publications that appeared in each year. You can ask the DSL for that information with a return phrase like this:

search publications for "malaria" return year

When given this query, the DSL will find all publications matching the search phrase, then look at the year of each of those publications to tally up the count of matching publications for each year; the results will then present the years (not the publications themselves) with the highest publication counts.

We may also be interested in the number of malaria publications supported by different funding organizations, which we could determine with a query like:

search publications for "malaria" return year return funders

We can calculate such aggregations via an operation called “faceting”, which is supported for certain fields of the given source; such fields are called facets. Consult the tables of supported source fields to learn which fields are facets.

Todo

Change the query above to aggregate over another publications facet. (Consult the list of publications fields for ideas.)

Specify entity fields

Since funders are an entity, the Dimensions database stores some metadata about each of these funders. Just as we could specify which fields we wanted to see for each of the publications we returned earlier, we can specify which fields (or fieldset) we’d like to see for each of the funders in the same way. Perhaps we’re only interested in each funder’s name:

search publications for "malaria" return funders[name]

Compare this to the results of the previous query, where we saw additional data like the acronym and country_name of each funder.

Todo

Play with the query above to add fields to the funders results (consult the list of funding organization fields).

Note

Specifying fields is only possible for entity fields of a given source, such as funders of publications. With a non-entity field, such as years, we cannot specify fields; a year has no name or other metadata. Which fields of a given source are entities/non-entities can be seen in the field tables for the supported Data Sources, and for each of the States there is also a list of available fields/fieldsets.

Use different aggregation indicators

So far we’ve been interested in the number of publications each funder supported, but perhaps we’re interested in knowing which funders supported the most-cited publications. In that case, instead of just tallying the count of publications per funder, we could use a different aggregation indicator, e.g. take the average of the Relative Citation Ratio (RCR) score for each publication supported by a given funder. We can tell the DSL to do this by specifying rcr_avg as the aggregation indicator:

search publications for "malaria" return funders aggregate rcr_avg

As you can see in the results of the query above, now we see an rcr_avg score for each funder, but we still see the count in the results (this is always included in aggregation results), and the funders are still ranked by count. This is because count is the default sort indicator for any aggregations; if we want to sort by a different indicator, we just need to add a sort by phrase:

search publications for "malaria" return funders aggregate rcr_avg sort by rcr_avg

Compare the results of the query above to those of the previous query; as you can see, now the funders appear ranked by their average RCR score, not publication count.

Consult the list of indicators for each supported source to find out which indicators(s) can be used.

Todo

Try updating the query above to aggregate & sort the funders by the median Altmetric Attention Score of the publications they supported.

Limit the number of results

In all of the results we’ve seen so far, whether we’re returning publications documents or an aggregation over years or funders, we’ve seen the top 20 entries for each. That is because the DSL is configured to return 20 results by default, and for those to be the first 20 results.

You can customize how many results are returned by setting a limit, as follows:

search publications for "malaria" return funders limit 5

Todo

Try it yourself by changing the limit in the query above. How high can it go?

Return multiple results

Instead of writing two separate queries like these that use the same search phrase and are asking for an aggregation over the results, like search publications for "malaria" return years and search publications for "malaria" return funders, we can combine them using multiple return phrases, like so:

search publications for "malaria"
return years
return funders

If you wish to also get data about the publications themselves, you can ask for that in a separate return phrase, like this:

search publications for "malaria"
return year
return funders
return publications

And if you want to use a different aggregation indicator or limit for funders, but not years, you’ll need to use separate return phrases for each, e.g.:

search publications for "malaria"
return year
return funders aggregate rcr_avg limit 5
return publications

One way of looking at this is that we are using multiple return phrases to get multiple different “views” over a single set of data (the documents matching the search phrase).

Warning

Returning field both as a facet and a field might result in incomplete results. For example, this query: search publications return publications[researchers] return researchers[extras] would fail to retrieve extras fieldset for the researchers facet, and instead only the basic fieldset would be returned. In this case, it might be better to split the query into two and return facet separately.