Strategies for cost reduction¶

In most scenarios, the primary cost from using BigQuery is data usage billed from running queries against a dataset. Optimizing your queries can help lower costs and improve the performance of dashboards and reports—Google has several well-documented suggestions (1, 2), and we provide some Dimensions-specific examples below.

Minimize outputs¶

Keep a close eye on which data from your queries is actually being put to use in your visualizations, and trim any extraneous information being returned. For example, imagine we’re evaluating a journal that has published, say, 3101 papers since 2010. Is that a lot? We can provide some context by counting how many papers have been published at all journals:

SELECT journal.id, journal.title, COUNT(id) as totcount
FROM `dimensions-ai.data_analytics.publications`
WHERE  year>=2010 AND
    type='article'
GROUP BY journal.id, journal.title

This uses 2.79 GB of data and returns results like this:

id	title	totcount
jour.1049388	Journal of Bioprocessing & Biotechniques	297
jour.1052310	Vibrant Virtual Brazilian Anthropology	377
…	(73,627 more rows)	…

However, if we’re just building a “total papers since 2010” histogram, all those titles aren’t being used for anything. We can change our query to omit titles:

SELECT journal.id, COUNT(id) as totcount
FROM `dimensions-ai.data_analytics.publications`
WHERE  year>=2010 AND
    type='article'
GROUP BY journal.id

This reduces data usage to 1.66 GB, a 41 percent reduction. Note that we also don’t need to return the journal.id field in this example, but removing it will not reduce data usage since the ID is still needed in the GROUP BY clause.

Combine multiple queries¶

If two separate (but similar) queries power different charts in your dashboard, you can probably cut costs in half by reducing the number of queries. As you build out more visualizations, keep an eye out for ways to append more fields onto existing data sources, rather than creating new ones. Imagine you are collecting overall measurements across all journals. It may start with a query like this, to tally all publications and citations:

SELECT journal.id AS journal_id, journal.title AS journal_title,
    COUNT(DISTINCT p.id) AS tot_pubs,
    SUM(ARRAY_LENGTH(p.citations)) AS tot_cit,
FROM `dimensions-ai.data_analytics.publications` p
WHERE p.year >= 2010
    AND p.journal.id IS NOT NULL
GROUP BY 1,2
ORDER BY 3 DESC, 4 DESC, 1

This requires 2.77 GB of data usage and returns results in which each row lists a journal’s ID and title, plus their total publications and total citations.

Then, farther down in a dashboard, perhaps you create another query for summing Altmetric attention scores:

SELECT journal.id AS journal_id, journal.title AS journal_title,
    SUM(IFNULL(altmetrics.score, 0)) as tot_altmetrics,
FROM `dimensions-ai.data_analytics.publications` p
WHERE p.year >= 2010
    AND p.journal.id IS NOT NULL
GROUP BY 1,2
ORDER BY 3 DESC, 1

This adds another 2.1 GB of data usage to the dashboard, with each row of this separate data source listing a journal’s ID, total, and cumulative Altmetric scores.

As written, these two queries account for 4.87 GB of data usage. However, the first query can be modified to cover both use cases, like this:

SELECT journal.id AS journal_id, journal.title AS journal_title,
    COUNT(DISTINCT p.id) AS tot_pubs,
    SUM(ARRAY_LENGTH(p.citations)) AS tot_cit,
    SUM(IFNULL(altmetrics.score, 0)) as tot_altmetrics
FROM `dimensions-ai.data_analytics.publications` p
WHERE p.year >= 2010
    AND p.journal.id IS NOT NULL
GROUP BY 1,2
ORDER BY 3 DESC, 4 DESC, 1

This returns one row for each journal, including total publications, total citations, and the cumulative Altmetric attention score for all journal articles going back to 2010—with total data usage of 2.84 GB, a 42 percent reduction.

Store results from a query in a table¶

One way you may be able to reduce costs is by pre-computing the results of expensive queries. For example, imagine we are an editor at Journal X, and we want to monitor the journals that most frequently cite papers published in our journal. A query like this would give you the information you want:

SELECT cp.journal.id AS citing, COUNT(citing_pubs.id) as totcount
-- grab a list of all publications,
FROM `dimensions-ai.data_analytics.publications` p,
-- and unpack the list of IDs of papers that cite each one:
    UNNEST(citations) as citing_pubs
-- then, we associate each citing paper with its journal:
JOIN `dimensions-ai.data_analytics.publications` cp ON cp.id=citing_pubs.id
-- we only want information for Journal X papers after 2010:
WHERE  p.year>=2010 AND
    p.journal.id='JOURNAL_X_ID'
GROUP BY cp.journal.id

This query would return a table in which each row lists a single journal and how many times papers in that journal have cited Journal X—perfect for generating a Data Studio table or pie chart. However, this requires parsing a lot of data—about 9 GB per query. That’s 9 GB every time the Data Studio dashboard is refreshed, which, if it’s used frequently, can add up quickly.

If you don’t need up-to-the-minute accuracy, saving the results of this query in a table can both save you money and load your dashboard more quickly. Google maintains detailed documentation about how to configure a scheduled query—in short, from the BigQuery SQL Workspace, rather than clicking the “Run” button, you should click the “Schedule” button. The critical piece here is to select a destination table for your query, otherwise it will just periodically execute a SELECT statement and not store it anywhere. When you are configuring the scheduled query, be sure to select “Overwrite table” rather than “append.”

Tip

Scheduled queries only support destination tables if the query does not include any scripting, which generally refers to features such as variable declarations and control flow statements.

Once you have the information stored in a table (called, let’s say, precomputed_citations,) you can go back to your dashboard and modify the data source to query this new table. Now, rather than the tricky (and expensive!) query above, where we join the whole publications table to itself, we can use this instead:

SELECT citing, totcount
FROM precomputed_citations

That’s it! And since you already did all the heavy-duty processing, this now only uses 55 MB of data—a 99.5 percent reduction, not including the cost of running the initial query once to populate the “precomputed_citations” table.

Store results from multiple queries in a table¶

In the previous example, we used a new table to store information about the journals citing Journal X. But what if we also want to store information about citations to Journal Y and Journal Z? Two more journals could just mean two more scheduled queries and two more precomputed tables, but this isn’t a particularly scalable solution for more than that. A slightly different approach is to use a single table to store the results of multiple queries—if all the queries have the same output fields, this could be much more efficient, particularly if you can populate the entire table with a single query.

We can modify the query above, which calculates incoming citations to a single journal, to instead calculate incoming citations to all journals:

SELECT p.journal.id AS receiving, cp.journal.id AS citing, COUNT(citing_pubs.id) as totcount
FROM `dimensions-ai.data_analytics.publications` p,
    UNNEST(citations) as citing_pubs
JOIN `dimensions-ai.data_analytics.publications` cp ON cp.id=citing_pubs.id
WHERE  p.year>=2010
GROUP BY p.journal.id, cp.journal.id

There is only one small change to the original query: Rather than having a WHERE clause that specifies a single value for p.journal.id, we instead add p.journal.id as one of the outputs. The result is a table containing all incoming citations to all journals, with minimal changes to the query required to fetch the results in Data Studio:

SELECT citing, totcount
FROM precomputed_citations
WHERE receiving='JOURNAL_X_ID'

Minimize the amount of materialized data¶

When recording data in materialized tables, you may be able to save on costs by only saving the data you will actually need. The simplest way to do this is adding a LIMIT clause to whatever query you’re storing. Consider the first query from the previous section, collecting the journals that most frequently cite Journal X:

SELECT cp.journal.id AS citing, COUNT(citing_pubs.id) as totcount
-- grab a list of all publications,
FROM `dimensions-ai.data_analytics.publications` p,
-- and unpack the list of IDs of papers that cite each one:
    UNNEST(citations) as citing_pubs
-- then, we associate each citing paper with its journal:
JOIN `dimensions-ai.data_analytics.publications` cp ON cp.id=citing_pubs.id
-- we only want information for Journal X papers after 2010:
WHERE  p.year>=2010 AND
    p.journal.id='JOURNAL_X_ID'
GROUP BY cp.journal.id

This could return thousands of rows, one for each citing journal. If you’re only interested in, say, the top 100 entries, you could add a LIMIT 100 clause. When requesting the top results from a query, it’s important to add an ORDER BY clause to tell BigQuery what “top” actually means. In our case, we’d want to order the results by total citations before applying our cutoff:

SELECT cp.journal.id AS citing, COUNT(citing_pubs.id) as totcount
-- grab a list of all publications,
FROM `dimensions-ai.data_analytics.publications` p,
-- and unpack the list of IDs of papers that cite each one:
    UNNEST(citations) as citing_pubs
-- then, we associate each citing paper with its journal:
JOIN `dimensions-ai.data_analytics.publications` cp ON cp.id=citing_pubs.id
-- we only want information for Journal X papers after 2010:
WHERE  p.year>=2010 AND
    p.journal.id='JOURNAL_X_ID'
GROUP BY cp.journal.id
ORDER BY totcount DESC
LIMIT 100

Unfortunately, many queries are much more complicated—consider the other example, in which we calculate citations for every journal, not just one:

SELECT p.journal.id AS receiving, cp.journal.id AS citing, COUNT(citing_pubs.id) as totcount
FROM `dimensions-ai.data_analytics.publications` p,
    UNNEST(citations) as citing_pubs
JOIN `dimensions-ai.data_analytics.publications` cp ON cp.id=citing_pubs.id
WHERE  p.year>=2010
GROUP BY p.journal.id, cp.journal.id

This query results in 9.06 GB of data usage and 24.8 million rows of output, and would store thousands or tens of thousands of citing journals for each receiving journal. If you only ever use the top 10 citing journals, you can save a lot of space (and write operations) by only saving the top 10 for any particular receiving journal. However, we can’t simply add a LIMIT 10 to the end, because we want to limit it to 10 rows for each receiving journal. To make this happen, you can use one of the available numbering functions such as RANK():

SELECT p.journal.id AS receiving, cp.journal.id AS citing, COUNT(citing_pubs.id) as totcount,
    RANK() OVER (PARTITION BY p.journal.id ORDER BY COUNT(citing_pubs.id) DESC) AS cite_rank
FROM `dimensions-ai.data_analytics.publications` p,
    UNNEST(citations) as citing_pubs
JOIN `dimensions-ai.data_analytics.publications` cp ON cp.id=citing_pubs.id
WHERE  p.year>=2010 AND p.journal.id IS NOT NULL
GROUP BY p.journal.id, cp.journal.id

The call to RANK() assigns integer ranks to each row, with separate ranks for each receiving journal (PARTITION BY p.journal.id), with ranks determined by the same number that’s assigned to totcount (ORDER BY COUNT(citing_pubs.id) DESC) AS cite_rank). However, this still returns all the results, not the top 10 for each journal. For that, we can use a “WITH” clause to turn our initial query into a subquery, then select only the low ranks:

WITH all_results AS (
    SELECT p.journal.id AS receiving, cp.journal.id AS citing, COUNT(citing_pubs.id) as totcount,
        RANK() OVER (PARTITION BY p.journal.id ORDER BY COUNT(citing_pubs.id) DESC) AS cite_rank
    FROM `dimensions-ai.data_analytics.publications` p,
        UNNEST(citations) as citing_pubs
    JOIN `dimensions-ai.data_analytics.publications` cp ON cp.id=citing_pubs.id
    WHERE  p.year>=2010 AND p.journal.id IS NOT NULL
    GROUP BY p.journal.id, cp.journal.id
)

SELECT * FROM all_results
WHERE cite_rank <= 10

This query returns 769,585 rows, a 97 percent reduction from the unlimited query.

Add clustering¶

For tables that are larger than 1 GB, you may be able to improve performance by using clustered tables. BigQuery splits your data across multiple storage blocks; clustering allows you to define how BigQuery does the splitting. Specify a field that you’re most likely to use when segmenting the data for individual queries. For example, if we were defining a table with information about 3 million publications, we’d probably use the journal field for clustering—if we’re querying that publications table, the most likely filter we’ll be applying is on journal. Once this is set up (upon creation of the dataset), BigQuery can then use this field as a shortcut. If you request all papers published in Journal X, BigQuery uses block pruning to reduce overhead: Since clustering is done using the journal field, it will check whether a block has any papers from Journal X before actually searching through the data, enabling the database to skip entire storage blocks rather than taking the time (and data usage!) to look through a block that doesn’t have any relevant info.