# 30. Publication volume growth rate

Level: Advanced

This query requires a good understanding of SQL and the Dimensions data model

## Description

This query calculates the growth rate of the number of publications in a particular field over time. Using annual numbers can make visualizations almost impossible to read, since year-to-year numbers can be noisy and include large swings in either direction. To avoid this, the query below uses a **sliding five-year window**: For example, the publication count for 2005 would include *all of the publications between 2001 and 2005*.

The other way the results are smoothed is in our choice of comparison: Rather than comparing the 2005 total to the 2004 total, **we compare 2005 to the previous 5-year window, which ended in 2000**. Both of these strategies can be adjusted in the query:

- To adjust the "width" of the window, modify the line that says
`4 PRECEDING`

to whatever number of years works best for your data. - To adjust which windows are compared to get the growth rate, adjust the line that says
`LAG(running_total, 5)`

by changing the`5`

to a different value. For example, a`5`

here means we compare the 2005 total to the 2000 total, but, changing this to`LAG(running_total, 1)`

would compare 2005 to*2004*.

### Unusual features

There are a few components of this query that may not be intuitive, or use a complicated syntax:

- The
`SUM(num) OVER ...`

clause is part of a*window function*that includes the`num`

field of the previous four rows (ordered by year). The BigQuery documentation has more information about using functions like`SUM() OVER`

. - The
`LAG(running_total, 5)`

clause does something similar, but instead of adding the previous four rows together, it only retrieves the value from the`running_total`

field from the row five positions higher when ordered by year. This is called a "navigation function" and is also discussed in the BigQuery documentation. - The
`allyears`

subquery is a clunky workaround for situations in which a particular year has no publications of interest. The`pub_counts`

subquery returns a row for each year, and a count of publications in that year, but a year only appears if there's at least one publication. Since we're counting rows in the`SUM`

and`LAG`

functions described above, it would be a big problem if we're calculating a "five-year window" that actually stretches over eight years because three years in the window have no publications.`allyears`

returns the exact same numbers as the`pubcounts`

subquery, but it guarantees every year will have a row.- The start and end years for this step are defined in the
`params`

subquery. Ideally, we could just use the first and last years of the publications in the list, but BigQuery doesn't support "correlated subqueries" that reference each other, so this simpler method is used here instead.

- The start and end years for this step are defined in the
- The line that includes
`COALESCE(NULLIF(prev_total, 0), 1)`

is a bit of a hack. Ideally, the`rate`

field would be calculated simply using`(running_total - prev_total) / prev_total`

. However, there may be cases where`prev_total`

is zero, which would throw an error for dividing by zero. Because there is no function for "change this value if it is zero," it has to happen in two steps:`NULLIF(prev_total, 0)`

: If`prev_total`

is equal to`0`

, this step returns`NULL`

instead of`0`

.`COALESCE(NULLIF(prev_total, 0), 1)`

: If the previous step returns`NULL`

, this step will return`1`

instead.- The end result is that all non-zero values of
`prev_total`

are used without modification, but a value of`0`

is changed to a value of`1`

. This isn't*perfect*, since the real answer is an infinite growth rate.

## Query

```
WITH params AS (
SELECT
-- Define the start and end points for your rate calculations.
-- Publications outside of this range will be ignored.
1990 AS minyear
,2020 AS maxyear
),
pub_counts AS (
SELECT
p.year AS pubyear, COUNT(p.id) AS num
FROM `dimensions-ai.data_analytics.publications` p
WHERE
p.year >= (SELECT minyear FROM params)
AND p.year <= (SELECT maxyear FROM params)
---- HERE is where to define your publications
---- of interest
AND '2101' IN UNNEST(p.category_for.second_level.codes) -- Archaeology
GROUP BY 1
),
allyears AS (
SELECT year, COALESCE(p.num, 0) AS num
FROM UNNEST(GENERATE_ARRAY(
(SELECT minyear FROM params),
(SELECT maxyear FROM params)
)) year
LEFT JOIN pub_counts p
ON p.pubyear=year
),
pub_window AS (
SELECT year, num
,SUM(num) OVER(ORDER BY year ROWS BETWEEN 4 PRECEDING AND CURRENT ROW) AS running_total
FROM allyears
),
pub_collate AS (
SELECT year, num, running_total
,LAG(running_total, 5) OVER(ORDER BY year ASC) AS prev_total
FROM pub_window
)
SELECT year
,num AS pub_count
,(running_total - prev_total) / COALESCE(NULLIF(prev_total, 0), 1) AS rate
FROM pub_collate
WHERE year >= (SELECT minyear FROM params)+10
ORDER BY 1 ASC
```

## Results

Row | year | pub_count | rate |
---|---|---|---|

0 | 2000 | 4404 | 0.1616 |

1 | 2001 | 4152 | 0.1404 |

2 | 2002 | 4477 | 0.1413 |

3 | 2003 | 4328 | 0.1047 |

4 | 2004 | 4190 | 0.0397 |