globaltrends

Download and measure global trends through Google search volumes

Harald Puhr

# install package --------------------------------------------------------------
# current cran version
install.packages("globaltrends")
# current dev version
devtools::install_github("ha-pu/globaltrends", build_vignettes = TRUE)

# load package -----------------------------------------------------------------
library(globaltrends)

# package version --------------------------------------------------------------
packageVersion("globaltrends")
#> [1] '0.0.14'

Case study: Analyzing firm internationalization

We demonstrate the functionality of the globaltrends package based on a sample of six large U.S. firms. Measuring degree of internationalization for firms is an essential empirical task in international business research. Yet the proposed methodology can be generalized to other applications. In this brief case study, we analyze the degree of internationalization of Alaska Air Group Inc., Coca-Cola Company, Facebook Inc., Illinois Tool Works Inc., J.M. Smucker Company, and Microsoft Corporation. The workflow proceeds in four major steps:

  1. Setup and start database
  2. Download data from Google Trends
  3. Compute search scores and internationalization
  4. Exports and plots

Setup and start database

Research projects that use Google Trends generate a substantial amount of data. To optimally handle this data, the globaltrends package uses an SQLite database to store and handle all data. This ensures efficiency and portability on the one hand and seamless integration with functions implemented in the DBI and dplyr packages on the other hand.

Users create the underlying database through the initialize_db command. The command creates a folder named db within the current working directory and creates an SQLite database file named globaltrends_db.sqlite within this folder. The command also creates all necessary tables within the database. For more information on database tables, please refer to their built-in documentation e.g., ?globaltrends::data_score. The database initialization is necessary only for the first usage of the globaltrends package.

# initialize_db ----------------------------------------------------------------
setwd("your/globaltrends/folder")
initialize_db()
#> Database has been created.
#> Table 'batch_keywords' has been created.
#> ...
#> Table 'data_global' has been created.
#> Successfully disconnected.

After initialization or when resuming work on an existing database it is sufficient to call start_db from the respective working directory. This command connects to the globaltrends_db.sqlite database in the folder db and creates connections to all tables in the database.

# start_db ---------------------------------------------------------------------
setwd("your/globaltrends/folder")
start_db()
#> Successfully connected to database.
#> Successfully exported all objects to .GlobalEnv.
print(ls())
#>  [1] "batch_keywords"   "batch_time"       "countries"        "data_control"
#>  [5] "data_doi"         "data_global"      "data_locations"   "data_mapping"
#>  [9] "data_object"      "data_score"       "dir_current"      "dir_wd"
#> [13] "globaltrends_db"  "keyword_synonyms" "keywords_control" "keywords_object"
#> [17] "time_control"     "time_object"      "us_states"

After work with the globaltrends package is complete, the user disconnects from the database with the command disconnect_db.

# disconnect_db ----------------------------------------------------------------
disconnect_db()
#> Successfully disconnected.

Compute search scores and internationalization

Once the user has completed all control and object downloads, globaltrends computes search scores for each keyword-time-location combination and at a global level (volume of internationalization). Next, the package uses the across-country distribution of these search scores to measure the degree of internationalization of an object keyword.

Compute country search scores and volume of internationalization

The function compute_score divides the search volumes for an object keyword by the sum of search volumes for the keywords in the respective control batch. The search score computation proceeds in four steps. First, the function aggregates all search volumes to monthly data. Then, it applies some optional time series adjustments that we outline in greater detail below. Next, it follows the procedure proposed by Castelnuovo and Tran (2017, pp. A1-A2) and outlined in the Appendix B to map control and object data. After the mapping, object search volumes are divided by the sum of control search volumes in the respective control batch. We use the sum of search volumes for a set of control keywords, rather than the search volumes for a single control keyword, to smooth-out variation in the underlying control data. Because of this division, it is essential to define a set of control keywords that mirrors “standard” Google usage for the given research setting.

# compute_score ----------------------------------------------------------------
compute_score(control = new_control[[1]], object = new_object, locations = countries)
#> Successfully computed search score | control: 1 | object: 1 | location: US [1/66]
#> ...
#> Successfully computed search score | control: 1 | object: 2 | location: DO [66/66]

A message indicates each successful computation of search scores. The data is written directly to table data_score in the database. The computation of the volume of internationalization follows the same principles. Instead of search volumes of control and object keywords at the country level, the function compute_voi compares control and object search volumes at the global level.

# compute_voi ------------------------------------------------------------------
compute_voi(control = new_control[[1]], object = new_object)
#> Successfully computed search score | control: 1 | object: 1 | location: world [1/1]
#> Successfully computed search score | control: 1 | object: 2 | location: world [1/1]

Compute degree of internationalization

The globaltrends package uses the distribution of search scores across countries to compute degree of internationalization for objects of interest. The function compute_doi uses an inverted Gini-coefficient as measure for degree of internationalization. The more uniform the distribution of search scores across all countries, the higher the inverted Gini-coefficient and the greater the degree of internationalization. In addition to the Gini-coefficient, the package uses inverted Herfindahl index and inverted Entropy as measures for internationalization (details below).

# compute_doi ------------------------------------------------------------------
compute_doi(control = new_control[[1]], object = new_object, locations = "countries")
#> Successfully computed DOI | control: 1 | object: 1 [1/2]
#> Successfully computed DOI | control: 1 | object: 2 [2/2]

A message indicates each successful computation. The data is written directly to table data_doi in the database.

Exports and plots

Functions in globaltrends write all data directly to tables in the database. With the help of functions from the dplyr package and connections exported from start_db, users can access database tables and prepare their own analysis.

# manual exports ---------------------------------------------------------------
library(dplyr)
data_score %>%
  filter(keyword == "coca cola") %>%
  collect()
#> # A tibble: 8,040 x 8
#>    location keyword    date score_obs score_sad score_trd batch_c batch_o
#>    <chr>    <chr>     <int>     <dbl>     <dbl>     <dbl>   <int>   <int>
#>  1 US       coca cola 14610   0.00362   0.00381   0.00548       1      1
#>  ...
#> 10 US       coca cola 14883   0.00347   0.00365   0.00389       1      1
#> # ... with 8,030 more rows

To enhance usability, the globaltrends package includes a set of export functions that offer filters and return data as tibble. The default value for the batch/keyword, for which export_xxx exports data is NULL. In this case, all values from the database are exported. Alternatively, users can specify filters (e.g., keywords, batches, locations) individually, as vector or as list.

# export_control ---------------------------------------------------------------
export_control(control = 1)
#> # A tibble: 39,600 x 5
#>    location keyword date        hits control
#>    <chr>    <chr>   <date>     <dbl>   <int>
#>  1 US       gmail   2010-01-01    22       1
#>  ...
#> 10 US       gmail   2010-10-01    27       1
#> # ... with 39,590 more rows

# export_score -----------------------------------------------------------------
export_score(object = 1, control = 1)
#> # A tibble: 23,760 x 8
#>    location keyword   date       score_obs score_sad score_trd control object
#>    <chr>    <chr>     <date>         <dbl>     <dbl>     <dbl>   <int>  <int>
#>  1 US       coca cola 2010-01-01   0.00362   0.00381   0.00548       1     1
#>  ...
#> 10 US       coca cola 2010-10-01   0.00347   0.00365   0.00389       1     1
#> # ... with 23,750 more rows

# export_doi and purrr interaction ---------------------------------------------
purrr::map_dfr(c("coca cola", "microsoft"), export_doi, control = 1, type = "obs")
#> # A tibble: 240 x 9
#>    keyword   date       type       gini   hhi entropy control object locations
#>    <chr>     <date>     <chr>     <dbl> <dbl>   <dbl>   <int>  <int> <chr>
#>  1 coca cola 2010-01-01 score_obs 0.397 0.874  -0.938       1     1 countries
#>  ...
#> 10 coca cola 2010-10-01 score_obs 0.574 0.968  -0.303       1     1 countries
#> # ... with 230 more rows

The export functions from globaltrends also allow direct interaction with dplyr or other packages for further analysis.

# export and dplyr interaction -------------------------------------------------
library(dplyr)
export_doi(object = 1, control = 1, type = "obs") %>%
  filter(lubridate::year(date) == 2019) %>%
  group_by(keyword) %>%
  summarise(gini = mean(gini), .groups = "drop")
#> # A tibble: 3 x 2
#>   keyword    gini
#>   <chr>     <dbl>
#> 1 coca cola 0.615
#> 2 facebook  0.707
#> 3 microsoft 0.682

Exports from globaltrends also serve as input for plot functions and the computation of abnormal changes in internationalization implemented in the package. Except for plot_voi_doi, plot functions have methods for classes of outputs from export_score, export_voi, and export_doi. Alternatively, all plot-functions provide options to work without the respective class e.g., for cases where the class gets lost in a join. The function plot_bar uses the output from export_score as input and shows the locations with the highest search scores for a given object keyword. The function uses only the first keyword in the dataset and averages the search scores for the input dataset – we therefore suggest filtering the output from export_score to a specific period. The plot shows that Coca-Cola has high search scores across Latin America and India.

# plot_score -------------------------------------------------------------------
library(dplyr)
export_score(keyword = "coca cola", control = 1) %>%
  filter(lubridate::year(date) == 2019) %>%
  plot_bar()

The functions plot_box and plot_ts have methods for classes of output from export_score, export_voi, and export_doi. The time series plot function plot_ts shows how search scores and volume or degree of internationalization for objects of interest develops over time. The function plot_box generates boxplots of search score and volume or degree of internationalization distributions. The four plots below compare volume and degree of internationalization for the six companies in our sample. At first glance, we see that Coca-Cola, Facebook, and Microsoft have higher degrees of internationalization than Alaska Air Group, Illinois Tool Works, and J.M. Smucker. It seems as if the degree of internationalization of Facebook and Microsoft increased slightly from 2010 to 2015. Although the overall trend remains stable, Coca-Cola shows greater variation than the other companies.

# plot_doi_ts and plot_doi_box -------------------------------------------------
data <- purrr::map_dfr(1:2, export_doi, keyword = NULL, control = 1, type = "obs")
plot_ts(data)
plot_box(data)

With the function plot_voi_doi, users can compare the volume of internationalization for an object of interest to its degree of internationalization. Like plot_bar, the function uses only the first keyword in a dataset, filtering might be necessary. In the plot below, we compare Facebook’s volume of internationalization to its degree of internationalization. While volume of internationalization indicates the level of global search scores, degree of internationalization relates to the global distribution of search scores. We see that Facebook’s volume of internationalization constantly decreased after its peak in 2013. At the same time, we observe that its degree of internationalization grew from 2010 before peaking in 2013.

# plot_voi_doi -----------------------------------------------------------------
out_voi <- export_voi(keyword = "facebook", type = "obs")
out_doi <- export_doi(keyword = "facebook", object = 1, type = "obs")
plot_voi_doi(data_voi = out_voi, data_doi = out_doi)

Abnormal changes in internationalization

A unique feature of internationalization data from globaltrends is that it allows time series analysis. For a better understanding of changes in the data, the function provides the get_abnorm_hist function that implements functionality used in financial event studies (MacKinlay, 1997; McWilliams & Siegel, 1997). The function compares search scores and volume or degree of internationalization to a historic baseline. By default, the historic baseline is the average from the preceding twelve months. Users can specify the window of the baseline period (train_win) and a can use a break between baseline and date of interest (train_break). Since they are used as baseline, the first train_win + train_break abnormal changes are NA. The get_abnorm_hist function has methods for classes of outputs from export_score, export_voi, and export_doi. For each month in the dataset, the deviation from the historic baseline is computed. To identify abnormal changes, the function provides the percentile rank for each change within the distribution of changes.

data <- export_score(keyword = "facebook", locations = countries)
out <- get_abnorm_hist(data)
na.omit(out) # to drop baseline NA values
#> # A tibble: 7,590 x 8
#>   keyword  location date       control object score score_abnorm quantile
#>   <chr>    <chr>    <date>       <int>  <int> <dbl>        <dbl>    <dbl>
#>  1 facebook US       2011-01-01       1      1  1.19      0.0220     0.728
#>  ...
#> 10 facebook US       2011-10-01       1      1  1.32     -0.0669     0.456
#> # ... with 7,580 more rows

data <- export_voi(object = 1)
out <- get_abnorm_hist(data)
na.omit(out) # to drop baseline NA values
#> # A tibble: 345 x 7
#>    keyword   date       control object     voi voi_abnorm quantile
#>    <chr>     <date>       <int>  <int>   <dbl>      <dbl>    <dbl>
#>  1 coca cola 2011-01-01       1      1 0.00320  -0.000299    0.316
#>  ...
#> 10 coca cola 2011-10-01       1      1 0.00274  -0.000458    0.193
#> # ... with 335 more rows

data <- export_doi(keyword = "microsoft", locations = "us_states")
out <- get_abnorm_hist(data)
na.omit(out) # to drop baseline NA values
#> # A tibble: 345 x 9
#>    keyword   date       type      control object locations   doi doi_abnorm quantile
#>    <chr>     <date>     <chr>       <int>  <int> <chr>     <dbl>      <dbl>    <dbl>
#>  1 microsoft 2011-01-01 score_obs       1      1 us_states 0.919    0.0330     0.991
#>  ...
#> 10 microsoft 2011-04-01 score_obs       1      1 us_states 0.909    0.0171     0.886
#> # ... with 335 more rows

The functions plot_bar, plot_box, and plot_ts have methods for classes of output from get_abnorm_hist. This allows seamless plotting of changes in internationalization. The function plot_bar shows the five locations with the highest and lowest changes in search scores for a given object keyword. The function uses only the first keyword in the dataset and averages changes in search scores for the input dataset – we therefore suggest filtering the output from get_abnorm_hist to a specific period. The plot shows that while positive abnormal changes in search scores for Facebook were greatest in Ecuador and Myanmar, negative abnormal changes were greatest in Italy and Argentina.

data <- export_score(object = 1, locations = countries)
data <- dplyr::filter(data, keyword == "facebook" & lubridate::year(date) >= 2018)
# use 2018 as baseline to compute abnormal changes in 2019
out <- get_abnorm_hist(data)
plot_bar(out)

The time series plot function plot_ts shows how search scores and volume or degree of internationalization for objects of interest changed over time. The function plot_box generates boxplots of changes in search score and volume or degree of internationalization distributions. The input ci allows users to set a confidence interval for plotting. Changes with percentile ranks outside this two-tailed confidence interval are highlighted with red dots. The left-hand plot shows abnormal changes in Facebook’s search score for Germany. Search scores increased “abnormally” (i.e., compared to the historic average) in 2012 and decreased abnormally in 2014. The right-hand plot shows the distribution for Coca Cola’s degree of internationalization and indicates abnormal changes.

data <- export_score(keyword = "facebook", locations = "DE")
out <- get_abnorm_hist(data)
plot_ts(out)

data <- export_doi(keyword = "coca cola", locations = "countries")
out <- get_abnorm_hist(data)
plot_box(out)

Additional options

The globaltrends package offers several options that allow robustness checks and adjustments for default computations. Users can compute global trend dispersion based on different types of time series, use other measures than the inverted Gini-coefficient, or change the set of locations.

Time series adjustments

The computation of search scores in the globaltrends package compares a time series of search volumes for object keywords to the time series of search volumes for control keywords. Noise and seasonality in search volume time series could affect the resulting search scores. The globaltrends package offers two time series adjustments as robustness checks. In the data_score table, column score_obs refers to values without adjustment. Column score_trd uses the underlying time series’ trend for computation.

# computation seasonally adjusted ----------------------------------------------
search_score <- ts(data$hits, frequency = 12)
fit <- stl(search_score, s.window = "period")
trend <- fit$time.series[, "trend"]
# computation trend only -------------------------------------------------------
search_score <- ts(data$hits, frequency = 12)
fit <- stl(search_score, s.window = "period")
seasad <- forecast::seasadj(fit)

Column score_sad corrects the time series for seasonal patterns. In general, outcomes for all three types of time series are similar. Column score_trd applies the greatest smoothing, while score_sad reduces some noise.

The export_doi, get_abnorm_hist, plot_bar, plot_ts, plot_box, and plot_voi_doi functions allow filtering for the type of time series through the type input.

# adapt export and plot options ------------------------------------------------
data_score <- export_score(keyword)
data_voi <- export_voi(keyword)
data_doi <- export_doi(keyword, type = "obs")

plot_bar(data_score, type = "obs")
plot_ts(data_voi, type = "sad")
plot_box(data_doi, type = "trd")
plot_voi_doi(data_voi, data_doi, type = "obs")

get_abnorm_hist(data_voi, type = "obs")

Alternative dispersion measures

The globaltrends package computes degree of internationalization based on the across-location distribution of search scores. By default, the package uses an inverted Gini-coefficient. In addition, the package provides inverted Herfindahl index and inverted Entropy as robustness checks. In general, outcomes for all three dispersion measures are similar.

The export_doi, get_abnorm_hist, plot_ts, plot_box, and plot_voi_doi functions allow filtering for the type of dispersion measures through the measure input.

# adapt export and plot options ------------------------------------------------
data_voi <- export_voi(keyword)
data_doi <- export_doi(keyword, measure = "gini")

plot_ts(data_doi, measure = "gini")
plot_box(data_doi, measure = "hhi")
plot_voi_doi(data_voi, data_doi, measure = "entropy")

get_abnorm_hist(data_doi, measure = "hhi")

Alternative sets of locations

By default, globaltrends makes all downloads and computations for the countries set of locations. The countries set covers all countries that generated at least 0.1% of world GDP in 2018. By changing the input locations to us_states, the package uses US states and Washington DC as basis for downloads and computations instead. Apart from compute_doi, all functions use the name of the variable that contains the location vector as inputs for locations (e.g., countries, us_states). The function start_db exports these vectors of ISO2 codes to the global environment. Function compute_doi, however does not directly refer to these objects, but to their names (e.g., “countries”, “us_states”). Using state or district level locations allows users to analyze within-country dispersion of firms.

# change locations -------------------------------------------------------------
download_control(control = 1, locations = us_states)
download_object(object = list(1, 2), locations = us_states)
download_mapping(control = 1, object = 2, locations = us_states)
compute_score(control = 1, object = 2, locations = us_states)
compute_doi(control = 1, object = list(1, 2), locations = "us_states")

Users can add individual sets of locations through the function add_locations. In the variable locations, users specify the location codes (e.g., “AT”, “CH”, “DE”) and type takes the name of the location set (e.g., “DACH”). The new location set can be used in all functions. Since all functions check whether data on a location already exists, globaltrends does not duplicate data for new location sets.

add_locations(c("AT", "CH", "DE"), type = "dach")
#> Successfully created new location set dach (AT, CH, DE).
data <- export_score(keyword = "coca cola", locations = dach)
dplyr::count(data, location)
#> # A tibble: 3 x 2
#>   location     n
#>   <chr>    <int>
#> 1 AT         127
#> 2 CH         127
#> 3 DE         127

Search topics vs search terms

Results for individual keywords as search terms (e.g., weather, apple, coca cola) might be distorted by translation issues (i.e., keywords are search for in different languages), keyword contamination (i.e., keywords relate to different queries: apple vs. Apple Inc.), and keyword dilution (i.e., multiple keywords relate to the same query: election, vote). Search topics allow users to partly overcome these issues. Google defines a search topic as “a group of terms that share the same concept in any language.” Thereby, queries that use search topics are language-independent, cover queries for different terms, and differentiate between queries.

Users can identify the codes of search topics on the Google Trends portal, by selecting the respective topic, rather than a search term (see the screenshot below).

After selecting the relevant search topics, users can identify the topic codes in the query’s URL. For example, based on the URL https://trends.google.com/trends/explore?q=%2Fm%2F03phgz&geo=AT the topic The Coca-Cola Company is %2Fm%2F03phgz. Users can use these topic codes as keywords instead of single search terms. We point users to Kupfer and Zorn (2020, pp. 1169-1170) for a detailed comparison of search topics and search terms.

Important: We recommend that search topics for control keywords are used in combination with search topics for object keywords and vice versa.

Further applications

To measure degree of internationalization, globaltrends offers a wide array of empirical possibilities (Puhr & Müllner, 2021). It allows researchers to compare degree of internationalization for various organizations on a unified scale (e.g., Coca-Cola Company, Facebook Inc., Real Madrid, and Manchester United). In addition, the time-series nature of Google Trends allows for historical analysis of internationalization patterns and speed within organizations.