library(dplyr, warn.conflicts = FALSE)
library(tardis)
Most sentiment-analysis algorithms boil down to two things:
By prioritizing flexibility, transparency, and speed, tardis makes it fast and easy to analyze text with customisable dictionaries and rules.
This means you can use the right dictionary and rules for your context and study aims.
A sentiment-analysis algorithm is only as good as its dictionary and its rules.
But relying on any single dictionary can cause problems:
And similarly, standard approaches may have problems with their rules:
Tardis aims to overcome these issues by following three principles:
And given the importance of online communication and large data sets, Tardis also meets the following requirements:
Tardis first decomposes texts into tokens (words, emojis, or multi-word strings), which are scored based on any dictionary value, if they’re in ALL CAPS, and the three preceding tokens. Preceding negations like “not” will reverse and reduce a token’s score, and modifiers will either increase (e.g. “very”) or decrease (e.g. “slightly”) its score. Sentence scores are found by summing token scores, adjusting for punctuation, and mapping results to the range \((-1, 1)\) with a sigmoid function. Text scores are means of sentence scores. Each of these steps can be tweaked or disabled by user-supplied parameters. Tardis’s algorithm is inspired by other approaches, notably VADER, although it differs from this latter in three key respects: first, it is much more customisable; second, token score adjustments are all multiplicative, making the order of operations unimportant; and third, there are no special cases or exceptions, making the rules simpler and more intuitive.
Because R is a vectorized language, internally tardis creates several
vectors of length \(n\) and stores them
in a tbl_df
data frame, where \(n\) is the number of tokens in the input
texts, and then operates largely by adding and multiplying across these
vectors. For example, if \(neg\) is the
negation scaling factor, \(s_i\) is the
vector of each token’s dictionary sentiment, and \(n_i\) is the number of negations in the
tokens at indices \(i-1\), \(i-2\), and \(i-3\), then we can calculate the effect of
negations as \(s_i * (-neg)^{n_i}\).
The implementation makes heavy use of the package dplyr, although it
also uses base R and custom C++ functions to increase performance.
In languages like Python or C++, the preceding algorithm could be efficiently implemented through a “moving window” approach that steps through each token sequentially and computes a score based on a function \(f(t_j,t_{j-1},t_{j-2},t_{j-3})\) of each token \(t_j\) and its three preceding tokens.
To be completed…
A simple children’s rhyme shows one pitfall of relying on a fixed dictionary. Here we see the sad story of Ed, whose bed is too small:
library(tardis)
library(dplyr)
library(knitr)
<- c("This is not good.",
text "This is not right.",
"My feet stick out of bed all night.",
"And when I pull them in, oh dear!",
"My feet stick out of bed up here!")
::tardis(text) %>%
tardis::select(sentences, score) %>%
dplyr::kable() knitr
sentences | score |
---|---|
This is not good. | -0.3453024 |
This is not right. | 0.0000000 |
My feet stick out of bed all night. | 0.0000000 |
And when I pull them in, oh dear! | 0.4291202 |
My feet stick out of bed up here! | 0.0000000 |
Tardis has correctly noted that “not good” is negative, but has incorrectly classified the fourth sentence as positive because it contains the affectionate term “dear.” To fix this, we can add a new row to our default dictionary classifying “oh dear” as a negative term.
<- dplyr::add_row(tardis::dict_tardis_sentiment,
custom_dictionary token = "oh dear", score = -1)
::tardis(text, dict_sentiments = custom_dictionary) %>%
tardis::select(sentences, score) %>%
dplyr::kable() knitr
sentences | score |
---|---|
This is not good. | -0.3453024 |
This is not right. | 0.0000000 |
My feet stick out of bed all night. | 0.0000000 |
And when I pull them in, oh dear! | -0.2846456 |
My feet stick out of bed up here! | 0.0000000 |
Of course, our choice to assign “oh dear” a sentiment value of -1 was arbitrary, but with this change tardis correctly flags the fourth sentence as negative. This demonstrates how easy it is to adapt tardis’s dictionaries to a specific context.
Here are three two-sentence texts that have similarly neutral mean sentiments, but very different meanings.
<- c("I guess so, that might be fine. I don't know.",
text "Wow, you're really smart. MORON!",
"It's the worst idea I've ever heard 😘" )
::tardis(text) %>%
tardis::kable() knitr
sentences | score | score_sd | score_range |
---|---|---|---|
I guess so, that might be fine. I don’t know. | 0.1011443 | 0.1430397 | 0.2022887 |
Wow, you’re really smart. MORON! | 0.0767885 | 1.0030603 | 1.4185415 |
It’s the worst idea I’ve ever heard 😘 | -0.0073832 | 0.8732911 | 1.2350202 |
Only the first sentence is genuinely neutral; the second two express two wildly different sentiments that on average are neutral, but to most human readers imply a strong emotional value. Tardis also returns the standard deviation and ranges of within-text sentence sentiments, and we can see that the ranges for the two sarcastic texts are much larger than for the truly neutral text. Of course, these examples are blunt and not particularly funny, but they show the use of looking beyond the mean when studying sentiment in informal online communications.
In some cases, researchers may have pre-built dictionaries and be
interested in simply detecting those words, without worrying about any
of the more complex rules described above. For this use case, tardis has
a convenience parameter simple_count
which, when
TRUE
, disables most of the logic and returns simple sums of
token values. Tardis also sends the user a warning to confirm this is
the expected behaviour.
For example:
<- dplyr::tibble(token = c("cat", "cats"), score = c(1, 1))
dict_cats
<- c("I love cats.", "Not a cat?!?!", "CATS CATS CATS!!!")
text
::tardis(text, dict_sentiments = dict_cats, simple_count = TRUE) %>%
tardis::select(sentences, score) %>%
dplyr::kable()
knitr#> Warning in tardis::tardis(text, dict_sentiments = dict_cats, simple_count =
#> TRUE): Parameter simple_count = TRUE overrides most other parameters. Make sure
#> this is intended!
sentences | score |
---|---|
I love cats. | 1 |
Not a cat?!?! | 1 |
CATS CATS CATS!!! | 3 |
Note that the column names are unchanged, although the interpretation differs.
Once a text has been broken down into sentences and tokens, scores are built back up starting from the tokens.