The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:
gutenberg_download() that downloads one or
more works from Project Gutenberg by ID: e.g.,
gutenberg_download(84) downloads the text of
Frankenstein.gutenberg_metadata contains information about each
work, pairing Gutenberg ID with title, author, language, etcgutenberg_authors contains information about each
author, such as aliases and birth/death yeargutenberg_subjects contains pairings of works with
Library of Congress subjects and topicsThis package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.
The dataset gutenberg_metadata contains information
about each work, pairing Gutenberg ID with title, author, language,
etc:
gutenberg_metadata
#> # A tibble: 79,491 × 8
#>    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
#>           <int> <chr>    <chr>                <int> <fct>    <chr>              
#>  1            1 "The De… Jeffe…                1638 en       Politics/American …
#>  2            2 "The Un… Unite…                   1 en       Politics/American …
#>  3            3 "John F… Kenne…                1666 en       Browsing: History …
#>  4            4 "Lincol… Linco…                   3 en       US Civil War/Brows…
#>  5            5 "The Un… Unite…                   1 en       United States/Poli…
#>  6            6 "Give M… Henry…                   4 en       American Revolutio…
#>  7            7 "The Ma… <NA>                    NA en       Browsing: History …
#>  8            8 "Abraha… Linco…                   3 en       US Civil War/Brows…
#>  9            9 "Abraha… Linco…                   3 en       US Civil War/Brows…
#> 10           10 "The Ki… <NA>                    NA en       Banned Books List …
#> # ℹ 79,481 more rows
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>For example, you could find the Gutenberg ID(s) of Jane Austen’s Persuasion by doing:
gutenberg_metadata |>
  filter(title == "Persuasion")
#> # A tibble: 3 × 8
#>   gutenberg_id title     author gutenberg_author_id language gutenberg_bookshelf
#>          <int> <chr>     <chr>                <int> <fct>    <chr>              
#> 1          105 Persuasi… Auste…                  68 en       Browsing: Culture/…
#> 2        22963 Persuasi… Auste…                  68 en       Browsing: Culture/…
#> 3        36777 Persuasi… Auste…                  68 fr       FR Littérature/Bro…
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>In many analyses, you may want to filter just for English works,
avoid duplicates, and include only books that have text that can be
downloaded. The gutenberg_works() function does this
pre-filtering:
gutenberg_works()
#> # A tibble: 61,693 × 8
#>    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
#>           <int> <chr>    <chr>                <int> <fct>    <chr>              
#>  1            1 "The De… Jeffe…                1638 en       Politics/American …
#>  2            2 "The Un… Unite…                   1 en       Politics/American …
#>  3            3 "John F… Kenne…                1666 en       Browsing: History …
#>  4            4 "Lincol… Linco…                   3 en       US Civil War/Brows…
#>  5            5 "The Un… Unite…                   1 en       United States/Poli…
#>  6            6 "Give M… Henry…                   4 en       American Revolutio…
#>  7            7 "The Ma… <NA>                    NA en       Browsing: History …
#>  8            8 "Abraha… Linco…                   3 en       US Civil War/Brows…
#>  9            9 "Abraha… Linco…                   3 en       US Civil War/Brows…
#> 10           10 "The Ki… <NA>                    NA en       Banned Books List …
#> # ℹ 61,683 more rows
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>It also allows you to perform filtering as an argument:
gutenberg_works(author == "Austen, Jane")
#> # A tibble: 13 × 8
#>    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
#>           <int> <chr>    <chr>                <int> <fct>    <chr>              
#>  1          105 "Persua… Auste…                  68 en       Browsing: Culture/…
#>  2          121 "Northa… Auste…                  68 en       Gothic Fiction/Bro…
#>  3          141 "Mansfi… Auste…                  68 en       Browsing: Culture/…
#>  4          158 "Emma"   Auste…                  68 en       Browsing: Culture/…
#>  5          161 "Sense … Auste…                  68 en       Browsing: Culture/…
#>  6          946 "Lady S… Auste…                  68 en       Browsing: Culture/…
#>  7         1212 "Love a… Auste…                  68 en       Browsing: Culture/…
#>  8         1342 "Pride … Auste…                  68 en       Best Books Ever Li…
#>  9        31100 "The Co… Auste…                  68 en       Browsing: Culture/…
#> 10        37431 "Pride … Auste…                  68 en       Browsing: Literatu…
#> 11        42078 "The Le… Auste…                  68 en       Browsing: Biograph…
#> 12        63569 "The Wa… Auste…                  68 en       Browsing: Culture/…
#> 13        74233 "Fragme… Auste…                  68 en       Browsing: Literatu…
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>
# or with a regular expression
gutenberg_works(str_detect(author, "Austen"))
#> # A tibble: 23 × 8
#>    gutenberg_id title    author gutenberg_author_id language gutenberg_bookshelf
#>           <int> <chr>    <chr>                <int> <fct>    <chr>              
#>  1          105 Persuas… Auste…                  68 en       Browsing: Culture/…
#>  2          121 Northan… Auste…                  68 en       Gothic Fiction/Bro…
#>  3          141 Mansfie… Auste…                  68 en       Browsing: Culture/…
#>  4          158 Emma     Auste…                  68 en       Browsing: Culture/…
#>  5          161 Sense a… Auste…                  68 en       Browsing: Culture/…
#>  6          946 Lady Su… Auste…                  68 en       Browsing: Culture/…
#>  7         1212 Love an… Auste…                  68 en       Browsing: Culture/…
#>  8         1342 Pride a… Auste…                  68 en       Best Books Ever Li…
#>  9        17797 Memoir … Auste…                7603 en       Browsing: Biograph…
#> 10        22536 Jane Au… Auste…               25392 en       Browsing: Biograph…
#> # ℹ 13 more rows
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>The meta-data currently in the package was last updated on 27 May 2025.
The function gutenberg_download() downloads one or more
works from Project Gutenberg based on their ID. For example, we earlier
saw that one version of Persuasion has ID 105 (see the URL here), so
gutenberg_download(105) downloads this text.
persuasion
#> # A tibble: 8,357 × 4
#>    gutenberg_id text             title      author      
#>           <int> <chr>            <chr>      <chr>       
#>  1          105 "Persuasion"     Persuasion Austen, Jane
#>  2          105 ""               Persuasion Austen, Jane
#>  3          105 ""               Persuasion Austen, Jane
#>  4          105 "by Jane Austen" Persuasion Austen, Jane
#>  5          105 ""               Persuasion Austen, Jane
#>  6          105 "(1818)"         Persuasion Austen, Jane
#>  7          105 ""               Persuasion Austen, Jane
#>  8          105 ""               Persuasion Austen, Jane
#>  9          105 ""               Persuasion Austen, Jane
#> 10          105 ""               Persuasion Austen, Jane
#> # ℹ 8,347 more rowsNotice it is returned as a tbl_df (a type of data frame) including
two variables: gutenberg_id (useful if multiple books are
returned), and a character vector of the text, one row per line.
You can also provide gutenberg_download() a vector of
IDs to download multiple books. For example, to download Renascence,
and Other Poems (book 109) along with
Persuasion, do:
books
#> # A tibble: 9,579 × 4
#>    gutenberg_id text                         title                       author 
#>           <int> <chr>                        <chr>                       <chr>  
#>  1          109 "Renascence and Other Poems" Renascence, and Other Poems Millay…
#>  2          109 ""                           Renascence, and Other Poems Millay…
#>  3          109 ""                           Renascence, and Other Poems Millay…
#>  4          109 "by"                         Renascence, and Other Poems Millay…
#>  5          109 ""                           Renascence, and Other Poems Millay…
#>  6          109 "Edna St. Vincent Millay"    Renascence, and Other Poems Millay…
#>  7          109 ""                           Renascence, and Other Poems Millay…
#>  8          109 ""                           Renascence, and Other Poems Millay…
#>  9          109 ""                           Renascence, and Other Poems Millay…
#> 10          109 ""                           Renascence, and Other Poems Millay…
#> # ℹ 9,569 more rowsNotice that the meta_fields argument allows us to add
one or more additional fields from the gutenberg_metadata
to the downloaded text, such as title or author.
You may want to select books based on information other than their
title or author, such as their genre or topic.
gutenberg_subjects contains pairings of works with Library
of Congress subjects and topics. “lcc” means Library of Congress
Classification, while “lcsh” means Library of Congress
subject headings:
gutenberg_subjects
#> # A tibble: 255,312 × 3
#>    gutenberg_id subject_type subject                                            
#>           <int> <fct>        <chr>                                              
#>  1            1 lcsh         United States -- History -- Revolution, 1775-1783 …
#>  2            1 lcsh         United States. Declaration of Independence         
#>  3            1 lcc          E201                                               
#>  4            1 lcc          JK                                                 
#>  5            2 lcsh         Civil rights -- United States -- Sources           
#>  6            2 lcsh         United States. Constitution. 1st-10th Amendments   
#>  7            2 lcc          JK                                                 
#>  8            2 lcc          KF                                                 
#>  9            3 lcsh         United States -- Foreign relations -- 1961-1963    
#> 10            3 lcsh         Presidents -- United States -- Inaugural addresses 
#> # ℹ 255,302 more rowsThis is useful for extracting texts from a particular topic or genre,
such as detective stories, or a particular character, such as Sherlock
Holmes. The gutenberg_id column can then be used to
download these texts or to link with other metadata.
gutenberg_subjects |>
  filter(subject == "Detective and mystery stories")
#> # A tibble: 939 × 3
#>    gutenberg_id subject_type subject                      
#>           <int> <fct>        <chr>                        
#>  1          170 lcsh         Detective and mystery stories
#>  2          173 lcsh         Detective and mystery stories
#>  3          244 lcsh         Detective and mystery stories
#>  4          305 lcsh         Detective and mystery stories
#>  5          330 lcsh         Detective and mystery stories
#>  6          481 lcsh         Detective and mystery stories
#>  7          547 lcsh         Detective and mystery stories
#>  8          863 lcsh         Detective and mystery stories
#>  9          905 lcsh         Detective and mystery stories
#> 10         1155 lcsh         Detective and mystery stories
#> # ℹ 929 more rows
gutenberg_subjects |>
  filter(grepl("Holmes, Sherlock", subject))
#> # A tibble: 57 × 3
#>    gutenberg_id subject_type subject                                           
#>           <int> <fct>        <chr>                                             
#>  1          108 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  2          221 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  3          244 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  4          834 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  5         1661 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  6         2097 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  7         2343 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  8         2344 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#>  9         2345 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#> 10         2346 lcsh         Holmes, Sherlock (Fictitious character) -- Fiction
#> # ℹ 47 more rowsgutenberg_authors contains information about each
author, such as aliases and birth/death year:
gutenberg_authors
#> # A tibble: 26,077 × 7
#>    gutenberg_author_id author        alias birthdate deathdate wikipedia aliases
#>                  <int> <chr>         <chr>     <int>     <int> <chr>     <chr>  
#>  1                   1 United States U.S.…        NA        NA https://… U.S.A. 
#>  2                   3 Lincoln, Abr… <NA>       1809      1865 https://… United…
#>  3                   4 Henry, Patri… <NA>       1736      1799 https://… <NA>   
#>  4                   5 Adam, Paul    <NA>       1849      1931 https://… <NA>   
#>  5                   7 Carroll, Lew… Dodg…      1832      1898 https://… Dodgso…
#>  6                   8 United State… <NA>         NA        NA https://… Agency…
#>  7                   9 Melville, He… Melv…      1819      1891 https://… Melvil…
#>  8                  10 Barrie, J. M… <NA>       1860      1937 https://… Barrie…
#>  9                  11 Church of Je… <NA>         NA        NA https://… <NA>   
#> 10                  12 Smith, Josep… Smit…      1805      1844 https://… Smith,…
#> # ℹ 26,067 more rowsWhat’s next after retrieving a book’s text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.
words <- books |>
  unnest_tokens(word, text)
words
#> # A tibble: 90,581 × 4
#>    gutenberg_id title                       author                   word      
#>           <int> <chr>                       <chr>                    <chr>     
#>  1          109 Renascence, and Other Poems Millay, Edna St. Vincent renascence
#>  2          109 Renascence, and Other Poems Millay, Edna St. Vincent and       
#>  3          109 Renascence, and Other Poems Millay, Edna St. Vincent other     
#>  4          109 Renascence, and Other Poems Millay, Edna St. Vincent poems     
#>  5          109 Renascence, and Other Poems Millay, Edna St. Vincent by        
#>  6          109 Renascence, and Other Poems Millay, Edna St. Vincent edna      
#>  7          109 Renascence, and Other Poems Millay, Edna St. Vincent st        
#>  8          109 Renascence, and Other Poems Millay, Edna St. Vincent vincent   
#>  9          109 Renascence, and Other Poems Millay, Edna St. Vincent millay    
#> 10          109 Renascence, and Other Poems Millay, Edna St. Vincent contents  
#> # ℹ 90,571 more rows
word_counts <- words |>
  anti_join(stop_words, by = "word") |>
  count(title, word, sort = TRUE)
word_counts
#> # A tibble: 6,664 × 3
#>    title      word          n
#>    <chr>      <chr>     <int>
#>  1 Persuasion anne        447
#>  2 Persuasion captain     302
#>  3 Persuasion elliot      254
#>  4 Persuasion lady        214
#>  5 Persuasion wentworth   191
#>  6 Persuasion charles     155
#>  7 Persuasion time        152
#>  8 Persuasion sir         149
#>  9 Persuasion miss        125
#> 10 Persuasion walter      123
#> # ℹ 6,654 more rowsYou may also find these resources useful:
wikipedia column in
gutenberg_author to Wikipedia content with the WikipediR
package or to pageview statistics with the wikipediatrend
packageformat_reverse function for reversing “Last, First”
names).