Introduction to grepreaper

Introduction

Modern file systems often include data sets that are stored in components across multiple files. This could include daily stock pricing changes, monthly transactions, or quarterly updates. For the purpose of analysis, these data may be read in, filtered for relevant records, and then combined into a single object (such as a data.frame in R). However, this reading process is not computationally efficient. It requires individual reads of all of the files before aggregation. If filters are applied, this is done after the reading takes place. Many irrelevant records therefore have to be read in.

Utilizing grep at the command line can facilitate pre-filtering of data and aggregation from multiple files. Linking this version of grep to R can help to achieve a number of goals:

Read and aggregate data from multiple files without the need for processing work.
Use pattern matching to filter data before it is read. This supports reading and aggregating relevant data from multiple files without loading unnecessary records.
Provide counts of the number of rows of relevant data in a range of files. This can incorporate pattern matching.

Each of these goals begins with certain assumptions about the data to be read:

The data are stored in delimited flat files that could reasonably be read into a data.frame object in R with a typical file reading program.
The data in each file has a similar structure in terms of variables (columns). It would be reasonable to bind the rows from all of the files into a single, comprehensive data.frame object.

The grepreaper package is designed to facilitate this reading process. It designs user-friendly functions for reading data and counting rows without the need for the user to craft the corresponding grep commands. This vignette will show examples of the features and capabilities of the grepreaper package.

Platform Compatibility

grepreaper is designed to be cross-platform. On Unix-like systems (Linux and macOS), it uses the system’s native grep utility. On Windows, the package requires Rtools to be installed, which provides the necessary grep.exe executable.

The package automatically detects the location of the grep binary and handles shell-specific quoting requirements (e.g., double quotes for Windows CMD and single quotes for Unix shells) to ensure consistent behavior across environments.

Reading Data

Most typically, we use a file reading method to load data. As an example, the fread() function from the data.table package can read a delimited file. We will work with the diamonds data from the ggplot2 library. Here this file is stored in a .csv file:

diamonds <- fread(input = "diamonds.csv")
diamonds[1:5,]

##    carat     cut  color clarity depth table price     x     y     z
##    <num>  <char> <char>  <char> <num> <num> <int> <num> <num> <num>
## 1:  0.23   Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2:  0.21 Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3:  0.23    Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4:  0.29 Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5:  0.31    Good      J     SI2  63.3    58   335  4.34  4.35  2.75

With these data, we could subsequently filter the records to only show diamonds listed as “Ideal” in the cut variable:

ideal <- diamonds[cut == "Ideal",]
ideal[1:5,]

##    carat    cut  color clarity depth table price     x     y     z
##    <num> <char> <char>  <char> <num> <num> <int> <num> <num> <num>
## 1:  0.23  Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2:  0.23  Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46
## 3:  0.31  Ideal      J     SI2  62.2    54   344  4.35  4.37  2.71
## 4:  0.30  Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68
## 5:  0.33  Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78

Utilizing grep at the command line provides another option to read the data. This can be performed within data.table’s fread() function:

diamonds <- fread(cmd = "grep '' 'diamonds.csv'")
diamonds[1:5,]

##    carat     cut  color clarity depth table price     x     y     z
##    <num>  <char> <char>  <char> <num> <num> <int> <num> <num> <num>
## 1:  0.23   Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2:  0.21 Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3:  0.23    Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4:  0.29 Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5:  0.31    Good      J     SI2  63.3    58   335  4.34  4.35  2.75

With grep, it is also possible to pre-filter the data based upon pattern matching:

ideal <- fread(cmd = "grep 'Ideal' 'diamonds.csv'")
ideal[1:5,]

##       V1     V2     V3     V4    V5    V6    V7    V8    V9   V10
##    <num> <char> <char> <char> <num> <num> <int> <num> <num> <num>
## 1:  0.23  Ideal      E    SI2  61.5    55   326  3.95  3.98  2.43
## 2:  0.23  Ideal      J    VS1  62.8    56   340  3.93  3.90  2.46
## 3:  0.31  Ideal      J    SI2  62.2    54   344  4.35  4.37  2.71
## 4:  0.30  Ideal      I    SI2  62.0    54   348  4.31  4.34  2.68
## 5:  0.33  Ideal      I    SI2  61.8    55   403  4.49  4.51  2.78

Notice that this approach removes the headers. However, the method is otherwise sound. With grep, we can pre-filter the data.

While some users may be eager to learn command line programming tools, the goal of our work is to simplify this approach. The grepreaper library designs simple functions for reading and pre-filtering data.

diamonds <- grep_read(files = "diamonds.csv")
diamonds[1:5,]

##    carat     cut  color clarity depth table price     x     y     z
##    <num>  <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.23   Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2:  0.21 Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3:  0.23    Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4:  0.29 Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5:  0.31    Good      J     SI2  63.3    58   335  4.34  4.35  2.75

Showing the Underlying grep Command

The grep_read() function can also demonstrate the underlying grep command:

grep_read(files = "diamonds.csv", show_cmd = TRUE)

## [1] "'/usr/bin/grep' '' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

This is useful for educational purposes and to better understand how the data are being read.

Reading and Pre-Filtering Data

A filter can be established by adding the pattern:

ideal <- grep_read(files = "diamonds.csv", pattern = "Ideal")
ideal[1:5,]

##    carat    cut  color clarity depth table price     x     y     z
##    <num> <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.23  Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46
## 2:  0.31  Ideal      J     SI2  62.2    54   344  4.35  4.37  2.71
## 3:  0.30  Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68
## 4:  0.33  Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78
## 5:  0.33  Ideal      I     SI2  61.2    56   403  4.49  4.50  2.75

This would correspond to the following grep command:

grep_read(files = "diamonds.csv", pattern = "Ideal", show_cmd = TRUE)

## [1] "'/usr/bin/grep' 'Ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

You can also search for multiple patterns (using OR logic):

multiple_cuts <- grep_read(files = "diamonds.csv", pattern = c("Ideal", "Very Good"))
multiple_cuts[1:5,]

##    carat       cut  color clarity depth table price     x     y     z
##    <num>    <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48
## 2:  0.24 Very Good      I    VVS1  62.3    57   336  3.95  3.98  2.47
## 3:  0.26 Very Good      H     SI1  61.9    55   337  4.07  4.11  2.53
## 4:  0.23 Very Good      H     VS1  59.4    61   338  4.00  4.05  2.39
## 5:  0.23     Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46

We could also display the construction of this grep command:

grep_read(files = "diamonds.csv", pattern = c("Ideal", "Very Good"), show_cmd = TRUE)

## [1] "'/usr/bin/grep' -e 'Ideal' -e 'Very Good' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

Special Options

File reading with grep also allows for some variations on filtering. The grep_read() function has a number of options built in:

invert: Search for records that do NOT contain the requested pattern:

grep_read(files = "diamonds.csv", pattern = c("SI2"), invert = TRUE)[1:5,]

##    carat       cut  color clarity depth table price     x     y     z
##    <num>    <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.21   Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 2:  0.23      Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 3:  0.29   Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 4:  0.24 Very Good      J    VVS2  62.8    57   336  3.94  3.96  2.48
## 5:  0.24 Very Good      I    VVS1  62.3    57   336  3.95  3.98  2.47

This adds the -v option to the grep command:

grep_read(files = "diamonds.csv", pattern = c("SI2"), invert = TRUE, show_cmd = TRUE)

## [1] "'/usr/bin/grep' -v 'SI2' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

ignore_case: Identify any records that contain the pattern without regard to case sensitivity:

grep_read(files = "diamonds.csv", pattern = c("ideal"), ignore_case = TRUE)[1:5,]

##    carat    cut  color clarity depth table price     x     y     z
##    <num> <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.23  Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46
## 2:  0.31  Ideal      J     SI2  62.2    54   344  4.35  4.37  2.71
## 3:  0.30  Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68
## 4:  0.33  Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78
## 5:  0.33  Ideal      I     SI2  61.2    56   403  4.49  4.50  2.75

This adds the -i option to the grep command:

grep_read(files = "diamonds.csv", pattern = c("ideal"), ignore_case = TRUE, show_cmd = TRUE)

## [1] "'/usr/bin/grep' -i 'ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

fixed: The pattern will be supplied in a fixed manner, exactly as written.

grep_read(files = "diamonds.csv", pattern = "IdEaL", ignore_case = TRUE ,fixed = TRUE)[1:5,]

##    carat    cut  color clarity depth table price     x     y     z
##    <num> <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.23  Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46
## 2:  0.31  Ideal      J     SI2  62.2    54   344  4.35  4.37  2.71
## 3:  0.30  Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68
## 4:  0.33  Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78
## 5:  0.33  Ideal      I     SI2  61.2    56   403  4.49  4.50  2.75

This adds the -F option to the grep command:

grep_read(files = "diamonds.csv", pattern = "IdEaL", ignore_case = TRUE ,fixed = TRUE, show_cmd = T)

## [1] "'/usr/bin/grep' -i -F 'IdEaL' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

recursive: This will search recursively for all files within a folder and its subfolders. Note that it would be necessary to specify a path and potentially a file_pattern.

grep_read(path = ".", recursive = TRUE, pattern = "Ideal", file_pattern = ".csv")[1:5,]

##    carat    cut  color clarity depth table price     x     y     z
##    <num> <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.23  Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46
## 2:  0.31  Ideal      J     SI2  62.2    54   344  4.35  4.37  2.71
## 3:  0.30  Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68
## 4:  0.33  Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78
## 5:  0.33  Ideal      I     SI2  61.2    56   403  4.49  4.50  2.75

This adds the -r option to the grep command. Note that recursive searching will include a larger number of files, which can greatly lengthen the command.

cmd <- grep_read(path = ".", recursive = TRUE, pattern = "Ideal", file_pattern = ".csv", show_cmd = TRUE)
substring(text = cmd, first = 1, last = 100)

## [1] "'/usr/bin/grep' -r -H 'Ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv' '/tmp/RtmpTATCD1/gr"

word_match: This restricts the matches to entire words rather than portions of words.

grep_read(files = "diamonds.csv", pattern = "VS1", word_match = TRUE)

##       carat       cut  color clarity depth table price     x     y     z
##       <num>    <char> <char>  <char> <num> <int> <int> <num> <num> <num>
##    1:  0.23 Very Good      H     VS1  59.4    61   338  4.00  4.05  2.39
##    2:  0.23     Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46
##    3:  0.23 Very Good      H     VS1  61.0    57   353  3.94  3.96  2.41
##    4:  0.24   Premium      I     VS1  62.5    57   355  3.97  3.94  2.47
##    5:  0.23 Very Good      F     VS1  60.9    57   357  3.96  3.99  2.42
##   ---                                                                   
## 8166:  0.57   Premium      E     VS1  61.6    58  2753  5.36  5.33  3.29
## 8167:  0.84      Good      I     VS1  63.7    59  2753  5.94  5.90  3.77
## 8168:  0.76   Premium      I     VS1  59.3    62  2753  5.93  5.85  3.49
## 8169:  0.70 Very Good      D     VS1  63.1    59  2755  5.67  5.58  3.55
## 8170:  0.71     Ideal      G     VS1  61.4    56  2756  5.76  5.73  3.53

Notice that using word_match limits the search results to only diamonds with a clarity of ‘VS1’. Diamonds that are ‘VVS1’ would otherwise match the pattern without an exact word match.

This adds the -w option to the grep command.

grep_read(files = "diamonds.csv", pattern = "VS1", word_match = TRUE, show_cmd = TRUE)

## [1] "'/usr/bin/grep' -w 'VS1' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

include_filename: We can identify the original source file for each row:

grep_read(files = "diamonds.csv", include_filename = TRUE)[1:5]

##    carat     cut  color clarity depth table price     x     y     z
##    <num>  <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.23   Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2:  0.21 Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3:  0.23    Good      E     VS1  56.9    65   327  4.05  4.07  2.31
## 4:  0.29 Premium      I     VS2  62.4    58   334  4.20  4.23  2.63
## 5:  0.31    Good      J     SI2  63.3    58   335  4.34  4.35  2.75
##                                                file
##                                              <char>
## 1: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 2: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 3: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 4: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 5: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv

This is especially helpful when reading through multiple files, which will be discussed in more detail later in this document.

This adds -H to the grep command:

grep_read(files = "diamonds.csv", include_filename = TRUE, show_cmd = TRUE)

## [1] "'/usr/bin/grep' -H '' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

show_line_numbers: This provides the rows indices of the original files:

grep_read(files = "diamonds.csv", pattern = "Ideal", show_line_numbers = TRUE)[1:5]

##    carat    cut  color clarity depth table price     x     y     z line_number
##    <num> <char> <char>  <char> <num> <int> <int> <num> <num> <num>       <num>
## 1:    NA  Ideal      J     VS1  62.8    56   340  3.93  3.90  2.46          NA
## 2:    NA  Ideal      J     SI2  62.2    54   344  4.35  4.37  2.71          NA
## 3:    NA  Ideal      I     SI2  62.0    54   348  4.31  4.34  2.68          NA
## 4:    NA  Ideal      I     SI2  61.8    55   403  4.49  4.51  2.78          NA
## 5:    NA  Ideal      I     SI2  61.2    56   403  4.49  4.50  2.75          NA

Note that the displayed indices have an assumption that headers are not part of the count. This is an adjustment from the outputs of grep, which would ordinarily include the headers. (This effectively removes 1 from grep’s counts.)

Showing the line number adds -n to the grep command:

grep_read(files = "diamonds.csv", pattern = "Ideal", show_line_numbers = TRUE, show_cmd = T)

## [1] "'/usr/bin/grep' -n -H 'Ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"

For processing purposes behind the scenes, we maintain the -H for filenames when extracting the line numbers. These filenames are removed if not specifically requested.

Aggregating Data from Multiple Files

Some data systems include records that are spread out over many similar files. For our purposes, we will assume that all of the relevant files have the same structure (number of columns, column names, and order of columns).

As an example data set, we have provided 1000 files of simulated ratings. Each row shows how a user, item and a rating on a 1-5 Likert scale. The files are organized so that each user’s ratings are contained in a unique file.

Most file reading programs in R only read a single file at a time. As a result, we would have to iterate through the reading process. Then all of the data would require aggregation, using functions like rbind() to create a single object.

We can improve upon this process with another application of grep at the command line. This is set up to read and aggregate data from many files, all in a single line of code. The grep_read() function implements this with a simple call:

two_files <- c("ratings_data/file_1.csv", "ratings_data/file_2.csv")
grep_read(files = two_files)

##          user             item rating
##        <char>           <char>  <int>
##   1: a7gzXxfI 0JFCjVx2P1RMzy3h      4
##   2: a7gzXxfI 0kG80toKp2msfAut      5
##   3: a7gzXxfI 1Bji5PQIOKXaMGZq      3
##   4: a7gzXxfI 1fg4sLgEFzAtOqCa      5
##   5: a7gzXxfI 3k7Wf4yv0RV6vi4K      5
##  ---                                 
## 211: MAYVCdgd whllyERYJir9lTi6      2
## 212: MAYVCdgd ykBGM3UcRgF8whc5      5
## 213: MAYVCdgd ypJCM39tzfpxsVgB      3
## 214: MAYVCdgd yrltwtIEX93JzLYx      2
## 215: MAYVCdgd zQBDgVrQmSYGpYrn      4

We can also show the underlying grep command:

grep_read(files = two_files, show_cmd = TRUE)

## [1] "'/usr/bin/grep' -H '' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_1.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_2.csv'"

We could likewise read in the data for 10 files:

ten_files <- sprintf("ratings_data/file_%d.csv", 1:10)
grep_read(files = ten_files)

##           user             item rating
##         <char>           <char>  <int>
##    1: a7gzXxfI 0JFCjVx2P1RMzy3h      4
##    2: a7gzXxfI 0kG80toKp2msfAut      5
##    3: a7gzXxfI 1Bji5PQIOKXaMGZq      3
##    4: a7gzXxfI 1fg4sLgEFzAtOqCa      5
##    5: a7gzXxfI 3k7Wf4yv0RV6vi4K      5
##   ---                                 
## 1016: QPW5X7ci uNp5n9ziPoSjwab6      1
## 1017: QPW5X7ci uPU8XKJD4wo3Twss      1
## 1018: QPW5X7ci uXjCOKMvr1gPaxTg      4
## 1019: QPW5X7ci vPI43TEe3CMQUM5U      3
## 1020: QPW5X7ci wN8YPrJls7N3vGjC      1

Because each filename is appended, the grep command becomes quite lengthy:

grep_read(files = ten_files, show_cmd = TRUE)

## [1] "'/usr/bin/grep' -H '' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_1.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_2.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_3.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_4.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_5.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_6.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_7.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_8.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_9.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_10.csv'"

Now we can scale up to reading data from all 1000 files. First, we can use the list.files() function from base R to obtain all of the file names:

all_files <- list.files(path = "ratings_data", pattern = ".csv", full.names = TRUE)
length(all_files)

## [1] 1000

all_files[1:10]

##  [1] "ratings_data/file_1.csv"    "ratings_data/file_10.csv"  
##  [3] "ratings_data/file_100.csv"  "ratings_data/file_1000.csv"
##  [5] "ratings_data/file_101.csv"  "ratings_data/file_102.csv" 
##  [7] "ratings_data/file_103.csv"  "ratings_data/file_104.csv" 
##  [9] "ratings_data/file_105.csv"  "ratings_data/file_106.csv"

Then we can proceed with reading and aggregating all of the ratings data:

ratings <- grep_read(files = all_files)
ratings

##             user             item rating
##           <char>           <char>  <int>
##      1: a7gzXxfI 0JFCjVx2P1RMzy3h      4
##      2: a7gzXxfI 0kG80toKp2msfAut      5
##      3: a7gzXxfI 1Bji5PQIOKXaMGZq      3
##      4: a7gzXxfI 1fg4sLgEFzAtOqCa      5
##      5: a7gzXxfI 3k7Wf4yv0RV6vi4K      5
##     ---                                 
##  99996: mbPQOZV1 yO8XGg7a9UHbqatL      3
##  99997: mbPQOZV1 ySHGYLNC7XtywfIZ      2
##  99998: mbPQOZV1 yeZcUlaFqrZmEB9a      3
##  99999: mbPQOZV1 yrltwtIEX93JzLYx      5
## 100000: mbPQOZV1 zfL4EAaUxSMqxtX1      4

From there, we can utilize pattern matching to extract only the relevant records. For instance, this data system stores ratings by users. What if we wanted to pull only the ratings for a specific item?

ratings_0kG80toKp2msfAut <- grep_read(files = all_files, pattern = "0kG80toKp2msfAut")
ratings_0kG80toKp2msfAut

##        user             item rating
##      <char>           <char>  <int>
## 1: ubDjTvkL 0kG80toKp2msfAut      4
## 2: r3YTt57i 0kG80toKp2msfAut      5

File reading can also be performed recursively. This means we would not have to specify all of the filenames in a long list Instead, we can specify a file path, a file pattern (such as searching all .csv files), and recursive search:

ratings_1fg4sLgEFzAtOqCa <- grep_read(path = "ratings_data", file_pattern = ".csv", pattern = "1fg4sLgEFzAtOqCa")
ratings_1fg4sLgEFzAtOqCa

##        user             item rating
##      <char>           <char>  <int>
## 1: 5p5qBUBD 1fg4sLgEFzAtOqCa      3
## 2: OBBWuWnn 1fg4sLgEFzAtOqCa      5

With these tools, we now have a simple method that can read, pre-filter, and aggregate data from multiple files.

Counting Records

File reading begins with an uncertainty about the overall dimensions of the data to be read. We can read a few sample rows to understand the column structure, specifying the nrows parameter:

grep_read(files = "diamonds.csv", nrows = 3)

##    carat     cut  color clarity depth table price     x     y     z
##    <num>  <char> <char>  <char> <num> <int> <int> <num> <num> <num>
## 1:  0.23   Ideal      E     SI2  61.5    55   326  3.95  3.98  2.43
## 2:  0.21 Premium      E     SI1  59.8    61   326  3.89  3.84  2.31
## 3:  0.23    Good      E     VS1  56.9    65   327  4.05  4.07  2.31

However, we do not necessarily know the overall number of rows in advance. This is another place in which utilizing grep at the command line can be of benefit. It can count the rows in a file without reading the full data. The grepreaper package utilizes a grep_count() function to perform this task:

grep_count(files = "diamonds.csv")

##    count
##    <num>
## 1: 53940

Counting can be performed with multiple files from the ratings data:

grep_count(files = ten_files)

##     count
##     <num>
##  1:   110
##  2:   105
##  3:   117
##  4:    94
##  5:   102
##  6:   102
##  7:   109
##  8:    82
##  9:   109
## 10:    90

We can also choose to include the filenames:

grep_count(files = ten_files, include_filename = TRUE)

##                                                             file count
##                                                           <char> <num>
##  1:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_1.csv   110
##  2:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_2.csv   105
##  3:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_3.csv   117
##  4:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_4.csv    94
##  5:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_5.csv   102
##  6:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_6.csv   102
##  7:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_7.csv   109
##  8:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_8.csv    82
##  9:  /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_9.csv   109
## 10: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_10.csv    90

Pattern matching can also be applied:

grep_count(files = "diamonds.csv", pattern = "VVS1")

##    count
##    <num>
## 1:  3655

Likewise, the full range of options for the pattern-matching can also be applied, such as an inverted search:

grep_count(files = "diamonds.csv", pattern = "VVS1", invert = TRUE)

##    count
##    <num>
## 1: 50286

Word matching can be useful to find all cases of a 5-star rating:

grep_count(files = all_files, pattern = "5", word_match = TRUE)

##       count
##       <num>
##    1:    18
##    2:    16
##    3:    20
##    4:    16
##    5:    22
##   ---      
##  996:    19
##  997:    17
##  998:    19
##  999:    24
## 1000:    21

With word matching, we will avoid extracting rows that include the pattern “5” as part of the identifier for an item or user but do not include a 5-star rating.

Discussion

The grepreaper package introduces a number of tools that greatly simplify the process of reading data. Some of its benefits include:

Simple Programming: A single function can replace iterated calls to file reading tools. No knowledge of the syntax of grep at the command line is required.
Pre-Counting: The grep_count() function allows us to understand the size of the data prior to reading it in.
Aggregation: The grep_read() function automatically binds the data from all sources without additional programming.
Pre-Filtering: With pattern matching in grep_read(), users can read in only the relevant records of data. This is more efficient than filtering after reading all of the data. In fact, we can use pre-filtering to search and aggregate data from vast file systems.