Modern file systems often include data sets that are stored in components across multiple files. This could include daily stock pricing changes, monthly transactions, or quarterly updates. For the purpose of analysis, these data may be read in, filtered for relevant records, and then combined into a single object (such as a data.frame in R). However, this reading process is not computationally efficient. It requires individual reads of all of the files before aggregation. If filters are applied, this is done after the reading takes place. Many irrelevant records therefore have to be read in.
Utilizing grep at the command line can facilitate pre-filtering of data and aggregation from multiple files. Linking this version of grep to R can help to achieve a number of goals:
Read and aggregate data from multiple files without the need for processing work.
Use pattern matching to filter data before it is read. This supports reading and aggregating relevant data from multiple files without loading unnecessary records.
Provide counts of the number of rows of relevant data in a range of files. This can incorporate pattern matching.
Each of these goals begins with certain assumptions about the data to be read:
The data are stored in delimited flat files that could reasonably be read into a data.frame object in R with a typical file reading program.
The data in each file has a similar structure in terms of variables (columns). It would be reasonable to bind the rows from all of the files into a single, comprehensive data.frame object.
The grepreaper package is designed to facilitate this reading process. It designs user-friendly functions for reading data and counting rows without the need for the user to craft the corresponding grep commands. This vignette will show examples of the features and capabilities of the grepreaper package.
grepreaper is designed to be cross-platform. On
Unix-like systems (Linux and macOS), it uses the system’s native
grep utility. On Windows, the package requires
Rtools to be installed, which provides the necessary
grep.exe executable.
The package automatically detects the location of the
grep binary and handles shell-specific quoting requirements
(e.g., double quotes for Windows CMD and single quotes for Unix shells)
to ensure consistent behavior across environments.
Most typically, we use a file reading method to load data. As an example, the fread() function from the data.table package can read a delimited file. We will work with the diamonds data from the ggplot2 library. Here this file is stored in a .csv file:
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <num> <int> <num> <num> <num>
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
With these data, we could subsequently filter the records to only show diamonds listed as “Ideal” in the cut variable:
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <num> <int> <num> <num> <num>
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 3: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 4: 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 5: 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
Utilizing grep at the command line provides another option to read the data. This can be performed within data.table’s fread() function:
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <num> <int> <num> <num> <num>
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
With grep, it is also possible to pre-filter the data based upon pattern matching:
## V1 V2 V3 V4 V5 V6 V7 V8 V9 V10
## <num> <char> <char> <char> <num> <num> <int> <num> <num> <num>
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 3: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 4: 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 5: 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
Notice that this approach removes the headers. However, the method is otherwise sound. With grep, we can pre-filter the data.
While some users may be eager to learn command line programming tools, the goal of our work is to simplify this approach. The grepreaper library designs simple functions for reading and pre-filtering data.
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
The grep_read() function can also demonstrate the underlying grep command:
## [1] "'/usr/bin/grep' '' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
This is useful for educational purposes and to better understand how the data are being read.
A filter can be established by adding the pattern:
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 2: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 3: 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 4: 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 5: 0.33 Ideal I SI2 61.2 56 403 4.49 4.50 2.75
This would correspond to the following grep command:
## [1] "'/usr/bin/grep' 'Ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
You can also search for multiple patterns (using OR logic):
multiple_cuts <- grep_read(files = "diamonds.csv", pattern = c("Ideal", "Very Good"))
multiple_cuts[1:5,]## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 2: 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 3: 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 4: 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## 5: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
We could also display the construction of this grep command:
## [1] "'/usr/bin/grep' -e 'Ideal' -e 'Very Good' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
File reading with grep also allows for some variations on filtering. The grep_read() function has a number of options built in:
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 2: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 3: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 4: 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 5: 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
This adds the -v option to the grep command:
## [1] "'/usr/bin/grep' -v 'SI2' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 2: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 3: 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 4: 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 5: 0.33 Ideal I SI2 61.2 56 403 4.49 4.50 2.75
This adds the -i option to the grep command:
## [1] "'/usr/bin/grep' -i 'ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 2: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 3: 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 4: 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 5: 0.33 Ideal I SI2 61.2 56 403 4.49 4.50 2.75
This adds the -F option to the grep command:
grep_read(files = "diamonds.csv", pattern = "IdEaL", ignore_case = TRUE ,fixed = TRUE, show_cmd = T)## [1] "'/usr/bin/grep' -i -F 'IdEaL' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 2: 0.31 Ideal J SI2 62.2 54 344 4.35 4.37 2.71
## 3: 0.30 Ideal I SI2 62.0 54 348 4.31 4.34 2.68
## 4: 0.33 Ideal I SI2 61.8 55 403 4.49 4.51 2.78
## 5: 0.33 Ideal I SI2 61.2 56 403 4.49 4.50 2.75
This adds the -r option to the grep command. Note that recursive searching will include a larger number of files, which can greatly lengthen the command.
cmd <- grep_read(path = ".", recursive = TRUE, pattern = "Ideal", file_pattern = ".csv", show_cmd = TRUE)
substring(text = cmd, first = 1, last = 100)## [1] "'/usr/bin/grep' -r -H 'Ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv' '/tmp/RtmpTATCD1/gr"
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Very Good H VS1 59.4 61 338 4.00 4.05 2.39
## 2: 0.23 Ideal J VS1 62.8 56 340 3.93 3.90 2.46
## 3: 0.23 Very Good H VS1 61.0 57 353 3.94 3.96 2.41
## 4: 0.24 Premium I VS1 62.5 57 355 3.97 3.94 2.47
## 5: 0.23 Very Good F VS1 60.9 57 357 3.96 3.99 2.42
## ---
## 8166: 0.57 Premium E VS1 61.6 58 2753 5.36 5.33 3.29
## 8167: 0.84 Good I VS1 63.7 59 2753 5.94 5.90 3.77
## 8168: 0.76 Premium I VS1 59.3 62 2753 5.93 5.85 3.49
## 8169: 0.70 Very Good D VS1 63.1 59 2755 5.67 5.58 3.55
## 8170: 0.71 Ideal G VS1 61.4 56 2756 5.76 5.73 3.53
Notice that using word_match limits the search results to only diamonds with a clarity of ‘VS1’. Diamonds that are ‘VVS1’ would otherwise match the pattern without an exact word match.
This adds the -w option to the grep command.
## [1] "'/usr/bin/grep' -w 'VS1' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4: 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5: 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## file
## <char>
## 1: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 2: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 3: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 4: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
## 5: /tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv
This is especially helpful when reading through multiple files, which will be discussed in more detail later in this document.
This adds -H to the grep command:
## [1] "'/usr/bin/grep' -H '' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
## carat cut color clarity depth table price x y z line_number
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num> <num>
## 1: NA Ideal J VS1 62.8 56 340 3.93 3.90 2.46 NA
## 2: NA Ideal J SI2 62.2 54 344 4.35 4.37 2.71 NA
## 3: NA Ideal I SI2 62.0 54 348 4.31 4.34 2.68 NA
## 4: NA Ideal I SI2 61.8 55 403 4.49 4.51 2.78 NA
## 5: NA Ideal I SI2 61.2 56 403 4.49 4.50 2.75 NA
Note that the displayed indices have an assumption that headers are not part of the count. This is an adjustment from the outputs of grep, which would ordinarily include the headers. (This effectively removes 1 from grep’s counts.)
Showing the line number adds -n to the grep command:
## [1] "'/usr/bin/grep' -n -H 'Ideal' '/tmp/RtmpTATCD1/grepreaper_vignette/diamonds.csv'"
For processing purposes behind the scenes, we maintain the -H for filenames when extracting the line numbers. These filenames are removed if not specifically requested.
Some data systems include records that are spread out over many similar files. For our purposes, we will assume that all of the relevant files have the same structure (number of columns, column names, and order of columns).
As an example data set, we have provided 1000 files of simulated ratings. Each row shows how a user, item and a rating on a 1-5 Likert scale. The files are organized so that each user’s ratings are contained in a unique file.
Most file reading programs in R only read a single file at a time. As a result, we would have to iterate through the reading process. Then all of the data would require aggregation, using functions like rbind() to create a single object.
We can improve upon this process with another application of grep at the command line. This is set up to read and aggregate data from many files, all in a single line of code. The grep_read() function implements this with a simple call:
## user item rating
## <char> <char> <int>
## 1: a7gzXxfI 0JFCjVx2P1RMzy3h 4
## 2: a7gzXxfI 0kG80toKp2msfAut 5
## 3: a7gzXxfI 1Bji5PQIOKXaMGZq 3
## 4: a7gzXxfI 1fg4sLgEFzAtOqCa 5
## 5: a7gzXxfI 3k7Wf4yv0RV6vi4K 5
## ---
## 211: MAYVCdgd whllyERYJir9lTi6 2
## 212: MAYVCdgd ykBGM3UcRgF8whc5 5
## 213: MAYVCdgd ypJCM39tzfpxsVgB 3
## 214: MAYVCdgd yrltwtIEX93JzLYx 2
## 215: MAYVCdgd zQBDgVrQmSYGpYrn 4
We can also show the underlying grep command:
## [1] "'/usr/bin/grep' -H '' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_1.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_2.csv'"
We could likewise read in the data for 10 files:
## user item rating
## <char> <char> <int>
## 1: a7gzXxfI 0JFCjVx2P1RMzy3h 4
## 2: a7gzXxfI 0kG80toKp2msfAut 5
## 3: a7gzXxfI 1Bji5PQIOKXaMGZq 3
## 4: a7gzXxfI 1fg4sLgEFzAtOqCa 5
## 5: a7gzXxfI 3k7Wf4yv0RV6vi4K 5
## ---
## 1016: QPW5X7ci uNp5n9ziPoSjwab6 1
## 1017: QPW5X7ci uPU8XKJD4wo3Twss 1
## 1018: QPW5X7ci uXjCOKMvr1gPaxTg 4
## 1019: QPW5X7ci vPI43TEe3CMQUM5U 3
## 1020: QPW5X7ci wN8YPrJls7N3vGjC 1
Because each filename is appended, the grep command becomes quite lengthy:
## [1] "'/usr/bin/grep' -H '' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_1.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_2.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_3.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_4.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_5.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_6.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_7.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_8.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_9.csv' '/tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_10.csv'"
Now we can scale up to reading data from all 1000 files. First, we can use the list.files() function from base R to obtain all of the file names:
all_files <- list.files(path = "ratings_data", pattern = ".csv", full.names = TRUE)
length(all_files)## [1] 1000
## [1] "ratings_data/file_1.csv" "ratings_data/file_10.csv"
## [3] "ratings_data/file_100.csv" "ratings_data/file_1000.csv"
## [5] "ratings_data/file_101.csv" "ratings_data/file_102.csv"
## [7] "ratings_data/file_103.csv" "ratings_data/file_104.csv"
## [9] "ratings_data/file_105.csv" "ratings_data/file_106.csv"
Then we can proceed with reading and aggregating all of the ratings data:
## user item rating
## <char> <char> <int>
## 1: a7gzXxfI 0JFCjVx2P1RMzy3h 4
## 2: a7gzXxfI 0kG80toKp2msfAut 5
## 3: a7gzXxfI 1Bji5PQIOKXaMGZq 3
## 4: a7gzXxfI 1fg4sLgEFzAtOqCa 5
## 5: a7gzXxfI 3k7Wf4yv0RV6vi4K 5
## ---
## 99996: mbPQOZV1 yO8XGg7a9UHbqatL 3
## 99997: mbPQOZV1 ySHGYLNC7XtywfIZ 2
## 99998: mbPQOZV1 yeZcUlaFqrZmEB9a 3
## 99999: mbPQOZV1 yrltwtIEX93JzLYx 5
## 100000: mbPQOZV1 zfL4EAaUxSMqxtX1 4
From there, we can utilize pattern matching to extract only the relevant records. For instance, this data system stores ratings by users. What if we wanted to pull only the ratings for a specific item?
ratings_0kG80toKp2msfAut <- grep_read(files = all_files, pattern = "0kG80toKp2msfAut")
ratings_0kG80toKp2msfAut## user item rating
## <char> <char> <int>
## 1: ubDjTvkL 0kG80toKp2msfAut 4
## 2: r3YTt57i 0kG80toKp2msfAut 5
File reading can also be performed recursively. This means we would not have to specify all of the filenames in a long list Instead, we can specify a file path, a file pattern (such as searching all .csv files), and recursive search:
ratings_1fg4sLgEFzAtOqCa <- grep_read(path = "ratings_data", file_pattern = ".csv", pattern = "1fg4sLgEFzAtOqCa")
ratings_1fg4sLgEFzAtOqCa## user item rating
## <char> <char> <int>
## 1: 5p5qBUBD 1fg4sLgEFzAtOqCa 3
## 2: OBBWuWnn 1fg4sLgEFzAtOqCa 5
With these tools, we now have a simple method that can read, pre-filter, and aggregate data from multiple files.
File reading begins with an uncertainty about the overall dimensions of the data to be read. We can read a few sample rows to understand the column structure, specifying the nrows parameter:
## carat cut color clarity depth table price x y z
## <num> <char> <char> <char> <num> <int> <int> <num> <num> <num>
## 1: 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2: 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3: 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
However, we do not necessarily know the overall number of rows in advance. This is another place in which utilizing grep at the command line can be of benefit. It can count the rows in a file without reading the full data. The grepreaper package utilizes a grep_count() function to perform this task:
## count
## <num>
## 1: 53940
Counting can be performed with multiple files from the ratings data:
## count
## <num>
## 1: 110
## 2: 105
## 3: 117
## 4: 94
## 5: 102
## 6: 102
## 7: 109
## 8: 82
## 9: 109
## 10: 90
We can also choose to include the filenames:
## file count
## <char> <num>
## 1: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_1.csv 110
## 2: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_2.csv 105
## 3: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_3.csv 117
## 4: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_4.csv 94
## 5: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_5.csv 102
## 6: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_6.csv 102
## 7: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_7.csv 109
## 8: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_8.csv 82
## 9: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_9.csv 109
## 10: /tmp/RtmpTATCD1/grepreaper_vignette/ratings_data/file_10.csv 90
Pattern matching can also be applied:
## count
## <num>
## 1: 3655
Likewise, the full range of options for the pattern-matching can also be applied, such as an inverted search:
## count
## <num>
## 1: 50286
Word matching can be useful to find all cases of a 5-star rating:
## count
## <num>
## 1: 18
## 2: 16
## 3: 20
## 4: 16
## 5: 22
## ---
## 996: 19
## 997: 17
## 998: 19
## 999: 24
## 1000: 21
With word matching, we will avoid extracting rows that include the pattern “5” as part of the identifier for an item or user but do not include a 5-star rating.
The grepreaper package introduces a number of tools that greatly simplify the process of reading data. Some of its benefits include:
Simple Programming: A single function can replace iterated calls to file reading tools. No knowledge of the syntax of grep at the command line is required.
Pre-Counting: The grep_count() function allows us to understand the size of the data prior to reading it in.
Aggregation: The grep_read() function automatically binds the data from all sources without additional programming.
Pre-Filtering: With pattern matching in grep_read(), users can read in only the relevant records of data. This is more efficient than filtering after reading all of the data. In fact, we can use pre-filtering to search and aggregate data from vast file systems.