| Title: | Efficient Data Filtering and Aggregation Using Grep |
| Version: | 0.1.0 |
| Description: | Provides an interface to the system-level 'grep' utility for efficiently reading, filtering, and aggregating data from multiple flat files. By pre-filtering data at the command line before it enters the R environment, the package reduces memory overhead and improves ingestion speed. Includes functions for counting records across large file systems and supports recursive directory searching. |
| License: | MIT + file LICENSE |
| Encoding: | UTF-8 |
| RoxygenNote: | 7.3.3 |
| Suggests: | ggplot2, knitr, rmarkdown |
| VignetteBuilder: | knitr |
| Imports: | data.table, methods |
| NeedsCompilation: | no |
| Packaged: | 2026-01-20 12:59:50 UTC; akshat |
| Author: | David Shilane [aut], Atharv Raskar [aut], Akshat Maurya [aut, cre] |
| Maintainer: | Akshat Maurya <codingmaster902@gmail.com> |
| Repository: | CRAN |
| Date/Publication: | 2026-01-23 21:10:02 UTC |
Build grep command string
Description
Constructs a safe and properly formatted grep command string for system execution. This function handles input sanitization by utilizing R's internal shell quoting mechanism, ensuring compatibility across different operating systems.
Usage
build_grep_cmd(pattern, files, options = "", fixed = FALSE)
Arguments
pattern |
Character vector of patterns to search for. |
files |
Character vector of file paths to search in. |
options |
Character string containing grep flags (e.g., "-i", "-v"). |
fixed |
Logical; if TRUE, grep is told to treat patterns as fixed strings. |
Value
A properly formatted command string ready for system execution.
grep_count: Efficiently count the number of relevant records from one or more files using grep
Description
grep_count: Efficiently count the number of relevant records from one or more files using grep
Usage
grep_count(
files = NULL,
path = NULL,
file_pattern = NULL,
pattern = "",
invert = FALSE,
ignore_case = FALSE,
fixed = FALSE,
recursive = FALSE,
word_match = FALSE,
only_matching = FALSE,
skip = 0,
header = TRUE,
include_filename = FALSE,
show_cmd = FALSE,
show_progress = FALSE,
...
)
Arguments
files |
Character vector of file paths to read. |
path |
Optional. Directory path to search for files. |
file_pattern |
Optional. A pattern to filter filenames when using the
|
pattern |
Pattern to search for within files (passed to grep). |
invert |
Logical; if TRUE, return non-matching lines. |
ignore_case |
Logical; if TRUE, perform case-insensitive matching (default: TRUE). |
fixed |
Logical; if TRUE, pattern is a fixed string, not a regular expression. |
recursive |
Logical; if TRUE, search recursively through directories. |
word_match |
Logical; if TRUE, match only whole words. |
only_matching |
Logical; if TRUE, return only the matching part of the lines. |
skip |
Integer; number of rows to skip. |
header |
Logical; if TRUE, treat first row as header. |
include_filename |
Logical; if TRUE, include source filename as a column. |
show_cmd |
Logical; if TRUE, return the grep command string instead of executing it. |
show_progress |
Logical; if TRUE, show progress indicators. |
... |
Additional arguments passed to fread. |
Value
A data.table containing file names and counts.
grep_read: Efficiently read and filter lines from one or more files using grep, returning a data.table.
Description
grep_read: Efficiently read and filter lines from one or more files using grep, returning a data.table.
Usage
grep_read(
files = NULL,
path = NULL,
file_pattern = NULL,
pattern = "",
invert = FALSE,
ignore_case = FALSE,
fixed = FALSE,
show_cmd = FALSE,
recursive = FALSE,
word_match = FALSE,
show_line_numbers = FALSE,
only_matching = FALSE,
nrows = Inf,
skip = 0,
header = TRUE,
col.names = NULL,
include_filename = FALSE,
show_progress = FALSE,
...
)
Arguments
files |
Character vector of file paths to read. |
path |
Optional. Directory path to search for files. |
file_pattern |
Optional. A pattern to filter filenames when using the
|
pattern |
Pattern to search for within files (passed to grep). |
invert |
Logical; if TRUE, return non-matching lines. |
ignore_case |
Logical; if TRUE, perform case-insensitive matching (default: TRUE). |
fixed |
Logical; if TRUE, pattern is a fixed string, not a regular expression. |
show_cmd |
Logical; if TRUE, return the grep command string instead of executing it. |
recursive |
Logical; if TRUE, search recursively through directories. |
word_match |
Logical; if TRUE, match only whole words. |
show_line_numbers |
Logical; if TRUE, include line numbers from source files. Headers are automatically removed and lines renumbered. |
only_matching |
Logical; if TRUE, return only the matching part of the lines. |
nrows |
Integer; maximum number of rows to read. |
skip |
Integer; number of rows to skip. |
header |
Logical; if TRUE, treat first row as header. Note that using FALSE means that the first row will be included as a row of data in the reading process. |
col.names |
Character vector of column names. |
include_filename |
Logical; if TRUE, include source filename as a column. |
show_progress |
Logical; if TRUE, show progress indicators. |
... |
Additional arguments passed to fread. |
Value
A data.table with different structures based on the options:
Default: Data columns with original types preserved
show_line_numbers=TRUE: Additional 'line_number' column (integer) with source file line numbers
include_filename=TRUE: Additional 'source_file' column (character)
only_matching=TRUE: Single 'match' column with matched substrings
show_cmd=TRUE: Character string containing the grep command
Note
When searching for literal strings (not regex patterns), set
fixed = TRUE to avoid regex interpretation. For example, searching for
"3.94" with fixed = FALSE will match "3894" because "." is a regex
metacharacter.
Header rows are automatically handled:
With show_line_numbers=TRUE: Headers (line_number=1) are removed and lines renumbered
Without line numbers: Headers matching column names are removed
Empty rows and all-NA rows are automatically filtered out
Detect Windows reliably
Description
Detect Windows reliably
Usage
is_windows()
Split columns based on a delimiter
Description
Efficiently splits character vectors into multiple columns based on a specified delimiter. This function is optimized for performance and handles common use cases like parsing grep output or other delimited text data.
Usage
split_columns(
x,
column.names = NA,
split = ":",
resulting.columns = 3,
fixed = TRUE
)
Arguments
x |
Character vector to split |
column.names |
Names for the resulting columns (optional) |
split |
Delimiter to split on (default: ":") |
resulting.columns |
Number of columns to create (default: 3) |
fixed |
Whether to use fixed string matching (default: TRUE) |
Value
A data.table with split columns. Column names are automatically assigned
as V1, V2, V3, etc. unless custom names are provided via column.names.
Examples
# Split grep-like output with colon delimiter
data <- c("file.txt:15:error message", "file.txt:23:warning message")
result <- split_columns(data, resulting.columns = 3)
print(result)
# With custom column names
result_named <- split_columns(data,
column.names = c("filename", "line", "message"),
resulting.columns = 3)
print(result_named)
# Split into 2 columns (combining remaining elements)
result_2col <- split_columns(data, resulting.columns = 2)
print(result_2col)