
grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. Grep was originally developed for the Unix operating system, but later available for all Unix-like systems. (source: Wikipedia)
Recently I had a rather large base of files(4GB) to search for exact string ‘error’ matches in multi depth directories. After a few googling, there’s few commands that recommended by others.
szehuang:~$ du -sh /dir/
4.1G dir/
Normal grep search ``` szehuang:~$ time grep -rniw –exclude=a.txt ‘/dir/’ -e ‘error’ > err_grep.log
real 17m36.222s user 0m10.594s sys 0m13.398s
> Change environment variable. GNU grep works faster in C locale(128 unique characters) than UTF8(>110,000 unique characters).
szehuang:~$ time LANG=C grep -rniw –exclude=a.txt ‘/dir/’ -e ‘error’ > err_grep.log
real 8m46.802s user 0m6.776s sys 0m6.907s
> Ripgrep build on top of Rust's regex engine, that uses finite automata, SIMD and aggressive literal optimizations to make searching very fast.
szehuang:~$ time rg -rniw -g ‘!a.txt’ ‘/dir/’ -e ‘error’ > s1238_err.log
real 2m10.138s user 0m6.793s sys 0m15.474s ```
Of course, further testing are required with different settings to find the best time output but I am currently satisfied with the results shown by ripgrep with 850% improvement.
Read more on: Feature comparison of ack, ag, git-grep, GNU grep and ripgrep