Faster Grep for matching strings

16 May 2018

Incase you wondering,

grep is a command-line utility for searching plain-text data sets for lines that match a regular expression. Its name comes from the ed command g/re/p (globally search a regular expression and print), which has the same effect: doing a global search with the regular expression and printing all matching lines. Grep was originally developed for the Unix operating system, but later available for all Unix-like systems. (source: Wikipedia)

Recently I had a rather large base of files(4GB) to search for exact string ‘error’ matches in multi depth directories. After a few googling, there’s few commands that recommended by others.

szehuang:~$ du -sh /dir/
4.1G    dir/

Normal grep search ``` szehuang:~$ time grep -rniw –exclude=a.txt ‘/dir/’ -e ‘error’ > err_grep.log

real 17m36.222s user 0m10.594s sys 0m13.398s

> Change environment variable. GNU grep works faster in C locale(128 unique characters) than UTF8(>110,000 unique characters).

szehuang:~$ time LANG=C grep -rniw –exclude=a.txt ‘/dir/’ -e ‘error’ > err_grep.log

real 8m46.802s user 0m6.776s sys 0m6.907s

> Ripgrep build on top of Rust's regex engine, that uses finite automata, SIMD and aggressive literal optimizations to make searching very fast.

szehuang:~$ time rg -rniw -g ‘!a.txt’ ‘/dir/’ -e ‘error’ > s1238_err.log

real 2m10.138s user 0m6.793s sys 0m15.474s ```

Of course, further testing are required with different settings to find the best time output but I am currently satisfied with the results shown by ripgrep with 850% improvement.

Twitter Facebook Google+