GNU bug report logs - #60506
feature: parallel grep --recursive

Previous Next

Package: grep;

Reported by: Eike Dierks <foonlyboy <at> gmail.com>

Date: Tue, 3 Jan 2023 00:22:03 UTC

Severity: normal

To reply to this bug, email your comments to 60506 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Tue, 03 Jan 2023 00:22:03 GMT) Full text and rfc822 format available.

Acknowledgement sent to Eike Dierks <foonlyboy <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 03 Jan 2023 00:22:03 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Eike Dierks <foonlyboy <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: feature: parallel grep --recursive
Date: Mon, 2 Jan 2023 21:49:13 +0100
Hi at the gnu grep development team

I'd like to suggest a new feature
for: grep --recursive

The grep --recursive should work in parallel.

Rational:
This could speed up the grep by the numbers of threads

Currently, the --recursive option works on every file in sequence.
Instead, I want to start some greps in parallel.

If we want to be good,
then we would parse the expression first (which might be expensive)
and then fork on the files.

The master grep process would then collect the results,
so that the results would be serialized
to be identical with the current implementation.

I'd like to suggest a --fast option,
where results show up, as soon as they are found.
....

I am fed up with all that precomputed indexes.
I want to grep it really fast now.

I expect  that the file access is fast now, but has latency.
I want the grep to saturate the machine.

// job card






.




Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Tue, 03 Jan 2023 02:36:02 GMT) Full text and rfc822 format available.

Message #8 received at submit <at> debbugs.gnu.org (full text, mbox):

From: "Paul Jackson" <pj <at> usa.net>
To: bug-grep <at> gnu.org
Subject: Re: bug#60506: feature: parallel grep --recursive
Date: Mon, 02 Jan 2023 20:34:39 -0600
There's no need for special logic in grep to run parallel grep's.

The "parallel" command can handle that for you.

For example, on the 12 core, 24 thread Ryzen CPU that I am using:

find $HOME -xdev -type f -ctime -333  |  wc -l     ## counts 136126 files.

find $HOME -xdev -type f -ctime -333 |
    parallel -m grep -l foobar | wc -l                        ## takes about 13 seconds

find $HOME -xdev -type f -ctime -333 |
    xargs -d '\n' grep -l foobar | wc -l                      ## takes about 52 seconds

The above parallel invocation ran 24 grep commands in parallel, and took
about 1/4 the time, otherwise performing rather like xargs, which ran one grep
command at a time.

(Granted, reading either the 'parallel' or 'xargs' man pages is not easy <grin>.)

-- 
                Paul Jackson
                pj <at> usa.net




Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Tue, 03 Jan 2023 02:50:01 GMT) Full text and rfc822 format available.

Message #11 received at 60506 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Paul Jackson <pj <at> usa.net>, 60506 <at> debbugs.gnu.org
Subject: Re: bug#60506: feature: parallel grep --recursive
Date: Mon, 2 Jan 2023 18:48:49 -0800
On 2023-01-02 18:34, Paul Jackson wrote:
> There's no need for special logic in grep to run parallel grep's.

There might be, if one wants to use a parallel grep to search a single 
large file.




Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Tue, 03 Jan 2023 03:00:02 GMT) Full text and rfc822 format available.

Message #14 received at 60506 <at> debbugs.gnu.org (full text, mbox):

From: "Paul Jackson" <pj <at> usa.net>
To: "Paul Eggert" <eggert <at> cs.ucla.edu>, 60506 <at> debbugs.gnu.org
Subject: Re: bug#60506: feature: parallel grep --recursive
Date: Mon, 02 Jan 2023 20:56:23 -0600
<< a parallel grep to search a single large file >>

I'm but one user, and a rather idiosyncratic user at that,
but for my usage patterns, the specialized logic that it
would take to run a parallelized grep on a large file
would likely not shrink the elapsed time enough to justify
the coding, documentation, and maintenance effort.

I would expect the time to read the large file in from disk to
dominate the total elapsed time in any case.

(or maybe I am just jealous that I didn't think of that parallel
grep use case myself <grin>.)

-- 
                Paul Jackson
                pj <at> usa.net




Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Tue, 03 Jan 2023 22:34:01 GMT) Full text and rfc822 format available.

Message #17 received at 60506 <at> debbugs.gnu.org (full text, mbox):

From: "Paul Jackson" <pj <at> usa.net>
To: "David G. Pickett" <dgpickett <at> aol.com>,
 "eggert <at> cs.ucla.edu" <eggert <at> cs.ucla.edu>,
 "60506 <at> debbugs.gnu.org" <60506 <at> debbugs.gnu.org>
Subject: Re: bug#60506: feature: parallel grep --recursive
Date: Tue, 03 Jan 2023 16:32:01 -0600
[Message part 1 (text/plain, inline)]
David Pickett wrote:
<< I also wrote a simpler, line oriented, faster xargs, fxargs!  >>

I've been quite pleased with an xargs wrapper I wrote that basically
converts newlines to nuls, and then invokes either "xargs" or, if asked
to run multiple threads, "parallel --xargs", passing all the "xargs" arguments
to "xargs --null".

I got all the exit status's and such just right, and preferred having all the
xargs options available, once this hack worked around the confused
space character handling of xargs without the --null option.

I call my wrapper "x", a short name since  I use it a lot, having been a regular
xargs user since it was added to Version 7 Unix, inside Bell Labs, back around
1978.

You can find my wrapper at:

http://thepythoniccow.us/x.c

By the way, even the original author of xargs, Herb Gellis, agrees that its
interface is somewhat borked.  See a note Gellis posted a decade after writing
xargs, which I include in the above "x.c" source.  An amusing bit of history ...

-- 
                Paul Jackson
                pj <at> usa.net

[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Tue, 03 Jan 2023 23:57:03 GMT) Full text and rfc822 format available.

Message #20 received at 60506 <at> debbugs.gnu.org (full text, mbox):

From: "David G. Pickett" <dgpickett <at> aol.com>
To: "pj <at> usa.net" <pj <at> usa.net>, "eggert <at> cs.ucla.edu" <eggert <at> cs.ucla.edu>, 
 "60506 <at> debbugs.gnu.org" <60506 <at> debbugs.gnu.org>
Subject: Re: bug#60506: feature: parallel grep --recursive
Date: Tue, 3 Jan 2023 21:15:18 +0000 (UTC)
[Message part 1 (text/plain, inline)]
It seems like we have 2 suggestions: parallel in different files and parallel is large files.
 - Parallel in different files is two ways tricky since you need threads and mutex on the file name stream, and in addition for parallel directories, some sort of threads and queue to pass the file names (producers) to the grep's (consumers).     
   - You might need a following consumer layer to ensure the output lines are in order or at very least not commingled.  A big FILE* buffer and fflush() can ensure each lines is a write(), but you give up original ordering unless you arrange to arrange or sort the output.  
   - You probably want to set a thread count limit.  
   - You might want to start with one file name producer, one grep consumer-producer and one arrange/sort consumer, and add more threads to which ever upstream side is emptying/filling a fixed sized queue.  
   - But of course, a lot of this is available from "parallel" if you make a study of it!  
   - I made a C pipe fitting I called xdemux to take a stream like file name lines from stdin and spread it in rotation to N downstream popen() pipes to a given command, like xargs grep.  N can be set to 2 x your local core count so it is less likely to block on IO, paging, or congestion.  
   - I also wrote a simpler, line oriented, faster xargs, fxargs!  
   - I also wrote a C tool I called pipebuf to buffer stdin to stdout so one slow consumer does not stop others from getting work, but more parallelism is a simpler solution.  
   - Threads in Intel Hyperthreaded CPUs can run twice as many in parallel as with parallel processes.

 - Parallel in large files reminds me of AbInitio ETL, which I assume divides a file into N portions, but each thread is responsible for any line that starts in its portion, even if it ends in another.  Merging output to present hits in order requires some sort of buffering or sorting of output.  For very simple grep (is it in there?), you need to design it so you can call off the other threads on any hit.
Doing both the above simultaneously would be a lot!  Either is a lot to focus on what is one of many simple tools!  Other tools might want similar enhancement!  :D

File read speeds vary wildly, between network drives on various speed and congestion networks, spinning hard drives of various RPM and bit density, solid state drives, and then files cached in DRAM (most read IO uses mmap64()), not to mention in MOBO and CPU caches at many levels.  I wrote a mmap64() based fgrep and it turned out to be so "good" on a big file list that ALL the other processes on the group's server got swapped out big time (without parallelism)!

-----Original Message-----
From: Paul Jackson <pj <at> usa.net>
To: Paul Eggert <eggert <at> cs.ucla.edu>; 60506 <at> debbugs.gnu.org
Sent: Mon, Jan 2, 2023 9:56 pm
Subject: bug#60506: feature: parallel grep --recursive

<< a parallel grep to search a single large file >>

I'm but one user, and a rather idiosyncratic user at that,
but for my usage patterns, the specialized logic that it
would take to run a parallelized grep on a large file
would likely not shrink the elapsed time enough to justify
the coding, documentation, and maintenance effort.

I would expect the time to read the large file in from disk to
dominate the total elapsed time in any case.

(or maybe I am just jealous that I didn't think of that parallel
grep use case myself <grin>.)

-- 
                Paul Jackson
                pj <at> usa.net



[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Wed, 04 Jan 2023 16:44:01 GMT) Full text and rfc822 format available.

Message #23 received at 60506 <at> debbugs.gnu.org (full text, mbox):

From: "David G. Pickett" <dgpickett <at> aol.com>
To: "60506 <at> debbugs.gnu.org" <60506 <at> debbugs.gnu.org>
Subject: Re: bug#60506: feature: parallel grep --recursive
Date: Wed, 4 Jan 2023 16:42:56 +0000 (UTC)
[Message part 1 (text/plain, inline)]
xargs enhancement: I collect new args while the last set is running, use a fixed common buffer for input, and vary the arg count down for long args.

dgp <at> dgp-p6803w:~$ fxargs2
Usage: fxargs2 [ -n <args_per_exec> ] [ -v ] [ -p ] <cmd> [ <cmd_arg> ... ]
Reads arguments as lines from standard input and executes: <cmd> [ <cmd_args> ... ] <args_from_stdin>Each line becomes one argument.  The number of args per command is limitedby <args_per_exec> (default 1024).  The command is executed when either: - the total number of args from standard input is <args_per_exec>, or - the buffer has ( 80 * <args_per_exec> ) unexecuted bytes of input, or - stdin EOF is detected with any args from standard input.The <cmd> [ <cmd_args> ... ] is never executed alone.The buffer is fixed in size at 80 * <args_per_exec>, so long args can forcefewer <args_per_exec> for any pass.While a command is executing, reading resumes, but before another commandis executed, the prior command must return a status.With -v, any abnormal child state returned is reported.With -p, any child terminating on SIGPIPE causes a normal exit.
dgp <at> dgp-p6803w:~$ 

I was tempted to exec more often if stdin was temporarily dry, but better is the enemy of good enough!

-----Original Message-----
From: Paul Jackson <pj <at> usa.net>
To: David G. Pickett <dgpickett <at> aol.com>; eggert <at> cs.ucla.edu <eggert <at> cs.ucla.edu>; 60506 <at> debbugs.gnu.org <60506 <at> debbugs.gnu.org>
Sent: Tue, Jan 3, 2023 5:32 pm
Subject: Re: bug#60506: feature: parallel grep --recursive

#yiv4580765374 p.yiv4580765374MsoNormal, #yiv4580765374 p.yiv4580765374MsoNoSpacing{margin:0;}#yiv4580765374 p.yiv4580765374MsoNormal, #yiv4580765374 p.yiv4580765374MsoNoSpacing{margin:0;}David Pickett wrote:<< I also wrote a simpler, line oriented, faster xargs, fxargs!  >>

I've been quite pleased with an xargs wrapper I wrote that basically
converts newlines to nuls, and then invokes either "xargs" or, if asked
to run multiple threads, "parallel --xargs", passing all the "xargs" arguments
to "xargs --null".

I got all the exit status's and such just right, and preferred having all thexargs options available, once this hack worked around the confused
space character handling of xargs without the --null option.

I call my wrapper "x", a short name since  I use it a lot, having been a regularxargs user since it was added to Version 7 Unix, inside Bell Labs, back around
1978.

You can find my wrapper at:
http://thepythoniccow.us/x.c

By the way, even the original author of xargs, Herb Gellis, agrees that its
interface is somewhat borked.  See a note Gellis posted a decade after writing
xargs, which I include in the above "x.c" source.  An amusing bit of history ...

-- 
                Paul Jackson
                pj <at> usa.net



[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Sat, 07 Jan 2023 00:42:02 GMT) Full text and rfc822 format available.

Message #26 received at 60506 <at> debbugs.gnu.org (full text, mbox):

From: Eike Dierks <foonlyboy <at> gmail.com>
To: 60506 <at> debbugs.gnu.org
Subject: parallel grep
Date: Sat, 7 Jan 2023 01:40:46 +0100
I was thinking about this again.
It looked easy at first, but it is not.

My prime use would be to grep in /usr/include
That would search a lot of files, but only return a few results.
In that case, searching a lot of files in parallel could be beneficial.

But it gets a lot more troublesome,
if you get a lot of results from a single file.
In that case a lot of results would need to be buffered,
so as to give them the very same ordering of the results.
Because the output of grep should always stay stable.

But we could make this explicit:
We could introduce a new option: --parallel (-P)
That would not have any order of the results returned.

I know, that we have to be very conservative about how grep works.
Actually a wrapper with gnu-parallel could do.

I want the grep to make my machine to scream and go.
I want to have grep to use all io and all compute,
and to return results as fast as it can get.

// hi at the grep
// eike




Information forwarded to bug-grep <at> gnu.org:
bug#60506; Package grep. (Sun, 08 Jan 2023 04:34:01 GMT) Full text and rfc822 format available.

Message #29 received at 60506 <at> debbugs.gnu.org (full text, mbox):

From: "David G. Pickett" <dgpickett <at> aol.com>
To: "60506 <at> debbugs.gnu.org" <60506 <at> debbugs.gnu.org>
Subject: Re: bug#60506: parallel grep
Date: Sun, 8 Jan 2023 04:33:40 +0000 (UTC)
[Message part 1 (text/plain, inline)]
I recommend cscope for source file analysis.


-----Original Message-----
From: Eike Dierks <foonlyboy <at> gmail.com>
To: 60506 <at> debbugs.gnu.org
Sent: Fri, Jan 6, 2023 7:40 pm
Subject: bug#60506: parallel grep

I was thinking about this again.
It looked easy at first, but it is not.

My prime use would be to grep in /usr/include
That would search a lot of files, but only return a few results.
In that case, searching a lot of files in parallel could be beneficial.

But it gets a lot more troublesome,
if you get a lot of results from a single file.
In that case a lot of results would need to be buffered,
so as to give them the very same ordering of the results.
Because the output of grep should always stay stable.

But we could make this explicit:
We could introduce a new option: --parallel (-P)
That would not have any order of the results returned.

I know, that we have to be very conservative about how grep works.
Actually a wrapper with gnu-parallel could do.

I want the grep to make my machine to scream and go.
I want to have grep to use all io and all compute,
and to return results as fast as it can get.

// hi at the grep
// eike



[Message part 2 (text/html, inline)]

This bug report was last modified 1 year and 257 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.