GNU bug report logs - #56079
Missing performance optimisation: Word start/end tests

Previous Next

Package: grep;

Reported by: sur-behoffski <sur_behoffski <at> grouse.com.au>

Date: Sun, 19 Jun 2022 05:40:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 56079 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#56079; Package grep. (Sun, 19 Jun 2022 05:40:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to sur-behoffski <sur_behoffski <at> grouse.com.au>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sun, 19 Jun 2022 05:40:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: sur-behoffski <sur_behoffski <at> grouse.com.au>
To: bug-grep <at> gnu.org
Subject: Missing performance optimisation: Word start/end tests
Date: Sun, 19 Jun 2022 15:08:46 +0930
G'day,

I'm in the throes of a massive rewrite of "hstbm", which combined
a very-quick-and-dirty lexer, no parser, and an optimised
Boyer-Moore-style search where I had made some incremental
improvements.  The only release is at savannah.non-gnu.org:

      https://savannah.nongnu.org/projects/hstbm

It's been over six years since the first and only release; Lua
fans will note that I have had another project that has been
active over that time, intended to help people use a
scientific/technical toolkit on a range of GNU/Linux machines.

--

I'm now in the process of trialling an all-singing, all-dancing
lexer, with the philosophy that it tries to capture the pattern
syntax and semantics, without resorting to parser constructs
such as an AST.  [I'm currently at a hairy point where the meaning
of characters such as "^" can vary, based on constructs such as
"(" (start-of-group) and/or "|" (alternation)... where does lexing
stop and parsing start?!]

One thing that is captured is predicates e.g. relating to IS_WORD:
	 "IS_WORD_YES"   (0x01),
         "IS_WORD_NO"    (0x10), and
         "IS_WORD_MAYBE" (0x11).

I've found some patterns containing word start/end boundary checks
that are impossible to match in practice, e.g.:

        a\<b
        [abc]\>[def]

Grep does not recognise these cases, and so spends time ploughing
through the text for a match that can never occur.  My lexing code,
in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or,
equivalently, pairs of "IS_WORD_NO"), and arranges the lexical
token stream such that the very first token is (effectively)
MATCH_FAILED -- without any effort to inspect the haystack buffer.
This can reduce runtimes for large haystack input from seconds to
milliseconds.

While this is not a terribly common case, it's an easy item to
check for; it's possible that, in the future, patterns may become
less hand-crafted and more machine-crafted, and so this case may
become more relevant.

cheers,

sur-behoffski (Brenton Hoff)
programmer, Grouse Software

["sur-" means "meta-", it's a commentary on a peculiar Australian
event:  See "Tony Abbott" + "Captain's Pick" + "Prince Philip".
Absolutely no disrespect is intended to Honour-receivers at any level;
I am grateful for your service, and how you have enriched society.]




Information forwarded to bug-grep <at> gnu.org:
bug#56079; Package grep. (Sun, 19 Jun 2022 15:27:02 GMT) Full text and rfc822 format available.

Message #8 received at 56079 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: sur-behoffski <sur_behoffski <at> grouse.com.au>
Cc: 56079 <at> debbugs.gnu.org
Subject: Re: bug#56079: Missing performance optimisation: Word start/end tests
Date: Sun, 19 Jun 2022 10:26:43 -0500
On 6/19/22 00:38, sur-behoffski wrote:
>          a\<b
>          [abc]\>[def]
>
Yes, and this comes up even in strict POSIX without GNU extensions, as 
"a^" is a valid extended regular expression that cannot match anything.

I don't know whether dfa.c, regex.c, etc. optimize for this, but if not 
it would be nice if they did. For example, 'grep -E "a^" FOO' can 
silently exit with status 1 without even opening FOO.





Severity set to 'wishlist' from 'normal' Request was from Paul Eggert <eggert <at> cs.ucla.edu> to control <at> debbugs.gnu.org. (Sat, 02 Jul 2022 22:27:02 GMT) Full text and rfc822 format available.

This bug report was last modified 2 years and 132 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.