GNU bug report logs -
#56079
Missing performance optimisation: Word start/end tests
Previous Next
To reply to this bug, email your comments to 56079 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#56079
; Package
grep
.
(Sun, 19 Jun 2022 05:40:02 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
sur-behoffski <sur_behoffski <at> grouse.com.au>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Sun, 19 Jun 2022 05:40:02 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
G'day,
I'm in the throes of a massive rewrite of "hstbm", which combined
a very-quick-and-dirty lexer, no parser, and an optimised
Boyer-Moore-style search where I had made some incremental
improvements. The only release is at savannah.non-gnu.org:
https://savannah.nongnu.org/projects/hstbm
It's been over six years since the first and only release; Lua
fans will note that I have had another project that has been
active over that time, intended to help people use a
scientific/technical toolkit on a range of GNU/Linux machines.
--
I'm now in the process of trialling an all-singing, all-dancing
lexer, with the philosophy that it tries to capture the pattern
syntax and semantics, without resorting to parser constructs
such as an AST. [I'm currently at a hairy point where the meaning
of characters such as "^" can vary, based on constructs such as
"(" (start-of-group) and/or "|" (alternation)... where does lexing
stop and parsing start?!]
One thing that is captured is predicates e.g. relating to IS_WORD:
"IS_WORD_YES" (0x01),
"IS_WORD_NO" (0x10), and
"IS_WORD_MAYBE" (0x11).
I've found some patterns containing word start/end boundary checks
that are impossible to match in practice, e.g.:
a\<b
[abc]\>[def]
Grep does not recognise these cases, and so spends time ploughing
through the text for a match that can never occur. My lexing code,
in contrast, sees the "IS_WORD_YES" "\<" "IS_WORD_YES" (or,
equivalently, pairs of "IS_WORD_NO"), and arranges the lexical
token stream such that the very first token is (effectively)
MATCH_FAILED -- without any effort to inspect the haystack buffer.
This can reduce runtimes for large haystack input from seconds to
milliseconds.
While this is not a terribly common case, it's an easy item to
check for; it's possible that, in the future, patterns may become
less hand-crafted and more machine-crafted, and so this case may
become more relevant.
cheers,
sur-behoffski (Brenton Hoff)
programmer, Grouse Software
["sur-" means "meta-", it's a commentary on a peculiar Australian
event: See "Tony Abbott" + "Captain's Pick" + "Prince Philip".
Absolutely no disrespect is intended to Honour-receivers at any level;
I am grateful for your service, and how you have enriched society.]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#56079
; Package
grep
.
(Sun, 19 Jun 2022 15:27:02 GMT)
Full text and
rfc822 format available.
Message #8 received at 56079 <at> debbugs.gnu.org (full text, mbox):
On 6/19/22 00:38, sur-behoffski wrote:
> a\<b
> [abc]\>[def]
>
Yes, and this comes up even in strict POSIX without GNU extensions, as
"a^" is a valid extended regular expression that cannot match anything.
I don't know whether dfa.c, regex.c, etc. optimize for this, but if not
it would be nice if they did. For example, 'grep -E "a^" FOO' can
silently exit with status 1 without even opening FOO.
Severity set to 'wishlist' from 'normal'
Request was from
Paul Eggert <eggert <at> cs.ucla.edu>
to
control <at> debbugs.gnu.org
.
(Sat, 02 Jul 2022 22:27:02 GMT)
Full text and
rfc822 format available.
This bug report was last modified 2 years and 132 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.