GNU bug report logs -
#62483
echo a | grep -E -w '((()|a)|())*' # does not terminate
Previous Next
To reply to this bug, email your comments to 62483 AT debbugs.gnu.org.
Toggle the display of automated, internal messages from the tracker.
Report forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Mon, 27 Mar 2023 13:15:05 GMT)
Full text and
rfc822 format available.
Acknowledgement sent
to
Koen Claessen <koen <at> chalmers.se>
:
New bug report received and forwarded. Copy sent to
bug-grep <at> gnu.org
.
(Mon, 27 Mar 2023 13:15:05 GMT)
Full text and
rfc822 format available.
Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Hello!
Running the command:
echo a | grep -E -w '((()|a)|())*'
does not terminate, and uses a LOT of processor time, for all versions of
grep I have tried.
This is the smallest case that could be found; simplifying anything in the
input and/or expression leads to correct behavior.
Kind regards,
/Koen
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Mon, 27 Mar 2023 17:55:01 GMT)
Full text and
rfc822 format available.
Message #8 received at 62483 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
>
> Running the command:
>
> echo a | grep -E -w '((()|a)|())*'
>
> does not terminate, and uses a LOT of processor time, for all versions of
> grep I have tried.
>
> This is the smallest case that could be found; simplifying anything in the
> input and/or expression leads to correct behavior.
Reproducible with GNU grep 3.7 on Ubuntu 22:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
93938 dim 20 0 9.0m 2.1m 2.0m R 100.0 0.0 0:08.32 grep -E -w ((()|a)|())*
It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
1378 static reg_errcode_t
1379 __attribute_warn_unused_result__
1380 set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
1381 regmatch_t *pmatch, bool fl_backtrack)
1382 {
...
1415 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
1416 {
1417 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
1418
1419 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
1420 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
...
1442 /* Proceed to next node. */
1443 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
1444 &idx, cur_node,
1445 &eps_via_nodes, fs);
1446
1447 if (__glibc_unlikely (cur_node < 0))
...
1465 }
1466 }
-Dimitry
P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Tue, 28 Mar 2023 06:48:01 GMT)
Full text and
rfc822 format available.
Message #11 received at 62483 <at> debbugs.gnu.org (full text, mbox):
This does not reproduce with gawk, even when I force use of the regex
matcher.
What happens if grep is built with the included regex files instead of
relying on the ones in the local glibc?
Arnold
Dimitry Andric <dimitry <at> andric.com> wrote:
> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
> >
> > Running the command:
> >
> > echo a | grep -E -w '((()|a)|())*'
> >
> > does not terminate, and uses a LOT of processor time, for all versions of
> > grep I have tried.
> >
> > This is the smallest case that could be found; simplifying anything in the
> > input and/or expression leads to correct behavior.
>
> Reproducible with GNU grep 3.7 on Ubuntu 22:
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 93938 dim 20 0 9.0m 2.1m 2.0m R 100.0 0.0 0:08.32 grep -E -w ((()|a)|())*
>
> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
>
> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
>
> 1378 static reg_errcode_t
> 1379 __attribute_warn_unused_result__
> 1380 set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
> 1381 regmatch_t *pmatch, bool fl_backtrack)
> 1382 {
> ...
> 1415 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> 1416 {
> 1417 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> 1418
> 1419 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> 1420 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> ...
> 1442 /* Proceed to next node. */
> 1443 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> 1444 &idx, cur_node,
> 1445 &eps_via_nodes, fs);
> 1446
> 1447 if (__glibc_unlikely (cur_node < 0))
> ...
> 1465 }
> 1466 }
>
> -Dimitry
>
> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
>
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Fri, 31 Mar 2023 15:26:02 GMT)
Full text and
rfc822 format available.
Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Yes, it still reproduces when I configure the latest grep using
--without-included-regex:
1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
1: idx = 0
(gdb)
1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
1: idx = 0
(gdb)
1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
1: idx = 0
(gdb)
1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
1: idx = 0
(gdb)
1428 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
1: idx = 0
(gdb)
1432 if (__glibc_unlikely (cur_node < 0))
1: idx = 0
(gdb)
1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
1: idx = 0
(gdb)
1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
1: idx = 0
(gdb)
1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
1: idx = 0
(gdb)
1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
1: idx = 0
The endless loop looks the same. grep's regexec.c is mostly the same as
glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
directives added.
-Dimitry
> On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
>
> This does not reproduce with gawk, even when I force use of the regex
> matcher.
>
> What happens if grep is built with the included regex files instead of
> relying on the ones in the local glibc?
>
> Arnold
>
> Dimitry Andric <dimitry <at> andric.com> wrote:
>
>> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
>>>
>>> Running the command:
>>>
>>> echo a | grep -E -w '((()|a)|())*'
>>>
>>> does not terminate, and uses a LOT of processor time, for all versions of
>>> grep I have tried.
>>>
>>> This is the smallest case that could be found; simplifying anything in the
>>> input and/or expression leads to correct behavior.
>>
>> Reproducible with GNU grep 3.7 on Ubuntu 22:
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 93938 dim 20 0 9.0m 2.1m 2.0m R 100.0 0.0 0:08.32 grep -E -w ((()|a)|())*
>>
>> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
>>
>> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
>>
>> 1378 static reg_errcode_t
>> 1379 __attribute_warn_unused_result__
>> 1380 set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>> 1381 regmatch_t *pmatch, bool fl_backtrack)
>> 1382 {
>> ...
>> 1415 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>> 1416 {
>> 1417 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>> 1418
>> 1419 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>> 1420 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>> ...
>> 1442 /* Proceed to next node. */
>> 1443 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
>> 1444 &idx, cur_node,
>> 1445 &eps_via_nodes, fs);
>> 1446
>> 1447 if (__glibc_unlikely (cur_node < 0))
>> ...
>> 1465 }
>> 1466 }
>>
>> -Dimitry
>>
>> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
>>
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Fri, 31 Mar 2023 15:26:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 04:16:02 GMT)
Full text and
rfc822 format available.
Message #20 received at 62483 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
On Mon, Mar 27, 2023 at 6:15 AM Koen Claessen <koen <at> chalmers.se> wrote:
> Running the command:
>
> echo a | grep -E -w '((()|a)|())*'
>
> does not terminate, and uses a LOT of processor time, for all versions of
> grep I have tried.
>
> This is the smallest case that could be found; simplifying anything in the
> input and/or expression leads to correct behavior.
Thank you! How did you find that?
FYI, this strikes grep-3.10 (on Fedora 37/glibc-2.36-9.fc37.x86_64)
when using LC_ALL=en_US.UTF-8, but not with LC_ALL=C.
I.e., this infloops:
echo a | LC_ALL=en_US.UTF-8 grep -E -w '((()|a)|())*'
but this works as expected and promptly prints its line of input:
echo a | LC_ALL=C grep -E -w '((()|a)|())*'
For now, I've added an expected-failing test case for this bug:
[grep-glibc-infloop.patch (application/octet-stream, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 06:53:01 GMT)
Full text and
rfc822 format available.
Message #23 received at submit <at> debbugs.gnu.org (full text, mbox):
Hi.
Dimitry Andric <dimitry <at> andric.com> wrote:
> Yes, it still reproduces when I configure the latest grep using
> --without-included-regex:
I assume you meant --with-included-regex?
If you really used --without-included-regex, that doesn't prove anything... :-)
It's interesting, as gawk uses the same regex, but with different flags.
It might be worth trying to understand which of the syntax bits
is causing this.
Thanks,
Arnold
>
> 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> 1: idx = 0
> (gdb)
> 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> 1: idx = 0
> (gdb)
> 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> 1: idx = 0
> (gdb)
> 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> 1: idx = 0
> (gdb)
> 1428 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> 1: idx = 0
> (gdb)
> 1432 if (__glibc_unlikely (cur_node < 0))
> 1: idx = 0
> (gdb)
> 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> 1: idx = 0
> (gdb)
> 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> 1: idx = 0
> (gdb)
> 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> 1: idx = 0
> (gdb)
> 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> 1: idx = 0
>
> The endless loop looks the same. grep's regexec.c is mostly the same as
> glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
> directives added.
>
> -Dimitry
>
> > On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
> >
> > This does not reproduce with gawk, even when I force use of the regex
> > matcher.
> >
> > What happens if grep is built with the included regex files instead of
> > relying on the ones in the local glibc?
> >
> > Arnold
> >
> > Dimitry Andric <dimitry <at> andric.com> wrote:
> >
> >> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
> >>>
> >>> Running the command:
> >>>
> >>> echo a | grep -E -w '((()|a)|())*'
> >>>
> >>> does not terminate, and uses a LOT of processor time, for all versions of
> >>> grep I have tried.
> >>>
> >>> This is the smallest case that could be found; simplifying anything in the
> >>> input and/or expression leads to correct behavior.
> >>
> >> Reproducible with GNU grep 3.7 on Ubuntu 22:
> >>
> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >> 93938 dim 20 0 9.0m 2.1m 2.0m R 100.0 0.0 0:08.32 grep -E -w ((()|a)|())*
> >>
> >> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
> >>
> >> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
> >>
> >> 1378 static reg_errcode_t
> >> 1379 __attribute_warn_unused_result__
> >> 1380 set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
> >> 1381 regmatch_t *pmatch, bool fl_backtrack)
> >> 1382 {
> >> ...
> >> 1415 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >> 1416 {
> >> 1417 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >> 1418
> >> 1419 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >> 1420 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >> ...
> >> 1442 /* Proceed to next node. */
> >> 1443 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> >> 1444 &idx, cur_node,
> >> 1445 &eps_via_nodes, fs);
> >> 1446
> >> 1447 if (__glibc_unlikely (cur_node < 0))
> >> ...
> >> 1465 }
> >> 1466 }
> >>
> >> -Dimitry
> >>
> >> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
> >>
>
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 06:53:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 09:34:02 GMT)
Full text and
rfc822 format available.
Message #29 received at 62483 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
Ah sorry, I did indeed rebuild grep with --with-included-regex, and for
debugging purposes added CFLAGS="-O0 -g".
In any case, the problematic code is both in glibc and grep, as I
believe these are originating from the same source.
-Dimitry
> On 2 Apr 2023, at 08:52, arnold <at> skeeve.com wrote:
>
> Hi.
>
> Dimitry Andric <dimitry <at> andric.com> wrote:
>
>> Yes, it still reproduces when I configure the latest grep using
>> --without-included-regex:
>
> I assume you meant --with-included-regex?
>
> If you really used --without-included-regex, that doesn't prove anything... :-)
>
> It's interesting, as gawk uses the same regex, but with different flags.
> It might be worth trying to understand which of the syntax bits
> is causing this.
>
> Thanks,
>
> Arnold
>
>>
>> 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>> 1: idx = 0
>> (gdb)
>> 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>> 1: idx = 0
>> (gdb)
>> 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>> 1: idx = 0
>> (gdb)
>> 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>> 1: idx = 0
>> (gdb)
>> 1428 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
>> 1: idx = 0
>> (gdb)
>> 1432 if (__glibc_unlikely (cur_node < 0))
>> 1: idx = 0
>> (gdb)
>> 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>> 1: idx = 0
>> (gdb)
>> 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>> 1: idx = 0
>> (gdb)
>> 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>> 1: idx = 0
>> (gdb)
>> 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>> 1: idx = 0
>>
>> The endless loop looks the same. grep's regexec.c is mostly the same as
>> glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
>> directives added.
>>
>> -Dimitry
>>
>>> On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
>>>
>>> This does not reproduce with gawk, even when I force use of the regex
>>> matcher.
>>>
>>> What happens if grep is built with the included regex files instead of
>>> relying on the ones in the local glibc?
>>>
>>> Arnold
>>>
>>> Dimitry Andric <dimitry <at> andric.com> wrote:
>>>
>>>> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
>>>>>
>>>>> Running the command:
>>>>>
>>>>> echo a | grep -E -w '((()|a)|())*'
>>>>>
>>>>> does not terminate, and uses a LOT of processor time, for all versions of
>>>>> grep I have tried.
>>>>>
>>>>> This is the smallest case that could be found; simplifying anything in the
>>>>> input and/or expression leads to correct behavior.
>>>>
>>>> Reproducible with GNU grep 3.7 on Ubuntu 22:
>>>>
>>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>>>> 93938 dim 20 0 9.0m 2.1m 2.0m R 100.0 0.0 0:08.32 grep -E -w ((()|a)|())*
>>>>
>>>> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
>>>>
>>>> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
>>>>
>>>> 1378 static reg_errcode_t
>>>> 1379 __attribute_warn_unused_result__
>>>> 1380 set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>>>> 1381 regmatch_t *pmatch, bool fl_backtrack)
>>>> 1382 {
>>>> ...
>>>> 1415 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>>>> 1416 {
>>>> 1417 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>>>> 1418
>>>> 1419 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>>>> 1420 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>>>> ...
>>>> 1442 /* Proceed to next node. */
>>>> 1443 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
>>>> 1444 &idx, cur_node,
>>>> 1445 &eps_via_nodes, fs);
>>>> 1446
>>>> 1447 if (__glibc_unlikely (cur_node < 0))
>>>> ...
>>>> 1465 }
>>>> 1466 }
>>>>
>>>> -Dimitry
>>>>
>>>> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
>>>>
>>
[signature.asc (application/pgp-signature, attachment)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 09:34:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 10:05:01 GMT)
Full text and
rfc822 format available.
Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):
OK, thanks!
Dimitry Andric <dimitry <at> andric.com> wrote:
> Ah sorry, I did indeed rebuild grep with --with-included-regex, and for
> debugging purposes added CFLAGS="-O0 -g".
>
> In any case, the problematic code is both in glibc and grep, as I
> believe these are originating from the same source.
>
> -Dimitry
>
> > On 2 Apr 2023, at 08:52, arnold <at> skeeve.com wrote:
> >
> > Hi.
> >
> > Dimitry Andric <dimitry <at> andric.com> wrote:
> >
> >> Yes, it still reproduces when I configure the latest grep using
> >> --without-included-regex:
> >
> > I assume you meant --with-included-regex?
> >
> > If you really used --without-included-regex, that doesn't prove anything... :-)
> >
> > It's interesting, as gawk uses the same regex, but with different flags.
> > It might be worth trying to understand which of the syntax bits
> > is causing this.
> >
> > Thanks,
> >
> > Arnold
> >
> >>
> >> 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >> 1: idx = 0
> >> (gdb)
> >> 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >> 1: idx = 0
> >> (gdb)
> >> 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >> 1: idx = 0
> >> (gdb)
> >> 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >> 1: idx = 0
> >> (gdb)
> >> 1428 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> >> 1: idx = 0
> >> (gdb)
> >> 1432 if (__glibc_unlikely (cur_node < 0))
> >> 1: idx = 0
> >> (gdb)
> >> 1400 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >> 1: idx = 0
> >> (gdb)
> >> 1402 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >> 1: idx = 0
> >> (gdb)
> >> 1404 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >> 1: idx = 0
> >> (gdb)
> >> 1405 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >> 1: idx = 0
> >>
> >> The endless loop looks the same. grep's regexec.c is mostly the same as
> >> glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
> >> directives added.
> >>
> >> -Dimitry
> >>
> >>> On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
> >>>
> >>> This does not reproduce with gawk, even when I force use of the regex
> >>> matcher.
> >>>
> >>> What happens if grep is built with the included regex files instead of
> >>> relying on the ones in the local glibc?
> >>>
> >>> Arnold
> >>>
> >>> Dimitry Andric <dimitry <at> andric.com> wrote:
> >>>
> >>>> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
> >>>>>
> >>>>> Running the command:
> >>>>>
> >>>>> echo a | grep -E -w '((()|a)|())*'
> >>>>>
> >>>>> does not terminate, and uses a LOT of processor time, for all versions of
> >>>>> grep I have tried.
> >>>>>
> >>>>> This is the smallest case that could be found; simplifying anything in the
> >>>>> input and/or expression leads to correct behavior.
> >>>>
> >>>> Reproducible with GNU grep 3.7 on Ubuntu 22:
> >>>>
> >>>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >>>> 93938 dim 20 0 9.0m 2.1m 2.0m R 100.0 0.0 0:08.32 grep -E -w ((()|a)|())*
> >>>>
> >>>> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
> >>>>
> >>>> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
> >>>>
> >>>> 1378 static reg_errcode_t
> >>>> 1379 __attribute_warn_unused_result__
> >>>> 1380 set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
> >>>> 1381 regmatch_t *pmatch, bool fl_backtrack)
> >>>> 1382 {
> >>>> ...
> >>>> 1415 for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >>>> 1416 {
> >>>> 1417 update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >>>> 1418
> >>>> 1419 if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >>>> 1420 || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >>>> ...
> >>>> 1442 /* Proceed to next node. */
> >>>> 1443 cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> >>>> 1444 &idx, cur_node,
> >>>> 1445 &eps_via_nodes, fs);
> >>>> 1446
> >>>> 1447 if (__glibc_unlikely (cur_node < 0))
> >>>> ...
> >>>> 1465 }
> >>>> 1466 }
> >>>>
> >>>> -Dimitry
> >>>>
> >>>> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
> >>>>
> >>
>
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 10:05:02 GMT)
Full text and
rfc822 format available.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 18:30:02 GMT)
Full text and
rfc822 format available.
Message #41 received at 62483 <at> debbugs.gnu.org (full text, mbox):
On 2023-04-01 23:52, arnold <at> skeeve.com wrote:
> It's interesting, as gawk uses the same regex, but with different flags.
Also, GNU grep -w passes the following more-complicated regexp to dfaparse:
(^|[^[:alnum:]_])(((()|a)|())*)([^[:alnum:]_]|$)
and quite possibly the bug is related to this extra complexity.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 20:25:02 GMT)
Full text and
rfc822 format available.
Message #44 received at 62483 <at> debbugs.gnu.org (full text, mbox):
On Sun, Apr 2, 2023 at 11:30 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> Also, GNU grep -w passes the following more-complicated regexp to dfaparse:
but AFAIK `-w` is not necessary to trigger it, as the following also
infloops in Fedora Rawhide
$ echo a | grep -E '((()|a)|())+'
interestingly; the loop is broken if any character is added to any of
the `()` branches which might mean that this is also unlikely to
happen in well formed expressions.
Carlo
PS. -P doesn't loop and neither does `echo a | grep -E '((a|())|())+'`
nor '(()|(a|()))+` nor `(()|(()|a))+'`
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Sun, 02 Apr 2023 21:31:02 GMT)
Full text and
rfc822 format available.
Message #47 received at 62483 <at> debbugs.gnu.org (full text, mbox):
On Sun, Apr 2, 2023 at 1:25 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> On Sun, Apr 2, 2023 at 11:30 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> >
> > Also, GNU grep -w passes the following more-complicated regexp to dfaparse:
>
> but AFAIK `-w` is not necessary to trigger it, as the following also
> infloops in Fedora Rawhide
>
> $ echo a | grep -E '((()|a)|())+'
FYI, this prints its input line (and no infloop) when grep is
configured --with-included-regex, so at least for that one, it may be
due to a recent change in upstream glibc.
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Mon, 03 Apr 2023 14:08:03 GMT)
Full text and
rfc822 format available.
Message #50 received at 62483 <at> debbugs.gnu.org (full text, mbox):
[Message part 1 (text/plain, inline)]
I found it when I was testing various new regular expression algorithms
against grep (which I used as the golden standard for this).
I used a random generator for regular expressions (using the QuickCheck
framework) and then shrinking/delta debugging to automatically find the
smallest failing test case.
BTW, if you are interested, I could do a larger more targeted effort stress
testing grep like this and possibly find more test cases with unexpected
behavior. I would need some guidance on where to put most effort in order
to be as useful as this can be. I could find a MSc student to help out with
that. Let me know if this sounds like an interesting thing to do!
kind regards,
/Koen
On Sun, Apr 2, 2023 at 6:15 AM Jim Meyering <jim <at> meyering.net> wrote:
> On Mon, Mar 27, 2023 at 6:15 AM Koen Claessen <koen <at> chalmers.se> wrote:
> > Running the command:
> >
> > echo a | grep -E -w '((()|a)|())*'
> >
> > does not terminate, and uses a LOT of processor time, for all versions of
> > grep I have tried.
> >
> > This is the smallest case that could be found; simplifying anything in
> the
> > input and/or expression leads to correct behavior.
>
> Thank you! How did you find that?
>
> FYI, this strikes grep-3.10 (on Fedora 37/glibc-2.36-9.fc37.x86_64)
> when using LC_ALL=en_US.UTF-8, but not with LC_ALL=C.
> I.e., this infloops:
> echo a | LC_ALL=en_US.UTF-8 grep -E -w '((()|a)|())*'
>
> but this works as expected and promptly prints its line of input:
> echo a | LC_ALL=C grep -E -w '((()|a)|())*'
>
> For now, I've added an expected-failing test case for this bug:
>
[Message part 2 (text/html, inline)]
Information forwarded
to
bug-grep <at> gnu.org
:
bug#62483
; Package
grep
.
(Mon, 03 Apr 2023 15:58:02 GMT)
Full text and
rfc822 format available.
Message #53 received at 62483 <at> debbugs.gnu.org (full text, mbox):
On 2023-04-03 05:07, Koen Claessen wrote:
> BTW, if you are interested, I could do a larger more targeted effort stress
> testing grep like this and possibly find more test cases with unexpected
> behavior. I would need some guidance on where to put most effort in order
> to be as useful as this can be. I could find a MSc student to help out with
> that. Let me know if this sounds like an interesting thing to do!
Any help like this would be most welcome.
Unfortunately (or perhaps fortunately for your student, who will learn a
lot!), none of the current maintainers of the glibc regex code really
understand it. The code's original author is no longer available to
answer questions, and the code is tricky as it attempts to implement
POSIX regular expressions (which are worst-case exponential) efficiently
in the usual cases.
The main guidance I can give you is to look at the existing bug reports
against glibc regex[1] and against grep[2], as well as at the grep
source code itself[3].
[1]:
https://sourceware.org/bugzilla/buglist.cgi?component=regex&product=glibc
[2]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
[3]: https://savannah.gnu.org/projects/grep
This bug report was last modified 1 year and 172 days ago.
Previous Next
GNU bug tracking system
Copyright (C) 1999 Darren O. Benham,
1997,2003 nCipher Corporation Ltd,
1994-97 Ian Jackson.