GNU bug report logs - #62483
echo a | grep -E -w '((()|a)|())*' # does not terminate

Previous Next

Package: grep;

Reported by: Koen Claessen <koen <at> chalmers.se>

Date: Mon, 27 Mar 2023 13:15:05 UTC

Severity: normal

To reply to this bug, email your comments to 62483 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Mon, 27 Mar 2023 13:15:05 GMT) Full text and rfc822 format available.

Acknowledgement sent to Koen Claessen <koen <at> chalmers.se>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Mon, 27 Mar 2023 13:15:05 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Koen Claessen <koen <at> chalmers.se>
To: bug-grep <at> gnu.org
Subject: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Mon, 27 Mar 2023 11:14:17 +0200
[Message part 1 (text/plain, inline)]
Hello!

Running the command:

  echo a | grep -E -w '((()|a)|())*'

does not terminate, and uses a LOT of processor time, for all versions of
grep I have tried.

This is the smallest case that could be found; simplifying anything in the
input and/or expression leads to correct behavior.

Kind regards,
/Koen
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Mon, 27 Mar 2023 17:55:01 GMT) Full text and rfc822 format available.

Message #8 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Dimitry Andric <dimitry <at> andric.com>
To: Koen Claessen <koen <at> chalmers.se>
Cc: 62483 <at> debbugs.gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Mon, 27 Mar 2023 19:54:27 +0200
[Message part 1 (text/plain, inline)]
On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
> 
> Running the command:
> 
>  echo a | grep -E -w '((()|a)|())*'
> 
> does not terminate, and uses a LOT of processor time, for all versions of
> grep I have tried.
> 
> This is the smallest case that could be found; simplifying anything in the
> input and/or expression leads to correct behavior.

Reproducible with GNU grep 3.7 on Ubuntu 22:

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
  93938 dim       20   0    9.0m   2.1m   2.0m R 100.0   0.0   0:08.32 grep -E -w ((()|a)|())*

It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.

It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:

  1378  static reg_errcode_t
  1379  __attribute_warn_unused_result__
  1380  set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
  1381            regmatch_t *pmatch, bool fl_backtrack)
  1382  {
...
  1415    for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
  1416      {
  1417        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
  1418
  1419        if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
  1420            || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
...
  1442        /* Proceed to next node.  */
  1443        cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
  1444                                      &idx, cur_node,
  1445                                      &eps_via_nodes, fs);
  1446
  1447        if (__glibc_unlikely (cur_node < 0))
...
  1465          }
  1466      }

-Dimitry

P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Tue, 28 Mar 2023 06:48:01 GMT) Full text and rfc822 format available.

Message #11 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: koen <at> chalmers.se, dimitry <at> andric.com
Cc: 62483 <at> debbugs.gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Tue, 28 Mar 2023 00:46:53 -0600
This does not reproduce with gawk, even when I force use of the regex
matcher.

What happens if grep is built with the included regex files instead of
relying on the ones in the local glibc?

Arnold

Dimitry Andric <dimitry <at> andric.com> wrote:

> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
> > 
> > Running the command:
> > 
> >  echo a | grep -E -w '((()|a)|())*'
> > 
> > does not terminate, and uses a LOT of processor time, for all versions of
> > grep I have tried.
> > 
> > This is the smallest case that could be found; simplifying anything in the
> > input and/or expression leads to correct behavior.
>
> Reproducible with GNU grep 3.7 on Ubuntu 22:
>
>     PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>   93938 dim       20   0    9.0m   2.1m   2.0m R 100.0   0.0   0:08.32 grep -E -w ((()|a)|())*
>
> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
>
> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
>
>   1378  static reg_errcode_t
>   1379  __attribute_warn_unused_result__
>   1380  set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>   1381            regmatch_t *pmatch, bool fl_backtrack)
>   1382  {
> ...
>   1415    for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>   1416      {
>   1417        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>   1418
>   1419        if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>   1420            || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> ...
>   1442        /* Proceed to next node.  */
>   1443        cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
>   1444                                      &idx, cur_node,
>   1445                                      &eps_via_nodes, fs);
>   1446
>   1447        if (__glibc_unlikely (cur_node < 0))
> ...
>   1465          }
>   1466      }
>
> -Dimitry
>
> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
>




Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Fri, 31 Mar 2023 15:26:02 GMT) Full text and rfc822 format available.

Message #14 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Dimitry Andric <dimitry <at> andric.com>
To: arnold <at> skeeve.com
Cc: 62483 <at> debbugs.gnu.org, Koen Claessen <koen <at> chalmers.se>, bug-grep <at> gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Fri, 31 Mar 2023 17:25:21 +0200
[Message part 1 (text/plain, inline)]
Yes, it still reproduces when I configure the latest grep using
--without-included-regex:

1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
1: idx = 0
(gdb)
1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
1: idx = 0
(gdb)
1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
1: idx = 0
(gdb)
1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
1: idx = 0
(gdb)
1428          cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
1: idx = 0
(gdb)
1432          if (__glibc_unlikely (cur_node < 0))
1: idx = 0
(gdb)
1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
1: idx = 0
(gdb)
1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
1: idx = 0
(gdb)
1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
1: idx = 0
(gdb)
1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
1: idx = 0

The endless loop looks the same. grep's regexec.c is mostly the same as
glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
directives added.

-Dimitry

> On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
> 
> This does not reproduce with gawk, even when I force use of the regex
> matcher.
> 
> What happens if grep is built with the included regex files instead of
> relying on the ones in the local glibc?
> 
> Arnold
> 
> Dimitry Andric <dimitry <at> andric.com> wrote:
> 
>> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
>>> 
>>> Running the command:
>>> 
>>> echo a | grep -E -w '((()|a)|())*'
>>> 
>>> does not terminate, and uses a LOT of processor time, for all versions of
>>> grep I have tried.
>>> 
>>> This is the smallest case that could be found; simplifying anything in the
>>> input and/or expression leads to correct behavior.
>> 
>> Reproducible with GNU grep 3.7 on Ubuntu 22:
>> 
>>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>  93938 dim       20   0    9.0m   2.1m   2.0m R 100.0   0.0   0:08.32 grep -E -w ((()|a)|())*
>> 
>> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
>> 
>> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
>> 
>>  1378  static reg_errcode_t
>>  1379  __attribute_warn_unused_result__
>>  1380  set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>>  1381            regmatch_t *pmatch, bool fl_backtrack)
>>  1382  {
>> ...
>>  1415    for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>>  1416      {
>>  1417        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>>  1418
>>  1419        if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>>  1420            || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>> ...
>>  1442        /* Proceed to next node.  */
>>  1443        cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
>>  1444                                      &idx, cur_node,
>>  1445                                      &eps_via_nodes, fs);
>>  1446
>>  1447        if (__glibc_unlikely (cur_node < 0))
>> ...
>>  1465          }
>>  1466      }
>> 
>> -Dimitry
>> 
>> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
>> 

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Fri, 31 Mar 2023 15:26:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 04:16:02 GMT) Full text and rfc822 format available.

Message #20 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Koen Claessen <koen <at> chalmers.se>
Cc: 62483 <at> debbugs.gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Sat, 1 Apr 2023 21:15:07 -0700
[Message part 1 (text/plain, inline)]
On Mon, Mar 27, 2023 at 6:15 AM Koen Claessen <koen <at> chalmers.se> wrote:
> Running the command:
>
>   echo a | grep -E -w '((()|a)|())*'
>
> does not terminate, and uses a LOT of processor time, for all versions of
> grep I have tried.
>
> This is the smallest case that could be found; simplifying anything in the
> input and/or expression leads to correct behavior.

Thank you! How did you find that?

FYI, this strikes grep-3.10 (on Fedora 37/glibc-2.36-9.fc37.x86_64)
when using LC_ALL=en_US.UTF-8, but not with LC_ALL=C.
I.e., this infloops:
   echo a | LC_ALL=en_US.UTF-8 grep -E -w '((()|a)|())*'

but this works as expected and promptly prints its line of input:
     echo a | LC_ALL=C grep -E -w '((()|a)|())*'

For now, I've added an expected-failing test case for this bug:
[grep-glibc-infloop.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 06:53:01 GMT) Full text and rfc822 format available.

Message #23 received at submit <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: dimitry <at> andric.com, arnold <at> skeeve.com
Cc: 62483 <at> debbugs.gnu.org, koen <at> chalmers.se, bug-grep <at> gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Sun, 02 Apr 2023 00:52:02 -0600
Hi.

Dimitry Andric <dimitry <at> andric.com> wrote:

> Yes, it still reproduces when I configure the latest grep using
> --without-included-regex:

I assume you meant --with-included-regex?

If you really used --without-included-regex, that doesn't prove anything... :-)

It's interesting, as gawk uses the same regex, but with different flags.
It might be worth trying to understand which of the syntax bits
is causing this.

Thanks,

Arnold

>
> 1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> 1: idx = 0
> (gdb)
> 1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> 1: idx = 0
> (gdb)
> 1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> 1: idx = 0
> (gdb)
> 1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> 1: idx = 0
> (gdb)
> 1428          cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> 1: idx = 0
> (gdb)
> 1432          if (__glibc_unlikely (cur_node < 0))
> 1: idx = 0
> (gdb)
> 1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> 1: idx = 0
> (gdb)
> 1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> 1: idx = 0
> (gdb)
> 1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> 1: idx = 0
> (gdb)
> 1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> 1: idx = 0
>
> The endless loop looks the same. grep's regexec.c is mostly the same as
> glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
> directives added.
>
> -Dimitry
>
> > On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
> > 
> > This does not reproduce with gawk, even when I force use of the regex
> > matcher.
> > 
> > What happens if grep is built with the included regex files instead of
> > relying on the ones in the local glibc?
> > 
> > Arnold
> > 
> > Dimitry Andric <dimitry <at> andric.com> wrote:
> > 
> >> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
> >>> 
> >>> Running the command:
> >>> 
> >>> echo a | grep -E -w '((()|a)|())*'
> >>> 
> >>> does not terminate, and uses a LOT of processor time, for all versions of
> >>> grep I have tried.
> >>> 
> >>> This is the smallest case that could be found; simplifying anything in the
> >>> input and/or expression leads to correct behavior.
> >> 
> >> Reproducible with GNU grep 3.7 on Ubuntu 22:
> >> 
> >>    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >>  93938 dim       20   0    9.0m   2.1m   2.0m R 100.0   0.0   0:08.32 grep -E -w ((()|a)|())*
> >> 
> >> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
> >> 
> >> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
> >> 
> >>  1378  static reg_errcode_t
> >>  1379  __attribute_warn_unused_result__
> >>  1380  set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
> >>  1381            regmatch_t *pmatch, bool fl_backtrack)
> >>  1382  {
> >> ...
> >>  1415    for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >>  1416      {
> >>  1417        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >>  1418
> >>  1419        if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >>  1420            || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >> ...
> >>  1442        /* Proceed to next node.  */
> >>  1443        cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> >>  1444                                      &idx, cur_node,
> >>  1445                                      &eps_via_nodes, fs);
> >>  1446
> >>  1447        if (__glibc_unlikely (cur_node < 0))
> >> ...
> >>  1465          }
> >>  1466      }
> >> 
> >> -Dimitry
> >> 
> >> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
> >> 
>




Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 06:53:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 09:34:02 GMT) Full text and rfc822 format available.

Message #29 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Dimitry Andric <dimitry <at> andric.com>
To: arnold <at> skeeve.com
Cc: 62483 <at> debbugs.gnu.org, Koen Claessen <koen <at> chalmers.se>, bug-grep <at> gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Sun, 2 Apr 2023 11:33:42 +0200
[Message part 1 (text/plain, inline)]
Ah sorry, I did indeed rebuild grep with --with-included-regex, and for
debugging purposes added CFLAGS="-O0 -g".

In any case, the problematic code is both in glibc and grep, as I
believe these are originating from the same source.

-Dimitry

> On 2 Apr 2023, at 08:52, arnold <at> skeeve.com wrote:
> 
> Hi.
> 
> Dimitry Andric <dimitry <at> andric.com> wrote:
> 
>> Yes, it still reproduces when I configure the latest grep using
>> --without-included-regex:
> 
> I assume you meant --with-included-regex?
> 
> If you really used --without-included-regex, that doesn't prove anything... :-)
> 
> It's interesting, as gawk uses the same regex, but with different flags.
> It might be worth trying to understand which of the syntax bits
> is causing this.
> 
> Thanks,
> 
> Arnold
> 
>> 
>> 1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>> 1: idx = 0
>> (gdb)
>> 1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>> 1: idx = 0
>> (gdb)
>> 1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>> 1: idx = 0
>> (gdb)
>> 1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>> 1: idx = 0
>> (gdb)
>> 1428          cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
>> 1: idx = 0
>> (gdb)
>> 1432          if (__glibc_unlikely (cur_node < 0))
>> 1: idx = 0
>> (gdb)
>> 1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>> 1: idx = 0
>> (gdb)
>> 1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>> 1: idx = 0
>> (gdb)
>> 1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>> 1: idx = 0
>> (gdb)
>> 1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>> 1: idx = 0
>> 
>> The endless loop looks the same. grep's regexec.c is mostly the same as
>> glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
>> directives added.
>> 
>> -Dimitry
>> 
>>> On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
>>> 
>>> This does not reproduce with gawk, even when I force use of the regex
>>> matcher.
>>> 
>>> What happens if grep is built with the included regex files instead of
>>> relying on the ones in the local glibc?
>>> 
>>> Arnold
>>> 
>>> Dimitry Andric <dimitry <at> andric.com> wrote:
>>> 
>>>> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
>>>>> 
>>>>> Running the command:
>>>>> 
>>>>> echo a | grep -E -w '((()|a)|())*'
>>>>> 
>>>>> does not terminate, and uses a LOT of processor time, for all versions of
>>>>> grep I have tried.
>>>>> 
>>>>> This is the smallest case that could be found; simplifying anything in the
>>>>> input and/or expression leads to correct behavior.
>>>> 
>>>> Reproducible with GNU grep 3.7 on Ubuntu 22:
>>>> 
>>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
>>>> 93938 dim       20   0    9.0m   2.1m   2.0m R 100.0   0.0   0:08.32 grep -E -w ((()|a)|())*
>>>> 
>>>> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
>>>> 
>>>> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
>>>> 
>>>> 1378  static reg_errcode_t
>>>> 1379  __attribute_warn_unused_result__
>>>> 1380  set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
>>>> 1381            regmatch_t *pmatch, bool fl_backtrack)
>>>> 1382  {
>>>> ...
>>>> 1415    for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
>>>> 1416      {
>>>> 1417        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
>>>> 1418
>>>> 1419        if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
>>>> 1420            || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
>>>> ...
>>>> 1442        /* Proceed to next node.  */
>>>> 1443        cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
>>>> 1444                                      &idx, cur_node,
>>>> 1445                                      &eps_via_nodes, fs);
>>>> 1446
>>>> 1447        if (__glibc_unlikely (cur_node < 0))
>>>> ...
>>>> 1465          }
>>>> 1466      }
>>>> 
>>>> -Dimitry
>>>> 
>>>> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
>>>> 
>> 

[signature.asc (application/pgp-signature, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 09:34:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 10:05:01 GMT) Full text and rfc822 format available.

Message #35 received at submit <at> debbugs.gnu.org (full text, mbox):

From: arnold <at> skeeve.com
To: dimitry <at> andric.com, arnold <at> skeeve.com
Cc: 62483 <at> debbugs.gnu.org, koen <at> chalmers.se, bug-grep <at> gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Sun, 02 Apr 2023 04:03:56 -0600
OK, thanks!

Dimitry Andric <dimitry <at> andric.com> wrote:

> Ah sorry, I did indeed rebuild grep with --with-included-regex, and for
> debugging purposes added CFLAGS="-O0 -g".
>
> In any case, the problematic code is both in glibc and grep, as I
> believe these are originating from the same source.
>
> -Dimitry
>
> > On 2 Apr 2023, at 08:52, arnold <at> skeeve.com wrote:
> > 
> > Hi.
> > 
> > Dimitry Andric <dimitry <at> andric.com> wrote:
> > 
> >> Yes, it still reproduces when I configure the latest grep using
> >> --without-included-regex:
> > 
> > I assume you meant --with-included-regex?
> > 
> > If you really used --without-included-regex, that doesn't prove anything... :-)
> > 
> > It's interesting, as gawk uses the same regex, but with different flags.
> > It might be worth trying to understand which of the syntax bits
> > is causing this.
> > 
> > Thanks,
> > 
> > Arnold
> > 
> >> 
> >> 1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >> 1: idx = 0
> >> (gdb)
> >> 1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >> 1: idx = 0
> >> (gdb)
> >> 1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >> 1: idx = 0
> >> (gdb)
> >> 1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >> 1: idx = 0
> >> (gdb)
> >> 1428          cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> >> 1: idx = 0
> >> (gdb)
> >> 1432          if (__glibc_unlikely (cur_node < 0))
> >> 1: idx = 0
> >> (gdb)
> >> 1400      for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >> 1: idx = 0
> >> (gdb)
> >> 1402          update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >> 1: idx = 0
> >> (gdb)
> >> 1404          if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >> 1: idx = 0
> >> (gdb)
> >> 1405              || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >> 1: idx = 0
> >> 
> >> The endless loop looks the same. grep's regexec.c is mostly the same as
> >> glibc's, except for the latter having a bunch of #if RE_ENABLE_I18N
> >> directives added.
> >> 
> >> -Dimitry
> >> 
> >>> On 28 Mar 2023, at 08:46, arnold <at> skeeve.com wrote:
> >>> 
> >>> This does not reproduce with gawk, even when I force use of the regex
> >>> matcher.
> >>> 
> >>> What happens if grep is built with the included regex files instead of
> >>> relying on the ones in the local glibc?
> >>> 
> >>> Arnold
> >>> 
> >>> Dimitry Andric <dimitry <at> andric.com> wrote:
> >>> 
> >>>> On 27 Mar 2023, at 11:14, Koen Claessen <koen <at> chalmers.se> wrote:
> >>>>> 
> >>>>> Running the command:
> >>>>> 
> >>>>> echo a | grep -E -w '((()|a)|())*'
> >>>>> 
> >>>>> does not terminate, and uses a LOT of processor time, for all versions of
> >>>>> grep I have tried.
> >>>>> 
> >>>>> This is the smallest case that could be found; simplifying anything in the
> >>>>> input and/or expression leads to correct behavior.
> >>>> 
> >>>> Reproducible with GNU grep 3.7 on Ubuntu 22:
> >>>> 
> >>>>   PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
> >>>> 93938 dim       20   0    9.0m   2.1m   2.0m R 100.0   0.0   0:08.32 grep -E -w ((()|a)|())*
> >>>> 
> >>>> It seems that at least on Ubuntu, grep in this mode uses glibc's regexec(3), and it is that implementation which ends up in an endless loop.
> >>>> 
> >>>> It loops between lines 1415, 1417 and 1443, but idx and cur_node never change from 0:
> >>>> 
> >>>> 1378  static reg_errcode_t
> >>>> 1379  __attribute_warn_unused_result__
> >>>> 1380  set_regs (const regex_t *preg, const re_match_context_t *mctx, size_t nmatch,
> >>>> 1381            regmatch_t *pmatch, bool fl_backtrack)
> >>>> 1382  {
> >>>> ...
> >>>> 1415    for (idx = pmatch[0].rm_so; idx <= pmatch[0].rm_eo ;)
> >>>> 1416      {
> >>>> 1417        update_regs (dfa, pmatch, prev_idx_match, cur_node, idx, nmatch);
> >>>> 1418
> >>>> 1419        if ((idx == pmatch[0].rm_eo && cur_node == mctx->last_node)
> >>>> 1420            || (fs && re_node_set_contains (&eps_via_nodes, cur_node)))
> >>>> ...
> >>>> 1442        /* Proceed to next node.  */
> >>>> 1443        cur_node = proceed_next_node (mctx, nmatch, pmatch, prev_idx_match,
> >>>> 1444                                      &idx, cur_node,
> >>>> 1445                                      &eps_via_nodes, fs);
> >>>> 1446
> >>>> 1447        if (__glibc_unlikely (cur_node < 0))
> >>>> ...
> >>>> 1465          }
> >>>> 1466      }
> >>>> 
> >>>> -Dimitry
> >>>> 
> >>>> P.S.: Interestingly this does not reproduce with BSD grep, which returns immediately with "a".
> >>>> 
> >> 
>




Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 10:05:02 GMT) Full text and rfc822 format available.

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 18:30:02 GMT) Full text and rfc822 format available.

Message #41 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: arnold <at> skeeve.com, dimitry <at> andric.com
Cc: 62483 <at> debbugs.gnu.org, koen <at> chalmers.se
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Sun, 2 Apr 2023 11:29:39 -0700
On 2023-04-01 23:52, arnold <at> skeeve.com wrote:
> It's interesting, as gawk uses the same regex, but with different flags.

Also, GNU grep -w passes the following more-complicated regexp to dfaparse:

  (^|[^[:alnum:]_])(((()|a)|())*)([^[:alnum:]_]|$)

and quite possibly the bug is related to this extra complexity.




Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 20:25:02 GMT) Full text and rfc822 format available.

Message #44 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 62483 <at> debbugs.gnu.org, dimitry <at> andric.com, arnold <at> skeeve.com,
 koen <at> chalmers.se
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Sun, 2 Apr 2023 13:23:54 -0700
On Sun, Apr 2, 2023 at 11:30 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> Also, GNU grep -w passes the following more-complicated regexp to dfaparse:

but AFAIK `-w` is not necessary to trigger it, as the following also
infloops in Fedora Rawhide

  $ echo a | grep -E '((()|a)|())+'

interestingly; the loop is broken if any character is added to any of
the `()` branches which might mean that this is also unlikely to
happen in well formed expressions.

Carlo

PS. -P doesn't loop and neither does `echo a | grep -E '((a|())|())+'`
nor '(()|(a|()))+` nor `(()|(()|a))+'`




Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Sun, 02 Apr 2023 21:31:02 GMT) Full text and rfc822 format available.

Message #47 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 62483 <at> debbugs.gnu.org, dimitry <at> andric.com, Paul Eggert <eggert <at> cs.ucla.edu>,
 arnold <at> skeeve.com, koen <at> chalmers.se
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Sun, 2 Apr 2023 14:30:37 -0700
On Sun, Apr 2, 2023 at 1:25 PM Carlo Arenas <carenas <at> gmail.com> wrote:
> On Sun, Apr 2, 2023 at 11:30 AM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
> >
> > Also, GNU grep -w passes the following more-complicated regexp to dfaparse:
>
> but AFAIK `-w` is not necessary to trigger it, as the following also
> infloops in Fedora Rawhide
>
>   $ echo a | grep -E '((()|a)|())+'

FYI, this prints its input line (and no infloop) when grep is
configured --with-included-regex, so at least for that one, it may be
due to a recent change in upstream glibc.




Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Mon, 03 Apr 2023 14:08:03 GMT) Full text and rfc822 format available.

Message #50 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Koen Claessen <koen <at> chalmers.se>
To: Jim Meyering <jim <at> meyering.net>
Cc: 62483 <at> debbugs.gnu.org
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Mon, 3 Apr 2023 14:07:00 +0200
[Message part 1 (text/plain, inline)]
I found it when I was testing various new regular expression algorithms
against grep (which I used as the golden standard for this).

I used a random generator for regular expressions (using the QuickCheck
framework) and then shrinking/delta debugging to automatically find the
smallest failing test case.

BTW, if you are interested, I could do a larger more targeted effort stress
testing grep like this and possibly find more test cases with unexpected
behavior. I would need some guidance on where to put most effort in order
to be as useful as this can be. I could find a MSc student to help out with
that. Let me know if this sounds like an interesting thing to do!

kind regards,
/Koen

On Sun, Apr 2, 2023 at 6:15 AM Jim Meyering <jim <at> meyering.net> wrote:

> On Mon, Mar 27, 2023 at 6:15 AM Koen Claessen <koen <at> chalmers.se> wrote:
> > Running the command:
> >
> >   echo a | grep -E -w '((()|a)|())*'
> >
> > does not terminate, and uses a LOT of processor time, for all versions of
> > grep I have tried.
> >
> > This is the smallest case that could be found; simplifying anything in
> the
> > input and/or expression leads to correct behavior.
>
> Thank you! How did you find that?
>
> FYI, this strikes grep-3.10 (on Fedora 37/glibc-2.36-9.fc37.x86_64)
> when using LC_ALL=en_US.UTF-8, but not with LC_ALL=C.
> I.e., this infloops:
>    echo a | LC_ALL=en_US.UTF-8 grep -E -w '((()|a)|())*'
>
> but this works as expected and promptly prints its line of input:
>      echo a | LC_ALL=C grep -E -w '((()|a)|())*'
>
> For now, I've added an expected-failing test case for this bug:
>
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#62483; Package grep. (Mon, 03 Apr 2023 15:58:02 GMT) Full text and rfc822 format available.

Message #53 received at 62483 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Koen Claessen <koen <at> chalmers.se>
Cc: 62483 <at> debbugs.gnu.org, Jim Meyering <jim <at> meyering.net>
Subject: Re: bug#62483: echo a | grep -E -w '((()|a)|())*' # does not terminate
Date: Mon, 3 Apr 2023 08:57:19 -0700
On 2023-04-03 05:07, Koen Claessen wrote:

> BTW, if you are interested, I could do a larger more targeted effort stress
> testing grep like this and possibly find more test cases with unexpected
> behavior. I would need some guidance on where to put most effort in order
> to be as useful as this can be. I could find a MSc student to help out with
> that. Let me know if this sounds like an interesting thing to do!

Any help like this would be most welcome.

Unfortunately (or perhaps fortunately for your student, who will learn a 
lot!), none of the current maintainers of the glibc regex code really 
understand it. The code's original author is no longer available to 
answer questions, and the code is tricky as it attempts to implement 
POSIX regular expressions (which are worst-case exponential) efficiently 
in the usual cases.

The main guidance I can give you is to look at the existing bug reports 
against glibc regex[1] and against grep[2], as well as at the grep 
source code itself[3].

[1]: 
https://sourceware.org/bugzilla/buglist.cgi?component=regex&product=glibc
[2]: https://debbugs.gnu.org/cgi/pkgreport.cgi?package=grep
[3]: https://savannah.gnu.org/projects/grep




This bug report was last modified 1 year and 172 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.