GNU bug report logs - #60708
pcre: improve support for linking with a library without unicode

Previous Next

Package: grep;

Reported by: Carlo Arenas <carenas <at> gmail.com>

Date: Tue, 10 Jan 2023 11:14:02 UTC

Severity: normal

To reply to this bug, email your comments to 60708 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Tue, 10 Jan 2023 11:14:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Carlo Arenas <carenas <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Tue, 10 Jan 2023 11:14:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: pcre: improve support for linking with a library without unicode
Date: Tue, 10 Jan 2023 03:13:17 -0800
[Message part 1 (text/plain, inline)]
Noticed while testing the previous patch, and which resulted in tests
being skipped for the wrong reason.

Carlo
[0001-pcre-only-use-UTF-when-available-in-the-library.patch (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Wed, 11 Jan 2023 05:39:02 GMT) Full text and rfc822 format available.

Message #8 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Jim Meyering <jim <at> meyering.net>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Tue, 10 Jan 2023 21:37:51 -0800
[Message part 1 (text/plain, inline)]
On Tue, Jan 10, 2023 at 3:19 AM Carlo Arenas <carenas <at> gmail.com> wrote:
> Noticed while testing the previous patch, and which resulted in tests
> being skipped for the wrong reason.

Thanks for catching that.
I'll push the following tomorrow.
It has a tiny change that moves the declaration of "unicode" down to
just before where it's set and changes its type to uint32_t.
[pcre-no-unicode.diff (application/octet-stream, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Wed, 11 Jan 2023 21:14:02 GMT) Full text and rfc822 format available.

Message #11 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jim Meyering <jim <at> meyering.net>, Carlo Arenas <carenas <at> gmail.com>
Cc: 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Wed, 11 Jan 2023 13:13:19 -0800
On 2023-01-10 21:37, Jim Meyering wrote:

> +  uint32_t unicode = 1;
> +  pcre2_config (PCRE2_CONFIG_UNICODE, &unicode);
> +  if (unicode && localeinfo.multibyte)

Shouldn't that be:

  uint32_t unicode;
  if (localeinfo.multibyte
      && 0 <= pcre2_config (PCRE2_CONFIG_UNICODE, &unicode)
      && unicode)

That is, don't bother to call pcre2_config in a unibyte locale, and 
don't initialize 'unicode' (so that valgrind-like tools can detect an 
error if pcre2_config is buggy), and check the return value of pcre2_config.




Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Wed, 11 Jan 2023 22:13:01 GMT) Full text and rfc822 format available.

Message #14 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jim Meyering <jim <at> meyering.net>, 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Wed, 11 Jan 2023 14:12:19 -0800
pcre2_config does a static check (defined at compile time) and
therefore is unlikely to fail and might be even under the right
circumstances optimized out.

you are correct that setting the original value was meant to protect
from that function failing and will ensure the original path was still
being taken (which I thought was safer), while your suggested change
will take the opposite one (not setting UTF in a multibyte locale,
which will fail in different ways).

either way, IMHO, considering that most PCRE libraries have utf
enabled (the default) the original code was the less likely to
introduce any changing behaviour or even code changes (because of the
expected optimization), but agree might have been too clever without a
corresponding explanation.

Carlo




Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Thu, 12 Jan 2023 02:30:02 GMT) Full text and rfc822 format available.

Message #17 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: Jim Meyering <jim <at> meyering.net>, 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Wed, 11 Jan 2023 18:29:26 -0800
On 1/11/23 14:12, Carlo Arenas wrote:
> pcre2_config does a static check (defined at compile time) and
> therefore is unlikely to fail and might be even under the right
> circumstances optimized out.

Not sure what is meant by "static check" here. The call won't be 
optimized out unless you compile with -flto or equivalent, and have the 
source code to pcre2 as well as the source code to grep. And in that 
case the two forms should generate equivalent code (no insns needed).

> you are correct that setting the original value was meant to protect
> from that function failing and will ensure the original path was still
> being taken (which I thought was safer), while your suggested change
> will take the opposite one (not setting UTF in a multibyte locale,
> which will fail in different ways).

Oh, I think see your point, but doesn't this mean that even my code was 
too trusting? It should be something like this:

  if (localeinfo.multibyte)
    {
      uint32_t unicode;
      if (! (localeinfo.using_utf8
             && 0 <= pcre2_config (PCRE2_CONFIG_UNICODE, &unicode)
	     && unicode))
        die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8 
locales"));
      ...

That is, we're better off diagnosing the problem and not attempting to 
use pcre2 if the result will be wrong (or even result in undefined 
behavior). The problem is unlikely to occur so it's good to be 
conservative here.




Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Thu, 12 Jan 2023 04:05:01 GMT) Full text and rfc822 format available.

Message #20 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jim Meyering <jim <at> meyering.net>, 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Wed, 11 Jan 2023 20:03:43 -0800
On Wed, Jan 11, 2023 at 6:29 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> Oh, I think see your point, but doesn't this mean that even my code was
> too trusting? It should be something like this:
>
>    if (localeinfo.multibyte)
>      {
>        uint32_t unicode;
>        if (! (localeinfo.using_utf8
>               && 0 <= pcre2_config (PCRE2_CONFIG_UNICODE, &unicode)
>              && unicode))
>          die (EXIT_TROUBLE, 0, _("-P supports only unibyte and UTF-8
> locales"));
>        ...
>
> That is, we're better off diagnosing the problem and not attempting to
> use pcre2 if the result will be wrong (or even result in undefined
> behavior). The problem is unlikely to occur so it's good to be
> conservative here.

Maybe we are not clear on what the "problem" is.

The issue the original code was trying to avoid was to set PCRE_UTF if
the library doesn't have Unicode support, as that would block grep
with a PCRE error (as shown in the commit message), and which also
disabled some tests as it couldn't be differentiated with a failure in
grep because -P wasn't supported.  Your suggested code doesn't address
that, it merely changes the error message with one that would be IMHO
even less clear and worsens the problem.

Using a non Unicode PCRE library is perfectly fine, and there is no
"undefined behavior" risk, and indeed `grep -P` without the UTF flag
is exactly what the alternate path uses and what is recommended for
speed, so?

Carlo




Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Fri, 13 Jan 2023 03:40:01 GMT) Full text and rfc822 format available.

Message #23 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: Jim Meyering <jim <at> meyering.net>, 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Thu, 12 Jan 2023 19:38:52 -0800
[Message part 1 (text/plain, inline)]
On 1/11/23 20:03, Carlo Arenas wrote:
> Your suggested code doesn't address
> that, it merely changes the error message with one that would be IMHO
> even less clear and worsens the problem.

In that case let's improve the error message wording; something like the 
attached patch, say.


> Using a non Unicode PCRE library is perfectly fine, and there is no
> "undefined behavior" risk, and indeed `grep -P` without the UTF flag
> is exactly what the alternate path uses and what is recommended for
> speed, so?

It's not a question of undefined behavior. It's a question of whether 
grep does what the user requested. Without the attached patch, in a 
UTF-8 locale "grep -P '[[:alpha:]]'" won't report matching alphabetic 
characters, if they're multibyte. Silent misbehavior is quite bad, and 
it's better for grep to issue a diagnostic and exit than to silently do 
the wrong thing.
[0001-grep-diagnose-no-UTF-8-support-Bug-60708.patch (text/x-patch, attachment)]

Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Fri, 13 Jan 2023 05:55:02 GMT) Full text and rfc822 format available.

Message #26 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Carlo Arenas <carenas <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: Jim Meyering <jim <at> meyering.net>, 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Thu, 12 Jan 2023 21:54:36 -0800
On Thu, Jan 12, 2023 at 7:38 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> Without the attached patch, in a
> UTF-8 locale "grep -P '[[:alpha:]]'" won't report matching alphabetic
> characters, if they're multibyte. Silent misbehavior is quite bad, and
> it's better for grep to issue a diagnostic and exit than to silently do
> the wrong thing.

Fair enough, this will likely need a new test though, and of course
changes to the current ones as well.

Right now they will report that grep doesn't have '-P' support instead
of reporting that unicode is missing, and take into consideration
those tests that set multibyte locale were successful after my change,
so they will also need changes as they would misbehave silently
otherwise.

Carlo




Information forwarded to bug-grep <at> gnu.org:
bug#60708; Package grep. (Fri, 13 Jan 2023 07:38:01 GMT) Full text and rfc822 format available.

Message #29 received at 60708 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Carlo Arenas <carenas <at> gmail.com>
Cc: Jim Meyering <jim <at> meyering.net>, 60708 <at> debbugs.gnu.org
Subject: Re: bug#60708: pcre: improve support for linking with a library
 without unicode
Date: Thu, 12 Jan 2023 23:37:47 -0800
[Message part 1 (text/plain, inline)]
On 2023-01-12 21:54, Carlo Arenas wrote:
> On Thu, Jan 12, 2023 at 7:38 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> Fair enough, this will likely need a new test though, and of course
> changes to the current ones as well.
> 
> Right now they will report that grep doesn't have '-P' support instead
> of reporting that unicode is missing, and take into consideration
> those tests that set multibyte locale were successful after my change,
> so they will also need changes as they would misbehave silently
> otherwise.

OK, I installed it with the attached additional patch, which attempts to 
address these issues. The new test is done by init.cfg each time you run 
require_pcre_ in a UTF-8 locale. The current tests are changed by 
altering require_pcre_ to do the right thing in the en_US.UTF-8 locale 
(I also tightened up the other locales' testing slightly). The new 
require_pcre_ uses a different diagnostic when checking en_US.UTF-8 
locale. Since all the PCRE tests are done in either the C or the 
en_US.UTF-8 locale, this should be good enough.
[0001-tests-better-diagnostic-for-P-sans-Unicode.patch (text/x-patch, attachment)]

This bug report was last modified 1 year and 252 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.