GNU bug report logs - #64277
[feature request] handle files encoded in utf-16le

Previous Next

Package: grep;

Reported by: Jeremy Hetzler <jeremyhetzler <at> gmail.com>

Date: Sat, 24 Jun 2023 21:24:02 UTC

Severity: wishlist

To reply to this bug, email your comments to 64277 AT debbugs.gnu.org.

Toggle the display of automated, internal messages from the tracker.

View this report as an mbox folder, status mbox, maintainer mbox


Report forwarded to bug-grep <at> gnu.org:
bug#64277; Package grep. (Sat, 24 Jun 2023 21:24:02 GMT) Full text and rfc822 format available.

Acknowledgement sent to Jeremy Hetzler <jeremyhetzler <at> gmail.com>:
New bug report received and forwarded. Copy sent to bug-grep <at> gnu.org. (Sat, 24 Jun 2023 21:24:02 GMT) Full text and rfc822 format available.

Message #5 received at submit <at> debbugs.gnu.org (full text, mbox):

From: Jeremy Hetzler <jeremyhetzler <at> gmail.com>
To: bug-grep <at> gnu.org
Subject: [feature request] handle files encoded in utf-16le
Date: Sat, 24 Jun 2023 17:23:05 -0400
[Message part 1 (text/plain, inline)]
Maintainers,

I recently was confused as to why GNU grep did not find any matches in
certain files, when vim clearly showed that the search string was present.

Turns out the files (log files from a Windows application) are encoded in
UTF16-LE.

$ file '06-21-2023 03-22-46'
> 06-21-2023 03-22-46: Unicode text, UTF-16, little-endian text, with CRLF
> line terminators


> $ /bin/od -Ad -w16 -t cz '06-21-2023 03-22-46' | head -10
> 0000000 377 376   [  \0   H  \0   E  \0   A  \0   D  \0   E  \0   R  \0
>  >..[.H.E.A.D.E.R.<
> 0000016   :  \0   ]  \0  \r  \0  \n  \0   [  \0   I  \0   D  \0   r  \0
>  >:.].....[.I.D.r.<
> 0000032   i  \0   v  \0   e  \0      \0   v  \0   e  \0   r  \0   s  \0
>  >i.v.e. .v.e.r.s.<
> 0000048   i  \0   o  \0   n  \0   :  \0      \0   6  \0   .  \0   7  \0
>  >i.o.n.:. .6...7.<
> 0000064   .  \0   4  \0   .  \0   4  \0   6  \0      \0   R  \0   e  \0
>  >..4...4.6. .R.e.<
> 0000080   l  \0   e  \0   a  \0   s  \0   e  \0      \0   D  \0   a  \0
>  >l.e.a.s.e. .D.a.<
> 0000096   t  \0   e  \0   :  \0      \0   0  \0   6  \0   /  \0   1  \0
>  >t.e.:. .0.6./.1.<
> 0000112   6  \0   /  \0   2  \0   0  \0   2  \0   3  \0   ]  \0  \r  \0
>  >6./.2.0.2.3.]...<
> 0000128  \n  \0   [  \0   I  \0   n  \0   t  \0   e  \0   r  \0   a  \0
>  >..[.I.n.t.e.r.a.<
> 0000144   c  \0   t  \0   i  \0   v  \0   e  \0      \0   B  \0   a  \0
>  >c.t.i.v.e. .B.a.<


> $ grep --version
> grep (GNU grep) 3.11
> Packaged by Cygwin (3.11-1)
> Copyright (C) 2023 Free Software Foundation, Inc.
> License GPLv3+: GNU GPL version 3 or later <
> https://gnu.org/licenses/gpl.html>.
> This is free software: you are free to change and redistribute it.
> There is NO WARRANTY, to the extent permitted by law.


>
There is no easy way to use grep to search these files, even if one knows
the encoding in advance.

I would like to request a feature to be added to grep which would enable it
to transparently decode UTF16-LE files so they can be conveniently searched.

Thanks,
Jeremy Hetzler
(he/him)
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#64277; Package grep. (Sat, 24 Jun 2023 22:18:01 GMT) Full text and rfc822 format available.

Message #8 received at 64277 <at> debbugs.gnu.org (full text, mbox):

From: Paul Eggert <eggert <at> cs.ucla.edu>
To: Jeremy Hetzler <jeremyhetzler <at> gmail.com>
Cc: 64277 <at> debbugs.gnu.org
Subject: Re: bug#64277: [feature request] handle files encoded in utf-16le
Date: Sat, 24 Jun 2023 15:17:34 -0700
On 2023-06-24 14:23, Jeremy Hetzler wrote:
> I would like to request a feature to be added to grep which would enable it
> to transparently decode UTF16-LE files so they can be conveniently searched.

Not sure it's worth the effort as this format is not that common for GNU 
grep, it'd be a pain to add proper support for it, and anyway 16-bit 
encodings have been problematic ever since Unicode crossed the 16-bit 
Rubicon. I'm not saying we'd reject a patch if someone wrote it, but I'd 
say it should be low priority.




Information forwarded to bug-grep <at> gnu.org:
bug#64277; Package grep. (Sun, 25 Jun 2023 16:55:02 GMT) Full text and rfc822 format available.

Message #11 received at 64277 <at> debbugs.gnu.org (full text, mbox):

From: Jeremy Hetzler <jeremyhetzler <at> gmail.com>
To: Paul Eggert <eggert <at> cs.ucla.edu>
Cc: 64277 <at> debbugs.gnu.org
Subject: Re: bug#64277: [feature request] handle files encoded in utf-16le
Date: Sun, 25 Jun 2023 12:53:31 -0400
[Message part 1 (text/plain, inline)]
Thanks Paul.

I understand. Appreciate the quick response.

Yours,
Jeremy

On Sat, Jun 24, 2023 at 6:17 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:

> On 2023-06-24 14:23, Jeremy Hetzler wrote:
> > I would like to request a feature to be added to grep which would enable
> it
> > to transparently decode UTF16-LE files so they can be conveniently
> searched.
>
> Not sure it's worth the effort as this format is not that common for GNU
> grep, it'd be a pain to add proper support for it, and anyway 16-bit
> encodings have been problematic ever since Unicode crossed the 16-bit
> Rubicon. I'm not saying we'd reject a patch if someone wrote it, but I'd
> say it should be low priority.
>


-- 
Thanks,
Jeremy Hetzler
(he/him)
203-887-5398
[Message part 2 (text/html, inline)]

Information forwarded to bug-grep <at> gnu.org:
bug#64277; Package grep. (Mon, 26 Jun 2023 08:04:02 GMT) Full text and rfc822 format available.

Message #14 received at 64277 <at> debbugs.gnu.org (full text, mbox):

From: lacsaP Patatetom <patatetom <at> gmail.com>
To: Jeremy Hetzler <jeremyhetzler <at> gmail.com>
Cc: Paul Eggert <eggert <at> cs.ucla.edu>, 64277 <at> debbugs.gnu.org
Subject: Re: bug#64277: [feature request] handle files encoded in utf-16le
Date: Mon, 26 Jun 2023 10:03:27 +0200
[Message part 1 (text/plain, inline)]
Le dim. 25 juin 2023 à 18:55, Jeremy Hetzler <jeremyhetzler <at> gmail.com> a
écrit :

> Thanks Paul.
>
> I understand. Appreciate the quick response.
>
> Yours,
> Jeremy
>
> On Sat, Jun 24, 2023 at 6:17 PM Paul Eggert <eggert <at> cs.ucla.edu> wrote:
>
> > On 2023-06-24 14:23, Jeremy Hetzler wrote:
> > > I would like to request a feature to be added to grep which would
> enable
> > it
> > > to transparently decode UTF16-LE files so they can be conveniently
> > searched.
> >
> > Not sure it's worth the effort as this format is not that common for GNU
> > grep, it'd be a pain to add proper support for it, and anyway 16-bit
> > encodings have been problematic ever since Unicode crossed the 16-bit
> > Rubicon. I'm not saying we'd reject a patch if someone wrote it, but I'd
> > say it should be low priority.
> >
>
>
> --
> Thanks,
> Jeremy Hetzler
> (he/him)
> 203-887-5398
>

iconv -f utf-16le  '06-21-2023 03-22-46' | grep regexp
[Message part 2 (text/html, inline)]

This bug report was last modified 1 year and 88 days ago.

Previous Next


GNU bug tracking system
Copyright (C) 1999 Darren O. Benham, 1997,2003 nCipher Corporation Ltd, 1994-97 Ian Jackson.