Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filenames with newlines aren't handled properly #695

Closed
systwi-again opened this issue Mar 24, 2022 · 9 comments
Closed

Filenames with newlines aren't handled properly #695

systwi-again opened this issue Mar 24, 2022 · 9 comments

Comments

@systwi-again
Copy link

systwi-again commented Mar 24, 2022

When I use md5sum (version 9.0 from GNU coreutils), I can properly calculate and verify hashes of files containing 0x0A in their filenames:

$ md5sum -- 'foo
bar' |tee md5.txt
\6d573d7359721f7ee1274e75fc4fd21a  foo\nbar
$ md5sum -cw --strict -- md5.txt
\foo\nbar: OK

Unfortunately, this intended behaviour cannot be replicated with xxhsum (version 0.8.1):

$ xxhsum -- 'foo
bar' | tee xxh64.txt
5a651732ba4018ed  foo
bar
$ xxhsum -c --strict --warn xxh64.txt
xxh64.txt:1: Could not open or read 'foo': No such file or directory.
xxh64.txt:2: Error: Improperly formatted checksum line.
1 line is improperly formatted
1 listed file could not be read

Filenames containing carriage returns (0x0D) act bizarrely when quoted (finished with tab complete):

$ xxhsum -- 'foo
bar' | tee xxh64.txt
Error: Could not open 'foo
bar': No such file or directory.

as well as when unquoted:

$ xxhsum -- foo^Mbar # my terminal displays '0x0D' as '^M'
bar13e00c0d7ef4c  foo
$ xxhsum -c --strict --warn xxh64.txt
bar: OK

Notice the 'bar' overlapped part of the hash in my terminal and makes it look like a file named 'bar' has a matching hash. GNU coreutils' md5sum uses \r instead of the raw character (excluding its -z flag).

It seems that the offending character(s) are not backslash-escaped and therefore causes confusion with xxhsum when parsing the list back.

Including/omitting --tag results in the same behaviour.

Tested using Arch GNU/Linux.

@t-mat
Copy link
Contributor

t-mat commented Mar 24, 2022

Hi @systwi-again, thanks for the report.

I've confirmed your issue. So far, I put terminal agnostic version of your investigation for contributors and future version of our test.

$ # clone

$ cd
$ mkdir xxhsum-issue-695
$ cd xxhsum-issue-695
$ git clone https://github.com/Cyan4973/xxHash.git
$ cd xxHash
$ make
$ mkdir issue-695
$ cd issue-695
$ # create specimen
$ echo issue-695-test-string > $'foo\nbar'
$ # md5sum

$ md5sum -- $'foo\nbar' | tee md5.txt
\c7300c3387ac5a6d439c41122d0e4b43  foo\nbar
$ cat md5.txt
\c7300c3387ac5a6d439c41122d0e4b43  foo\nbar
$ md5sum -cw --strict -- md5.txt
\foo\nbar: OK

$ hexdump -C md5.txt
00000000  5c 63 37 33 30 30 63 33  33 38 37 61 63 35 61 36  |\c7300c3387ac5a6|
00000010  64 34 33 39 63 34 31 31  32 32 64 30 65 34 62 34  |d439c41122d0e4b4|
00000020  33 20 20 66 6f 6f 5c 6e  62 61 72 0a              |3  foo\nbar.|
0000002c
$ # xxhsum

$ ../xxhsum $'foo\nbar' | tee xxh64.txt
901a65c004253f1f  foo
bar
$ cat xxh64.txt
901a65c004253f1f  foo
bar
$ ../xxhsum -c xxh64.txt
xxh64.txt:1: Could not open or read 'foo': No such file or directory.
1 line is improperly formatted
1 listed file could not be read

$ hexdump -C xxh64.txt
00000000  39 30 31 61 36 35 63 30  30 34 32 35 33 66 31 66  |901a65c004253f1f|
00000010  20 20 66 6f 6f 0a 62 61  72 0a                    |  foo.bar.|
0000001a

@t-mat
Copy link
Contributor

t-mat commented Mar 24, 2022

How md5sum (digest.c) handles special characters in the filename?

filename_unescape() unescapes the following characters

  • \n (0x0a, LF)
  • \r (0x0d, CR)
  • \\ (0x5c, back slash)

It doesn't accept other \ escaped characters such as \0 or \x7f.
It also doesn't accept NUL (0x00) as a part of filename.

filename_unescape()
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/digest.c?id=04d735ea9637b213ef7fc327a635decbbf20152e#n547

How md5sum (digest.c) put \ at the beginning of the line?

If filename contains escaped character, digest_check() put \ at the beginning of the line.

https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/digest.c?id=04d735ea9637b213ef7fc327a635decbbf20152e#n1130
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/digest.c?id=04d735ea9637b213ef7fc327a635decbbf20152e#n1165


The following functions also recognize escaped filename.

split_3() , escaped_filename
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/digest.c?id=04d735ea9637b213ef7fc327a635decbbf20152e#n716

print_filename() , escape
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/digest.c?id=04d735ea9637b213ef7fc327a635decbbf20152e#n858

output_file() , needs_escape
https://git.savannah.gnu.org/cgit/coreutils.git/tree/src/digest.c?id=04d735ea9637b213ef7fc327a635decbbf20152e#n970

t-mat added a commit to t-mat/xxHash that referenced this issue Mar 31, 2022
@t-mat
Copy link
Contributor

t-mat commented Mar 31, 2022

I've made experimental branch for this issue - https://github.com/t-mat/xxHash/tree/issue-695

You can check it by the following commands

git clone -b issue-695 https://github.com/t-mat/xxHash xxHash-isshe-695
cd xxHash-isshe-695
make
$ mkdir issue-695
$ cd issue-695/
$ echo issue-965-test-string > $'foo\nbar'
$ ../xxhsum $'foo\nbar' | tee xxh64.txt
\dcf121a176784e70  foo\nbar
$ cat xxh64.txt
\dcf121a176784e70  foo\nbar
$ ../xxhsum -c xxh64.txt
\foo\nbar: OK
$ hexdump -C xxh64.txt
00000000  5c 64 63 66 31 32 31 61  31 37 36 37 38 34 65 37  |\dcf121a176784e7|
00000010  30 20 20 66 6f 6f 5c 6e  62 61 72 0a              |0  foo\nbar.|
0000001c

But I have 2 questions

(1) Should we really support \n, \r in the filename / path?

I'm really sorry. And this is 100% my personal taste.
I think we should refuse these filename / path. E.g. just display Error: Failed to open the file.
Because practically we don't gain any benefit from these filenames. I think they just spread confusion and inconvenience.

But since we've been stating and trying to mimic md5sum, we should accept it though.

(2) Should we introduce md5sum style convention to BSD style line?

Since I'm no-brainer, I put same rule to XSUM_printLine_BSD_internal(). But I have absolutely no knowledge about it. I honestly need some help.

@Cyan4973
Copy link
Owner

(1) Should we really support \n, \r in the filename / path?

Good question,
I've got essentially the same state of mind :
I would prefer to simplify the situation, even at the cost of forbidding these cases,
but if the goal is to mimic md5sum, and md5sum supports these characters,
then it would be a change of objective.

Hence I presume it's better to support them, though this is weak preference.
If the code becomes massively complex because of that, we might want to revisit that.

(2) Should we introduce md5sum style convention to BSD style line?

Same idea, it depends on md5sum --tag support.
My understanding is that it supports the same rules (regarding "complex" characters) as the regular md5sum output format.

@t-mat
Copy link
Contributor

t-mat commented Apr 2, 2022

(1) Should we really support \n, \r in the filename / path?
Hence I presume it's better to support them, though this is weak preference.
If the code becomes massively complex because of that, we might want to revisit that.

Sure!

--tag support

Same idea, it depends on md5sum --tag support.
My understanding is that it supports the same rules (regarding "complex" characters) as the regular md5sum output format.

You're right. I've checked md5sum --tag. And your assumption is right.

$ md5sum --tag -- $'foo\nbar' | tee md5-tag.txt
\MD5 (foo\nbar) = c7300c3387ac5a6d439c41122d0e4b43
$ cat md5-tag.txt
\MD5 (foo\nbar) = c7300c3387ac5a6d439c41122d0e4b43

$ md5sum -cw --strict -- md5-tag.txt
\foo\nbar: OK

$ hexdump -C md5-tag.txt
00000000  5c 4d 44 35 20 28 66 6f  6f 5c 6e 62 61 72 29 20  |\MD5 (foo\nbar) |
00000010  3d 20 63 37 33 30 30 63  33 33 38 37 61 63 35 61  |= c7300c3387ac5a|
00000020  36 64 34 33 39 63 34 31  31 32 32 64 30 65 34 62  |6d439c41122d0e4b|
00000030  34 33 0a                                          |43.|
00000033

I also checked xxhsum --tag in my branch. Since it behaves similar, seems okay.

$ ../xxhsum --tag $'foo\nbar' | tee xxh64-tag.txt
\XXH64 (foo\nbar) = dcf121a176784e70
$ cat xxh64-tag.txt
\XXH64 (foo\nbar) = dcf121a176784e70

$ ../xxhsum -c xxh64-tag.txt
\foo\nbar: OK

$ hexdump -C xxh64-tag.txt
00000000  5c 58 58 48 36 34 20 28  66 6f 6f 5c 6e 62 61 72  |\XXH64 (foo\nbar|
00000010  29 20 3d 20 64 63 66 31  32 31 61 31 37 36 37 38  |) = dcf121a17678|
00000020  34 65 37 30 0a                                    |4e70.|
00000025

--zero

It seems filename escaping has relation with --zero. Their web (info) manual of md5sum has the following note.

Without --zero, if file contains a backslash, newline, or carriage return, the line is started with a backslash, and each problematic character in the file name is escaped with a backslash, making the output unambiguous even in the presence of arbitrary file names.

-z
--zero
Output a zero byte (ASCII NUL) at the end of each line, rather than a newline. This option enables other programs to parse the output even when that output would contain data with embedded newlines. Also file name escaping is not used.

So, should we also introduce this feature?

Personally, I'd like to leave it for now. And wait for actual application program which take a benefit from this feature.

@Cyan4973
Copy link
Owner

Cyan4973 commented Apr 2, 2022

Personally, I'd like to leave it for now. And wait for actual application program which take a benefit from this feature.

Agreed

@t-mat
Copy link
Contributor

t-mat commented Apr 3, 2022

I've added test script to my branch.

make test-filename-escape invokes the test script.

git clone -b issue-695 https://github.com/t-mat/xxHash xxHash-isshe-695
cd xxHash-isshe-695
make
make test-filename-escape

The test script filename-escape.sh invokes the following commands and report there's any error or not.

./xxhsum $'filename-escape-foo\nbar'
./xxhsum --tag $'filename-escape-foo\nbar'

t-mat added a commit to t-mat/xxHash that referenced this issue Apr 4, 2022
For details, see issue Cyan4973#695.

If filename contains special character (\n, 0x0a, LF),

- Put '\' (0x5c) at the beginning of the line.
- Escape special character by '\'.
@t-mat t-mat mentioned this issue Apr 4, 2022
Cyan4973 added a commit that referenced this issue Apr 4, 2022
t-mat added a commit to t-mat/xxHash that referenced this issue Apr 5, 2022
@t-mat
Copy link
Contributor

t-mat commented Apr 6, 2022

Hi @systwi-again, since we've merged the fix of this issue, could you check fresh dev branch?
When you'll confirm the resolution. we'd like to close this issue.

@systwi-again
Copy link
Author

systwi-again commented Apr 9, 2022

Thank you @t-mat and @Cyan4973 for looking into this issue. I sincerely apologize for the delay in replying.

The fixes in the dev branch work perfectly under my Arch GNU/Linux install. Thanks again! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants