Question

我想纠正数千个文件的错误编码。错误始终是相同的，未知字符应替换为法语é。

$ find . -type f | grep 127427
./documents/1778_commande_127427_accus�_de_r�ception.pdf

$ find . -type f | grep 127427 | hexdump -C
00000000  2e 2f 64 6f 63 75 6d 65  6e 74 73 2f 31 37 37 38  |./documents/1778|
00000010  5f 63 6f 6d 6d 61 6e 64  65 5f 31 32 37 34 32 37  |_commande_127427|
00000020  5f 61 63 63 75 73 ef bf  bd 5f 64 65 5f 72 ef bf  |_accus..._de_r..|
00000030  bd 63 65 70 74 69 6f 6e  2e 70 64 66 0a           |.ception.pdf.|
0000003d

因此，我正在寻找看起来不像Unicode字符的ef bf bd。不幸的是，寻找0xef无效：

$ find . -type f | grep -P '\xef'
(nothing)

有任何线索吗？

下一步，我打算做类似的事情：

$ find . -type f | grep <magic-here> | xargs -n1 -I{} sh -c 'mv "{}" $(echo "{}" | sed s/<magic-here>/é/) '

Answer 1

赞：

echo $'\x2e\x2f\x64\x6f\x63\x75\x6d\x65\x6e\x74\x73\x2f\x31\x37\x37\x38\x5f\x63\x6f\x6d\x6d\x61\x6e\x64\x65\x5f\x31\x32\x37\x34\x32\x37\x5f\x61\x63\x63\x75\x73\xef\xbf\xbd\x5f\x64\x65\x5f\x72\xef\xbf\xbd\x63\x65\x70\x74\x69\x6f\x6e\x2e\x70\x64\x66\x0a'\
| grep -Fa $'\xef\xbf\xbd'

-a将二进制文件视为文本。 -F执行固定的字符串搜索，不使用正则表达式。 $''是ANSI string

find命令应如下所示：

find ... -exec sed $'s/\xef\xbf\xbd/é/g' {} +

确定可以使用时，请使用-i，这将在适当的位置更改文件：

find ... -exec sed -i $'s/\xef\xbf\xbd/é/g' {} +

查找包含十六进制值的文件名？

1 个答案: