Question

我有一个bash脚本，可以在目录中找到.htm或.html文件中的电话号码（如果我需要，可以递归下来）查找格式为（ddd）ddd-dddd或ddd-ddd-的电话号码dddd（其中d代表一个数字）。

这是我的代码：

find ./ -maxdepth 1 -regex ".*\(html\|htm\)$" | xargs grep '\(([0-9]\{3\})\|[0-9]\{3\}\)[-]\?[0-9]\{3\}-[0-9]\{4\}'

输出结果为：

./dash_only_phone.htm:800-555-1212</p>
./paren_phone.htm:(800)555-1212</p>

我想知道如何更改grep命令以删除最后的html p标签打印输出。

谢谢，

Answer 1

如果您的grep支持Perl兼容正则表达式，GNU和OS X grep也是如此：

grep -Po '(\([0-9]{3}\)|[0-9]{3})-?[0-9]{3}-[0-9]{4}(?=</p>)'

请注意转义中的更改（与grep -E类似或相同）。

Answer 2

为什么不通过sed过滤器传递输出以将其删除，如下面的记录：

pax$ echo './dash_only_phone.htm:800-555-1212</p>' | sed 's?</p>$??'
./dash_only_phone.htm:800-555-1212

这将消除出现在行尾的任何</p>序列。

Answer 3

您只需添加-o开关即可获取IP

find ./ -maxdepth 1 -regex ".*\(html\|htm\)$" | xargs grep -o '\(([0-9]\{3\})\|[0-9]\{3\}\)[-]\?[0-9]\{3\}-[0-9]\{4\}'