从html文件中查找唯一的文件名

时间:2010-12-14 06:15:27

标签: regex shell sed awk grep

$ cat downloaded_file.html

1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM  

如何从我的shell脚本中搜索html文件,并选择以STDMON开头并以_company.txt结尾的唯一文件名

2 个答案:

答案 0 :(得分:2)

如果您只有STDMON_company.txt之间的数字,则可以执行以下操作:

grep -o 'STDMON[0-9]*_company\.txt' input.txt | sort -u

See it

如果可以做任何事情:

grep -oP 'STDMON.*?_company\.txt' input.txt | sort -u

答案 1 :(得分:0)

 awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++' download_file.html

输入

$ cat downloaded_file.html
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON14959440_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON14959440_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON11202010_company.txt</A><br> Monday, November 22, 2010  1:31 AM
1373 <A HREF="http://site.com/STDMON11202010_company.txt">STDMON12342440_company.txt</A><br> Monday, November 22, 2010  1:31 AM

输出

$ awk -F'>|<' '$3 ~ /STDMON[0-9]+_company.txt/ && !a[$0=$3]++'
STDMON11202010_company.txt
STDMON14959440_company.txt
STDMON12342440_company.txt