使用域名提取URL

时间:2014-07-22 12:08:51

标签: shell awk sed

我有一个这样的文件:

http://article.wn.com/view/2010/11/26/IV_drug_policy_feels_HIV_patients_Red_Cross/      http://aidsjournal.com/,www.cfpa.org.cn/page1/page2 , www.youtube.com
http://seattletimes.nwsource.com/html/jerrybrewer/2013517803_brewer25.html   http://www.moortowntoday.co.uk/your-moortown/Yorkshire-Evening-Post-First-for.6038672.jp, www.yorkshireeveningpost.co.uk/business/1/

我想用域

提取网址
http://article.wn.com        http://aidsjournal.com,www.cfpa.org.cn, www.youtube.com
http://seattletimes.nwsource.com   http://www.moortowntoday.co.uk, www.yorkshireeveningpost.co.uk

我使用了这个脚本,但它只给了我一栏中的结果:

sed  's|\(http://[^/]*/\).*|\1|g' file

任何建议都适用于文件中的所有网址。

5 个答案:

答案 0 :(得分:1)

通过perl,

$ perl -ple 's/(?:http:\/\/|www\.)[^\/]*\K[^, ]*//g' file
http://article.wn.com      http://aidsjournal.com,www.cfpa.org.cn , www.youtube.com
http://seattletimes.nwsource.com   http://www.moortowntoday.co.uk, www.yorkshireeveningpost.co.uk

答案 1 :(得分:1)

你可以试试awk:

awk -F/ '{print $1"//"$3}' file

答案 2 :(得分:1)

awk -v FS='[ ,]*' -v OFS=', ' '{ for (i = 1; i <= NF; ++i) { match($i, /^(([[:alpha:]]+:[/][/])?[^/]+)/); $i = substr($i, RSTART, RLENGTH) } print }' file

输出:

http://article.wn.com, http://aidsjournal.com, www.cfpa.org.cn, www.youtube.com
http://seattletimes.nwsource.com, http://www.moortowntoday.co.uk, www.yorkshireeveningpost.co.uk

答案 3 :(得分:0)

改变fesias回答。

awk 'BEGIN{RS="((\n| +),* *|,)";FS="/"}/^http:\/\//{print $1"//"$3;next}{print $1}' file
编辑:没看到cfpa

答案 4 :(得分:0)

如果您实际上并不关心输出中的空格,并且实际上您不希望其中一个URL的末尾有逗号(如果您这样做,我们如何将您想要的逗号分隔开来那些你没有?):

awk -v RS='[[:space:],]+' '{sub(/http:\/\//," "); sub(/\/.*/,""); sub(/ /,"http://")} 1' file
http://article.wn.com
http://aidsjournal.com
www.cfpa.org.cn
www.youtube.com
http://seattletimes.nwsource.com
http://www.moortowntoday.co.uk
www.yorkshireeveningpost.co.uk