使用Python查找并列出网页中的特定链接

时间:2015-09-18 18:14:11

标签: python beautifulsoup

1.a从网页源代码中的链接我想列出所有链接,例如“mypage.php?REF = 1137988” 这是mypage.php?REF =后跟一个数字

1.B。但是这个源页面还包含一些像Supp.Form.php?REF = 1137988这样的东西,我希望避免这种情况。

</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>

到目前为止,这是我的代码,我一直在努力实现

from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"

headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)

print links
  1. 我还想把REF之后的数字放在一个列表中。我将把这段代码中的数字部分放在
  2. 这意味着我将从第一个列表中提取的数字i必须用逗号分隔它们以放入replace = []

    template = """fjajflakjfakjfl;kj REF={}
    sklkasalsjklas
    klajsl;kdajs;djas
    aksljl;askjflka
    """
    
    replace = [1131062,
        1140921,
    1141326,
    1141355,
    1141426,
    1141430,
    1141461,
    1141473,
    1141477,
    1141502]
    
    output = [template.format(r) for r in replace]
    with open('output.txt', 'w') as f_output:
    
    f_output.write(''.join([template.format(r) for r in replace]))
    
  3. 所以请帮助我在这里做的两件事。抱歉,如果格式有点偏。

    非常感谢你。

    正如@wilbur所建议的那样 我修改了我的代码,这就是我所做的

    from bs4 import BeautifulSoup
    import urllib2
    import re
    
    url = "somewebsite"
    
    headers = { 'User-Agent' : 'Mozilla/5.0' }
    html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
    soup = BeautifulSoup(html)
    
    links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
    template = """lasljasfkljaslkfj{}
    slajfljasflk
    aslkjfklasjflkasjf
    alksjflkasjf;lk
    """
    
    replace = [ link.split("=")[1] for link in links ]
    
    output = [template.format(r) for r in replace]
    
    print output
    with open('output.txt', 'w') as f_output:
        f_output.write(''.join([template.format(r) for r in replace]))
    

1 个答案:

答案 0 :(得分:0)

以下内容将获取与您的描述匹配的所有链接,然后从每个链接获取REF参数并将其替换。

Traceback (most recent call last):
  File "execute.py", line 1, in <module>
    program=open(programfilename, "r")
NameError: name 'programfilename' is not defined