1.a从网页源代码中的链接我想列出所有链接,例如“mypage.php?REF = 1137988” 这是mypage.php?REF =后跟一个数字
1.B。但是这个源页面还包含一些像Supp.Form.php?REF = 1137988这样的东西,我希望避免这种情况。
</TD></TR>
</TABLE>
<FONT CLASS=t><TABLE cellspacing=5><TR><TD bgcolor='#FFFFA0' style='border:5px ridge lightgray'><TABLE cellspacing=4><TR><TD VALIGN=top><FONT CLASS=t2><CENTER>2015-09-03<BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFFF' style='border:4px ridge lightgray'><CENTER><FONT CLASS=t9>1137988 <A HREF='SuppForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/supp.gif' width=12 height=12 border=0 TITLE='delete'></A> <A HREF='ModifForm.php?REF=1137988' target='_blank'><IMG SRC='boutons/modif.gif' width=10 height=11 border=0 TITLE='modify'></A><BR><TABLE cellspacing=4><TR><TD bgcolor='#FFFFA0' style='border:4px ridge lightgray'><TABLE><TR><TD><IMG SRC='faces/F.gif' width=36 border=0></TD><TD><CENTER><FONT SIZE=1>Age<BR></FONT><FONT SIZE=5><B>35</TD></TR></TABLE></TD></TR></TABLE></TD></TR></TABLE></TD><TD WIDTH=50%><CENTER><FONT class=t><A HREF='mypage.php?REF=1137988' TARGET='_blank'><I>
</pre>
到目前为止,这是我的代码,我一直在努力实现
from bs4 import BeautifulSoup
import urllib2
url = "http://wwww.somewebsite.com"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.find_all("a")
for link in links:
print "A HREF=mypage.php?REF=" %(link.get("a"), link.text)
print links
这意味着我将从第一个列表中提取的数字i必须用逗号分隔它们以放入replace = []
template = """fjajflakjfakjfl;kj REF={}
sklkasalsjklas
klajsl;kdajs;djas
aksljl;askjflka
"""
replace = [1131062,
1140921,
1141326,
1141355,
1141426,
1141430,
1141461,
1141473,
1141477,
1141502]
output = [template.format(r) for r in replace]
with open('output.txt', 'w') as f_output:
f_output.write(''.join([template.format(r) for r in replace]))
所以请帮助我在这里做的两件事。抱歉,如果格式有点偏。
非常感谢你。正如@wilbur所建议的那样 我修改了我的代码,这就是我所做的
from bs4 import BeautifulSoup
import urllib2
import re
url = "somewebsite"
headers = { 'User-Agent' : 'Mozilla/5.0' }
html = urllib2.urlopen(urllib2.Request(url, None, headers)).read()
soup = BeautifulSoup(html)
links = soup.findAll('a', href=re.compile('.*mypage\.php\?REF=[0-9]*'))
template = """lasljasfkljaslkfj{}
slajfljasflk
aslkjfklasjflkasjf
alksjflkasjf;lk
"""
replace = [ link.split("=")[1] for link in links ]
output = [template.format(r) for r in replace]
print output
with open('output.txt', 'w') as f_output:
f_output.write(''.join([template.format(r) for r in replace]))
答案 0 :(得分:0)
以下内容将获取与您的描述匹配的所有链接,然后从每个链接获取REF参数并将其替换。
Traceback (most recent call last):
File "execute.py", line 1, in <module>
program=open(programfilename, "r")
NameError: name 'programfilename' is not defined