使用BeautifulSoup提取一些文本,然后我想将条目保存到csv文件中。我的代码如下:
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
saveFile = open("some.csv", "a")
saveFile.write(str(tdTags_string) + ",")
saveFile.close()
saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()
除了条目中有逗号(",")之外,它在大部分时间内完成了我想要的操作,它将其视为分隔符并将单个条目拆分为两个不同的单元格(这不是我想要的)。所以我在网上搜索,发现有人建议使用csv模块,我将代码更改为:
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
print tdTags_string
with open("some.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(tdTags_string)
saveFile = open("some.csv", "a")
saveFile.write("\n")
saveFile.close()
这使情况变得更糟,现在每个字母/数字的字母/数字占据csv文件中的单个单元格。例如,如果条目是" Cat"。 " C"在一个单元格中," a"是下一个细胞," t"是第三个细胞等。
编辑版:
import urllib2
import re
import csv
from bs4 import BeautifulSoup
SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()
# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow([tdTags_string])
第2版:
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
placeHolder.append(tdTags_string)
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
更新了输出:
u'stuff1'
u'stuff2'
u'stuff3'
输出示例:
u'record1' u'31 Mar 1901' u'California'
u'record1' u'31 Mar 1901' u'California'
record1 31-Mar-01 California
另一个已编辑的代码(仍然有一个问题 - 跳过下面的一行):
import urllib2
import re
import csv
from bs4 import BeautifulSoup
SomeSiteURL = "https://SomeSite.org/xyz"
OpenSomeSiteURL = urllib2.urlopen(SomeSiteURL)
Soup_SomeSite = BeautifulSoup(OpenSomeSiteURL, "lxml")
OpenSomeSiteURL.close()
# finding name
NameParentTag = Soup_SomeSite.find("tr", class_="result-item highlight-person")
Name = NameParentTag.find("td", class_="result-value-bold").get_text(strip=True)
saveFile = open("SomeSite.csv", "a")
saveFile.write(str(Name) + ",")
saveFile.close()
# finding other info
# <tbody> -> many <tr> -> in each <tr>, extract second <td>
tbodyTags = Soup_SomeSite.find("tbody")
trTags = tbodyTags.find_all("tr", class_="result-item ")
placeHolder = []
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True)
#print repr(tdTags_string)
placeHolder.append(tdTags_string.rstrip('\n'))
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
答案 0 :(得分:1)
with open("some.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow([tdTags_string]) # put in a list
writeFile.writerow
将遍历您传入的内容,因此字符串"foo"
变为f,o,o
三个单独的值,将其包装在list
中将阻止此操作,因为编写器将迭代列表不是字符串
你应该打开你的文件一次,而不是每次循环你的文件:
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
for trTag in trTags:
tdTags = trTag.find("td", class_="result-value")
tdTags_string = tdTags.get_text(strip=True) #
writeFile.writerow([tdTags_string])
答案 1 :(得分:1)
对于最新的跳绳问题,我找到了答案。而不是
with open("SomeSite.csv", "a") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
使用此:
with open("SomeSite.csv", "ab") as f:
writeFile = csv.writer(f)
writeFile.writerow(placeHolder)
来源:https://docs.python.org/3/library/functions.html#open。 &#34; a&#34; mode是追加模式,其中&#34; ab&#34;是一个附加模式,同时打开文件作为二进制文件,解决了跳过一个额外行的问题。