Question

我已经构建了一个网络抓取工具，可以提取网站上的所有图像。我的代码应该将每个img URL打印到标准输出并写入包含所有这些的csv文件，但是现在它只是将找到的最后一个图像写入文件，并将结果的编号写入csv。

以下是我目前正在使用的代码：

# This program prints a list of all images contained in a web page 
#imports library for url/html recognition
from urllib.request import urlopen
from HW_6_CSV import writeListToCSVFile
#imports library for regular expressions
import re
#imports for later csv writing
import csv
#gets user input
address = input("Input a url for a page to get your list of image urls       ex. https://www.python.org/:  ")
#opens Web Page for processing
webPage = urlopen(address)
#defines encoding
encoding = "utf-8"
#defines resultList variable
resultList=[]
#sets i for later printing
i=0
#defines logic flow
for line in webPage :
   line = str(line, encoding)
   #defines imgTag
   imgTag = '<img '
   #goes to next piece of logical flow
   if imgTag in line :
      i = i+1
      srcAttribute = 'src="'
      if srcAttribute in line:
      #parses the html retrieved from user input 
       m = re.search('src="(.+?)"', line)
       if m:
          reline = m.group(1)
          #prints results
          print("[ ",[i], reline , " ]")

data = [[i, reline]]

output_file = open('examp_output.csv', 'w')
datawriter = csv.writer(output_file)
datawriter.writerows(data)
output_file.close()
webPage.close()

如何让此程序将找到的所有图像写入CSV文件？

Answer 1

你只看到你的csv中的最后一个结果，因为在for循环的范围内永远不会正确更新data：当你退出循环时，你只会写一次。要将HTML的所有相关部分添加到列表data，您应该缩进该行并使用列表的append或extend方法。

因此，如果您将循环重写为：

img_nbr = 0  # try to avoid using `i` as the name of an index. It'll save you so much time if you ever find you need to replace this identifier with another one if you chose a better name
data = []
imgTag = '<img ' # no need to redefine this variable each time in the loop
srcAttribute = 'src="' # same comment applies here

for line in webPage:
   line = str(line, encoding)
   if imgTag in line :
      img_nbr += 1  # += saves you typing a few keystrokes and a possible future find-replace.
      #if srcAttribute in line:  # this check and the next do nearly the same: get rid of one
      m = re.search('src="(.+?)"', line)
      if m:
          reline = m.group(1)
          print("[{}: {}]".format(img_nbr, reline)) # `format` is the suggested way to build strings. It's been around since Python 2.6.
          data.append((img_nbr, reline)) # This is what you really missed.

你会得到更好的结果。我添加了一些评论，为您的编码技巧提供一些建议，并删除您的评论以使新的评论脱颖而出。

但是，您的代码仍然存在一些问题：HTML不应该使用正则表达式进行解析，除非源代码结构非常合理（甚至......）。现在，因为您要求用户输入，他们可以提供任何网址，并且网页的结构往往不是很糟糕。如果您想构建更强大的网络抓取工具，我建议您查看BeautifulSoup。

无法向CSV写入多行

1 个答案: