使用美丽的汤找到描述

时间:2012-10-11 18:50:41

标签: python beautifulsoup

此处有很多帖子询问如何在Google上进行自动搜索。我选择使用BeautifulSoup,并阅读了许多有关它的问题。我找不到我的问题的直接答案,虽然具体任务似乎很平常。我的下面的代码是不言自明的,括号内的部分是我遇到麻烦的地方(编辑通过“遇到麻烦”我的意思是我无法弄清楚如何为这部分实现我的伪代码,在阅读文档并在线搜索类似的问题代码后,我仍然不知道该怎么做)。如果它有帮助,我认为我的问题可能非常类似于在PubMed上进行自动搜索以找到感兴趣的特定文章的任何人。非常感谢。

#Find Description

import BeautifulSoup
import csv
import urllib
import urllib2

input_csv = "Company.csv"
output_csv = "output.csv"

def main():
    with open(input_csv, "rb") as infile:
        input_fields = ("Name")
        reader = csv.DictReader(infile, fieldnames = input_fields)
        with open(output_csv, "wb") as outfile:
            output_fields = ("Name", "Description")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            first_row = next(reader)
            for next_row in reader:
                search_term = first_row["Name"]
                url = "http://google.com/search?q=%s" % urllib.quote_plus(search_term)

                #STEP ONE: Enter "search term" into Google Search
                #req = urllib2.Request(url, None, {'User-Agent':'Google Chrome'} )
                #res = urllib2.urlopen(req)
                #dat = res.read()
                #res.close()
                #BeautifulSoup(dat)


                #STEP TWO: Find Description
                #if there is a wikipedia page for the entity:
                    #return first sentence of wikipedia page
                #if other site:
                    #return all sentences that have the keyword "keyword" in them

                #STEP THREE: Return Description as "google_search" variable

                first_row["Company_Description"] = google_search
                writer.writerow(first_row)
                first_row = next_row

if __name__ == "__main__":
    main()

附录

对于任何从事这项工作或正在研究它的人,我想出了一个我还在完成的次优解决方案。但我想我会发布它,以防它来帮助其他任何人来到这个页面。基本上,我只是做了一个初步的步骤来完成维基百科中的所有搜索,而不是处理找到要选择的网页的问题。这不是我想要的,但至少它会使得更容易获得实体的子集。代码分为两个文件(Wikipedia.py和wiki_test.py):

#Wikipedia.py

from BeautifulSoup import BeautifulSoup
import csv
import urllib
import urllib2
import wiki_test


input_csv = "Name.csv"
output_csv = "WIKIPEDIA.csv"

def main():
    with open(input_csv, "rb") as infile:
        input_fields = ("A", "C", "E", "M", "O", "N", "P", "Y")
        reader = csv.DictReader(infile, fieldnames = input_fields)
        with open(output_csv, "wb") as outfile:
            output_fields = ("A", "C", "E", "M", "O", "N", "P", "Y", "Description")
            writer = csv.DictWriter(outfile, fieldnames = output_fields)
            writer.writerow(dict((h,h) for h in output_fields))
            next(reader)
            first_row = next(reader)
            for next_row in reader:
                print(next_row)
                print(first_row["A"])
                search_term = first_row["A"]
                #print(search_term)
                result = wiki_test.wiki(search_term)
                first_row["Description"] = result
                writer.writerow(first_row)
                first_row = next_row

if __name__ == "__main__":
main()

基于此帖Extract the first paragraph from a Wikipedia article (Python)的辅助模块:

import urllib
import urllib2
from BeautifulSoup import BeautifulSoup

def wiki(article):
    article = urllib.quote(article)
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Google Chrome')] #wikipedia needs this
    resource = opener.open("http://en.wikipedia.org/wiki/" + article)
    #try:
    #    urllib2.urlopen(resource)
    #except urllib2.HTTPError, e:
    #    print(e)
    data = resource.read()
    resource.close()
    soup = BeautifulSoup(data)
    print soup.find('div',id="bodyContent").p

我只需修复它来处理HTTP 404错误(即找不到页面),此代码适用于任何想要查找维基百科上可用的基本公司信息的人。再说一遍,我宁愿在谷歌搜索中找到适用的东西,并找到相关网站和网站上提到“关键字”的相关部分,但至少这个当前的程序能为我们带来一些东西。

0 个答案:

没有答案