Question

我正在尝试从Google Patents中抓取数据，并且发现执行时间过长。如何提高速度？运行8000项专利已经花费了7个小时...

Here是专利的示例。

我需要从下表中获取数据，并将其写入csv文件。我认为瓶颈在WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']")))

这是必需的还是我可以使用 find_elements_by_css_selector 并检查是否返回任何内容？

#...
from selenium.webdriver.support import expected_conditions as EC
#...

##  read file of patent numbers and initiate chrome

url = "https://patents.google.com/patent/US6403086B1/en?oq=US6403086B1"

for x in patent_number:

    #url = new url with new patent number similar to above

    try: 
        driver.get(url) 
        driver.set_page_load_timeout(20) 
    except: 
        #--write to csv
        continue

    if "404" in driver.title: #patent number not found
        #--write to csv
        continue

    try: 
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.XPATH, "//div[@class='table style-scope patent-result']"))
        )
    except: 
        #--write to csv
        continue


    ##  rest of code to get data from tables and write to csv

是否有更有效的方法来查找专利表上是否存在这些表格？还是如果我使用BeautifulSoup，会有所不同吗？

我是网络爬虫的新手，因此，非常感谢您的帮助：）

Answer 1

不确定要使用哪个表，但考虑到您可以使用请求和熊猫来获取表，以及使用Session来重新使用连接。

import requests
from bs4 import BeautifulSoup as bs
import pandas as pd

codes = ['US6403086B1','US6403086B1'] #patent numbers to come from file
with requests.Session() as s:
    for code in codes:
        url = 'https://patents.google.com/patent/{}/en?oq={}'.format(code, code)
        r = s.get(url)
        tables = pd.read_html(str(r.content))
        print(tables)  #example only. Remove later
       #here would add some tidying up to tables e.g. dropNa rows, replace NaN with '' .... 
       # rather than print... whatever steps to store info you want until write out

什么是最好的（最快的）scape网页方法？

1 个答案: