webdriver下载临时文件而不是csv?

时间:2020-06-17 09:47:43

标签: python selenium web-scraping

我写了一段代码,在此网站https://violationtracker.goodjobsfirst.org/上查找公司,并下载了公司页面的csv结果-请在此处查看Nike的示例:https://violationtracker.goodjobsfirst.org/prog.php?parent=&major_industry_sum=&offense_group_sum=&primary_offense_sum=&agency_sum=&agency_sum_st=&hq_id_sum=&company_op=starts&company=nike&major_industry%5B%5D=&case_category=&offense_group=&all_offense%5B%5D=&penalty_op=%3E&penalty=&govt_level=&agency_code%5B%5D=&agency_code_st%5B%5D=&pen_year%5B%5D=&pres_term=&free_text=&case_type=&ownership%5B%5D=&hq_id=&naics%5B%5D=&state=&city=

代码可以使用很长时间,但是现在我不确定为什么不下载CSV而是下载临时文件,而不下载CSV?该网站没有问题,因为我手动进行了尝试,可以下载csv。

这是我的代码

df_all = []


supplier = ['Nike']

length = len(supplier)

##go to the website
for idx, i in enumerate(supplier):
    rem = length-idx
    print('This is index: ', idx, ', element: ', i, ', with remaining : ', rem, ' elements')
    try:
        driver = webdriver.Chrome(executable_path=r"C:\webdrivers\chromedriver.exe")
        driver.get("https://www.goodjobsfirst.org/violation-tracker")

    ##find the iframe with the broweser 
        driver.switch_to_frame(0)
    ## Insert text via xpath ##
        elem = driver.find_element_by_xpath("//*[@id='edit-field-violation-company-value']")
        elem.send_keys(i)
        elem.send_keys(Keys.RETURN)
        time.sleep(10) 
        try:
            ##download the information from the relevant page
            button = driver.find_element_by_xpath('//*[@id="content"]/div/div[2]/a[1]/img')
            ActionChains(driver).move_to_element(button).click(button).perform()
            ##upload last csv in the download folder
            list_of_files = glob.glob(r'C:\Users\~\Downloads\*.csv')
            latest_file = max(list_of_files, key=os.path.getctime)
            time.sleep(3)
            df = pd.read_csv(latest_file)
            print(df)
            df_all.append(df)
            driver.close()
            if os.path.exists(latest_file):
                os.remove(latest_file)
            else:
                print("The file does not exist")
        except:
            driver.close()
    except:
        pass



violation_tracker = pd.concat(df_all)

我想念什么?

1 个答案:

答案 0 :(得分:0)

此网站看起来非常有趣!谢谢你。

只需在第二个URL的末尾添加“&detail = csv_results”。以下代码有效:

import requests as rq
from bs4 import BeautifulSoup as bs
import csv

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Firefox/78.0"}
url = "https://violationtracker.goodjobsfirst.org/prog.php?parent=&major_industry_sum=&offense_group_sum=&primary_offense_sum=&agency_sum=&agency_sum_st=&hq_id_sum=&company_op=starts&company=nike&major_industry[]=&case_category=&offense_group=&all_offense[]=&penalty_op=%3E&penalty=&govt_level=&agency_code[]=&agency_code_st[]=&pen_year[]=&pres_term=&free_text=&case_type=&ownership[]=&hq_id=&naics[]=&state=&city=&detail=csv_results"

resp = rq.get(url, headers=headers)
a = resp.content
wrapper = csv.reader(resp.text.strip().split('\n'))
for record in wrapper:
    print(record)
相关问题