Beautifulsoup - 将删除的结果附加到CSV文件时出现问题

时间:2018-03-18 00:11:07

标签: python python-3.x beautifulsoup tags

HTML:

<div class="job-result-logo-title">
   <div class="job-result-logo">
      <a href="/Recruiters/SQS-Ireland-5673.aspx"><img alt="SQS Ireland" src="/Logos/SQS-Ireland-small-5673.gif"></a>
   </div>
   <div class="job-result-title">
      <h2 itemprop="title"><a href="/Jobs/QA-Analyst-8148774.aspx">QA Analyst</a>
      </h2>
      <h3 itemprop="name">
         <a itemprop="hiringOrganization" itemscope="" itemtype="https://schema.org/Organization" href="/Recruiters/SQS-Ireland-5673.aspx">SQS Ireland</a>
      </h3>
   </div>
</div>
<div class="job-result-overview" style="display: ">
   <ul class="job-overview">
      <li itemprop="baseSalary" class="salary">Negotiable</li>
      <li itemprop="datePosted" class="updated-time">Updated 17/03/2018</li>
      <li itemprop="jobLocation" class="location">
         <a href="/Jobs/Dublin-City-Centre/">Dublin City Centre</a>
         <span>&nbsp;/</span>                                            <a href="/Jobs/Dublin-South/">Dublin South</a>
         <span>&nbsp;/</span>                                            <a href="/Jobs/Dublin-North/">Dublin North</a>
      </li>
   </ul>
</div>

我的代码:

def find_data(source):
    for a in source.find_all('div', class_='job-result-title'):
        job_info = a.find('h2').find('a')
        company_name = a.find('h3').find('a').get_text()
        url = job_info['href']
        full_url = base_url + url
        role = job_info.get_text()
    for ul in source.find_all('ul', class_='job-overview'):
        date = ul.find('li',class_='updated-time').get_text().replace('Updated','').strip()
    append_data("data.csv", company_name, role, full_url, date)

我已经尝试了太多这个代码的替代品,并尝试在这里寻找类似的答案,但没有运气,我总是从这行代码中得到相同的日期,我不知道为什么它不迭代所有相同的标签包含每个标签的日期:

<li itemprop="datePosted" class="updated-time">Updated 17/03/2018</li>

1 个答案:

答案 0 :(得分:0)

您没有保存在for循环中找到的值。这就是为什么当您写入CSV文件时,您将获得所有变量的最后一个值。

您需要保存列表中的所有值,然后将其写入CSV。

代码:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.irishjobs.ie/ShowResults.aspx?Keywords=test&Location=102&Category=3&Recruiter=All&SortBy=MostRecent&PerPage=100')
source = BeautifulSoup(r.text, 'lxml')

company_name, role, full_url, date = [], [], [], []
base_url = 'https://www.irishjobs.ie'

for a in source.find_all('div', class_='job-result-title'):
    job_info = a.find('h2').find('a')
    company_name.append(a.find('h3').find('a').get_text())
    url = job_info['href']
    full_url.append(base_url + url)
    role.append(job_info.get_text())
for ul in source.find_all('ul', class_='job-overview'):
    date.append(ul.find('li',class_='updated-time').get_text().replace('Updated','').strip())

for a, b, c, d in zip(company_name, role, full_url, date):
    print(a, b, c, d)

部分输出:

Globoforce Senior QA Automation Engineer https://www.irishjobs.ie/Jobs/Senior-QA-Automation-Engineer-8149253.aspx 17/03/2018
Globoforce Technical  Team Lead (Java) https://www.irishjobs.ie/Jobs/Technical-Team-Lead-Java-8149252.aspx 17/03/2018
Globoforce Performance Test Engineer https://www.irishjobs.ie/Jobs/Performance-Test-Engineer-8149251.aspx 17/03/2018
Globoforce Senior Front End Developer https://www.irishjobs.ie/Jobs/Senior-Front-End-Developer-8149249.aspx 17/03/2018
Synchronoss Technologies Lead iOS Swift  Developer Enterprise Agile https://www.irishjobs.ie/Jobs/Lead-iOS-Swift-Developer-Enterprise-8149248.aspx 17/03/2018
Computer Futures .NET Engineer Front End Developer https://www.irishjobs.ie/Jobs/NET-Engineer-Front-End-Developer-8149244.aspx 17/03/2018
Computer Futures .NET Developer C# ASP.NET Core https://www.irishjobs.ie/Jobs/NET-Developer-CSharp-ASP-NET-8149241.aspx 17/03/2018
Computer Futures Senior C# Developer TDD DDD https://www.irishjobs.ie/Jobs/Senior-CSharp-Developer-TDD-DDD-8149240.aspx 17/03/2018

您只需要在CSV中写入值而不是print(a,b,c,d)