如何使用python将web-scraped文本写入csv?

时间:2016-10-14 08:12:20

标签: python csv web-scraping beautifulsoup

我一直致力于练习网络刮板,它可以获得书面评论并将其写入csv文件,每个评论都有自己的行。我一直遇到麻烦:

  1. 我似乎无法删除html 并只获取文字(即书面评论而不是其他内容)
  2. 在我的评论文字之间和之内都有很多奇怪的空间(即行之间的一行空格等)。
  3. 感谢您的帮助!

    以下代码:

    #! python3
    
    import bs4, os, requests, csv
    
    # Get URL of the page
    
    URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')
    
    # Looping until the 5th page of reviews
    
    pagecounter = 0
    while pagecounter != 5:
    
        # Request get the first page
        res = requests.get(URL)
        res.raise_for_status
    
        # Download the html of the first page
        soup = bs4.BeautifulSoup(res.text, "html.parser")
        reviewElems = soup.select('.partial_entry')
    
    
        if reviewElems == []:
            print('Could not find clue.')
    
        else:
            #for i in range(len(reviewElems)):
                #print(reviewElems[i].getText())
    
            with open('GardensbytheBay.csv', 'a', newline='') as csvfile:
    
                for row in reviewElems:
                    writer = csv.writer(csvfile, delimiter=' ', quoting=csv.QUOTE_ALL)
                    writer.writerow(row)
                print('Writing page')
    
        # Find URL of next page and update URL
        if pagecounter == 0:
            nextLink = soup.select('a[data-offset]')[0]
    
        elif pagecounter != 0:
            nextLink = soup.select('a[data-offset]')[1]
    
        URL = 'http://www.tripadvisor.com' + nextLink.get('href')
        pagecounter += 1
    
    print('Download complete')
    csvfile.close()
    

1 个答案:

答案 0 :(得分:1)

您可以使用row.get_text(strip=True)获取所选p.partial_entry的文字。请尝试以下方法:

import bs4, os, requests, csv

# Get URL of the page
URL = ('https://www.tripadvisor.com/Attraction_Review-g294265-d2149128-Reviews-Gardens_by_the_Bay-Singapore.html')

with open('GardensbytheBay.csv', 'w', newline='') as csvfile:
    writer = csv.writer(csvfile, delimiter=' ')

    # Looping until the 5th page of reviews
    for pagecounter in range(6):

        # Request get the first page
        res = requests.get(URL)
        res.raise_for_status

        # Download the html of the first page
        soup = bs4.BeautifulSoup(res.text, "html.parser")
        reviewElems = soup.select('p.partial_entry')

        if reviewElems:
            for row in reviewElems:
                review_text = row.get_text(strip=True).encode('utf8', 'ignore').decode('latin-1')
                writer.writerow([review_text])
            print('Writing page', pagecounter + 1)
        else:
            print('Could not find clue.')

        # Find URL of next page and update URL
        if pagecounter == 0:
            nextLink = soup.select('a[data-offset]')[0]
        elif pagecounter != 0:
            nextLink = soup.select('a[data-offset]')[1]

        URL = 'http://www.tripadvisor.com' + nextLink.get('href')

print('Download complete')