返回最常见的10个单词的计数器

时间:2019-06-12 08:42:03

标签: python web-scraping counter

我试图制作一个刮板,从每个博客文章中下载10个最常用的单词以及它们的编号,但Counter遇到问题。

已抓取的数据必须转到数据库。 REST以JSON文档的形式返回以下统计信息:

最常用的10个单词及其编号可在地址/统计信息/

中找到

地址/统计信息/ /

下有10个最常见的单词及其每个作者的编号

我尝试了以下计数器:

# split() returns list of all the words in the string
split_it = contents.split()

# Pass the split_it list to instance of Counter class.
Counter = Counter(split_it)

# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = Counter.most_common(10)

print(most_occur)

下面是我的整个刮板:

import requests
from bs4 import BeautifulSoup as bs
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from collections import Counter
from sqlalchemy.dialects.postgresql import psycopg2


url = 'https://teonite.com/blog/page/{}/index.html'
all_links = []


headers = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
    'User-Agent': 'Mozilla/5.0'
}
with requests.Session() as s:
    r = s.get('https://teonite.com/blog/')
    soup = bs(r.content, 'lxml')
    article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
    all_links.append(article_links)
    num_pages = int(soup.select_one('.page-number').text.split('/')[1])


    for page in range(2, num_pages + 1):
        r = s.get(url.format(page))
        soup = bs(r.content, 'lxml')
        article_links = ['https://teonite.com' + item['href'][2:] for item in soup.select('.post-content a')]
        all_links.append(article_links)



    all_links = [item for i in all_links for item in i]

    d = webdriver.Chrome(ChromeDriverManager().install())

    contents = []
    authors = []

    for article in all_links:
        d.get(article)
        soup = bs(d.page_source, 'lxml')
        [t.extract() for t in soup(['style', 'script', '[document]', 'head', 'title'])]
        visible_text = soup.getText()
        content = soup.find('section', attrs={'class': 'post-content'})
        contents.append(content)
        author = soup.find('span', attrs={'class': 'author-content'})
        authors.append(author)
        unique_authors = list(set(authors))
        unique_contents = list(set(contents))


        try:
            print(soup.select_one('.post-title').text)
        except:
            print(article)
            print(soup.select_one('h1').text)
            break  # for debugging
    d.quit()

    # POSTGRESQL CONNECTION
    # 1. Connect to local database using psycopg2

    import psycopg2

    hostname = 'balarama.db.elephantsql.com'
    username = 'yagoiucf'
    password = 'jXoWg8Hc8Ftxxxxxxxxxxxxxxxxxxxxxxxxxo'
    database = 'yagoiucf'

    conn = psycopg2.connect(host='balarama.db.elephantsql.com', user='yagoiucf',
                            password='jXoWg8Hc8FthwIxxxxxxxxxxxxxxxxx', dbname='yagoiucf')
    conn.close()

0 个答案:

没有答案
相关问题