对于此问题，在网址中计数为“ /”

Question

所以我试图编写一个网络爬虫代码，该代码进入雕像标题的每一章，并计算一组关键词的出现次数（＆＃34;＆＃34;＆＃34;必须＆＃34; ）在其内容中。

以下是我用来获取每章章节链接的代码。我使用的基本网址是http://law.justia.com/codes/georgia/2015/

import requests
from bs4 import BeautifulSoup, SoupStrainer
import re
from collections import Counter

pattern1 = re.compile(r"\bshall\b",re.IGNORECASE)
pattern2 = re.compile(r"\bmust\b",re.IGNORECASE)


########################################Sections##########################
def levelthree(item2_url):
 r = requests.get(item2_url)
 for sectionlinks in     BeautifulSoup((r.content),"html.parser",parse_only=SoupStrainer('a')):
  if sectionlinks.has_attr('href'):
   if 'section' in sectionlinks['href']:
         href = "http://law.justia.com" + sectionlinks.get('href')
         href = "\n" + str(href)
         print (href)



########################################Chapters##########################
def leveltwo(item_url):
    r = requests.get(item_url)
    for sublinks in BeautifulSoup((r.content), "html.parser", parse_only=SoupStrainer('a')):
        if sublinks.has_attr('href'):
            if 'chapt' in sublinks['href']:
                chapterlinks = "http://law.justia.com" + sublinks.get('href')
                # chapterlinks = "\n" + str(chapterlinks)
                #print (chapterlinks)


######################################Titles###############################
def levelone(url):
    r = requests.get(url)
    for links in BeautifulSoup((r.content), "html.parser", parse_only=SoupStrainer('a')):
        if links.has_attr('href'):
            if 'title-43' in links['href']:
                titlelinks = "http://law.justia.com" + links.get('href')
                # titlelinks = "\n" + str(titlelinks)
                leveltwo(titlelinks)
                # print (titlelinks)


###########################################################################
base_url = "http://law.justia.com/codes/georgia/2015/"
levelone(base_url)

问题是页面的结构通常是标题 - 章节 - 部分 - 内容（例如：http://law.justia.com/codes/georgia/2015/title-43/chapter-1/section-43-1-1/）

但有些是标题 - 章节 - 文章 - 部分 - 内容（例如http://law.justia.com/codes/georgia/2015/title-43/chapter-4/article-1/section-43-4-1/）

我能够获得第一个场景的链接。但是，我会错过所有的标题 - 章节 - 文章 - 部分 - 内容

我的问题是，如何对此进行编码，以便我能够获取每章的内容（从章节链接和文章到章节链接）然后查找单词的出现（例如＆＃34;将＆＃ 34;或者＆＃34;必须＆＃34;）每个章节单独？

我希望按章节找到频率这个词，希望输出会是这样的

第1章

Word     Frequency
shall     35
must      3

第2章

Word     Frequency
shall     59
must      14

Answer 1

对于此问题，在网址中计数为“ /”

http://law.justia.com/codes/georgia/2015/title-43/chapter-1/section-43-1-1/） http://law.justia.com/codes/georgia/2015/title-43/chapter-4/article-1/section-43-4-1/）

if url.count('/') == 9:
    # do somthing
if url.count('/') == 10:
    # do somthing

或者您可以做一个简单的技巧：

part = url.split('/')
title = part[7]
chapter = part[8]
section = part[-1]

注意：-1表示最后一部分

要计数必须或必须

使用相同的计数功能

shall_count = response_text.count('shall')
must_count = response_text.count('must')

Python Web爬虫，通过链接爬行并查找特定单词

1 个答案:

对于此问题，在网址中计数为“ /”

或者您可以做一个简单的技巧：

要计数必须或必须