Python程序随机挂断

时间:2018-07-26 07:49:23

标签: python python-3.x web-scraping beautifulsoup

我正在使用BeautifulSoup在Python中进行一些网络抓取,其中涉及访问大约500个相似的网页,以将数据放入.txt文件中。

但是,我遇到了一些问题:

  • 我的程序的CPU使用率(通过任务管理器检查)随机下降到0%并保持这种状态。
  • 由于Ctrl + C不会终止程序,因此我的命令提示符本身变得无响应。
  • 这似乎是随机发生的,介于第8个网页和第480个网页之间。

    def getAnime():
    
      for index in range(2, 502):
    
          # gets anime statistics from HTML
          container = containers[index]
          ranking = container.td.text
          name = container.findAll('td', {'class', 't'})
          link = 'https://www.animenewsnetwork.com' + name[0].a['href']
          name = name[0].text
          statistics = container.findAll('td', {'class', 'r'})
          rating = statistics[0].text
          numVotes = statistics[1].text
    
          # prints out anime stats to file
          currentAnime = Anime(name, ranking, rating, numVotes, link)
          animeFile.write('\nname: ' + name)
          animeFile.write('\nlink: ' + link)
          animeFile.write('\nranking: ' + ranking)
          animeFile.write('\nrating: ' + rating)
          animeFile.write('\nvotes: ' + numVotes)
    
          # Goes to the webpage for the current anime
          animeClient = uReq(link)
          animeHTML = animeClient.read()
          animeClient.close()
          pageSoup = soup(animeHTML, 'html.parser')
    
          # Genres of the current anime
          try:
              genreDiv = pageSoup.find(id='infotype-30')
              genres = genreDiv.findAll('span')
              genreList = []
              for genre in genres:
                  genreList.append(genre.a.text)
              currentAnime.genres = genreList
          except:
              currentAnime.genres = 'unknown'
    
          # Themes of the current anime
          try:
              themes = pageSoup.find(id='infotype-31').findAll('span')
              themeList = []
              for theme in themes:
                  themeList.append(theme.a.text)
              currentAnime.themes = themeList
          except:
              currentAnime.themes = 'unknown'
    
          # Premiere date of the current anime
          try:
              date = pageSoup.find(id='infotype-9').div.text
              currentAnime.premiereDate = date
          except:
              currentAnime.premiereDate = 'unknown'
    
          # Director of the current anime
          try:
              director = pageSoup.find('b', text='Director').parent.a.text
              currentAnime.director = director
          except:
              currentAnime.director = 'unknown'
    
          # Production Studio of the current anime
          try:
              productionStudio = pageSoup.find('b', text='Production').parent.a.text
              currentAnime.studio = productionStudio
          except:
              currentAnime.studio = 'unknown'
    
          # Prints the genres
          animeFile.write('\ngenres: ')
          for genre in currentAnime.genres:
              animeFile.write(genre + ', ')
          # Prints the themes
          animeFile.write('\nthemes: ')
          for theme in currentAnime.themes:
              animeFile.write(theme + ', ')
          # Prints the premiere date, director, and studio
          animeFile.write('\npremiere date: ' + currentAnime.premiereDate)
          animeFile.write('\ndirector: ' + currentAnime.director)
          animeFile.write('\nproduction studio: ' + currentAnime.studio)
    
          animeFile.write('\n')
    

1 个答案:

答案 0 :(得分:3)

CTRL-C just sends a Keyboard Interrupt command to Python.这意味着在BeautifulSoup中发出HTML请求时,您可能只是在提升一个异常级别。 Ctrl 中断将完全停止程序。

您的脚本可能正在运行到无响应的网页中。您的CPU 设为0%,因为它正在Web服务上等待。我建议在每次调用link之前在代码中打印uReq的值,以跟踪发生这种情况的地方。