如何从文本中找到正面和负面单词的总数?

时间:2014-02-28 11:42:29

标签: python file file-io web-crawler scrapy

我想找到从给定文本匹配的正面和负面单词的总数。我有positive.txt文件中的正面词汇列表和negative.txt文件中的否定词汇列表。如果一个单词与正单词列表匹配,那么我想要一个简单的整数变量,其值增加1,负匹配单词相同。从我给定的代码中我得到一个在@class=[story-hed]下的段落。这是我想要与正面和负面单词列表以及单词总数进行比较的文本。我的代码是,

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from dawn.items import DawnItem

class dawnSpider(BaseSpider):
   name = "dawn"
   allowed_domains = ["dawn.com"]
   start_urls = [
       "http://dawn.com/"
   ]

   def parse(self, response):

      hxs = HtmlXPathSelector(response)      
      sites = hxs.select('//h3[@class="story-hed"]//a/text()').extract()
      items=[]

      for site in sites:
         item=DawnItem()
         item['title']=site
         items.append(item)
      return items

4 个答案:

答案 0 :(得分:4)

下面的独立代码可以解决这个问题:

from collections import Counter

def readwords( filename ):
    f = open(filename)
    words = [ line.rstrip() for line in f.readlines()]
    return words

positive = readwords('positive.txt')
negative = readwords('negative.txt')

paragraph = 'this is really bad and in fact awesome. really awesome.'

count = Counter(paragraph.split())

pos = 0
neg = 0
for key, val in count.iteritems():
    key = key.rstrip('.,?!\n') # removing possible punctuation signs
    if key in positive:
        pos += val
    if key in negative:
        neg += val

print pos, neg

以下是我在两个输入文件中的内容:

positive.txt:

good 
awesome

negative.txt:

bad
ugly

,输出为:     2 1

要在scrapy中实现此功能,您可能需要使用项目管道http://doc.scrapy.org/en/latest/topics/item-pipeline.html

答案 1 :(得分:0)

首先,您可能想要阅读这些文件。假设每行有一个单词,您可以使用以下代码阅读所有单词:

postive = [l.strip() for l in open("possitive.txt")]

完成后,您可以创建一个dict,它将单词保存为键,计数作为值。要将字典启动为零,您可以使用:

positive_count = dict.fromkeys(postive, 0)

最后,如果发现了世界,你必须重复所有项目并增加计数:

for item in items:
    if item in positive_count:
         postive_count[item] +=1

最后,您可以打印结果:

for item, value in postive_counts.iteritems():
    print "Word %s count %d" % (item, value)

对于否定将是相同的,只是为了简化答案而省略。

答案 2 :(得分:0)

这取决于单词列表的大小。如果它们很小(小于几kb),那么将它们读入一个列表:

with open(positive_wordlist_file_name) as fd:
  positive_words = [line.strip() for line in fd]

一旦你有两个单词列表,你就可以用它们来完成文本 - 如果可以的话,一行一行。将它们拆分为单词,然后使用“in”运算符在列表中检查它们。我会在课堂上使用几个协同程序:

class WordCounter:
  # You can probably read word lists and store them here
  def positive_word_counter(self):
    """Co-routine that will count positive words. I'll leave it to reader
    to make a similar negative word one"""
    self.positive_words = 0
    while True:
      words = yield
      matched = [word for word in words if word in self.positive_words]
      self.positive_words += len(matched)

  def read_text(text):
    """Text - some iterable of lines - an file handle, or list or whatever."""
    #expand on this split with other word separators - or use re.split with the word boundary instead
    line_words = (line.strip().split(' ,') for line in text)
    #Create and prime coroutines
    positive_counter = self.positive_word_counter()
    positive_counter.next()
    negative_counter = self.negative_word_counter()
    negative_counter.next()
    #Now fire it in
    [[positive_counter.next(words), negative_counter.next(words)] for words in line_words]
    #You should now be able to read positive/negative words from this object

答案 3 :(得分:0)

for key, val in count.iteritems(): ==>仅当您使用python 3以上版本时,它才在python 3以下版本中工作

for key, val in count.item()
    key = key.rstrip('.,?!\n') # removing possible punctuation signs
    if key in positive:
        pos += val
    if key in negative:
        neg += val