从列表中计算字符串中元素的出现次数?

时间:2015-10-06 20:22:04

标签: python string for-loop text

我试图在我收集的一些演讲中计算出口头收缩的次数。一个特别的演讲看起来像这样:

speech = "I've changed the path of the economy, and I've increased jobs in our own
home state. We're headed in the right direction - you've all been a great help."

所以,在这种情况下,我想计算四(4)次收缩。我有一个收缩列表,这里有一些前几个术语:

contractions = {"ain't": "am not; are not; is not; has not; have not",
"aren't": "are not; am not",
"can't": "cannot",...}

我的代码看起来像这样,首先是:

count = 0
for word in speech:
    if word in contractions:
        count = count + 1
print count
然而,由于代码重复了每一个字母,而不是整个单词,所以我没有得到这个。

3 个答案:

答案 0 :(得分:5)

使用str.split()在空格上拆分字符串:

for word in speech.split():

这将拆分任意空格;这意味着空格,制表符,换行符和一些更奇特的空白字符,以及它们中的任意数量。

您可能需要使用str.lower() 小写您的单词(否则将无法找到Ain't),并删除标点符号:

from string import punctuation

count = 0
for word in speech.lower().split():
    word = word.strip(punctuation)
    if word in contractions:
        count += 1

我在这里使用str.strip() method;它会从单词的开头和结尾删除string.punctuation string中找到的所有内容。

答案 1 :(得分:1)

您正在迭代字符串。所以这些项目都是人物。要从字符串中获取单词,您可以使用像str.split()这样的天真方法为您做到这一点(现在您可以迭代一个字符串列表(在str.split()的参数上拆分的单词,默认值:split在空白上。甚至有re.split(),它更强大。但我不认为你需要用正则表达式分割文本。

你至少要做的是用str.lower()小写你的字符串,或者把所有可能的出现(也用大写字母)放在字典中。我强烈推荐第一种替代方案。后者并非真实可行。删除标点符号也是一项义务。但这还是天真的。如果您需要更复杂的方法,则必须通过单词标记器拆分文本。 NLTK是一个很好的起点,请参阅nltk tokenizer。但我强烈认为这个问题不是你的主要问题,也不会影响你真正解决你的问题。 :)

speech = """I've changed the path of the economy, and I've increased jobs in our own home state. We're headed in the right direction - you've all been a great help."""
# Maybe this dict makes more sense (list items as values). But for your question it doesn't matter.
contractions = {"ain't": ["am not", "are not", "is not", "has not", "have not"], "aren't": ["are not", "am not"], "i've": ["i have", ]} # ...

# with re you can define advanced regexes, but maybe
# from string import punctuation (suggestion from Martijn Pieters answer
# is still enough for you)
import re

def abbreviation_counter(input_text, abbreviation_dict):   
    count = 0
    # what you want is a list of words. str.split() does this job for you.
    # " " is default and you can also omit this. But if you really need better
    # methods (see answer text abover), you have to take a word tokenizer tool
    # or have to write your own.
    for word in input_text.split(" "):
        # and also clean word (remove ',', ';', ...) afterwards. The advantage of 
        # using re over `from string import punctuation` is that you have more
        # control in what you want to remove. That means that you can add or
        # remove easily any punctuation mark. It could be very handy. It could be
        # also overpowered. If the latter is the case, just stick to Martijn Pieters
        # solution.
        if re.sub(',|;', '', word).lower() in abbreviation_dict:
            count += 1

    return count

print abbrev_counter(speech, contractions)
2 # yeah, it worked - I've included I've in your list :)

与Martijn Pieters同时给出答案一点点令人沮丧;)但我希望我仍然为你创造了一些价值观。这就是为什么我编辑了我的问题,为你提供了一些未来工作的暗示。

答案 2 :(得分:0)

Python中的for循环迭代迭代中的所有元素。在字符串的情况下,元素是字符。

您需要将字符串拆分为包含单词的字符串列表(或元组)。您可以使用.split(delimiter)

你的问题非常普遍,所以Python有一个快捷方式:speech.split()分割任意数量的空格/制表符/换行符,所以你只能在列表中找到你的单词。

所以你的代码应该是这样的:

count = 0
for word in speech.split():
    if word in contractions:
        count = count + 1
print(count)

speech.split(" ")也适用,但只能拆分空格而不是制表符或换行符,如果有双倍空格,则会在结果列表中显示空元素。