用于单词计数,平均单词长度,单词频率和以字母开头的单词频率的Python程序

时间:2018-08-26 18:10:01

标签: python file dictionary python-3.3 word-count

需要编写一个Python文件来分析文件并计数:

  • 字数
  • 一个单词的平均长度
  • 每个单词出现多少次
  • 多少个单词以字母表中的每个字母开头

我有执行前两件事的代码:

with open(input('Please enter the full name of the file: '),'r') as f:
     w = [len(word) for line in f for word in line.rstrip().split(" ")]
     total_w = len(w)
     avg_w = sum(w)/total_w

print('The total number of words in this file is:', total_w)
print('The average length of the words in this file is:', avg_w)

但是我不确定其他方法。任何帮助表示赞赏。

顺便说一句,当我说“有多少个单词以字母表中的每个字母开头”时,我是指有多少个单词以“ A”开头,有多少个以“ B”开头,有多少个以“ C”开头等等。到达“ Z”的方式。

2 个答案:

答案 0 :(得分:0)

  

给了您有趣的挑战,我对问题3提出了一个建议,即单词在字符串中出现了多少次。这段代码根本不是最优的,但是确实可以。

     
    

我还使用了文件 { "AWSTemplateFormatVersion": "2010-09-09", "Description": "stack 1", "Parameters": {}, "Resources": { "MyPolicy": { "Type": "AWS::IAM::Policy", "Properties": { "PolicyDocument": { "Statement": [{ "Action": "sqs:*", "Effect": "Allow", "Resource": { "Fn::GetAtt": ["MyQueue", "Arn"] } }], "Version": "2012-10-17" }, "PolicyName": "MyPolicyName", "Roles": [{ "Ref": "MyRole" }] } }, "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Statement": [{ "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": ["events.amazonaws.com", "sqs.amazonaws.com"] } }], "Version": "2012-10-17" } } }, "MyQueue": { "Type": "AWS::SQS::Queue", "Properties": { "QueueName": "MyQueue2" } }, "MyRule": { "Type": "AWS::Events::Rule", "Properties": { "Description": "A rule to schedule data update", "Name": "MyRule", "ScheduleExpression": "rate(1 minute)", "State": "ENABLED", "RoleArn": { "Fn::GetAtt": ["MyRole", "Arn"] }, "Targets": [{ "Arn": { "Fn::GetAtt": ["MyQueue", "Arn"] }, "Id": "MyRule1", "Input": "{\"a\":\"b\"}" }] } }, "MyQueuePolicy": { "DependsOn": ["MyQueue", "MyRule"], "Type": "AWS::SQS::QueuePolicy", "Properties": { "PolicyDocument": { "Version": "2012-10-17", "Id": "MyQueuePolicy", "Statement": [{ "Effect": "Allow", "Principal": { "Service": ["events.amazonaws.com", "sqs.amazonaws.com"] }, "Action": "sqs:SendMessage", "Resource": { "Fn::GetAtt": ["MyQueue", "Arn"] } }] }, "Queues": [{ "Ref": "MyQueue" }] } } }, "Outputs": { } }

  

编辑:注意到我忘了创建单词表,因为它已保存在内存中

text.txt
  

问题四的答案:创建包含所有单词的列表后,这并不是很困难,因为可以将字符串视为列表,并且只需执行{{1 }},以及包含字符串with open('text.txt', 'r') as doc: print('opened txt') for words in doc: wordlist = words.split() for numbers in range(len(wordlist)): for inner_numbers in range(len(wordlist)): if inner_numbers != numbers: if wordlist[numbers] == wordlist[inner_numbers]: print('word: %s == %s' %(wordlist[numbers], wordlist[inner_numbers]))

的列表
string[0]

答案 1 :(得分:0)

有很多方法可以实现此目的,一种更高级的方法是首先简单地收集文本和单词,然后使用ML / DS工具处理数据,然后您可以推断出更多的统计信息(例如“一个新的段落主要以X词开头” /“ X词大多在Y词之前/之后”等。)

如果您只需要非常基本的统计信息,则可以在遍历每个单词的同时收集它们并在其末尾进行计算,例如:

stats = {
  'amount': 0,
  'length': 0,
  'word_count': {},
  'initial_count': {}
}

with open('lorem.txt', 'r') as f:
  for line in f:
    line = line.strip()
    if not line:
      continue
    for word in line.split():
      word = word.lower()
      initial = word[0]

      # Add word and length count
      stats['amount'] += 1
      stats['length'] += len(word)

      # Add initial count
      if not initial in stats['initial_count']:
        stats['initial_count'][initial] = 0
      stats['initial_count'][initial] += 1

      # Add word count
      if not word in stats['word_count']:
        stats['word_count'][word] = 0
      stats['word_count'][word] += 1

# Calculate average word length
stats['average_length'] = stats['length'] / stats['amount']

在线演示here