Question

需要编写一个Python文件来分析文件并计数：

字数
一个单词的平均长度
每个单词出现多少次
多少个单词以字母表中的每个字母开头

我有执行前两件事的代码：

with open(input('Please enter the full name of the file: '),'r') as f:
     w = [len(word) for line in f for word in line.rstrip().split(" ")]
     total_w = len(w)
     avg_w = sum(w)/total_w

print('The total number of words in this file is:', total_w)
print('The average length of the words in this file is:', avg_w)

但是我不确定其他方法。任何帮助表示赞赏。

顺便说一句，当我说“有多少个单词以字母表中的每个字母开头”时，我是指有多少个单词以“ A”开头，有多少个以“ B”开头，有多少个以“ C”开头等等。到达“ Z”的方式。

Answer 1

给了您有趣的挑战，我对问题3提出了一个建议，即单词在字符串中出现了多少次。这段代码根本不是最优的，但是确实可以。


我还使用了文件{ "AWSTemplateFormatVersion": "2010-09-09", "Description": "stack 1", "Parameters": {}, "Resources": { "MyPolicy": { "Type": "AWS::IAM::Policy", "Properties": { "PolicyDocument": { "Statement": [{ "Action": "sqs:*", "Effect": "Allow", "Resource": { "Fn::GetAtt": ["MyQueue", "Arn"] } }], "Version": "2012-10-17" }, "PolicyName": "MyPolicyName", "Roles": [{ "Ref": "MyRole" }] } }, "MyRole": { "Type": "AWS::IAM::Role", "Properties": { "AssumeRolePolicyDocument": { "Statement": [{ "Action": "sts:AssumeRole", "Effect": "Allow", "Principal": { "Service": ["events.amazonaws.com", "sqs.amazonaws.com"] } }], "Version": "2012-10-17" } } }, "MyQueue": { "Type": "AWS::SQS::Queue", "Properties": { "QueueName": "MyQueue2" } }, "MyRule": { "Type": "AWS::Events::Rule", "Properties": { "Description": "A rule to schedule data update", "Name": "MyRule", "ScheduleExpression": "rate(1 minute)", "State": "ENABLED", "RoleArn": { "Fn::GetAtt": ["MyRole", "Arn"] }, "Targets": [{ "Arn": { "Fn::GetAtt": ["MyQueue", "Arn"] }, "Id": "MyRule1", "Input": "{\"a\":\"b\"}" }] } }, "MyQueuePolicy": { "DependsOn": ["MyQueue", "MyRule"], "Type": "AWS::SQS::QueuePolicy", "Properties": { "PolicyDocument": { "Version": "2012-10-17", "Id": "MyQueuePolicy", "Statement": [{ "Effect": "Allow", "Principal": { "Service": ["events.amazonaws.com", "sqs.amazonaws.com"] }, "Action": "sqs:SendMessage", "Resource": { "Fn::GetAtt": ["MyQueue", "Arn"] } }] }, "Queues": [{ "Ref": "MyQueue" }] } } }, "Outputs": { } }

编辑：注意到我忘了创建单词表，因为它已保存在内存中

text.txt

问题四的答案：创建包含所有单词的列表后，这并不是很困难，因为可以将字符串视为列表，并且只需执行{{1 }}，以及包含字符串with open('text.txt', 'r') as doc: print('opened txt') for words in doc: wordlist = words.split() for numbers in range(len(wordlist)): for inner_numbers in range(len(wordlist)): if inner_numbers != numbers: if wordlist[numbers] == wordlist[inner_numbers]: print('word: %s == %s' %(wordlist[numbers], wordlist[inner_numbers]))
的列表

string[0]

Answer 2

有很多方法可以实现此目的，一种更高级的方法是首先简单地收集文本和单词，然后使用ML / DS工具处理数据，然后您可以推断出更多的统计信息（例如“一个新的段落主要以X词开头” /“ X词大多在Y词之前/之后”等。）

如果您只需要非常基本的统计信息，则可以在遍历每个单词的同时收集它们并在其末尾进行计算，例如：

stats = {
  'amount': 0,
  'length': 0,
  'word_count': {},
  'initial_count': {}
}

with open('lorem.txt', 'r') as f:
  for line in f:
    line = line.strip()
    if not line:
      continue
    for word in line.split():
      word = word.lower()
      initial = word[0]

      # Add word and length count
      stats['amount'] += 1
      stats['length'] += len(word)

      # Add initial count
      if not initial in stats['initial_count']:
        stats['initial_count'][initial] = 0
      stats['initial_count'][initial] += 1

      # Add word count
      if not word in stats['word_count']:
        stats['word_count'][word] = 0
      stats['word_count'][word] += 1

# Calculate average word length
stats['average_length'] = stats['length'] / stats['amount']

在线演示here

用于单词计数，平均单词长度，单词频率和以字母开头的单词频率的Python程序

2 个答案: