如何计算Python中单词的出现次数

时间:2016-02-18 19:08:55

标签: python python-2.7

我正在尝试创建一个查看日志文件的python脚本,并告诉我们用户bin出现的次数,所以我有这个:

#open the auth.log for reading
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        print line

这打印出整行,例如

>>> Feb  4 10:43:14 j4-be02 sshd[1212]: Failed password for bin from 83.212.110.234 port 42670 ssh2

但我只想生成次数,例如用户尝试登录26次

3 个答案:

答案 0 :(得分:1)

count = 0
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        count+=1
print count

答案 1 :(得分:0)

选项1:

如果您的文件不是巨大的,您可以使用re.findall并获取结果列表的长度:

count = len(re.findall(your_regex, myAuthlog.read()))

选项2:

如果您的文件非常大,请迭代生成器表达式中的行并总结匹配项:

count = sum(1 for line in myAuthlog if re.search(your_regex, line))

两个选项都假定您要计算得到匹配的行数,如示例代码所示。选项1还假定用户名每行可以出现一次。

关于你的正则表达式的说明:

(.*)(B|b)in(.*)也会匹配'Carabinero'等字符串,请考虑使用字边界,即\b(B|b)in\b

答案 2 :(得分:0)

除了@cricket_007's comment (no need for .*, as long as you switch to re.search which doesn't implicitly insert a start of line anchor at the front)之外,搜索没有其他限定符的bin可能会产生很多误报。使用分组parens使得检查更加昂贵(它必须存储捕获组)。最后,你应该总是使用原始字符串作为正则表达式,否则它最终会咬你。放在一起,您可以使用带有if re.search(r'\b[Bb]in\b', line):的正则表达式强制执行单词边界,避免不必要的捕获,并且仍然可以执行您的操作。

你甚至可以通过预编译正则表达式来优化它(Python缓存编译的正则表达式,但它仍然涉及执行Python级别代码以每次检查缓存;编译对象直接进入C而没有延迟)。

这可以简化为:

import re

# Compile and store bound method with useful name; use character classes
# to avoid capture of B/b, and word boundary assertions to avoid capturing
# longer words containing name, e.g "binary" when you want bin
hasuser = re.compile(r'\b[Bb]in\b').search

#open the auth.log for reading using with statement to close file deterministically
with open('auth.log') as myAuthlog:
    # Filter for lines with the specified user (in Py3, would need to wrap
    # filter in list or use sum(1 for _ in filter(hasuser, myAuthlog)) idiom
    loginattempts = len(filter(hasuser, myAuthlog))
print "User attempted to log in", loginattempts, "times"