Question

我正在尝试创建一个查看日志文件的python脚本，并告诉我们用户bin出现的次数，所以我有这个：

#open the auth.log for reading
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        print line

这打印出整行，例如

>>> Feb  4 10:43:14 j4-be02 sshd[1212]: Failed password for bin from 83.212.110.234 port 42670 ssh2

但我只想生成次数，例如用户尝试登录26次

Answer 1

count = 0
myAuthlog=open('auth.log', 'r')
for line in myAuthlog:
    if re.match("(.*)(B|b)in(.*)", line):
        count+=1
print count

Answer 2

选项1：

如果您的文件不是巨大的，您可以使用re.findall并获取结果列表的长度：

count = len(re.findall(your_regex, myAuthlog.read()))

选项2：

如果您的文件非常大，请迭代生成器表达式中的行并总结匹配项：

count = sum(1 for line in myAuthlog if re.search(your_regex, line))

两个选项都假定您要计算得到匹配的行数，如示例代码所示。选项1还假定用户名每行可以出现一次。

关于你的正则表达式的说明：

(.*)(B|b)in(.*)也会匹配'Carabinero'等字符串，请考虑使用字边界，即\b(B|b)in\b。

Answer 3

除了@cricket_007's comment (no need for .*, as long as you switch to re.search which doesn't implicitly insert a start of line anchor at the front)之外，搜索没有其他限定符的bin可能会产生很多误报。使用分组parens使得检查更加昂贵（它必须存储捕获组）。最后，你应该总是使用原始字符串作为正则表达式，否则它最终会咬你。放在一起，您可以使用带有if re.search(r'\b[Bb]in\b', line):的正则表达式强制执行单词边界，避免不必要的捕获，并且仍然可以执行您的操作。

你甚至可以通过预编译正则表达式来优化它（Python缓存编译的正则表达式，但它仍然涉及执行Python级别代码以每次检查缓存;编译对象直接进入C而没有延迟）。

这可以简化为：

import re

# Compile and store bound method with useful name; use character classes
# to avoid capture of B/b, and word boundary assertions to avoid capturing
# longer words containing name, e.g "binary" when you want bin
hasuser = re.compile(r'\b[Bb]in\b').search

#open the auth.log for reading using with statement to close file deterministically
with open('auth.log') as myAuthlog:
    # Filter for lines with the specified user (in Py3, would need to wrap
    # filter in list or use sum(1 for _ in filter(hasuser, myAuthlog)) idiom
    loginattempts = len(filter(hasuser, myAuthlog))
print "User attempted to log in", loginattempts, "times"

如何计算Python中单词的出现次数

3 个答案: