Question

我是一位长期读者，第一次问问（请温柔）。

我一直在用Unix Bash中的一个非常混乱的WHILE READ做这个，但是我正在学习python并且想尝试制作一个更有效的解析器例程。

所以我有一堆主要以空格分隔的日志文件，但包含方括号，其中也可能有空格。在寻找分隔符时如何忽略大括号内的内容？

（我假设RE库是必要的）

即。样本输入：

[21/Sep/2014:13:51:12 +0000] serverx 192.0.0.1 identity 200 8.8.8.8 - 500 unavailable RESULT 546 888 GET http ://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium [somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; colon:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy) DoesanyonerememberAOL/1.0]

期望的输出：

'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'

如果您注意到第一个和最后一个字段（方括号中的字段）仍然保留完整的空格。

奖励积分 第14个字段（URL）始终采用以下格式之一：

HTP：//google.com/path-data-might-be-here-and-can-contain-special-characters
google.com/path-data-might-be-here-and-can-contain-special-characters
xyz.abc.www.google.com/path-data-might-be-here-and-can-contain-special-characters
google.com:443
google.com

我想在仅包含域名的数据中添加其他列（例如xyz.abc.www.google.com或google.com）。

到目前为止，我一直在使用带有IF语句的Unix AWK来解析输出，以便通过＆＃39; /＆＃39;并检查第三个字段是否为空白。如果是，则返回第一个字段（直到：如果它存在），否则返回第三个字段）。如果有更好的方法来做到这一点 - 最好是按照上面的相同程序，我很乐意听到它 - 所以我的最终输出可能是：

'21/Sep/2014:13:51:12 +0000'; 'serverx'; '192.0.0.1'; 'identity'; '200'; '8.8.8.8'; '-'; '500'; 'unavailable'; 'RESULT'; '546'; '888'; 'GET'; 'htp://www.google.com/something/fsd?=somegibberish&youscanseethereisalotofcharactershere+bananashavealotofpotassium'; 'somestuff/1.0 (OSX v. 1.0; this_is_a_semicolon; rev:93.1.1) Somethingelse/1999 (COMMA, yep_they_didnt leave_me_a_lot_to_make_this_easy DoesanyonerememberAOL/1.0'; **'www.google.com'**

脚注：我在示例中将http更改为htp，因此不会创建一堆令人分心的链接。

Answer 1

正则表达式模式\[[^\]]*\]|\S+会对您的数据进行标记，但它不会从多字值中删除括号。您需要在单独的步骤中执行此操作：

import re

def parse_line(line):
    values = re.findall(r'\[[^\]]*\]|\S+', line)
    values = [v.strip("[]") for v in values]
    return values

这里是正则表达式模式的更详细版本：

pattern = r"""(?x)   # turn on verbose mode (ignores whitespace and comments)
    \[       # match a literal open bracket '['
    [^\]]*   # match zero or more characters, as long as they are not ']'
    \]       # match a literal close bracket ']'
        |        # alternation, match either the section above or the section below
    \S+      # match one or more non-space characters
    """

values = re.findall(pattern, line) # findall returns a list with all matches it finds

空格分隔，除了日志文件中的大括号外 - Python

1 个答案: