Question

我正在解析源代码文件，我想删除所有行注释（即以“//”开头）和多行注释（即/..../）。但是，如果多行注释中至少有一个换行符（\ n），我希望输出只有一个换行符。

例如，代码：

qwe /* 123
456 
789 */ asd

应该完全转变为：

qwe
asd

而不是“qweasd”或：

qwe

asd

最好的方法是什么？感谢

编辑：用于测试的示例代码：

comments_test = "hello // comment\n"+\
                "line 2 /* a comment */\n"+\
                "line 3 /* a comment*/ /*comment*/\n"+\
                "line 4 /* a comment\n"+\
                "continuation of a comment*/ line 5\n"+\
                "/* comment */line 6\n"+\
                "line 7 /*********\n"+\
                "********************\n"+\
                "**************/\n"+\
                "line ?? /*********\n"+\
                "********************\n"+\
                "********************\n"+\
                "********************\n"+\
                "********************\n"+\
                "**************/\n"+\
                "line ??"

预期结果：

hello 
line 2 
line 3  
line 4
line 5
line 6
line 7
line ??
line ??

Answer 1

comment_re = re.compile(
    r'(^)?[^\S\n]*/(?:\*(.*?)\*/[^\S\n]*|/[^\n]*)($)?',
    re.DOTALL | re.MULTILINE
)

def comment_replacer(match):
    start,mid,end = match.group(1,2,3)
    if mid is None:
        # single line comment
        return ''
    elif start is not None or end is not None:
        # multi line comment at start or end of a line
        return ''
    elif '\n' in mid:
        # multi line comment with line break
        return '\n'
    else:
        # multi line comment without line break
        return ' '

def remove_comments(text):
    return comment_re.sub(comment_replacer, text)

(^)?

MULTILINE将匹配，如果评论从一行的开头开始。
[^\S\n]将匹配除换行符之外的任何空白字符。如果评论从它自己的行开始，我们不希望匹配换行符。
/\*(.*?)\*/将匹配多行评论并捕获内容。懒惰匹配，所以我们不匹配两个或多个评论。 DOTALL - 标记使.匹配换行符。
//[^\n]将匹配单行评论。由于. - flag。

DOTALL

($)?

MULTILINE将匹配，如果评论在行尾停止。

示例：

>>> s = ("qwe /* 123\n"
         "456\n"
         "789 */ asd /* 123 */ zxc\n"
         "rty // fgh\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(s).splitlines()
... ) + '"'
"qwe"
"asd zxc"
"rty"
>>> comments_test = ("hello // comment\n"
...                  "line 2 /* a comment */\n"
...                  "line 3 /* a comment*/ /*comment*/\n"
...                  "line 4 /* a comment\n"
...                  "continuation of a comment*/ line 5\n"
...                  "/* comment */line 6\n"
...                  "line 7 /*********\n"
...                  "********************\n"
...                  "**************/\n"
...                  "line ?? /*********\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "********************\n"
...                  "**************/\n")
>>> print '"' + '"\n"'.join(
...     remove_comments(comments_test).splitlines()
... ) + '"'
"hello"
"line 2"
"line 3 "
"line 4"
"line 5"
"line 6"
"line 7"
"line ??"
"line ??"

<强>编辑：

已更新至新规范。
添加了另一个例子。

Answer 2

事实上你甚至不得不问这个问题，而且我们说的解决方案不是完全可读的:-)应该是一个很好的迹象表明RE不是这个问题的真正答案。

从可读性的角度来看，实际上将其编码为相对简单的解析器会更好。

很多时候，人们会尝试使用RE来“聪明”（我并不是以贬低的方式），认为一条线是优雅的，但他们最终得到的却是一个难以维持的角色泥潭。我宁愿拥有一个完全评论的20行解决方案，我可以在瞬间理解。

Answer 3

这是你要找的吗？

>>> print(s)
qwe /* 123
456
789 */ asd
>>> print(re.sub(r'\s*/\*.*\n.*\*/\s*', '\n', s, flags=re.S))
qwe
asd

这仅适用于多行注释，但会留下其他注释。

Answer 4

这个怎么样：

re.sub(r'\s*/\*(.|\n)*?\*/\s*', '\n', s, re.DOTALL).strip()

攻击领先的空白/*，任何文本和换行符直到第一个*\，然后攻击任何空格。

它对sykora的例子略有不同，但内心也不贪心。您还可能需要查看“多行”选项。

Answer 5

请参阅can-regular-expressions-be-used-to-match-nested-patterns - 如果您考虑嵌套注释，则正则表达式不是解决方案。

Python正则表达式问题：剥离多行注释但保持换行符

5 个答案: