Question

我需要使用python从大文本文件（500MiB）中删除带有unicode字符的url，空行和行。

这是我的档案：

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com


foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

在正则表达式之后它应该是这样的：

foobar1
foobar2
foobar3 
foobar4 foobar5
foobar6 foobar7
foobar8

我提出的代码就是：

    file = open(file_path, encoding="utf8")
    self.rawFile = file.read()
    rep = re.compile(r"""
                        http[s]?://.*?\s 
                        |www.*?\s  
                        |(\n){2,}  
                        """, re.X)
    self.processedFile = rep.sub('', self.rawFile)

但输出不正确：

foobar3 foobar4 foobar5
foobar6 foobar7
foobar8 www.removethis7.com

我还需要删除包含至少一个非ascii char的所有行，但我无法为此任务提出正则表达式。

Answer 1

您可以尝试编码为ascii以捕获非ascii行，我认为这是你想要的：

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    for line in f:
        m = rep.search(line)
        try:
            if m:
                line = line.replace(m.group(), "")
                line.encode("ascii")
        except UnicodeEncodeError:
            continue
        if line.strip():
            print(line.strip())

输入：

https://removethis1.com
http://removethis2.com foobar1
http://removethis3.com foobar2
foobar3 http://removethis4.com
www.removethis5.com

1234 ā
5678 字
foobar4 www.removethis6.com foobar5
foobar6 foobar7
foobar8 www.removethis7.com

输出：

foobar1
foobar2
foobar3
foobar4 foobar5
foobar6 foobar7
foobar8

或使用正则表达式匹配任何非ascii：

with open("test.txt",encoding="utf-8") as f:
    rep = re.compile(r"""
                        http[s]?://.*?\s
                        |www.*?\s
                        |(\n)
                        """, re.X)
    non_asc = re.compile(r"[^\x00-\x7F]")
    for line in f:
        non = non_asc.search(line)
        if non:
            continue
        m = rep.search(line)
        if m:
            line = line.replace(m.group(), "")
            if line.strip():
                print(line.strip())

与上面相同的输出。你不能将正则表达式组合起来，因为如果有任何匹配并且只是用另一个匹配，则完全删除一行。

Answer 2

这将删除所有链接

(?:http|www).*?(?=\s|$)

解释

(?:            #non capturing group
    http|www   #match "http" OR "www"
)
    .*?        #lazy match anything until...
(
    ?=\s|$     #it is followed by white space or the end of line (positive lookahead)
)

用换行符\s替换空格\n，然后删除所有空行

Answer 3

取决于您希望结果匹配的示例文本的接近程度：

( +)?\b(?:http|www)[^\s]*(?(1)|( +)?)|\n{2,}

regex101 demo

这种魔法寻找前导空间并捕获它们（如果存在）。然后它会查找http或www部分，然后是所有不是空白的部分（我使用[^\s]*而不是简单地\S*，以防您想添加更多条件以排除）。之后，它使用正则表达式条件来检查是否先前收集了任何空格。如果没有，那么它会尝试捕获任何尾随空格（例如，在foobar4 www.removethis6.com foobar5之间不要删除太多）。或者它会查找2个换行符。

如果你没有替换所有内容，它应该给你与你请求的相同的输出。

现在，这个正则表达式相当严格，可能会有许多边缘情况，它不起作用。这适用于OP，但如果您需要更灵活，可能需要提供更多详细信息。

删除python中的url，空行和unicode字符

3 个答案: