Question

尝试找出regex检测文本中的网址，但<a href="url">...</a>已包围的网址除外，并用标记围绕它们。

input: "http://google.sk this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"

input: "<a href="http://google.sk">http://google.sk</a> this is an url"
result: "<a href="http://google.sk">http://google.sk</a> this is an url"

这个answer给了我很多帮助，但它并不期望已经包含了URL。

def fix_urls(text):
    pat_url = re.compile(  r'''
                     (?x)( # verbose identify URLs within text
         (https|http|ftp|gopher) # make sure we find a resource type
                       :// # ...needs to be followed by colon-slash-slash
            (\w+[:.]?){2,} # at least two domain groups, e.g. (gnosis.)(cx)
                      (/?| # could be just the domain name (maybe w/ slash)
                [^ \n\r"]+ # or stuff then space, newline, tab, quote
                    [\w/]) # resource name ends in alphanumeric or slash
         (?=[\s\.,>)'"\]]) # assert: followed by white or clause ending
                         ) # end of match group
                           ''')

    for url in re.findall(pat_url, text):
       text = text.replace(url[0], '<a href="%(url)s">%(url)s</a>' % {"url" : url[0]})

    return text

如果文字中有任何<a>标记，则此功能会再次包装其中我不想要的网址。你知道怎么做吗？

Answer 1

使用否定的lookbehind检查href="是否在您的网址（第二行）之前：

(?x) # verbose
(?<!href=\") #don't match already inside hrefs
(https?|ftp|gopher) # make sure we find a resource type
:// # ...needs to be followed by colon-slash-slash
((?:\w+[:.]?){2,}) # at least two domain groups, e.g. (gnosis.)(cx) fixed capture group*
(/?| # could be just the domain name (maybe w/ slash)
[^ \n\r\"]+ # or stuff then space, newline, tab, quote
[\w\/]) # resource name ends in alphanumeric or slash
(?=[\s\.,>)'\"\]]) # assert: followed by white or clause ending

https://regex101.com/r/EpcMKw/2/

Python - 找到所有未被标签包围的URL

1 个答案: