Question

我想从一个文本文件中获取一个以http：//或https：//开头的URL，该文本文件还包含其他不相关的文本，并将其传输到另一个文件/列表中。

    def test():
        with open('findlink.txt') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                if "https://" in line:
                    outfile.write(line[line.find("https://"): line.find("")])
            print("Done")

该代码目前不执行任何操作。

编辑：我看到这像往常一样被否决了，我在这里可以添加任何内容吗？

这不是重复项，请仔细阅读。

Answer 1

您可以使用re提取所有网址。

In [1]: st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov
   ...: h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'''

In [2]: st
Out[2]: 'https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/'

In [3]: import re

In [4]: a = re.compile(r"https*://(\w+\.\w{3})/*")
In [5]: for i in a.findall(st):
   ...:     print(i)


regex101.com
regex202.gov
regex303.com
regex101.com

对于变量tld和路径：

st = '''https://regex101.com/ ha the hkj adh erht  https://regex202.gov h euy ashiu fa https://regex303.com aj feij ajj ai http://regex101.com/ ie fah fah http://regex101.co/ ty ahn fah jaio l http://regex101/yhes.com/'''
a = re.compile(r"https*://([\w/]+\.\w{0,3})/*")
for i in a.findall(st):
    print(i)

regex101.com
regex202.gov
regex303.com
regex101.com
regex101.co
regex101/yhes.com

Answer 2

您需要像在re答案中那样使用this。下面是将此功能集成到您的功能中。

def test():
        with open('findlink.txt', 'r') as infile, open('extractlink.txt', 'w') as outfile:
            for line in infile:
                try:
                    url = re.search("(?P<url>https?://[^\s]+)", line).group("url")
                    outfile.write(url)
                except AttributeError:
                    pass
            print("Done")

Answer 3

这就是代码当前什么都不做的原因：

outfile.write(line[line.find("https://"): line.find("")])

请注意，line.find("")正在寻找空字符串。始终会在字符串的开头找到它，因此它将始终返回0。因此，列表切片的长度为0个元素，因此为空。

尝试将其更改为line.find(" ")-您要查找的是空格，而不是空字符串。

但是，如果该行在该点之前包含空格，那么您仍然会陷入困境。最简单的读取方法可能只是使用单独的变量：

if "https://" in line:
    https_begin = line.find("https://")
    https_end = line[https_begin:].find(" ")  # find the next space after the url begins
    outfile.write(line[https_begin: https_end])

从文本文件中搜索并提取URL

3 个答案: