Question

我正在研究正则表达式，并想知道如何从HTML页面中提取URL。我想从这一行打印出网址：

Website is: http://www.somesite.com

每次找到该链接时，我想在**Website is:**之后提取哪些网址任何帮助将不胜感激。

Answer 1

这是否足够或者您需要更具体吗？

In [230]: s = 'Website is: http://www.somesite.com '
In [231]: re.findall('Website is:\s+(\S+)', s)
Out[231]: ['http://www.somesite.com']

Answer 2

您可以使用捕获组将每一行与正则表达式匹配，如下所示：

for l in page:
    m = re.match("Website is: (.*)")
    if m:
        print m.groups()[0]

这将检查每一行是否与模式匹配，并从中提取链接。

一些陷阱：

这假定“网站是”表达始终位于该行的开头。如果不是，您可以使用re.search。
这假设冒号和网站之间只有一个空格。如果不是这样，您可以将表达式更改为Website is:\s+(http.*)。

具体细节取决于您尝试解析的页面。

Answer 3

因为它很简单，所以正则表达式可能有点过头了。

def main():
    urls = []
    file = prepare_file("<yourfile>.html")
    for i in file:
         if "www" in i or "http://" in i:
             urls.append(i)
    return urls


def prepare_file(filename):
    file = open(filename)
    a = file.readlines() #splits on new lines
    a = [ i.strip() for i in [ x for x in a ] ] #remove white space
    a = filter(lambda x : x != '', a) #remove empty elements
    return a

Answer 4

根据我所读到的

，使用正则表达式捕获URL很难

可能使用以下正则表达式模式对您有好处：

pat = 'Website is: (%s)' % fireball

其中火球是一种捕捉您在此处可以找到的网址的模式：

daringfireball.net/2010/07/improved_regex_for_matching_urls

用于提取URL的Python正则表达式

4 个答案: