拆分字符串,但在Python中将分隔符保留在相同的结果子字符串中

时间:2018-02-07 20:24:51

标签: python regex string

我有一个包含网址的字符串:

string = https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F

我想提取所有这些结果,如下所示:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=','https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D','http%253A%252F%252Fwww.link-three.mu%252F']

我在尝试:

urls = [x for x in re.split('(http[s]?)', string) if x]
print urls 

结果是:

['https', '://www.link1.net/abc/cik?xai=En8MmT__aF_nQm- F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https', '://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http', '%253A%252F%252Fwww.link-three.mu%252F']

如果可以以“http”或“https”开头,我怎样才能得到完整的网址?

有什么想法吗?

2 个答案:

答案 0 :(得分:2)

不使用re,您可以按如下方式处理此问题:

['http' + x for x in filter(lambda x: x, string.split('http'))]

结果将是:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-
F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=', 'https://aax-us.link-
two.com/x/c/Qoj_sZnkA%2526adurl%253D', 'http%253A%252F%252Fwww.link-
three.mu%252F']

答案 1 :(得分:1)

您可以使用您的结果并加入2个连续的比赛,这将有效。

urls = [urls[i]+urls[i+1] for i in range(0,len(urls),2)]

但最好在findall或字符串结尾使用https?进行预测:

import re

string = "https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253Dhttp%253A%252F%252Fwww.link-three.mu%252F"

print(re.findall("https?.*?(?=https?|$)",string))

结果:

['https://www.link1.net/abc/cik?xai=En8MmT__aF_nQm-F48&sig=Cg0A7_5AE&urlfix=1&;ccurl=',
 'https://aax-us.link-two.com/x/c/Qoj_sZnkA%2526adurl%253D', 
 'http%253A%252F%252Fwww.link-three.mu%252F']

如评论中所述,由于您无法将:添加到分隔符,因此您无法确定网址是否正确(如果网址在您要烘焙的地址中包含http

相关问题