Question

我有这个正则表达式：http://regexr.com/39rbe

1413323829.0907|172.168.1.0|  |somedomain.com|OK|0015e248f2484591f52ed37030001|st=bla&cp=huh%2Cs_de%2Cf_bt%2Ce_rc%2Ch_sub%2Cl_ol%2Ca_noapp%2Cp_npaid%2Ci_t-e&sv=i2&pt=CP&rf=www.google.de&r2=https%3A%2F%2Fwww.google.de%2F&ur=mydomain.de&xy=1366x768x24&lo=DE%asdaasdasdcb=0009&vr=306&id=guccjs&lt=1413373830843&ev=&cs=w2dwmo&mo=1&la=1413773766|i00=0615e248f8484591f52ed47030001%3B543e5f46%3B55966cde|Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36|http://mydomain.de/uriPath|023|web|OK|OK

我正在尝试捕获URL等于http://mydomain.de/uriPath的用户代理字符串，例如还没有工作：

[^\|]+(?=https?:\/\/(?:www\.)?mydomain\.de[^\|]+)

Answer 1

怎么样？

\|[^|]+\|(?=https?:\/\/(?:www\.)?mydomain\.de[^\|]+)

例如：http://regex101.com/r/tF4jD3/5

如果您不想要启动和跟踪|，请将其添加到查看断言中

(?<=\|)[^|]+(?=\|https?:\/\/(?:www\.)?mydomain\.de[^\|]+)

输出

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36

它的作用是什么？

(?<=\|)断言以下正则表达式由|

提出

[^|]+匹配|

以外的任何内容

(?=\|https?:\/\/(?:www\.)?mydomain\.de[^\|]+)断言* | *以外的任何内容后跟|http://mydomain.de/uriPath|

修改

使用捕获组

\|([^|]+)\|(?:https?:\/\/(?:www\.)?mydomain\.de[^\|]+)

Answer 2

使用下面的positive lookahead，

[^|]+(?=\|[^\|]*(?:https?:\/\/)(?:www\.)?mydomain\.de[^\|]+)

DEMO

或

使用capturing groups，

\|([^|]+)\|[^\|]*(?:https?:\/\/)(?:www\.)?mydomain\.de[^\|]+

DEMO

>>> s = "1413323829.0907|172.168.1.0| |somedomain.com|OK|0015e248f2484591f52ed37030001|st=bla&cp=huh%2Cs_de%2Cf_bt%2Ce_rc%2Ch_sub%2Cl_ol%2Ca_noapp%2Cp_npaid%2Ci_t-e&sv=i2&pt=CP&rf=www.google.de&r2=https%3A%2F%2Fwww.google.de%2F&ur=mydomain.de&xy=1366x768x24&lo=DE%asdaasdasdcb=0009&vr=306&id=guccjs&lt=1413373830843&ev=&cs=w2dwmo&mo=1&la=1413773766|i00=0615e248f8484591f52ed47030001%3B543e5f46%3B55966cde|Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36|http://mydomain.de/uriPath|023|web|OK|OK" >>> re.search(r'\|([^|]+)\|[^\|]*(?:https?:\/\/)(?:www\.)?mydomain\.de[^\|]+', s).group(1) 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36'

通过拆分，

import re s = "1413323829.0907|172.168.1.0| |somedomain.com|OK|0015e248f2484591f52ed37030001|st=bla&cp=huh%2Cs_de%2Cf_bt%2Ce_rc%2Ch_sub%2Cl_ol%2Ca_noapp%2Cp_npaid%2Ci_t-e&sv=i2&pt=CP&rf=www.google.de&r2=https%3A%2F%2Fwww.google.de%2F&ur=mydomain.de&xy=1366x768x24&lo=DE%asdaasdasdcb=0009&vr=306&id=guccjs&lt=1413373830843&ev=&cs=w2dwmo&mo=1&la=1413773766|i00=0615e248f8484591f52ed47030001%3B543e5f46%3B55966cde|Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36|http://mydomain.de/uriPath|023|web|OK|OK" L = s.split('|') previous = '' for i in L: if re.match(r'[^\|]*(?:https?:\/\/)(?:www\.)?mydomain\.de[^\|]+', i): print(previous) previous = i

<强>输出：

Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/527.36 (KHTML, like Gecko) Chrome/37.0.2162.124 Safari/527.36

在相当复杂的日志文件中预测URL和用户代理

2 个答案: