使用正则表达式Python在文本中查找多个字符串

时间:2015-04-15 07:34:32

标签: python regex parsing

我有以下字符串:

background:url('http://images.bloomingdales.com/is/image/BLM/?&$b=BLM/swatches/&layer=0&size=322,23&src=is{$b$1/optimized/8757901_fpx.tif}&cropN=0,0,14,1&anchor=0,0&layer=1&size=23,23&src=is{$b$2/optimized/8757902_fpx.tif}&anchor=0,0&posN=0.071,0&layer=2&size=23,23&src=is{$b$4/optimized/8234544_fpx.tif}&anchor=0,0&posN=0.143,0&layer=3&size=23,23&src=is{$b$7/optimized/1111977_fpx.tif}&anchor=0,0&posN=0.214,0&layer=4&size=23,23&src=is{$b$0/optimized/8538460_fpx.tif}&anchor=0,0&posN=0.286,0&layer=5&size=23,23&src=is{$b$5/optimized/8234545_fpx.tif}&anchor=0,0&posN=0.357,0&layer=6&size=23,23&src=is{$b$3/optimized/1111973_fpx.tif}&anchor=0,0&posN=0.429,0&layer=7&size=23,23&src=is{$b$7/optimized/1252857_fpx.tif}&anchor=0,0&posN=0.5,0&layer=8&size=23,23&src=is{$b$8/optimized/1252858_fpx.tif}&anchor=0,0&posN=0.571,0&layer=9&size=23,23&src=is{$b$7/optimized/8234547_fpx.tif}&anchor=0,0&posN=0.643,0&layer=10&size=23,23&src=is{$b$0/optimized/8757900_fpx.tif}&anchor=0,0&posN=0.714,0&layer=11&size=23,23&src=is{$b$0/optimized/1111970_fpx.tif}&anchor=0,0&posN=0.786,0&layer=12&size=23,23&src=is{$b$1/optimized/1111971_fpx.tif}&anchor=0,0&posN=0.857,0&layer=13&size=23,23&src=is{$b$2/optimized/1111972_fpx.tif}&anchor=0,0&posN=0.929,0&layer=14&op_sharpen=1&fmt=jpeg&qlt=90,0&hei=23') 322px 0 transparent;

我需要得到所有这些部分:

1/optimized/8757901_fpx.tif2/optimized/8757902_fpx.tif等。

我正在使用这个正则表达式:

re.findall(re.compile(r'\d{1,2}/optimized/.+\.tif'), swatch)

返回错误的结果:

['1/optimized/8757901_fpx.tif}&cropN=0,0,14,1&anchor=0,0&layer=1&size=23,23&src=is{$b$2/optimized/8757902_fpx.tif}&anchor=0,0&posN=0.071,0&layer=2&size=23,23&src=is{$b$4/optimized/8234544_fpx.tif}&anchor=0,0&posN=0.143,0&layer=3&size=23,23&src=is{$b$7/optimized/1111977_fpx.tif}&anchor=0,0&posN=0.214,0&layer=4&size=23,23&src=is{$b$0/optimized/8538460_fpx.tif}&anchor=0,0&posN=0.286,0&layer=5&size=23,23&src=is{$b$5/optimized/8234545_fpx.tif}&anchor=0,0&posN=0.357,0&layer=6&size=23,23&src=is{$b$3/optimized/1111973_fpx.tif}&anchor=0,0&posN=0.429,0&layer=7&size=23,23&src=is{$b$7/optimized/1252857_fpx.tif}&anchor=0,0&posN=0.5,0&layer=8&size=23,23&src=is{$b$8/optimized/1252858_fpx.tif}&anchor=0,0&posN=0.571,0&layer=9&size=23,23&src=is{$b$7/optimized/8234547_fpx.tif}&anchor=0,0&posN=0.643,0&layer=10&size=23,23&src=is{$b$0/optimized/8757900_fpx.tif}&anchor=0,0&posN=0.714,0&layer=11&size=23,23&src=is{$b$0/optimized/1111970_fpx.tif}&anchor=0,0&posN=0.786,0&layer=12&size=23,23&src=is{$b$1/optimized/1111971_fpx.tif}&anchor=0,0&posN=0.857,0&layer=13&size=23,23&src=is{$b$2/optimized/1111972_fpx.tif']

我在regex101.com上测试了这个正则表达式并且它运行良好: https://regex101.com/r/tV9kU8/1#

3 个答案:

答案 0 :(得分:3)

re.findall(r'\d{1,2}/optimized/.+?\.tif', swatch)

                                            ^^

通过向quanitifer添加?来使{{1}}非贪婪。

答案 1 :(得分:2)

而不是贪婪.+,请在ungreedy模式下使用量词:.+?。 这样,您的正则表达式永远不会匹配/.tif之间的所需字符数超过所需的字符数,即它只匹配.tif的下一个实例。

答案 2 :(得分:1)

您可以在正则表达式中使用none greedy grouping请注意,在您的模式中,您需要在?之后添加+才能使其none greedy < / em>的):

>>> re.findall(re.compile(r'{\$b\$(.*?)}'), s)
['1/optimized/8757901_fpx.tif', '2/optimized/8757902_fpx.tif', 
'4/optimized/8234544_fpx.tif', '7/optimized/1111977_fpx.tif', 
'0/optimized/8538460_fpx.tif', '5/optimized/8234545_fpx.tif', 
'3/optimized/1111973_fpx.tif', '7/optimized/1252857_fpx.tif', 
'8/optimized/1252858_fpx.tif', '7/optimized/8234547_fpx.tif', 
'0/optimized/8757900_fpx.tif', '0/optimized/1111970_fpx.tif', 
'1/optimized/1111971_fpx.tif', '2/optimized/1111972_fpx.tif']

由于你们所有人的图像路径都在\$b\$之后,你可以使用以下模式:

{\$b\$(.*?)}

将匹配\$b\${}之后的任何内容。