Question

我正在尝试对组织化学名称进行标记化，即将“己烷”拆分为['hex'，'an'，'e']的组成部分。

此问题的核心是：如何列出与“一个或多个”正则表达式的“所有”匹配项，而不仅仅是与该正则表达式的最后一个匹配项？

我正在使用以下代码进行测试：

Regex: \A((-|nonadeca|heptadeca|tetradec|imine|hept|heptadec|benzene|cyclo|oate|tetradeca|hex|yn|octa|phenyl|arsine|yl|dodec|e|eth|meth|pentadec|nona|phosphino|octadec|di|formyl|arsino|oct|oxo|tridec|penta|pent|dodeca|hydroxy|hexadec|hexa|ol|an|oyl|ether|non|trideca|prop|undec|hepta|pentadeca|nonadec|amine|tri|but|carbonyl|deca|en|amino|undeca|hexadeca|thiol|oxy|tetra|dec|carboxy|chloro|mercapto|iodo|fluoro|octadeca|imino|bromo|al|phosphine|carboxylicacid|amide|one|amido|oicacid)+)\Z
Findall: [('hexane', 'e')]
Finditer [('hexane', 'e')]
Search: ('hexane', 'e')
Split ['', 'hexane', 'e', '']
Match ('hexane', 'e')

在我所有的测试中，名称=“己烷”。这应该解析为['hex'，'an'，'e']。我尝试过的正则表达式遵循“ \ A（{在此添加的许多基团，用竖线分隔}} \ Z”的模式，其中许多基团是有机化学物质可用前缀和后缀的子集。

在正则表达式的每个部分上使用不带括号的正则表达式时，我得到以下输出：

Regex: \A(-|(tetradec)|(thiol)|(phenyl)|(arsino)|(carbonyl)|(one)|(e)|(fluoro)|(ol)|(ether)|(eth)|(trideca)|(hex)|(iodo)|(nonadeca)|(non)|(pent)|(al)|(octa)|(octadec)|(di)|(undeca)|(arsine)|(tri)|(cyclo)|(prop)|(nona)|(dodec)|(phosphine)|(yn)|(but)|(an)|(heptadeca)|(carboxy)|(imine)|(hept)|(octadeca)|(amide)|(imino)|(deca)|(dodeca)|(oct)|(hydroxy)|(bromo)|(undec)|(pentadeca)|(tetra)|(hexadec)|(benzene)|(phosphino)|(hexa)|(tridec)|(mercapto)|(dec)|(oyl)|(oxy)|(meth)|(penta)|(amido)|(oicacid)|(amine)|(yl)|(nonadec)|(tetradeca)|(hexadeca)|(carboxylicacid)|(amino)|(chloro)|(pentadec)|(en)|(hepta)|(heptadec)|(formyl)|(oate)|(oxo))+\Z
Findall: [('e', '', '', '', '', '', '', 'e', '', '', '', '', '', 'hex', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'an', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '')]
Finditer [('e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)]
Search: ('e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)
Split ['', 'e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, '']
Match ('e', None, None, None, None, None, None, 'e', None, None, None, None, None, 'hex', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, 'an', None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None)

这表明正则表达式必须正确找到['hex'，'an'，'e']拆分，因为没有其他部分的组合可以提供全面的\ A-STUFF_IN_HERE- \ Z匹配。但是，没有任何结果可将分子拆分成其组成部分供我使用。

在每个部分周围加上括号会得到以下结果：

style="?android:attr/progressBarStyle"

这再次表明['hex'，'an'，'e']部分已被成功解析，但没有在简单列表中为我提供这些部分。

注意：诸如“ hex”和“ hexa”前缀之间的歧义使简单的从左到右的re.split或re.findall没有\ A \ Z说明符不可行。在所有情况下，优先级都将变为“ hex”，在这种情况下，“ hexapentyldecane”将被解析为[“ hex”，?????]，并被结尾的“ a”打断，或者优先级将变为“ hexa” ，这样“己烷”将解析为[“ hexa”，???]，并以尾随的“ n”分隔。

Answer 1

如果您不提前知道将有多少个匹配组，则单个正则表达式无法以方便的结构捕获所有匹配组。但是您可以循环播放或拆分。

import re

string = 'hexane'
while True:
    oldstring = string
    string = re.sub(r'\A(-|nonadeca|heptadeca|tetradec|imine|hept|heptadec|benzene|cyclo|oate|tetradeca|hex|yn|octa|phenyl|arsine|yl|dodec|e|eth|meth|pentadec|nona|phosphino|octadec|di|formyl|arsino|oct|oxo|tridec|penta|pent|dodeca|hydroxy|hexadec|hexa|ol|an|oyl|ether|non|trideca|prop|undec|hepta|pentadeca|nonadec|amine|tri|but|carbonyl|deca|en|amino|undeca|hexadeca|thiol|oxy|tetra|dec|carboxy|chloro|mercapto|iodo|fluoro|octadeca|imino|bromo|al|phosphine|carboxylicacid|amide|one|amido|oicacid)', '', string)
    if not string:
        print(oldstring)
        break
    print(oldstring[0:-len(string)])

上面的内容并不是特别优雅，但至少应该让您入门。

列出与“一个或多个”正则表达式的所有匹配项

1 个答案: