如何通过pyparsing避免与scanString中的化学式错误匹配

时间:2015-07-01 13:35:31

标签: pyparsing

我正在尝试使用pyparsing来扫描化学公式的文本。我有以下示例代码:

from pyparsing import *

caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()

element = oneOf( """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
            Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
            As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
            Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
            Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
            Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
            Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
            Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No """ )

separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )

nreal = (Combine( integer + Optional( separator +\
    Optional( integer ) ))\
    | Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )

block = Forward()
groupElem = (Group( element('elem') + Optional( nreal, default=1)('esteq') ))('dupla') | \
     Group( parl + block + parr + Optional( nreal,default=1 )('modi'))
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )+ Optional(Or([Literal("-"), Literal("+")]))

s = '''Water is H2O not h2o, methane is CH4 and of course there is PtCl4.
What about H+ and OH-? and carbon or Carbon or H2SO4?'''
for match, start, stop in formula.scanString(s):
  print match, s[start:stop]

并输出:

 [['W', 1]] W
 [['H', 2.0], ['O', 1]] H2O
 [['C', 1], ['H', 4.0]] CH4
 [['Pt', 1], ['Cl', 4.0], ['W', 1]] PtCl4.
 W
 [['H', 1], '+'] H+
 [['O', 1], ['H', 1], '-'] OH-     
 [['Ca', 1]] Ca
 [['H', 2.0], ['S', 1], ['O', 4.0]] H2SO4

哪个大概是对的,但有一些错误的命中。例如,不应列出W和碳的Ca.我不确定如何修改语法以表明碳中的Ca不是化学式。解析器与公式上的parseString完美配合,但在混合文本中不够具体。关于如何修复它的任何提示?

1 个答案:

答案 0 :(得分:0)

我认为您希望所有公式都是自包含的字母组,因此只需将公式定义更改为:

formula = (WordStart() + 
           OneOrMore( block )+ Optional(Or([Literal("-"), Literal("+")])) +
           WordEnd())