我正在尝试使用pyparsing来扫描化学公式的文本。我有以下示例代码:
from pyparsing import *
caps = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
lowers = caps.lower()
digits = "0123456789"
integer = Word( digits )
parl = Literal("(").suppress()
parr = Literal(")").suppress()
element = oneOf( """H He Li Be B C N O F Ne Na Mg Al Si P S Cl
Ar K Ca Sc Ti V Cr Mn Fe Co Ni Cu Zn Ga Ge
As Se Br Kr Rb Sr Y Zr Nb Mo Tc Ru Rh Pd Ag
Cd In Sn Sb Te I Xe Cs Ba Lu Hf Ta W Re Os
Ir Pt Au Hg Tl Pb Bi Po At Rn Fr Ra Lr Rf
Db Sg Bh Hs Mt Ds Rg Uub Uut Uuq Uup Uuh Uus
Uuo La Ce Pr Nd Pm Sm Eu Gd Tb Dy Ho Er Tm
Yb Ac Th Pa U Np Pu Am Cm Bk Cf Es Fm Md No """ )
separator = Literal( "," ).setParseAction(lambda s,l,t: t[0].replace(',','.')) | Literal( "." )
nreal = (Combine( integer + Optional( separator +\
Optional( integer ) ))\
| Combine( separator + integer )).setParseAction( lambda s,l,t: [ float(t[0]) ] )
block = Forward()
groupElem = (Group( element('elem') + Optional( nreal, default=1)('esteq') ))('dupla') | \
Group( parl + block + parr + Optional( nreal,default=1 )('modi'))
block << groupElem + ZeroOrMore( groupElem )
formula = OneOrMore( block )+ Optional(Or([Literal("-"), Literal("+")]))
s = '''Water is H2O not h2o, methane is CH4 and of course there is PtCl4.
What about H+ and OH-? and carbon or Carbon or H2SO4?'''
for match, start, stop in formula.scanString(s):
print match, s[start:stop]
并输出:
[['W', 1]] W
[['H', 2.0], ['O', 1]] H2O
[['C', 1], ['H', 4.0]] CH4
[['Pt', 1], ['Cl', 4.0], ['W', 1]] PtCl4.
W
[['H', 1], '+'] H+
[['O', 1], ['H', 1], '-'] OH-
[['Ca', 1]] Ca
[['H', 2.0], ['S', 1], ['O', 4.0]] H2SO4
哪个大概是对的,但有一些错误的命中。例如,不应列出W和碳的Ca.我不确定如何修改语法以表明碳中的Ca不是化学式。解析器与公式上的parseString完美配合,但在混合文本中不够具体。关于如何修复它的任何提示?
答案 0 :(得分:0)
我认为您希望所有公式都是自包含的字母组,因此只需将公式定义更改为:
formula = (WordStart() +
OneOrMore( block )+ Optional(Or([Literal("-"), Literal("+")])) +
WordEnd())