Question

我正在python中编写一个脚本来识别和删除自然文本文档中的引用号。例如，“患者使用了成功的治疗4-6”，其中4-6是参考数字。

我已经成功创建了一个可以区分字母字符串（单词）以及参考数字的语法：

real_word = Word(pyparsing.alphas)
first_punctuation = Word('.!?:,;')
second_punctuation = Word('.!?:,;-')
nums = Word(pyparsing.nums)

number_then_punctuation = WordStart() + real_word + nums + 
    second_punctuation + pyparsing.ZeroOrMore(nums | 
    second_punctuation) + WordEnd()

但是，我想扩展它以识别可能包含破折号或其他字符的单词。我认为最简单的解决方案是创建一个语法来识别令牌中的引用号码模式，然后将其从令牌中剥离（不关心“单词”部分的外观），这样：

number_then_punctuation = nums + second_punctuation + 
    pyparsing.ZeroOrMore(nums | second_punctuation) + WordEnd()

但是，当我尝试使用parseString识别令牌中的模式时，这会失败，因为参考号前面的单词没有模式。在令牌中，如何跳过参考数字模式，同时保存模式以及列表中的前一个“单词”？我可以使用searchString来查找模式，但这不会保存前面的单词。

在上面的例子中，我想返回['treatment'，'4-6']。我可以使用searchString，然后使用python的str.find（）方法：

string_test = 'treatment4-5'
x = number_then_punctuation.searchString(string_test).asList()[0][0]
index = string_test.find(x)
split = [string_test[:index], string_test[index:]]

但是我希望有一种方法可以做到内置于pyparsing。

由于

使用PyParsing来区分字符串结尾处的模式

0 个答案: