从数据中提取特定信息

时间:2016-10-08 03:06:32

标签: python python-3.x nltk stanford-nlp information-retrieval

如何转换数据格式,如:

James Smith was born on November 17, 1948

类似

("James Smith", DOB, "November 17, 1948")

无需依赖字符串的位置索引

我试过以下

from nltk import word_tokenize, pos_tag

new = "James Smith was born on November 17, 1948"
sentences = word_tokenize(new)
sentences = pos_tag(sentences)
grammar = "Chunk: {<NNP*><NNP*>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentences)
print(result)

如何继续进行所需的输出。

2 个答案:

答案 0 :(得分:1)

在修剪空格并分配给name和dob

之后,将字符串拆分为'was born on'

答案 1 :(得分:1)

您可以随时使用正则表达式。 正则表达式(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)将匹配并返回上面特定字符串格式的数据。

这里有效:https://regex101.com/r/W2ykKS/1

python中的正则表达式:

import re

regex = r"(\S+)\s(\S+)\s\bwas born on\b\s(\S+)\s(\S+),\s(\S+)"
test_str = "James Smith was born on November 17, 1948"

matches = re.search(regex, test_str)

# group 0 in a regex is the input string

print(matches.group(1)) # James
print(matches.group(2)) # Smith
print(matches.group(3)) # November
print(matches.group(4)) # 17
print(matches.group(5)) # 1948