我有一些数据看起来:
[[('Natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('is', 'VBZ', 'O'), ('one', 'CD', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('important', 'JJ', 'O'), ('branch', 'NN', 'O'), ('of', 'IN', 'O'), ('CS', 'NNP', 'B'), ('.', '.', 'I')] ... ...]]
我想对具有标签B或I的连续单词进行分组,而忽略具有'O'标签的连续单词。
输出关键字应类似于:
自然语言处理, CS , 机器学习, 深度学习
我做了如下代码:
data=[[('Natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('is', 'VBZ', 'O'), ('one', 'CD', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('important', 'JJ', 'O'), ('branch', 'NN', 'O'), ('of', 'IN', 'O'), ('CS', 'NNP', 'B'), ('.', '.', 'I')],
[('Machine', 'NN', 'B'), ('learning', 'NN', 'I'), (',', ',', 'I'), ('deep', 'JJ', 'I'), ('learning', 'NN', 'I'), ('are', 'VBP', 'O'), ('heavily', 'RB', 'O'), ('used', 'VBN', 'O'), ('in', 'IN', 'O'), ('natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('.', '.', 'I')],
[('It', 'PRP', 'O'), ('is', 'VBZ', 'O'), ('too', 'RB', 'O'), ('cool', 'JJ', 'O'), ('.', '.', 'O')]]
Key_words = []
index = 0
for sen in data:
for i in range(len(sen)):
while index < len(sen):
我不知道下一步该怎么做。谁能帮我吗?
谢谢
答案 0 :(得分:1)
您应该使用itertools.groupby
作为一个紧凑的解决方案:
import itertools
import string
data = [[('Natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('is', 'VBZ', 'O'), ('one', 'CD', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('important', 'JJ', 'O'), ('branch', 'NN', 'O'), ('of', 'IN', 'O'), ('CS', 'NNP', 'B'), ('.', '.', 'I')],
[('Machine', 'NN', 'B'), ('learning', 'NN', 'I'), (',', ',', 'I'), ('deep', 'JJ', 'I'), ('learning', 'NN', 'I'), ('are', 'VBP', 'O'), ('heavily', 'RB', 'O'), ('used', 'VBN', 'O'), ('in', 'IN', 'O'), ('natural', 'JJ', 'B'), ('language', 'NN', 'I'), ('processing', 'NN', 'I'), ('.', '.', 'I')],
[('It', 'PRP', 'O'), ('is', 'VBZ', 'O'), ('too', 'RB', 'O'), ('cool', 'JJ', 'O'), ('.', '.', 'O')]]
punctuation = set(string.punctuation)
keywords = [[' '.join(w[0] for w in g) for k, g in itertools.groupby(sen, key=lambda x: x[0] not in punctuation and x[2] != 'O') if k] for sen in data]
print(keywords)
# [['Natural language processing', 'CS'],
# ['Machine learning', 'deep learning', 'natural language processing'],
# []]
答案 1 :(得分:0)
希望这会有所帮助。
remove_o = list(filter(lambda x: x[2] in ['I', 'B'], data))
words = [item[0] for item in remove_o]
reuslt = ' '.join(words)
答案 2 :(得分:0)
您需要在不存在“ O”作为第三个元素的地方获取元组中的第一个值,对吗?您可以通过这种方式进行。
output = [j[0] for i in data for j in i if(j[2]!='O')]
上面的代码与
相同for i in data:
for j in i:
if(j[2]!='O'): # if(j[2] in ['I','B']) also works
print(j[0]) # Or append to the output list