列表中的词组匹配

时间:2018-08-14 08:42:41

标签: python list tuples phrase

假设我有一个代表句子的列表,例如:

sent = ['terras', 'ipsius', 'Azar', 'vocatas', 'Ta', 'Xellule', 'et', 'Ginen', 'Chagem', 'in', 'contrata', 'Deyr', 'Issafisaf']

和地名列表

places = ['Ta Xellule', 'Ginen Chagem', 'Deyr Issafisaf']

我最终会如何:

[('O','terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', vocatas'), ('PLACE', 'Ta'), ('PLACE', 'Xellule'), ('O','et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), ('O','in'), ('O','contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf')]

快速笔记:

例如,如果Ta必须仅在Xellule旁边。如果在句子中的其他上下文中发现,则不应将其标记为PLACE ex:Ta Buni mar Ta Xellule ...仅应标记第二个Ta。

这是我的地点列表的一个示例:

 'Ras il Huichile',
 'Ras il Hued',
 'Ta Richardu',
 'Roma',
 'Russilion',
 'La Rukiha',
 'Irrukiha ta il Bayada',
 'Casalis Milleri',
 'Ta Sabat',
 'Casalis Zebug',
 'Ta Zagra',
 'Sagra in  Ras il Hued',
 'Ta Isalme'

这是一个例句:

terras ipsius Azar vocatas Ta Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus

虽然它存在于Ras il Hued的Sagra中,但这里不应该标记为位置

5 个答案:

答案 0 :(得分:2)

好的,我根据您的修改更新了答案:

from functools import reduce

sent = "terras ipsius Azar vocatas Ta Ta Zagra Ta Zagra Xellule et Ginen Chagem in contrata Deyr Issafisaf cum iuribus suis omnibus"
places = [ 'Ras il Huichile', 'Ras il Hued', 'Ta Richardu', 'Roma', 'Russilion', 'La Rukiha', 'Irrukiha ta il Bayada',
'Casalis Milleri', 'Ta Sabat', 'Casalis Zebug', 'Ta Zagra', 'Sagra in  Ras il Hued', 'Ta Isalme', 'Ta Xellule', 'Ginen Chagem',
'Deyr Issafisaf']

places_map = {p:[('PLACE', l) for l in p.split()] for p in places}

def find_places(sent, places):
    if len(places) is 0:
        return [('O', l) for l in sent.split()]

    place = places[0]
    remaining_places = places[1:]

    sent_splits = sent.split(place)
    return reduce(lambda a,b:a+places_map[place]+b, [find_places(s, remaining_places) for s in sent_splits])

print(find_places(sent, places))

,输出为:

[('O', 'terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', 'vocatas'), ('O', 'Ta'), ('PLACE', 'Ta'), ('PLACE', 'Zagra'), ('PLACE', 'Ta'), ('PLACE', 'Zagra'), ('O', 'Xellule'), ('O', 'et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), ('O', 'in'), ('O', 'contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf'), ('O', 'cum'), ('O', 'iuribus'), ('O', 'suis'), ('O', 'omnibus')]

所以我使用了一种递归方法,在句子中找到一个位置,以所需的格式对其进行更改,然后对句子的其余部分与其余位置进行递归处理,然后将它们最终合并在一起。

答案 1 :(得分:0)

只需迭代并测试:

for word in sent:
    isPlace = False
    for place in places:
        if word in place:
            isPlace = True
    if isPlace:
        result.append(('PLACE', word))
    else:
        result.append(('O', word))

答案 2 :(得分:0)

尝试类似的事情:

d3.selectAll(".c3-area")
    .style ("pointer-events", "all")
    .on("mouseover", function (d) { return d3.select(this).style("opacity", 0.6)})
    .on("mouseout", function (d) { return d3.select(this).style("opacity", 0.2)})
;

答案 3 :(得分:0)

这是仅基于列表理解的建议,适用于理解爱好者:

sent = ['terras', 'ipsius', 'Azar', 'vocatas', 'Ta', 'Xellule', 'et', 'Ginen', 'Chagem', 'in', 'contrata', 'Deyr', 'Issafisaf']
places = ['Ta Xellule', 'Ginen Chagem', 'Deyr Issafisaf']

p      = [i for place in places for i in place.split()]
result = [('PLACE',word) if word in p else ('O',word) for word in sent]

print(result)
# [('O', 'terras'), ('O', 'ipsius'), ('O', 'Azar'), ('O', 'vocatas'), ('PLACE', 'Ta'),
#  ('PLACE', 'Xellule'), ('O', 'et'), ('PLACE', 'Ginen'), ('PLACE', 'Chagem'), 
#  ('O', 'in'), ('O', 'contrata'), ('PLACE', 'Deyr'), ('PLACE', 'Issafisaf')]

答案 4 :(得分:0)

另一种方法是在in den Warenkorb legen上使用join以创建字符串,然后检查单词是否在该字符串中:

places

输出:

sent = ['terras', 'ipsius', 'Azar', 'vocatas', 'Ta', 'Xellule', 'et', 'Ginen', 'Chagem', 'in', 'contrata', 'Deyr', 'Issafisaf']
places = ['Ta Xellule', 'Ginen Chagem', 'Deyr Issafisaf']

newList = [('Places',elem) if elem in " ".join(places) else ('O',elem) for elem in sent]
print(newList)