如何使 spaCy 匹配不区分大小写

时间:2021-06-16 13:40:43

标签: python pandas nlp spacy

如何让 spaCy 不区分大小写?

是否有任何我应该添加的代码片段或因为我无法获取非大写实体的原因?

import spacy
import pandas as pd

from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")


flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
    ruler.add_patterns([{"label": "animal", "pattern": a}])



result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
        result[ent.label_]=ent.text
df = pd.DataFrame([result])
print(df)

2 个答案:

答案 0 :(得分:3)

只要 LOWER 用于所有模式都可以,您可以继续使用短语模式并为实体标尺添加 phrase_matcher_attr 选项。然后你不必担心对短语进行标记,如果你有很多模式要匹配,它也会比使用标记模式更快:

import spacy

nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})

flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
    ruler.add_patterns([{"label": "animal", "pattern": a}])

doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
    print(ent, ent.label_)

输出:

CAT animal
Artic fox animal
african daisy flower

答案 1 :(得分:1)

您需要使用 LOWER 创建模式。但是,您还需要考虑多词实体,因此您需要拆分短语并动态构建模式:

import spacy
import pandas as pd

from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")

patterns = []
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
    patterns.append({"label": "FLOWER", "pattern": [{'LOWER': w} for w in f.split()]})
animals = ["cat", "dog", "artic fox"]
for a in animals:
    patterns.append({"label": "ANIMAL", "pattern": [{'LOWER': w} for w in a.split()]})

ruler.add_patterns(patterns)

result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
        result[ent.label_]=ent.text

print([(ent.text, ent.label_) for ent in doc.ents])

输出:

[('CAT', 'ANIMAL'), ('Artic fox', 'ANIMAL'), ('african daisy', 'FLOWER')]