如何让 spaCy 不区分大小写?
是否有任何我应该添加的代码片段或因为我无法获取非大写实体的原因?
import spacy
import pandas as pd
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
ruler.add_patterns([{"label": "animal", "pattern": a}])
result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
result[ent.label_]=ent.text
df = pd.DataFrame([result])
print(df)
答案 0 :(得分:3)
只要 LOWER
用于所有模式都可以,您可以继续使用短语模式并为实体标尺添加 phrase_matcher_attr
选项。然后你不必担心对短语进行标记,如果你有很多模式要匹配,它也会比使用标记模式更快:
import spacy
nlp = spacy.load('en_core_web_sm', disable=['ner'])
ruler = nlp.add_pipe("entity_ruler", config={"phrase_matcher_attr": "LOWER"})
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
ruler.add_patterns([{"label": "flower", "pattern": f}])
animals = ["cat", "dog", "artic fox"]
for a in animals:
ruler.add_patterns([{"label": "animal", "pattern": a}])
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
print(ent, ent.label_)
输出:
CAT animal
Artic fox animal
african daisy flower
答案 1 :(得分:1)
您需要使用 LOWER
创建模式。但是,您还需要考虑多词实体,因此您需要拆分短语并动态构建模式:
import spacy
import pandas as pd
from spacy.pipeline import EntityRuler
nlp = spacy.load('en_core_web_sm', disable = ['ner'])
ruler = nlp.add_pipe("entity_ruler")
patterns = []
flowers = ["rose", "tulip", "african daisy"]
for f in flowers:
patterns.append({"label": "FLOWER", "pattern": [{'LOWER': w} for w in f.split()]})
animals = ["cat", "dog", "artic fox"]
for a in animals:
patterns.append({"label": "ANIMAL", "pattern": [{'LOWER': w} for w in a.split()]})
ruler.add_patterns(patterns)
result={}
doc = nlp("CAT and Artic fox, plant african daisy")
for ent in doc.ents:
result[ent.label_]=ent.text
print([(ent.text, ent.label_) for ent in doc.ents])
输出:
[('CAT', 'ANIMAL'), ('Artic fox', 'ANIMAL'), ('african daisy', 'FLOWER')]