我正在尝试使用Elasticsearch以英语编写HTML文档。数据以原始HTML格式提供。我找到了一个过滤HTML标签的设置,但我不能将此过滤器与一起使用英文分析器。
我希望此设置返回三个令牌,但它会返回五个令牌,因为它将“html”视为令牌两次。
POST _analyze
{
"analyzer": "english",
"char_filter": ["html_strip"],
"text": "<html>It will be raining in yosemite this weekend</html>"
}
如何才能获得上述文本的三个令牌(无HTML标签),以便我的回复如下所示?
{
"tokens": [
{
"token": "rain",
"start_offset": 11,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 3
},
{
"token": "yosemit",
"start_offset": 22,
"end_offset": 30,
"type": "<ALPHANUM>",
"position": 5
},
{
"token": "weekend",
"start_offset": 36,
"end_offset": 43,
"type": "<ALPHANUM>",
"position": 7
}
]
}
答案 0 :(得分:2)
定义一个自定义分析器,它只使用英文分析器作为基本模板,并为其添加html条带过滤器。
button1.Click += Buttons_Click;
button2.Click += Buttons_Click;
然后你可以做
PUT /english_with_html_strip
{
"settings": {
"analysis": {
"filter": {
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_html_strip": {
"tokenizer": "standard",
"char_filter": ["html_strip"],
"filter": [
"english_possessive_stemmer",
"lowercase",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
}
}
这假设您要使用英语分析器分析文本。如果你只想让它标记化剥离html你可以做到
POST /english_with_html_strip/_analyze
{
"analyzer": "english_with_html_strip",
"text": "<html>It will be raining in yosemite this weekend</html>"
}