Elasticsearch和西班牙语口音

时间:2014-11-19 12:48:17

标签: elasticsearch

我正在尝试使用elasticsearch来索引一些关于研究论文的数据。但我想点缀一下口音。对于intance,如果我使用:

GET /_analyze?tokenizer=standard&filter=asciifolding&text="Boletínes de investigaciónes"

{
   "tokens": [
      {
         "token": "Bolet",
         "start_offset": 1,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "nes",
         "start_offset": 7,
         "end_offset": 10,
         "type": "<ALPHANUM>",
         "position": 2
      },
      {
         "token": "de",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "investigaci",
         "start_offset": 14,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 4
      },
      {
         "token": "nes",
         "start_offset": 26,
         "end_offset": 29,
         "type": "<ALPHANUM>",
         "position": 5
      }
   ]
}

我应该得到类似的东西

{
   "tokens": [
      {
         "token": "Boletines",
         "start_offset": 1,
         "end_offset": 6,
         "type": "<ALPHANUM>",
         "position": 1
      },
      {
         "token": "de",
         "start_offset": 11,
         "end_offset": 13,
         "type": "<ALPHANUM>",
         "position": 3
      },
      {
         "token": "investigacion",
         "start_offset": 14,
         "end_offset": 25,
         "type": "<ALPHANUM>",
         "position": 4
      }
   ]
}

我该怎么办?

1 个答案:

答案 0 :(得分:0)

为了防止形成额外的令牌,您需要使用替代的令牌化程序,例如试试whitespace tokenizer

或者使用language analyzer并指定语言。