Solr ngrams如何包含空格

时间:2016-10-20 17:28:53

标签: solr

我正在使用solr搜索名称并希望匹配部分匹配。使用至少2,我得到以下ngrams" Bob Smith":

  • 鲍勃
  • SM
  • SMI
  • SMIT
  • 史密斯

然而,这不包括" bob s"如果我搜索该查询,则不返回任何内容。在ngrams中包含该格式有哪些选择?这是我正在使用的字段类型:

<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
     <tokenizer class="solr.StandardTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" />
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

2 个答案:

答案 0 :(得分:1)

Since you're using different tokenizers, the results will be different. The KeywordTokenizer will give you only the search input as a single token directly. Using the StandardTokenizer will work, but will give 'smith bob' as a suggestion to 'bob' smith' as well.

An alternative is to index the content as shingles as well - allowing you to create shingles from tokens (example given with just two used for shingle generation):

bob smith jr. => bob smith, smith jr.

.. and then generate edgengrams based on that, giving you:

bo bob bob bob s bob sm ...

etc. The shingle factory will also include the actual token by default, so you should still be able to find just 'smith', etc.

<analyzer type="index">
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.LowerCaseFilterFactory"/>
 <filter class="solr.ShingleFilterFactory"/>
 <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" />
</analyzer>

.. should give you more tokens that allow you match tokens following each other. You can adjust maxShingleSize if you want more than just two sequential tokens to be included.

Also, if you just want to do autocomplete from the beginning of the text, using a KeywordTokenizer with a Lowercasefilter for indexing and using a wildcard for searching will work (as long as you lowercase the text before sending it to Solr, as all analysis is skipped for wildcards). This would also work with edgengram together with a KeywordTokenizer.

答案 1 :(得分:0)

You will need to use the KeywordTokenizerFactory on the index analyzer as well as the query analyzer, like this:

<fieldType name="edgytext" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
     <tokenizer class="solr.KeywordTokenizerFactory"/>
     <filter class="solr.LowerCaseFilterFactory"/>
     <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" />
    </analyzer>
    <analyzer type="query">
      <tokenizer class="solr.KeywordTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
    </analyzer>
  </fieldType>

This way, when indexing you will apply the edgeNgram filter to the entire string, rather than the tokens. It will be tokenized as "bob smith" (instead of the "bob", "smith" you had in the StandardTokenizer) then filtered as "b", "bo", "bob", "bob s", etc.