Dismax solr查询解析器工作得非常糟糕

时间:2016-10-15 22:00:53

标签: database apache search solr dismax

我有一个非常大的4.5M文档数据库。使用默认查询解析器时,我想要查找的文档将显示在结果中。

{
  "responseHeader":{
    "status":0,
    "QTime":0,
    "params":{
      "q":"\"I predict a riot\"",
      "rows":"1"}},
  "response":{
    "numFound":15,"start":0,"docs":[
      {
        "artist":"Kaiser Chiefs",
        "text":"<p>Oh, watchin' the people get lairy<br>It's not very pretty, I tell thee<br>Walkin' through town is quite scary<br>And not very sensible either<br>A friend of a friend he got beaten<br>He looked the wrong way at a policeman<br>Would never have happened to Smeaton<br>An old Leodiensian<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>Oh, I try to get to my taxi<br>A man in a tracksuit attacks me<br>He said that he saw it before me<br>Wants to get things a bit gory<br>Girls scrabble round with no clothes on<br>To borrow a pound for a condom<br>If it wasn't for chip fat, they'd be frozen<br>They're not very sensible<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>And if there's anybody left in here<br>That doesn't want to be out there<br><br>Ow!<br><br>Oh, watchin' the people get lairy<br>It's not very pretty, I tell thee<br>Walkin' through town is quite scary<br>Not very sensible<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot<br><br>And if there's anybody left in here<br>That doesn't want to be out there<br><br>I predict a riot, I predict a riot<br>I predict a riot, I predict a riot</p>",
        "_ts":6341730138387906561,
        "title":"I predict a riot",
        "id":"redacted"}]
  }}

但是,当我使用所有附加参数切换到DisMax查询处理程序时,这就是我得到的:

{
  "responseHeader": {
  "status": 0,
  "QTime": 1,
  "params": {
    "q": "\"I predict a riot\"",
    "defType": "dismax",
    "ps": "0",
    "qf": "text",
    "echoParams": "all",
    "pf": "text^5",
    "wt": "json"
  }
},
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  }
}

没有...如果我删除引号,它会发现一些非常不相关的结果(艺术家的歌曲叫#34;我&#34;)。如果它不清楚&#34;我预测骚乱&#34; 存在于本文档的 text 字段中。甚至好几次。

我是Solr的新手,我不明白这个查询有什么问题。我尝试将qf和pf更改为&#34;艺术家文字标题&#34;但没什么。

理想情况下,我们的目标是在所有三个领域中找到匹配项,如果所有单词在标题,艺术家或文本中以相同顺序找到,则会获得巨额奖励。但即便是这个简单的测试也不会出现上班。 : - /

谢谢!

编辑:使用这些参数

"params": {
"q": "I predict a riot",
"defType": "dismax",
"qf": "text artist title",
"echoParams": "all",
"pf": "text^5",
"rows": "100",
"wt": "json"
}

给我这个调试查询:

"debug": {
"rawquerystring": "I predict a riot",
"querystring": "I predict a riot",
"parsedquery": "(+(DisjunctionMaxQuery((text:I | title:I | artist:I)) DisjunctionMaxQuery((text:predict | title:predict | artist:predict)) DisjunctionMaxQuery((text:a | title:a | artist:a)) DisjunctionMaxQuery((text:riot | title:riot | artist:riot))) DisjunctionMaxQuery(((text:I predict a riot)^5.0)))/no_coord",
"parsedquery_toString": "+((text:I | title:I | artist:I) (text:predict | title:predict | artist:predict) (text:a | title:a | artist:a) (text:riot | title:riot | artist:riot)) ((text:I predict a riot)^5.0)",
"QParser": "DisMaxQParser",
"altquerystring": null,
"boostfuncs": null
}

我得到了可怕的结果,即一位名叫&#34;我&#34; - 但不是kaiser酋长的歌曲,在标题中有查询,在文本中有几次。

说明:

 <field name="title" type="string" indexed="true" stored="true"/>
 <field name="artist" type="string" indexed="true" stored="true"/>   
 <field name="text" type="string" indexed="true" stored="true"/>

1 个答案:

答案 0 :(得分:1)

string字段仅匹配字段的确切值(表示大小写和空格等)。

要实现您期望的那种匹配,您需要改为使用文本字段。示例模式中的text_general / text_en字段可能是可用的,至少作为起点,但您可能希望根据查询字段的方式精确调整字段的作用。如果您没有同义词或者不想删除停用词,请删除这些行并仅保留tokenizer和小写过滤器:

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
</fieldType>

更改字段类型后,您需要重新编制数据索引。

但我确实在qf中有一个完整句子的字段?是的。但是dismax查询解析器根据自己的规则对输入进行标记,然后根据这些规则创建新的内部查询。您可以看到它将查询字符串扩展为一长串OR,其中每个术语都是单独搜索的。由于自己没有索引与这些术语匹配的标记,因此没有命中。

如果您使用了支持lucene查询语法的edismax查询解析器,您可以使用title:"I predict a riot"至少获得一次点击,但它仍然不会像您一样预期,只需获得一个与角色的标题字符匹配的文档。