Question

我已经设置了一个单核心solr（4.6.0），我试图用多种语言索引文档。我以自动检测文档语言的方式配置solr，但它始终设置默认语言（在 langid.fallback 参数中配置）。

这是我在 solrconfig.xml 中编写的以允许语言检测的内容：

<requestHandler name="/update" class="solr.UpdateRequestHandler">
     <lst name="defaults">
       <str name="update.chain">langid</str>
     </lst>
  </requestHandler>

和

<updateRequestProcessorChain name="langid">
       <processor class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
         <str name="langid.fl">text,title,description,content</str>
         <str name="langid.langField">language_s</str>
         <str name="langid.fallback">en</str>
       </processor>
       <processor class="solr.LogUpdateProcessorFactory" />
       <processor class="solr.RunUpdateProcessorFactory" />
     </updateRequestProcessorChain>

上传文档后，日志显示在这里：

248638 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – LangId configured
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Language fallback to value en
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field text
248639 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field title
248639 [qtp723484867-14] WARN  org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Field title not a String value, not including in detection
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field description
248640 [qtp723484867-14] WARN  org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Field description not a String value, not including in detection
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Appending field content
248640 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – No input text to detect language from, returning empty list
248641 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – No language detected, using fallback en
248641 [qtp723484867-14] DEBUG org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor  – Detected main document language from fields [Ljava.lang.String;@6efbb783: en

根据我的理解，LanguageIdentifierUpdateProcessor不能处理 solr.TextField 字段进行语言检测，但我在任何文档中都没有看到这种限制。此外，我在书中看到了几个例子，他们都使用文本字段（不是字符串字段）进行语言检测。而且，我不知道原因，但字段文字和内容不会被考虑在内。

有人能指出我正确的方向吗？

这里有这些字段的字段定义：

<field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
<field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="content" type="text_general" indexed="false" stored="true" multiValued="true"/>

谢谢！

Answer 1

我通过致电/update/extract来管理它。

在solrconfig.xml中：

<!-- Solr Cell Update Request Handler
     http://wiki.apache.org/solr/ExtractingRequestHandler 
-->
<requestHandler name="/update/extract" 
                startup="lazy"
                class="solr.extraction.ExtractingRequestHandler" >
  <lst name="defaults">
    <str name="lowernames">true</str>
    <str name="uprefix">ignored_</str>

    <!-- capture link hrefs but ignore div attributes -->
    <str name="captureAttr">true</str>
    <str name="fmap.a">ignored_</str>
    <str name="fmap.div">ignored_</str>

    <str name="update.chain">langid</str>
  </lst>
</requestHandler>

在java代码中：

  // Upload pdf content
  ContentStreamUpdateRequest up = new ContentStreamUpdateRequest("/update/extract");
  up.setParam("literal.id", doc.getId().toString());
  up.setParam("literal.title", doc.getTitle());
  up.setParam("literal.description", doc.getDescription());
  up.addFile(new java.io.File(doc.getFile().getFilePath()), doc.getProcessedFile().getFile()
      .getMimeType());
  up.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
  solrServer.getServer().request(up);

通过这种方式可以很好地检测到文档语言。

希望它有所帮助！

Answer 2

从SolR 7.1开始， 1）取消注释<updateRequestProcessorChain name="langid">部分和其他所需参数。 2）将条目-langid添加到

  <initParams path="/update/**,/query,/select,/tvrh,/elevate,/spell,/browse">
    <lst name="defaults">
      <str name="df">_text_</str>
      <str name="update.chain">langid</str>

    </lst>
  </initParams>

3）重新启动solr并使用如下所示的标准pysolr：

solrTargetCollection = pysolr.Solr（'http://localhost:8983/solr/LangCollection'，超时= 10） solrTargetCollection.add（[dataTFText]） solrTargetCollection.commit（）

Answer 3

我使用6.1.0，实际上他们制作/更新工作，而/ update / extract不再工作。

sys.argv

Answer 4

更新时，您应该使用

/update?update.chain=langid

，如果配置正确，则可以使用。

Solr没有自动检测语言

4 个答案: