如何在SOLR中使用DIH索引不同类型的xml?

时间:2010-09-20 19:36:05

标签: solr dih

我需要索引5种不同类型的xml文件。它们具有相似的结构,每个结构略有不同。

示例1:

<?xml version="1.0"?>

    <manifest> 
  <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>Alberts;Bruce</author> 
                        <author>Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCDEF</category> 
                </categories> 
  </metadata> 
  <resources> 
                <audioresource> 
                        <uuid>123456789</uuid> 
                        <source>03_Mutations_Origin_Cancer.mp3</source> 
                        <mimetype>audio/mpeg</mimetype> 
                        <title>Part Three - Mutations and the Origin of Cancer</title> 
                        <description>123</description> 
                        <chapters> 
                                <chapter>1</chapter> 
                        </chapters> 
                </audioresource> 
  </resources> 
</manifest> 

示例2:

<?xml version="1.0"?> 
<manifest> 
        <metadata> 
                <isbn>9780815341291</isbn> 
                <title>Essential Cell Biology,Third Edition</title> 
                <authors> 
                        <author>FN:Alberts;Bruce</author> 
                        <author>FN:Bray;Dennis</author> 
                </authors> 
                <categories> 
                        <category>SCABC</category> 
                        <category>SCGHI</category> 
                </categories> 
        </metadata> 

        <resources> 
                <glossaryresource> 
                        <uuid>123456789</uuid> 
                        <term>A subunit </term> 
                        <definition>The portion of a bacterial exotoxin that interferes with normal host cell function. </definition> 
                        <chapters> 
                                <chapter>10</chapter> 
                        </chapters> 
                </glossaryresource> 
        </resources> 
</manifest> 

我的dih-config.xml如下:

 
<dataConfig> 
        <dataSource name="fileReader" type="FileDataSource" encoding="UTF-8"/> 
        <document> 
                <entity name="dir" rootEntry="false" dataSource="null" processor="FileListEntityProcessor" fileName="^.*\.xml$" recursive="true" baseDir="X:/tmp/npr"> 
                        <entity name="audioresource" 
                                        rootEntity="true" 
                                        dataSource="fileReader" 
                                        url="${dir.fileAbsolutePath}" 
                                        stream="false" 
                                        logTemplate=" processing ${dir.fileAbsolutePath}" 
                                        logLevel="debug" 
                                        processor="XPathEntityProcessor" 
                                        forEach="/manifest/metadata | /manifest/metadata/authors | /manifest/metadata/categories | /manifest/metadata/resources | /manifest/resources/audioresource | /manifest/resources/audioresource/chapters" 
                                        transformer="DateFormatTransformer"> 

                                        <field column="category" xpath="/manifest/metadata/categories/category" /> 
                                        <field column="author" xpath="/manifest/metadata/authors/author" /> 
                                        <field column="book_title" xpath="/manifest/metadata/title" /> 
                                        <field column="isbn" xpath="/manifest/metadata/isbn"/> 
                                        <field column="id" xpath="/manifest/resources/audioresource/uuid"/> 
                                        <field column="mimetype" xpath="/manifest/resources/audioresource/mimetype" /> 
                                        <field column="title" xpath="/manifest/resources/audioresource/title"/> 
                                        <field column="description" xpath="/manifest/resources/audioresource/description"/> 
                                        <field column="chapter" xpath="/manifest/resources/audioresource/chapters/chapter"/> 
                                        <field column="source" xpath="/manifest/resources/audioresource/source"/> 
                        </entity> 
                </entity> 
        </document> 
</dataConfig> 

我对xpath不太熟悉。我不能在元素名称中使用通配符,可以吗?尝试过,它没有用。

非常感谢提前。

1 个答案:

答案 0 :(得分:0)

我目前正在研究类似的问题。您是否尝试过创建XSLT? entity元素具有可选的“xsl”属性。