解析zip文件

时间:2018-11-23 09:52:36

标签: parsing nutch apache-tika

我正在使用nutch 1.15来抓取包含由file1.txt,file2.txt和file3.txt组成的zip文件的链接。

我在“ plugin.includes”中使用了parse-zip,parse-tika插件,但是它无法抓取文本文件的内容并将其编入索引。

已解析的内容将以这种方式返回

"content" : "file1.txt\nfile2.txt\nfile3.txt\n"

为什么无法获取file1.txt等的内容?

从regex-urlfilter.txt中删除了zip,

#-(?i)\.(gif|jpg|png|ico|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$
-(?i)\.(gif|jpg|png|ico|css|sit|eps|wmf|ppt|mpg|xls|gz|rpm|tgz|mov|exe|jpeg|bmp|js)$

plugin.includes in nutch-site.xml:

<property>
    <name>plugin.includes</name>
    <value>protocol-http|protocol-httpclient|urlfilter-regex|parse-(html|text|tika|zip|metatags)|index-(basic|anchor|metadata)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)|indexer-elastic</value>
</property>
<property>
    <name>http.content.limit</name>
    <value>-1</value>
</property>

parse-plugins.xml文件:

<parse-plugins>

  <!--  by default if the mimeType is set to *, or
        if it can't be determined, use parse-tika -->
    <mimeType name="*">
      <plugin id="parse-tika" />
    </mimeType>

    <mimeType name="application/rss+xml">
        <plugin id="parse-tika" />
        <plugin id="feed" />
    </mimeType>

    <mimeType name="application/x-bzip2">
        <!--  try and parse it with the zip parser -->
        <plugin id="parse-zip" />
    </mimeType>

    <mimeType name="application/x-gzip">
        <!--  try and parse it with the zip parser -->
        <plugin id="parse-zip" />
    </mimeType>

    <mimeType name="application/x-javascript">
        <plugin id="parse-js" />
    </mimeType>

    <mimeType name="application/x-shockwave-flash">
      <plugin id="parse-swf" />
    </mimeType>

    <mimeType name="application/zip">
        <plugin id="parse-zip" />
    </mimeType>

    <mimeType name="text/html">
        <plugin id="parse-html" />
    </mimeType>

        <mimeType name="application/xhtml+xml">
        <plugin id="parse-html" />
    </mimeType>

    <mimeType name="text/xml">
        <plugin id="parse-tika" />
        <plugin id="feed" />
    </mimeType>

       <!-- Types for parse-ext plugin: required for unit tests to pass. -->

    <mimeType name="application/vnd.nutch.example.cat">
        <plugin id="parse-ext" />
    </mimeType>

    <mimeType name="application/vnd.nutch.example.md5sum">
        <plugin id="parse-ext" />
    </mimeType>

    <!--  alias mappings for parse-xxx names to the actual extension implementation
    ids described in each plugin's plugin.xml file -->
    <aliases>
        <alias name="parse-tika"
            extension-id="org.apache.nutch.parse.tika.TikaParser" />
        <alias name="parse-ext" extension-id="ExtParser" />
        <alias name="parse-html"
            extension-id="org.apache.nutch.parse.html.HtmlParser" />
        <alias name="parse-js" extension-id="JSParser" />
        <alias name="feed"
            extension-id="org.apache.nutch.parse.feed.FeedParser" />
        <alias name="parse-swf"
            extension-id="org.apache.nutch.parse.swf.SWFParser" />
        <alias name="parse-zip"
            extension-id="org.apache.nutch.parse.zip.ZipParser" />
    </aliases>

</parse-plugins>

我在螺母侧缺少任何配置吗?

0 个答案:

没有答案