Question

如果我使用的GATE文档略大，我尝试执行Error java.lang.OutOfMemoryError: GC overhead limit exceeded时会收到Pipeline。

如果GATE文档很小，代码可以正常工作。

我的JAVA代码是这样的：

TestGate类：

    public void gateProcessor(Section section) throws Exception { 
                Gate.init();
                Gate.getCreoleRegister().registerDirectories(....
                SerialAnalyserController pipeline .......
                pipeline.add(All the language analyzers)
                pipeline.add(My Jape File)
                Corpus corpus = Factory.newCorpus("Gate Corpus");
                Document doc = Factory.newDocument(section.getContent());
                corpus.add(doc);

                pipeline.setCorpus(corpus);
                pipeline.execute();
}

主类包含：

            StringBuilder body = new StringBuilder();
            int character;
            FileInputStream file = new FileInputStream(
                    new File(
                            "filepath\\out.rtf"));  //The Document in question
            while (true)
            {
                character = file.read();
                if (character == -1) break;
                body.append((char) character);
            }


            Section section = new Section(body.toString()); //Creating object of Type Section with content field = body.toString()
            TestGate testgate = new TestGate();
            testgate.gateProcessor(section);

有趣的是，这件事在GATE Developer工具中失败了，如果文档超过了一个特定的限制，比如超过1页，工具基本上就会被卡住。

这证明我的代码在逻辑上是正确的，但我的方法是错误的。我们如何处理GATE文档中的大块数据。

Answer 1

您需要致电

corpus.clear();
Factory.deleteResource(doc);

在每个文档之后，否则如果你运行足够多次，你最终会在任何大小的文档上获得OutOfMemory（尽管你在方法中初始化gate的方式似乎你真的只需要处理一个文档一次）。

除此之外，注释和功能通常需要大量内存。如果您有一个注释密集型管道，即您生成大量具有许多功能和值的注释，则可能会耗尽内存。确保您没有处理资源以指数方式生成注释 - 例如，jape或groovy会生成 n W 注释的强大功能，其中W是文档中的单词数。或者，如果您的文档中每个可能的单词组合都有一个功能，那么这将生成W字符串的因子。

Answer 2

每次创建管道对象时都会占用大量内存。这就是为什么每次你使用安妮＆＃39;清理。

pipeline.cleanup（）; 管道= NULL;

我们如何处理大型GATE文档

2 个答案: