速度优化树数据解析器

时间:2015-03-02 14:29:28

标签: java algorithm parsing optimization

我正在进行一项输入,其输入格式如下,我必须尽快解析它:

5 (
 5 (
  3 (
  )
 )
 3 (
  3 (
  )
  3 (
  )
 )
 5 (
  2 (
  )
  4 (
  )
 )
)

这是"员工"的树结构,数字用于后续任务(语言索引)。

每个员工可以拥有任意数量的下属和一个上级(根节点是" Boss")。

这是我的解析器:(原来我使用的是Scanner,它很简单,但速度慢了两倍)

// Invocation
// Employee boss = collectEmployee(null, 0, reader);

private Employee collectEmployee(final Employee parent, int indent, final Reader r) throws IOException
{
    final StringBuilder sb = new StringBuilder();
    boolean nums = false;
    while (true) {
        char c = (char) r.read();
        if (c == 10 || c == 13) continue; // newline
        if (c == ' ') {
            if (nums) break;
        } else {
            nums = true;
            sb.append(c);
        }
    }
    final int lang = Integer.parseInt(sb.toString());
    final Employee self = new Employee(lang, parent);

    r.skip(1); // opening paren
    int spaces = 0;
    while (true) {
        r.mark(1);
        int i = r.read();
        char c = (char) i;
        if (c == 10 || c == 13) continue; // newline
        if (c == ' ') {
            spaces++;
        } else {
            if (spaces == indent) {
                break; // End of this employee
            } else {
                spaces = 0; // new line.
                r.reset();
                self.add(collectEmployee(self, indent + 1, r));
            }
        }
    }
    return self; // the root employee for this subtree
}

我需要在代码上多花几个周期,因此它将通过严格的要求。我已经对它进行了描述,这部分确实会减慢应用程序的速度。输入文件最多可以有30 MiB,因此任何微小的改进都会产生很大的不同。

任何想法都赞赏。感谢。

(为了完整起见,扫描仪实现在这里 - 它可以让你了解我如何解析它)

private Employee collectEmployee(final Employee parent, final Scanner sc)
{
    final int lang = Integer.parseInt(sc.next());
    sc.nextLine(); // trash the opening parenthesis

    final Employee self = new Employee(lang, parent);

    while (sc.hasNextInt()) {
        Employee sub = collectEmployee(self, sc);
        self.add(sub);
    }

    sc.nextLine(); // trash the closing parenthesis

    return self;
}

3 个答案:

答案 0 :(得分:2)

  1. 您正在使用StringBuilder进行大量数据推送 - 保留在遇到小数字符时更新的int值可能会有所帮助('0' - {{1} })('9')并在遇到非小数时存储/重置。这样你也可以摆脱Integer.parseInt。

  2. 您似乎正在使用/检查层次结构的缩进,但您的输入格式包含使其成为基于S表达式的语法的大括号 - 因此您的解析器执行的工作比需要的多得多(您可以忽略空格和使用一堆Employees处理大括号。

  3. 我考虑使用JMH基准测试并使用perf-asm(如果可用)运行以查看代码花费时间的位置。真的,它是一个非常宝贵的工具。

答案 1 :(得分:2)

嗯,基础知识是阅读和解析,以及你对数据做了什么。

通过递归下降进行读取和解析应该完全受IO限制。 它的运行时间只需要读取字符的一小部分。

您对数据的处理方式取决于您设计数据结构的方式。 如果你不小心,你可以花更多的时间在内存管理上。

无论如何,这里是C ++中一个骨骼简单的解析器。您可以将其转换为您喜欢的任何语言。

void scanWhite(const char* &pc){while(WHITE(*pc)) pc++;}

bool seeChar(const char* &pc, char c){
  scanWhite(pc);
  if (*pc != c) return False;
  pc++;
  return True;
}

bool seeNum((const char* &pc, int &n){
  scanWhite(pc);
  if (!DIGIT(*pc)) return False;
  n = 0; while(DIGIT(*pc)) n = n * 10 + (*pc++ - '0');
  return True;
}

// this sucks up strings of the form: either nothing or number ( ... )
bool readNumFollowedByList(const char* &pc){
  int n = 0;
  if (!seeNum(pc, n)) return False;
  // what you do with this number and what follows is up to you
  // if you hit the error, print a message and throw to the top level
  if (!seeChar(pc, LP)){ /* ERROR - NUMBER NOT FOLLOWED BY LEFT PAREN */ }
  // read any number of number ( ... )
  while(readNumFollowedByList(*pc)); // <<-- note the recursion
  if (!seeChar(pc, RP)){ /* ERROR - MISSING RIGHT PAREN */ }
  return True; 
}

答案 2 :(得分:0)

正确的实现应该真正使用状态机和Builder。不确定这是多少/多少有效,但它肯定适用于后来的增强和一些真正的简单。

static class Employee {

    final int language;
    final Employee parent;
    final List<Employee> children = new ArrayList<>();

    public Employee(int language, Employee parent) {
        this.language = language;
        this.parent = parent;
    }

    @Override
    public String toString() {
        StringBuilder s = new StringBuilder();
        s.append(language);
        if (!children.isEmpty()) {
            for (Employee child : children) {
                s.append("(").append(child.toString()).append(")");
            }
        } else {
            s.append("()");
        }
        return s.toString();
    }

    static class Builder {

        // Make a boss to wrap the data.
        Employee current = new Employee(0, null);
        // The number that is growing into the `language` field.
        StringBuilder number = new StringBuilder();
        // Bracket counter - not sure if this is necessary.
        int brackets = 0;
        // Current state.
        State state = State.Idle;

        enum State {

            Idle {

                        @Override
                        State next(Builder builder, char ch) {
                            // Any digits kick me into Number state.
                            if (Character.isDigit(ch)) {
                                return Number.next(builder, ch);
                            }
                            // Watch for brackets.
                            if ("()".indexOf(ch) != -1) {
                                return Bracket.next(builder, ch);
                            }
                            // No change - stay as I am.
                            return this;
                        }
                    },
            Number {

                        @Override
                        State next(Builder builder, char ch) {
                            // Any non-digits treated like an idle.
                            if (Character.isDigit(ch)) {
                                // Store it.
                                builder.number.append(ch);
                            } else {
                                // Now we have his number - make the new employee.
                                builder.current = new Employee(Integer.parseInt(builder.number.toString()), builder.current);
                                // Clear the number for next time around.
                                builder.number.setLength(0);
                                // Remember - could be an '('.
                                return Idle.next(builder, ch);
                            }
                            // No change - stay as I am.
                            return this;
                        }
                    },
            Bracket {

                        @Override
                        State next(Builder builder, char ch) {
                            // Open or close.
                            if (ch == '(') {
                                builder.brackets += 1;
                            } else {
                                builder.brackets -= 1;
                                // Keep that child.
                                Employee child = builder.current;
                                // Up to parent.
                                builder.current = builder.current.parent;
                                // Add the child.
                                builder.current.children.add(child);
                            }
                            // Always back to Idle after a bracket.
                            return Idle;
                        }
                    };

            abstract State next(Builder builder, char ch);
        }

        Builder data(String data) {
            for (int i = 0; i < data.length(); i++) {
                state = state.next(this, data.charAt(i));
            }
            return this;
        }

        Employee build() {
            // Current should hold the boss.
            return current;
        }
    }
}

static String testData = "5 (\n"
        + " 5 (\n"
        + "  3 (\n"
        + "  )\n"
        + " )\n"
        + " 3 (\n"
        + "  3 (\n"
        + "  )\n"
        + "  3 (\n"
        + "  )\n"
        + " )\n"
        + " 5 (\n"
        + "  2 (\n"
        + "  )\n"
        + "  4 (\n"
        + "  )\n"
        + " )\n"
        + ")";

public void test() throws IOException {
    Employee e = new Employee.Builder().data(testData).build();
    System.out.println(e.toString());
    File[] ins = Files.listFiles(new File("C:\\Temp\\datapub"),
            new FileFilter() {

                @Override
                public boolean accept(File file) {
                    return file.getName().endsWith(".in");
                }

            });
    for (File f : ins) {
        Employee.Builder builder = new Employee.Builder();
        String[] lines = Files.readLines(f);
        ProcessTimer timer = new ProcessTimer();
        for (String line : lines) {
            builder.data(line);
        }
        System.out.println("Read file " + f + " took " + timer);
    }
}

打印

  

0(5(5(3()))(3(3())(3()))(5(2())(4())))

请注意,0第一个元素是您提到的boss