Question

我一直在研究一种搜索层次结构文本的功能。由于我正在使用的文件在大小和数量上都很大，优化（速度和内存）的问题变得越来越重要，我正在研究如何改进算法

public NestLink<String> searchsubs(String input, NestLink<String> searchTags) {

// parameter tests

int startIndex = 0; // start of the resultant subsection
int endIndex = 0; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable

while (true) {
    // find the opening string
    startIndex = input.indexOf(openTag, endIndex);
    if (startIndex == -1) {
    break; // if no more are found, break from the loop
    } else {
    }
    endIndex = input.indexOf(closeTag, startIndex);
    if (endIndex == -1) {
    break; // if tag isn't closed, break from the loop
    } else {
    }

    // we now have a pair of tags with a content between
    subString = input.substring(startIndex + openTag.length(), endIndex);

    // store what we found, method unimportant

    // search this content for each subsearch in the heirarchy
    for (NestLink<String> subSearch : searchTags.subBranches) {
    // recurse
    node = subBlockParser(subString, subSearch);
    // do stuff with results
    }       
}

//final do stuff
return out;
}

注意：NestLink是一种自定义的树结构，但格式并不重要。

结果是，对于每个级别的搜索，正在创建子串的副本，有时最大为1mbyte，这显然远非有效。

为了尝试解决此问题，我考虑了以下内容：

public NestLink<String> searchsubs(String input, int substringStart, int substringEnd,
NestLink<String> searchTags) {

// parameter tests

int startIndex = substringStart; // start of the resultant subsection
int endIndex = substringStart; // end of the resultant subsection
String openTag = searchTags.elem1; // start of the string I need
String closeTag = searchTags.elem2; // end of the string I need
String subString = null; // stores the result in between
NestLink<String> node = null; // temp variable
NestLink<String> out = null; // output variable

while (true) {
    // find the opening string
    startIndex = input.indexOf(openTag, endIndex);
    if (startIndex == -1 || startIndex >= substringEnd) {
    break; // if no more are found, break from the loop
    } else {
    }
    endIndex = input.indexOf(closeTag, startIndex);
    if (endIndex == -1 || endIndex >= substringEnd) {
    break; // if tag isn't closed, break from the loop
    } else {
    }

    // we now have a pair of tags with a content between
    // store what we found, method unimportant

    // search this content for each subsearch in the heirarchy
    for (NestLink<String> subSearch : searchTags.subBranches) {
    // recurse, this time sharing input, but with a new substring start and end to serve as bounds
    node =
        subBlockParser(input, startIndex + openTag.length(), endIndex, subSearch);
    // do stuff with results
    }
}

// final do stuff
return out;
}

这次不是创建子字符串而是发送输入和一组边界。这提出了一个问题，JRE将如何处理这个问题？它会复制输入字符串（导致性能降低，因为现在正在复制更大的字符串），或者它是否只是传递一个指针对象，就像它与其他对象一样，并且所有递归共享相同的字符串对象（显着的性能提升）因为没有复制）

或者，是否有任何其他概念可能适用于heirarchal搜索？和heirarchal结果？

K.Barad

Answer 1

substring不会在原始字符串中创建包含字符副本的新String。它只返回一个String对象，该对象与原始字符串共享相同的char数组，但具有不同的偏移量和长度。

所以你的第二个实现类似于第一个实现，但更复杂（因为它执行String在内部执行的操作）。

Answer 2

我担心Java标准API已经实现了这种优化，所以没有任何东西可以从中获得。

java.lang.String由基础char[]，偏移量和长度组成。 substring()重新使用char[]，只调整偏移量和长度。

我建议您在代码中使用分析器来查找它实际花费的大部分时间，之前您会想到任何优化。这会阻止你像这样浪费你的努力，我几乎可以保证你会对结果感到惊讶。

优化字符串和嵌套搜索

2 个答案: