使用String方法拆分具有多个分隔符的字符串

时间:2015-10-31 17:07:27

标签: java tokenize

我想将一个字符串拆分为标记。

我发现了另一个Stack Overflow问题 - Equivalent to StringTokenizer with multiple characters delimiters,但我想知道是否只能使用字符串方法(.equals(),. startSith()等)。我不想使用RegEx' s,StringTokenizer类,模式,匹配器或String之外的其他任何内容。

例如,这就是我想要调用方法的方法

String[] delimiters = {" ", "==", "=", "+", "+=", "++", "-", "-=", "--", "/", "/=", "*", "*=", "(", ")", ";", "/**", "*/", "\t", "\n"};
        String splitString[] = tokenizer(contents, delimiters);

这是我扯掉另一个问题的代码(我不想这样做)。

    private String[] tokenizer(String string, String[] delimiters) {
        // First, create a regular expression that matches the union of the
        // delimiters
        // Be aware that, in case of delimiters containing others (example &&
        // and &),
        // the longer may be before the shorter (&& should be before &) or the
        // regexpr
        // parser will recognize && as two &.
        Arrays.sort(delimiters, new Comparator<String>() {
            @Override
            public int compare(String o1, String o2) {
                return -o1.compareTo(o2);
            }
        });
        // Build a string that will contain the regular expression
        StringBuilder regexpr = new StringBuilder();
        regexpr.append('(');
        for (String delim : delimiters) { // For each delimiter
            if (regexpr.length() != 1)
                regexpr.append('|'); // Add union separator if needed
            for (int i = 0; i < delim.length(); i++) {
                // Add an escape character if the character is a regexp reserved
                // char
                regexpr.append('\\');
                regexpr.append(delim.charAt(i));
            }
        }
        regexpr.append(')'); // Close the union
        Pattern p = Pattern.compile(regexpr.toString());

        // Now, search for the tokens
        List<String> res = new ArrayList<String>();
        Matcher m = p.matcher(string);
        int pos = 0;
        while (m.find()) { // While there's a delimiter in the string
            if (pos != m.start()) {
                // If there's something between the current and the previous
                // delimiter
                // Add it to the tokens list
                res.add(string.substring(pos, m.start()));
            }
            res.add(m.group()); // add the delimiter
            pos = m.end(); // Remember end of delimiter
        }
        if (pos != string.length()) {
            // If it remains some characters in the string after last delimiter
            // Add this to the token list
            res.add(string.substring(pos));
        }
        // Return the result
        return res.toArray(new String[res.size()]);
    }
    public static String[] clean(final String[] v) {
        List<String> list = new ArrayList<String>(Arrays.asList(v));
        list.removeAll(Collections.singleton(" "));
        return list.toArray(new String[list.size()]);
    }

编辑:我只想使用字符串方法charAt,equals,equalsIgnoreCase,indexOf,length和substring

8 个答案:

答案 0 :(得分:9)

修改: 我的原始答案并没有完全解决问题,它没有在结果数组中包含分隔符,并使用了String.split()方法,这是不允许的。

这是我的新解决方案,分为两种方法:

/**
 * Splits the string at all specified literal delimiters, and includes the delimiters in the resulting array
 */
private static String[] tokenizer(String subject, String[] delimiters)  { 

    //Sort delimiters into length order, starting with longest
    Arrays.sort(delimiters, new Comparator<String>() {
        @Override
        public int compare(String s1, String s2) {
          return s2.length()-s1.length();
         }
      });

    //start with a list with only one string - the whole thing
    List<String> tokens = new ArrayList<String>();
    tokens.add(subject);

    //loop through the delimiters, splitting on each one
    for (int i=0; i<delimiters.length; i++) {
        tokens = splitStrings(tokens, delimiters, i);
    }

    return tokens.toArray(new String[] {});
}

/**
 * Splits each String in the subject at the delimiter
 */
private static List<String> splitStrings(List<String> subject, String[] delimiters, int delimiterIndex) {

    List<String> result = new ArrayList<String>();
    String delimiter = delimiters[delimiterIndex];

    //for each input string
    for (String part : subject) {

        int start = 0;

        //if this part equals one of the delimiters, don't split it up any more
        boolean alreadySplit = false;
        for (String testDelimiter : delimiters) {
            if (testDelimiter.equals(part)) {
                alreadySplit = true;
                break;
            }
        }

        if (!alreadySplit) {
            for (int index=0; index<part.length(); index++) {
                String subPart = part.substring(index);
                if (subPart.indexOf(delimiter)==0) {
                    result.add(part.substring(start, index));   // part before delimiter
                    result.add(delimiter);                      // delimiter
                    start = index+delimiter.length();           // next parts starts after delimiter
                }
            }
        }
        result.add(part.substring(start));                      // rest of string after last delimiter          
    }
    return result;
}

原始答案

当你说你只想使用String方法时,我注意到你正在使用Pattern

我将采取的方法是考虑最简单的方法。我认为这是首先用一个分隔符替换所有可能的分隔符,然后进行分割。

以下是代码:

private String[] tokenizer(String string, String[] delimiters)  {       

    //replace all specified delimiters with one
    for (String delimiter : delimiters) {
        while (string.indexOf(delimiter)!=-1) {
            string = string.replace(delimiter, "{split}");
        }
    }

    //now split at the new delimiter
    return string.split("\\{split\\}");

}

我需要使用String.replace()而不是String.replaceAll()因为replace()采用文字文本而replaceAll()采用正则表达式参数,所提供的分隔符是文字文本。

这就是为什么我还需要一个while循环来替换每个分隔符的所有实例。

答案 1 :(得分:3)

仅使用非正则表达式String方法... 我使用了startsWith(...)方法,该方法不在您列出的方法的独占列表中,因为它只是字符串比较而不是正则表达式比较。

以下内容:

public static void main(String ... params) {
    String haystack = "abcdefghijklmnopqrstuvwxyz";
    String [] needles = new String [] { "def", "tuv" };
    String [] tokens = splitIntoTokensUsingNeedlesFoundInHaystack(haystack, needles);
    for (String string : tokens) {
        System.out.println(string);
    }
}

private static String[] splitIntoTokensUsingNeedlesFoundInHaystack(String haystack, String[] needles) {
    List<String> list = new LinkedList<String>();
    StringBuilder builder = new StringBuilder();
    for(int haystackIndex = 0; haystackIndex < haystack.length(); haystackIndex++) {
        boolean foundAnyNeedle = false;
        String substring = haystack.substring(haystackIndex);
        for(int needleIndex = 0; (!foundAnyNeedle) && needleIndex < needles.length; needleIndex ++) {
            String needle = needles[needleIndex];
            if(substring.startsWith(needle)) {
                if(builder.length() > 0) {
                    list.add(builder.toString());
                    builder = new StringBuilder();
                }
                foundAnyNeedle = true;
                list.add(needle);
                haystackIndex += (needle.length() - 1);
            }
        }
        if( ! foundAnyNeedle) {
            builder.append(substring.charAt(0));
        }
    }
    if(builder.length() > 0) {
        list.add(builder.toString());
    }
    return list.toArray(new String[]{});
}

输出

abc
def
ghijklmnopqrs
tuv
wxyz

请注意... 此代码仅限演示。如果其中一个分隔符是任何空字符串,它将表现不佳并最终崩溃与OutOfMemoryError:消耗大量CPU后的Java堆空间。

答案 2 :(得分:1)

据我了解你的问题,你可以这样做 -

public Object[] tokenizer(String value, String[] delimeters){
    List<String> list= new ArrayList<String>();
    for(String s:delimeters){
        if(value.contains(s)){
            String[] strArr=value.split("\\"+s);
            for(String str:strArr){
                list.add(str);
                if(!list.contains(s)){
                    list.add(s);
                }
            }
        }
    }
    Object[] newValues=list.toArray();
    return newValues;
}

现在在main方法中调用此函数 -

String[] delimeters = {" ", "{", "==", "=", "+", "+=", "++", "-", "-=", "--", "/", "/=", "*", "*=", "(", ")", ";", "/**", "*/", "\t", "\n"};
    Object[] obj=st.tokenizer("ge{ab", delimeters); //st is the reference of the other class. Edit this of your own.
    for(Object o:obj){
        System.out.println(o.toString());
    }

答案 3 :(得分:1)

建议:

  private static int INIT_INDEX_MAX_INT = Integer.MAX_VALUE;

  private static String[] tokenizer(final String string, final String[] delimiters) {
    final List<String> result = new ArrayList<>();

    int currentPosition = 0;
    while (currentPosition < string.length()) {
      // plan: search for the nearest delimiter and its position
      String nextDelimiter = "";
      int positionIndex = INIT_INDEX_MAX_INT;
      for (final String currentDelimiter : delimiters) {
        final int currentPositionIndex = string.indexOf(currentDelimiter, currentPosition);
        if (currentPositionIndex < 0) { // current delimiter not found, go to the next
          continue;
        }
        if (currentPositionIndex < positionIndex) { // we found a better one, update
          positionIndex = currentPositionIndex;
          nextDelimiter = currentDelimiter;
        }
      }
      if (positionIndex == INIT_INDEX_MAX_INT) { // we found nothing, finish up
        final String finalPart = string.substring(currentPosition, string.length());
        result.add(finalPart);
        break;
      }
      // we have one, add substring + delimiter to result and update current position
      // System.out.println(positionIndex + ":[" + nextDelimiter + "]"); // to follow the internals
      final String stringBeforeNextDelimiter = string.substring(currentPosition, positionIndex);
      result.add(stringBeforeNextDelimiter);
      result.add(nextDelimiter);
      currentPosition += stringBeforeNextDelimiter.length() + nextDelimiter.length();
    }

    return result.toArray(new String[] {});
  }

注意:

  • 我添加了更多不必要的评论。我想在这种情况下会有所帮助。
  • 这种表现非常糟糕(可以用树状结构和散列来改善)。它不是规范的一部分。
  • 未指定运算符优先级(请参阅我对该问题的评论)。它不是规范的一部分。
  

我只想使用字符串方法charAt,equals,equalsIgnoreCase,indexOf,length和substring

检查。该函数仅使用indexOf()length()substring()

  

不,我的意思是在返回的结果中。例如,如果我的分隔符是{,字符串是ge{ab,我想要一个包含ge{ab

的数组

检查:

  private static void test() {
    final String[] delimiters = { "{" };
    final String contents = "ge{ab";
    final String splitString[] = tokenizer(contents, delimiters);
    final String joined = String.join("", splitString);
    System.out.println(Arrays.toString(splitString));
    System.out.println(contents.equals(joined) ? "ok" : "wrong: [" + contents + "]#[" + joined + "]");
  }
  // [ge, {, ab]
  // ok

最后一句话:如果想要为这类问题提供最佳实践,我应该建议阅读有关编译器构造的建议,特别是编译器前端。

答案 4 :(得分:1)

也许我还没有完全理解这个问题,但我的印象是你要重写Java String方法split()。我建议你看看这个函数,看看它是如何完成的,并从那里开始。

答案 5 :(得分:1)

老实说,你可以使用Apache Commons Lang。如果您检查库的源代码,您会注意到它没有使用正则表达式。方法[StringUtils.split](http://commons.apache.org/proper/commons-lang/javadocs/api-2.6/org/apache/commons/lang/StringUtils.html#split(java.lang.String,java.lang.String))中只使用String和很多标志。

无论如何,请使用Apache Commons Lang查看代码。

import org.apache.commons.lang.StringUtils;
import org.junit.Assert;
import org.junit.Test;

public class SimpleTest {

    @Test
    public void testSplitWithoutRegex() {
        String[] delimiters = {"==", "+=", "++", "-=", "--", "/=", "*=", "/**", "*/",
            " ", "=", "+", "-", "/", "*", "(", ")", ";", "\t", "\n"};

        String finalDelimiter = "#";

        //check if demiliter can be used
        boolean canBeUsed = true;

        for (String delimiter : delimiters) {
            if (finalDelimiter.equals(delimiter)) {
                canBeUsed = false;
                break;
            }
        }

        if (!canBeUsed) {
            Assert.fail("The selected delimiter can't be used.");
        }

        String s = "Assuming that we have /** or /* all these signals like == and; / or * will be replaced.";
        System.out.println(s);

        for (String delimiter : delimiters) {
            while (s.indexOf(delimiter) != -1) {
                s = s.replace(delimiter, finalDelimiter);
            }
        }

        String[] splitted = StringUtils.split(s, "#");

        for (String s1 : splitted) {
            System.out.println(s1);
        }

    }
}

我希望它有所帮助。

答案 6 :(得分:1)

就像我能得到它一样简单......

public class StringTokenizer {
    public static String[] split(String s, String[] tokens) {
        Arrays.sort(tokens, new Comparator<String>() {
            @Override
            public int compare(String o1, String o2) {
                return o2.length()-o1.length();
            }
        });

        LinkedList<String> result = new LinkedList<>();

        int j=0;
        for (int i=0; i<s.length(); i++) {
            String ss = s.substring(i);

            for (String token : tokens) {
                if (ss.startsWith(token)) {
                    if (i>j) {
                        result.add(s.substring(j, i));
                    }

                    result.add(token);

                    j = i+token.length();
                    i = j-1;

                    break;
                }
            }
        }

        result.add(s.substring(j));

        return result.toArray(new String[result.size()]);
    }
}

它创建了很多新对象 - 并且可以通过编写自定义startsWith()实现进行优化,该实现将比较char的字符串char。

@Test
public void test() {
    String[] split = StringTokenizer.split("this==is the most>complext<=string<<ever", new String[] {"=", "<", ">", "==", ">=", "<="});

    assertArrayEquals(new String[] {"this", "==", "is the most", ">", "complext", "<=", "string", "<", "<", "ever"}, split);
}

通过罚款:)

答案 7 :(得分:1)

你可以使用递归(函数式编程的标志)来减少冗长。

public static String[] tokenizer(String text, String[] delims) {
    for(String delim : delims) {
        int i = text.indexOf(delim);

        if(i >= 0) {

            // recursive call
            String[] tail = tokenizer(text.substring(i + delim.length()), delims);

            // return [ head, middle, tail.. ]
            String[] list = new String[tail.length + 2];
            list[0] = text.substring(0,i);
            list[1] = delim;
            System.arraycopy(tail, 0, list, 2, tail.length);
            return list;
        }
    }
    return new String[] { text };
}

使用来自其他答案的相同单元测试进行测试

public static void main(String ... params) {
    String haystack = "abcdefghijklmnopqrstuvwxyz";
    String [] needles = new String [] { "def", "tuv" };
    String [] tokens = tokenizer(haystack, needles);
    for (String string : tokens) {
        System.out.println(string);
    }
}

输出

abc
def
ghijklmnopqrs
tuv
wxyz

如果Java具有更好的本机阵列支持,那将会更优雅。