Question

我使用Google Caliper对两种检查字符串中的mdn编号的方法进行基准测试。一种方法使用用户定义的方法，而另一种方法使用正则表达式。我很惊讶地发现，平均而言，正则表达式方法比用户定义的方法花费的时间长五倍。

这是我的基准测试代码。

package com.code4refernce.caliper;

import java.util.Random;
import java.util.regex.Pattern;

import com.google.caliper.Param;
import com.google.caliper.SimpleBenchmark;

public class SimpleCaliperTest extends SimpleBenchmark {
    String extensiveregex = "^\\d?(?:(?:[\\+]?(?:[\\d]{1,3}(?:[ ]+|[\\-.])))?[(]?(?:[\\d]{3})[\\-/)]?(?:[ ]+)?)?(?:[a-zA-Z2-9][a-zA-Z0-9 \\-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\\.]?)){1,2}(?:[\\d]{1,5}))?$";
    Pattern EXTENSIVE_REGEX_PATTERN = Pattern.compile(extensiveregex);

    String mdn[][];
    Random random;
    @Param
    int index;

    @Override
    protected void setUp() {
        random = new Random(0);
        mdn = new String[11][1<<16];
        for (int i=0; i<mdn.length; ++i) {
            mdn[0][i] = String.format("%03ddsfasdf00000", random.nextInt(1000));
            mdn[1][i] = String.format("%04d", random.nextInt(10000));
            mdn[2][i] = String.format("%10d", random.nextInt((int) 1e10));
            mdn[3][i] = String.format("-%10d", random.nextInt((int) 1e10));
            mdn[4][i] = String.format("%10d-", random.nextInt((int) 1e10));
            mdn[5][i] = String.format("%03d-%03d-%03d", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000));
            mdn[6][i] = String.format("-%03d-%03d-%03d-", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000));
            mdn[7][i] = String.format("%03d-%03d-%03d-", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000));
            mdn[8][i] = String.format("%03d-%03d-%03d ext %04d", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000), random.nextInt(10000));
            mdn[9][i] = String.format("%03d-%03d-%03d ext %04d-", random.nextInt(1000), random.nextInt(1000), random.nextInt(1000), random.nextInt(10000));
            mdn[10][i] = "123456789012345677890";
        }
    }

    /**
    *This method benchmark the user defined method to check the mdn.
    **/
    public boolean timeExtensiveSimpleMDNCheck(int reps){
        boolean results = false;
        for(int i = 0; i<reps; i ++){
            for(int index2=0; index2<mdn.length; index2++)
                //Use simple method to check the phone number in string.
            results ^= extensiveMDNCheckRegularMethod(mdn[index][index2]);
        }
        return results;
    }

    /**
    *This method benchmark the regex method.
    **/
    public boolean timeExtensiveMDNRegexCheck(int reps){
        boolean results = false;
        for(int i = 0; i<reps; i ++){
            for(int index2=0; index2<mdn.length; index2++)
                //user Regular expression to check the phone number in string.
            results ^=mdnExtensiveCheckRegEx(mdn[index][index2]);
        }
        return results;
    }

    public boolean extensiveMDNCheckRegularMethod(String mdn){

        //Strip the character which not numeric or 'x' character.
        String stripedmdn = stripString(mdn);

        if(stripedmdn.length() >= 10 && stripedmdn.length() <= 11 && (!stripedmdn.contains("x") || !stripedmdn.contains("X"))){
            //For following condition
            //1-123-456-7868 or 123-456-7868
            return true;
        }else if ( stripedmdn.length() >= 15 && stripedmdn.length() <= 16  ) {
            //1-123-456-7868 ext 2345 or  123-456-7868 ext 2345
            //
            if ( stripedmdn.contains("x") ) {
                int index = stripedmdn.indexOf("x");
                if(index >= 9 && index <= 10){
                    return true;
                }
            }else if( stripedmdn.contains("X") ) {
                int index = stripedmdn.indexOf("X");
                if(index >= 9 && index <= 10){
                    return true;
                }
            }
        }
        return false;
    }

    /**
     * Strip the other character and leave only x and numeric values.
     * @param extendedMdn
     * @return
     */
    public String stripString(String extendedMdn){
        byte mdn[] = new byte[extendedMdn.length()];
        int index = 0;
        for(byte b : extendedMdn.getBytes()){
            if((b >= '0' && b <='9') || b == 'x'){
                mdn[index++] = b;
            }
        }
        return new String(mdn);
    }

    private boolean mdnExtensiveCheckRegEx(String mdn){
        return EXTENSIVE_REGEX_PATTERN.matcher(mdn).matches();
    }
}

执行基准测试的主要类：

package com.code4refernce.caliper;
import com.google.caliper.Runner;

public class CaliperRunner {
    public static void main(String[] args) {
        String myargs[] = new String[1];
        myargs[0] = new String("-Dindex=0,1,2,3,4,5,6,7,8,9,10");
        Runner.main(SimpleCaliperTest.class, myargs);
    }
}

Caliper基准测试结果如下。

Benchmark index            us  linear runtime
ExtensiveSimpleMDNCheck     0  5.44 =====
ExtensiveSimpleMDNCheck     1  4.34 ====
ExtensiveSimpleMDNCheck     2  5.02 =====
ExtensiveSimpleMDNCheck     3  5.08 =====
ExtensiveSimpleMDNCheck     4  4.92 ====
ExtensiveSimpleMDNCheck     5  4.83 ====
ExtensiveSimpleMDNCheck     6  4.87 ====
ExtensiveSimpleMDNCheck     7  4.72 ====
ExtensiveSimpleMDNCheck     8  5.14 =====
ExtensiveSimpleMDNCheck     9  5.25 =====
ExtensiveSimpleMDNCheck    10  5.57 =====
 ExtensiveMDNRegexCheck     0 17.71 =================
 ExtensiveMDNRegexCheck     1 21.73 =====================
 ExtensiveMDNRegexCheck     2 13.47 =============
 ExtensiveMDNRegexCheck     3  3.37 ===
 ExtensiveMDNRegexCheck     4 12.44 ============
 ExtensiveMDNRegexCheck     5 26.06 ==========================
 ExtensiveMDNRegexCheck     6  3.36 ===
 ExtensiveMDNRegexCheck     7 29.84 ==============================
 ExtensiveMDNRegexCheck     8 23.80 =======================
 ExtensiveMDNRegexCheck     9 24.01 ========================
 ExtensiveMDNRegexCheck    10 20.53 ====================

我在这里遗漏了什么吗？为什么正则表达式需要更长时间才能执行？

Answer 1

正则表达式引擎只能与你提供的正则表达式一样好，而你的正则表达式效率非常低。我在RegexBuddy中尝试了这个输入：

1-123-456-7868 x2345!

...尾随!确保它无法匹配，但在此过程中做了大量工作。你的正则表达式采取了142步失败。然后我通过将大多数非捕获组更改为atomic groups并制作一些量词possessive来调整它，并且它只需要35个步骤才能失败。

仅供参考，如果您在使用正则表达式时会出现性能问题，那么绝大多数可能是您将看到它们的失败匹配尝试，而不是成功的匹配。当我从上面的字符串中删除!时，您的正则表达式和我的正则表达式只能在34个步骤中匹配。

另外，您的stripString()方法在很多方面都是错误的。您应该使用StringBuilder来创建新字符串，并且应该将char值与其他char进行比较，而不是byte。帮自己一个忙，忘记getBytes()方法和String(byte[])构造函数存在。如果必须执行String-to-byte []或byte [] - to-String转换，请始终使用允许您指定Charset的方法。

编辑根据下面的评论，这里是经过调整的正则表达式作为Java字符串文字：

"^\\d?(?>(?>\\+?(?>\\d{1,3}(?:\\s+|[.-])))?\\(?\\d{3}[/)-]?\\s*)?+(?>[a-zA-Z2-9][a-zA-Z0-9\\s.-]{6,})(?>(?>\\s+|[xX]|(i:ext\\s?)){1,2}\\d{1,5})?+$"

..以更易读的形式：

^
\d?
(?>
  (?>
    \+?
    (?>
      \d{1,3}
      (?:\s+|[.-])
    )
  )?
  \(?
  \d{3}
  [/)-]?
  \s*
)?+
(?>[a-zA-Z2-9][a-zA-Z0-9\s.-]{6,})
(?>
  (?>
    \s+
   |
    [xX]
   |
    (i:ext\s?)
  ){1,2}
  \d{1,5}
)?+
$

但我只是为了证明原子团和占有量词的影响而写的;为此，我独自留下了其他几个问题。我的观点是要证明写得不好的正则表达式对mdnExtensiveCheckRegEx()方法的性能有多大影响。

使用Caliper进行正则表达式的微基准测试

1 个答案: