用数字数据比cPickle更快?

时间:2013-05-30 09:57:36

标签: python numpy pickle

目前我正在使用Python进行图像检索。在该示例中从图像提取的关键点和描述符表示为numpy.array s。形状(2000,5)中的第一个和形状(2000,128)的后者。两者都只包含dtype=numpy.float32的值。

所以,我想知道使用哪种格式来保存我提取的关键点和描述符。即我总是保存2个文件:一个用于关键点,一个用于描述符 - 这在我的测量中算作一步。我比较了picklecPickle(协议0和2)和NumPy的二进制格式.pny,结果让我很困惑:

enter image description here

我一直认为cPickle应该比pickle模块更快。但特别是协议0的加载时间在结果中非常突出。 有没有人对此有解释?是因为我只使用数字数据吗?看起来很奇怪......

PS:在我的代码中,我基本上在每种技术上循环1000次(number=1000)并最终平均测量时间:

    timer = time.time

    print 'npy save...'
    t0 = timer()
    for i in range(number):
        numpy.save(npy_kp_path, kp)
        numpy.save(npy_descr_path, descr)
    t1 = timer()
    results['npy']['save'] = t1 - t0

    print 'npy load...'
    t0 = timer()
    for i in range(number):
        kp = numpy.load(npy_kp_path)
        descr = numpy.load(npy_descr_path)
    t1 = timer()
    results['npy']['load'] = t1 - t0


    print 'pickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=0)
        with open(pkl0_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['pkl0']['save'] = t1 - t0

    print 'pickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(pkl0_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl0_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl0']['load'] = t1 - t0


    print 'cPickle protocol 0 save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=0)
        with open(cpkl0_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=0)
    t1 = timer()
    results['cpkl0']['save'] = t1 - t0

    print 'cPickle protocol 0 load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl0_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl0_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl0']['load'] = t1 - t0


    print 'pickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'wb') as f:
            pickle.dump(descr, f, protocol=pickle.HIGHEST_PROTOCOL)
        with open(pkl2_kp_path, 'wb') as f:
            pickle.dump(kp, f, protocol=pickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['pkl2']['save'] = t1 - t0

    print 'pickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(pkl2_descr_path, 'rb') as f:
            descr = pickle.load(f)
        with open(pkl2_kp_path, 'rb') as f:
            kp = pickle.load(f)
    t1 = timer()
    results['pkl2']['load'] = t1 - t0


    print 'cPickle highest protocol (2) save...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'wb') as f:
            cPickle.dump(descr, f, protocol=cPickle.HIGHEST_PROTOCOL)
        with open(cpkl2_kp_path, 'wb') as f:
            cPickle.dump(kp, f, protocol=cPickle.HIGHEST_PROTOCOL)
    t1 = timer()
    results['cpkl2']['save'] = t1 - t0

    print 'cPickle highest protocol (2) load...'
    t0 = timer()
    for i in range(number):
        with open(cpkl2_descr_path, 'rb') as f:
            descr = cPickle.load(f)
        with open(cpkl2_kp_path, 'rb') as f:
            kp = cPickle.load(f)
    t1 = timer()
    results['cpkl2']['load'] = t1 - t0 

1 个答案:

答案 0 :(得分:6)

ndarray的数字数据的(二进制表示)被腌制为一个长字符串。在从协议0文件中取消大字符串时,cPickle似乎确实比pickle慢得多。为什么?我的猜测是pickle使用了来自标准库的经过良好调整的字符串算法,并且cPickle落后了。

上面的观察来自于使用Python 2.7。 Python 3.3自动使用C扩展,比Python 2.7上的任何一个模块都快,所以显然这个问题已得到解决。

相关问题