Python中的大项频率计数

时间:2013-03-13 12:54:35

标签: python tree hashtable

我想计算内存中的许多项目。我使用了计数器,但是当我有更多数据时,它变得越来越慢。这是非常期待的,因为它本质上是一个哈希表,但我不知道它可能太慢了。这些是我程序的一些日志行:

$ grep merg ex42.out
2013-03-13 12:47:07,544 - ex42.data - DEBUG - Complete merging 4175889 keys in 0.650000 seconds.
2013-03-13 12:47:24,996 - ex42.data - DEBUG - Complete merging 4135905 keys in 7.890000 seconds.
2013-03-13 13:13:33,577 - ex42.data - DEBUG - Complete merging 4159325 keys in 21.560000 seconds.
2013-03-13 13:13:40,822 - ex42.data - DEBUG - Complete merging 4140346 keys in 23.070000 seconds.
2013-03-13 13:14:04,972 - ex42.data - DEBUG - Complete merging 4187157 keys in 35.340000 seconds.
2013-03-13 13:14:18,744 - ex42.data - DEBUG - Complete merging 4205433 keys in 31.900000 seconds.
2013-03-13 13:14:34,457 - ex42.data - DEBUG - Complete merging 4255486 keys in 35.940000 seconds.
2013-03-13 13:14:51,988 - ex42.data - DEBUG - Complete merging 4220057 keys in 39.950000 seconds.
2013-03-13 13:15:15,714 - ex42.data - DEBUG - Complete merging 4215430 keys in 45.280000 seconds.
2013-03-13 13:15:32,742 - ex42.data - DEBUG - Complete merging 4232054 keys in 47.470000 seconds.
2013-03-13 13:51:28,386 - ex42.data - DEBUG - Complete merging 4244061 keys in 2187.990000 seconds.
2013-03-13 13:51:46,548 - ex42.data - DEBUG - Complete merging 4306790 keys in 2195.190000 seconds.

我还在等待下一个日志线出来........

我认为不同项目的数量可能在1亿左右,但我有192GB RAM,因此它可能很适合内存。在这种情况下,有没有可以帮助我的图书馆?我知道树木,堆积等等,但实施它们不仅令人疲惫,而且往往会出错。所以我更愿意重用一些代码。

任何建议都将不胜感激!


更新:我正在计算(一堆)文本语料库中的共现。基本上,我会计算在某些GB文本中彼此相邻的单词。还有一些过滤来减少项目的数量,但你可以想象它会很大。每个项目都是一个元组(单词,上下文)。我使用了默认的哈希函数。


更新2:SQLite不可用。我以前试过这个并且很想使用它,但我没有足够的权限来安装/修复它。

$ /opt/python/bin/python3
Python 3.2.1 (default, Nov 27 2012, 05:59:14) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/lib/python3.2/sqlite3/__init__.py", line 23, in <module>
    from sqlite3.dbapi2 import *
  File "/opt/python/lib/python3.2/sqlite3/dbapi2.py", line 26, in <module>
    from _sqlite3 import *
ImportError: No module named _sqlite3
>>> quit()

$ ~/python2.7 
Python 2.7.3 (default, Nov 27 2012, 05:53:53) 
[GCC 4.4.6 20120305 (Red Hat 4.4.6-4)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sqlite3
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/python/lib/python2.7/sqlite3/__init__.py", line 24, in <module>
    from dbapi2 import *
  File "/opt/python/lib/python2.7/sqlite3/dbapi2.py", line 27, in <module>
    from _sqlite3 import *
ImportError: No module named _sqlite3
>>> quit()

更新3:代码。我发现了一个错误,使报告的时间超过了实际情况。我有新的时候会更新上面的日志行。

def _count_coocurrences(coocurrences_callback, corpus_paths, words, contexts, 
                       multiprocessing=False, suspend_exceptions=True):
    counter = Counter()
    try:
        # find co-occurrences
            counter[c] += 1
    except Exception as e:
        # prevent process from hanging
    return counter


class AsynchronousCounter:
    '''
    A counter that provides an asynchronous update function
    '''

    def update_locked(self, counter):
        start = time.clock() # I made mistake at this line
        if self.lock.acquire():
            try:
                self.counter.update(counter)
                elapsed = time.clock() - start
                if _log.isEnabledFor(DEBUG):
                    _log.debug("Completed merging %d keys in %f seconds."
                                %(len(counter), elapsed))
            finally:
                self.lock.release()

    def update_async(self, counter):
        thread = Thread(target=self.update_locked, args=(counter,))
        thread.start()
        self.threads.append(thread)

    def wait(self):
        for thread in self.threads:
            thread.join()


    # some other functions


counter = AsynchronousCounter()
pool = Pool(max_processes)
for path in corpus_paths:
    pool.apply_async(_count_coocurrences, 
                     args=(coocurrences_callback, path, words, contexts, True),
                     callback=counter.update_async)
pool.close()
pool.join()
counter.wait()

更新4:新统计

我通过在start.counter.update()调用之前将start = time.clock()移动到同步块中来修复先前更新中发现的错误。这些是最新的结果:

2013-03-14 12:30:54,888 - ex42.data - DEBUG - Completed merging 4140346 keys in 0.770000 seconds.
2013-03-14 12:56:47,205 - ex42.data - DEBUG - Completed merging 4135905 keys in 1536.090000 seconds.
2013-03-14 12:57:04,156 - ex42.data - DEBUG - Completed merging 4159325 keys in 18.250000 seconds.
2013-03-14 12:57:34,640 - ex42.data - DEBUG - Completed merging 4175889 keys in 30.760000 seconds.
2013-03-14 14:01:09,155 - ex42.data - DEBUG - Completed merging 4187157 keys in 3811.940000 seconds.
2013-03-14 14:01:51,244 - ex42.data - DEBUG - Completed merging 4220057 keys in 39.260000 seconds.
2013-03-14 14:02:07,782 - ex42.data - DEBUG - Completed merging 4215430 keys in 11.470000 seconds.
2013-03-14 14:02:40,478 - ex42.data - DEBUG - Completed merging 4205433 keys in 25.340000 seconds.
2013-03-14 14:42:48,693 - ex42.data - DEBUG - Completed merging 4232054 keys in 2371.140000 seconds.
2013-03-14 14:43:13,818 - ex42.data - DEBUG - Completed merging 4255486 keys in 12.360000 seconds.
2013-03-14 14:43:28,132 - ex42.data - DEBUG - Completed merging 4244061 keys in 11.990000 seconds.
2013-03-14 14:43:56,665 - ex42.data - DEBUG - Completed merging 4269879 keys in 23.470000 seconds.
2013-03-14 14:44:13,066 - ex42.data - DEBUG - Completed merging 4282191 keys in 11.810000 seconds.
2013-03-14 14:44:24,671 - ex42.data - DEBUG - Completed merging 4306790 keys in 11.320000 seconds.
2013-03-14 15:56:59,668 - ex42.data - DEBUG - Completed merging 4320573 keys in 4352.680000 seconds.
2013-03-14 15:57:09,125 - ex42.data - DEBUG - Completed merging 4130680 keys in 9.300000 seconds.
2013-03-14 15:57:18,628 - ex42.data - DEBUG - Completed merging 4104878 keys in 9.950000 seconds.
2013-03-14 15:57:27,747 - ex42.data - DEBUG - Completed merging 4095587 keys in 9.030000 seconds.
2013-03-14 15:59:29,345 - ex42.data - DEBUG - Completed merging 4088393 keys in 11.290000 seconds.
2013-03-14 17:23:36,209 - ex42.data - DEBUG - Completed merging 4082050 keys in 2374.850000 seconds.
2013-03-14 17:23:55,361 - ex42.data - DEBUG - Completed merging 4062960 keys in 13.840000 seconds.
2013-03-14 17:24:10,038 - ex42.data - DEBUG - Completed merging 4048144 keys in 12.140000 seconds.

0 个答案:

没有答案