Question

我是python的新手，我必须感谢大家在这里进行了精彩的讨论，但我有一个问题，我没有看到任何建议。（或者对我来说理解太复杂了。）

我有两个列表（元组？），每个列表有大约一百万个条目。它们都在第一个条目（单词）上排序并具有相同的格式。在每个列表中，单词/页面组合是唯一的。

List1=  [('word1', 'page1'), ('word1', 'page2'), ('word3', 'page1'),...]
List2 = [('word1', 'page4'), ('word2', 'page2'), ('word3', 'page1'),...]

我需要找到每个单词＆＃39;在list1中也出现在list2中。此示例的输出应为

[('word1', 'page1'), ('word1', 'page2'), ('word1', 'page4'),('word3','page1')]

我一直在寻找这么多，以至于我现在对集合，列表，元组，字符串完全混淆了......我可以做一个for循环，但似乎在这里有更好的选择。

Answer 1

如果您需要将单词映射到页面，则可以使用dict将单词映射到页面。

from collections import defaultdict
word_pages_1 = defauldict(list)
for w, p in List1:
   word_pages_1[w].append(p)

然后你可以对你的dict键执行set操作以进行比较

Answer 2

看起来像是一个大数据问题。您可能希望使用numpy和pandas等特定工具。如果你有足够的RAM来容纳内存中的两个数据，可以在numpy：

中完成

In [103]:
import numpy as np
List1=  [('word1', 'page1'), ('word1', 'page2'), ('word3', 'page1')]
List2 = [('word1', 'page4'), ('word2', 'page2'), ('word3', 'page1')]

In [104]:
arr1 = np.array(List1)
arr2 = np.array(List2)

In [105]:
arr3=np.vstack((arr1, arr2)) #stack two dataset together
arr3

Out[105]:
array([['word1', 'page1'],
       ['word1', 'page2'],
       ['word3', 'page1'],
       ['word1', 'page4'],
       ['word2', 'page2'],
       ['word3', 'page1']], 
      dtype='|S5')

In [106]:
np.in1d(arr3[:,0], arr1[:,0]) 
#for each item in arr3, is the first value appears in the 1st position of arr1?

Out[106]:
array([ True,  True,  True,  True, False,  True], dtype=bool)

In [107]:
arr3[np.in1d(arr3[:,0], arr1[:,0])] #Boolean indexing

Out[107]:
array([['word1', 'page1'],
       ['word1', 'page2'],
       ['word3', 'page1'],
       ['word1', 'page4'],
       ['word3', 'page1']], 
      dtype='|S5')

In [108]:
set(map(tuple, arr3[np.in1d(arr3[:,0], arr1[:,0])]))

Out[108]:
{('word1', 'page1'),
 ('word1', 'page2'),
 ('word1', 'page4'),
 ('word3', 'page1')}

Answer 3

我相信有很多方法可以达到你的范围。由于您的数据非常庞大，您必须考虑性能，时间或空间或性能？以下是一些例子。

#!/usr/bin/python
#-*- coding:utf-8 -*-

L1 = [('word1', 'page1'), ('word1', 'page2'), ('word3', 'page1'), ]
L2 = [('word1', 'page4'), ('word2', 'page2'), ('word3', 'page2'), ]

def func1():
    '''
    Time Complexity is O(n^2) 
    '''
    res = []
    for i in L1:
        for k in L2:
            if i[0] == k[0]:
                res.append(i)
                res.append(k)
    return list(set(res))

def func2():
    '''
    Time Complexity is O(n)
    '''
    d1 = {}
    for i in L1:
        if d1.has_key(i[0]):
            d1[i[0]].append(i[1])
        else:
            d1[i[0]] = [i[1]]
    d2 = {}
    for i in L2:
        if d2.has_key(i[0]):
            d2[i[0]].append(i[1])
        else:
            d2[i[0]] = [i[1]]
    d3 = {}
    for key in d1.keys():
        if d2.has_key(key):
            d3[key] = d2[key] + d1[key]

    return [(m,n) for m in d3.keys() for n in d3[m]]



if __name__ == '__main__':
    print func1()
    print func2()

    import timeit
    t = timeit.Timer(func1)
    print t.timeit(10000)
    t = timeit.Timer(func2)
    print t.timeit(10000)

非常大的元组列表中的部分匹配列表

3 个答案: