Question

我有一些单词列表。一些列表彼此共享共同的单词。我正在尝试查看每个列表，其他列表具有相同序列中的常用词。例如，假设这些是我的列表（为简单起见，使用字母而不是单词/字符串）：

list1 = [a,b,c,d]
list2 = [f,n,a,b,g]
list3 = [x,f,g,z]
list4 = [y,a,b,f,g,k]

在这里，我们可以看到list1中的[a，b]也在list2和list4中以该顺序出现。我们还可以看到list3中的[f，g]出现在list4中。因此，我们将按如下方式将这些列表相互映射：

list1: list2, list4 #(contains [a,b])
list2: list1, list4 #(contains [a,b])
list3: list4 #(contains [f,g])
list4: list1, list2, list3 #(contains [a,b] and [f,g])

您可以忽略这些注释，因为这是为了解释，它只是相互映射的列表名称。请注意，即使list2具有元素'f'和'g'，因为它们不是[f，g]的顺序，它不会映射到list3或list4。

我已经使用set.intersection（）编写了一个函数来获取所有列表中的常用单词，但它并不关心顺序。所以，我似乎无法弄清楚要使用哪种数据结构或算法，以便以这种方式将列表相互映射。

我正在尝试以下方法，其中单词列表是我的列表列表，每个列表都包含各自的单词数量：

filelist = {}
for i in range(0, len(wordlists)):
    current_wordlist = wordlists[i]
    for j, j_word in enumerate(current_wordlist):
        if current_wordlist[j] == j_word:
            if j_word not in filelist:
                filelist[i] = {j}
            else:
                filelist[i].append(j)

但它没有正确映射，因为它没有映射到正确的列表号。我会很感激一些反馈或其他一些检查这个的技巧。

我怎样才能实现这个目标？

Answer 1

首先，我将创建一个帮助程序，为每个列表创建连续项集：

def create_successive_items(lst, n):
    return set(zip(*[lst[i:] for i in range(n)]))

然后，您可以根据这些集合简单地检查所有列表的交集：

list1 = ['a','b','c','d']
list2 = ['f','n','a','b','g']
list3 = ['x','f','g','z']
list4 = ['y','a','b','f','g','k']


lists = [list1, list2, list3, list4]

# First look for two elements
i = 2

all_found = []

while True:
    # find all "i" successive items in each list as sets
    succ = [create_successive_items(lst, i) for lst in lists]
    founds = []
    # Check for matches in different lists
    for list_number1, successives1 in enumerate(succ, 1):
        # one only needs to check all remaining other lists so slice the first ones away
        for list_number2, successives2 in enumerate(succ[list_number1:], list_number1+1):
            # Find matches in the sets with intersection
            inters = successives1.intersection(successives2)
            # Print and save them
            if inters:
                founds.append((inters, list_number1, list_number2))
                print(list_number1, list_number2, inters)

    # If we found matches look for "i+1" successive items that match in the lists
    # One could also discard lists that didn't have "i" matches, but that makes it
    # much more complicated.
    if founds:
        i += 1
        all_found.append(founds)
    # no new found, just end it
    else:  
        break

打印匹配项：

1 2 {('a', 'b')}
1 4 {('a', 'b')}
2 4 {('a', 'b')}
3 4 {('f', 'g')}

这些也可以在all_founds中使用，可以使用和/或转换，即转换为dict：

matches = {}
for match, idx1, idx2 in all_found[0]:
    matches.setdefault(idx1, []).append(idx2)
    matches.setdefault(idx2, []).append(idx1)

>>> matches
{1: [2, 4], 
 2: [1, 4], 
 3: [4], 
 4: [1, 2, 3]}

Answer 2

使用元组集可以获得一些乐趣。因为元组是可以清除的所有你需要的是几个辅助函数来获取给定列表中所有连续的有序子列表，你可以使用集合交集进行比较。

from itertools import permutations
def radix(rg, n_len):
    """
    Returns all ordered sublists of length n_len from
    the list rg
    :type rg: list[char]
    :type n_len: int
    """
    for x in range(0, len(rg) - n_len + 1):
        yield tuple(rg[x:x + n_len])

def all_radixes(rg):
    """
    Returns all ordered sublists of length 2 or longer
    from the given list
    :type rg: list[char]
    """
    for x in range(2, len(rg) + 1):
        for result in radix(rg, x):
            yield result

def compare_lists(rg1, rg2):
    s1 = set(all_radixes(rg1))
    s2 = set(all_radixes(rg2))
    return s1 & s2

list1 = 'a,b,c,d'.split(',')
list2 = 'f,n,a,b,g'.split(',')
list3 = 'x,f,g,z'.split(',')
list4 = 'y,a,b,f,g,k'.split(',')

all_lists = [ list1, list2, list3, list4 ]
for z in permutations(all_lists, 2):
    print 'Intersection of %s and %s: %s' % (z[0], z[1], compare_lists(z[0], z[1]),)

将列表与保留顺序中元素的其他列表进行比较

2 个答案: