我有两个字符串列表,如下所示:
test1 = ["abc", "abcdef", "abcedfhi"]
test2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]
第二个列表更长,所以我想通过随机抽样将其下采样到第一个列表的长度。
def downsample(data):
min_len = min(len(x) for x in data)
return [random.sample(x, min_len) for x in data]
downsample([list1, list2])
但是,我想添加一个限制,即从第二个列表中选择的单词必须与第一个列表的长度分布相匹配。因此,对于随机选择的第一个单词,它必须与较短列表的第一个单词具有相同的长度。这里的问题是也不允许替换。
如何从test2
中随机选择与test1
的字符长度分布匹配的n(较短列表长度)元素?
谢谢,
千斤顶
答案 0 :(得分:7)
<强> 设置 强>
from collections import defaultdict
import random
dct = defaultdict(list)
l1 = ["abc", "abcdef", "abcedfhi"]
l2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]
首先,使用 collections.defaultdict
创建一个密钥为字长的字典:
for word in l2:
dct[len(word)].append(word)
# Result
defaultdict(<class 'list'>, {3: ['The', 'the', 'the'], 6: ['silver', 'number'], 8: ['proposes', 'blushing', 'explores'], 5: ['burst'], 4: ['fast', 'iron'], 10: ['impossible']})
然后,您可以使用简单的列表推导以及 random.choice
来选择与第一个列表中每个元素的长度相匹配的随机词。如果字典中的字词长度不,请填写-1
:
final = [random.choice(dct.get(len(w), [-1])) for w in l1]
# Output
['The', 'silver', 'blushing']
根据明确的要求进行修改
如果列表2中不存在重复,则这种方法满足不允许重复的要求:
for word in l2:
dct[len(word)].append(word)
for k in dct:
random.shuffle(dct[k])
final = [dct[len(w)].pop() for w in l1]
# ['The', 'silver', 'proposes']
如果第二个列表中没有足够的字来完成分发,这种方法会引发 IndexError
。
答案 1 :(得分:1)
一种方法是在list
中创建test1
项的长度。然后,使用它来创建包含的其他列表
来自test2
的那些长度的子列表。最后从列表列表中随机弹出(similar answer之后),以便在为样本选择后删除该项。
from random import randrange
test1 = ["abc", "abcdef", "abcedfhi"]
test2 = ["The", "silver", "proposes", "the", "blushing", "number", "burst", "explores", "the", "fast", "iron", "impossible"]
sizes = [len(i) for i in test1]
# results: [3, 6, 8]
sublists = [[item for item in test2 if len(item) == i] for i in sizes ]
# results for sublists: [['The', 'the', 'the'], ['silver', 'number'], ['proposes', 'blushing', 'explores']]
# randomly pop from the list for samples
samples = [i.pop(randrange(len(i))) for i in sublists]
print('Samples: ',samples)
结果:
Samples: ['the', 'number', 'blushing']