Question

我有一个有排序的，独特的numpy字符数组：

import numpy as np
vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f'])

我有另一个未排序的数组（实际上我有数百万个）：

sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z'])

此第二个数组比第一个数组小得多，并且还可能包含不在原始数组中的值。

我想要做的是将第二个数组中的值与其对应的索引匹配，返回nan或非匹配的特殊值。

e.g：

sentence_idx = np.asarray([2, 1, 2, 1, 2, np.nan])

我已经尝试过与np.in1d匹配函数的几次不同迭代，但似乎总是会分解包含重复单词的句子。

我还尝试了几种不同的列表推导，但是他们在我收集的数百万句话中运行得太慢了。

那么，在numpy中实现这一目标的最佳方式是什么？在R中，我使用match函数，但似乎没有numpy等价物。

Answer 1

您可以使用漂亮的工具进行此类搜索np.searchsorted，就像这样 -

# Store matching indices of 'sentence' in 'vocab' when "left-searched"
out = np.searchsorted(vocab,sentence,'left').astype(float)

# Get matching indices of 'sentence' in 'vocab' when "right-searched".
# Now, the trick is that non-matches won't have any change between left 
# and right searches. So, compare these two searches and look for the 
# unchanged ones, which are the invalid ones and set them as NaNs.
right_idx = np.searchsorted(vocab,sentence,'right')
out[out == right_idx] = np.nan

示例运行 -

In [17]: vocab = np.asarray(['a', 'aaa', 'b', 'c', 'd', 'e', 'f']) 
    ...: sentence = np.asarray(['b', 'aaa', 'b', 'aaa', 'b', 'z'])
    ...: 

In [18]: out = np.searchsorted(vocab,sentence,'left').astype(float)
    ...: right_idx = np.searchsorted(vocab,sentence,'right')
    ...: out[out == right_idx] = np.nan
    ...: 

In [19]: out
Out[19]: array([  2.,   1.,   2.,   1.,   2.,  nan])

将非唯一的，未排序的数组与唯一排序数组中的索引匹配

1 个答案: