Question

我有一个数据帧数据，有近4百万行。这是世界上的城市列表。我需要尽快查询城市名称。

我通过索引城市名称找到了一个346ms：

D2 = data.set_index（ “市”，就地=假）

timeit d2.loc [['PARIS']]

1个循环，每循环3：346 ms最佳

这仍然太慢了。我想知道是否可以通过group-by实现更快的查询（如何进行此类查询）。每个城市在数据框中有大约10行（重复的城市）。我搜索了几天，但在互联网上找不到明确的解决方案

谢谢

Answer 1

使用索引的数组数据，与所需索引进行比较，然后在查找性能时使用比较中的掩码可能是一个选项。一个示例案例可能会使事情变得清晰。

1）输入数据帧：

In [591]: df
Out[591]: 
    city  population
0  Delhi        1000
1  Paris          56
2     NY          89
3  Paris          36
4  Delhi         300
5  Paris          52
6  Paris          34
7  Delhi          40
8     NY          89
9  Delhi         450

In [592]: d2 = df.set_index("city",inplace=False)

In [593]: d2
Out[593]: 
       population
city             
Delhi        1000
Paris          56
NY             89
Paris          36
Delhi         300
Paris          52
Paris          34
Delhi          40
NY             89
Delhi         450

2）使用.loc进行索引：

In [594]: d2.loc[['Paris']]
Out[594]: 
       population
city             
Paris          56
Paris          36
Paris          52
Paris          34

3）使用基于掩码的索引：

In [595]: d2[d2.index.values=="Paris"]
Out[595]: 
       population
city             
Paris          56
Paris          36
Paris          52
Paris          34

4）最后时间：

In [596]: %timeit d2.loc[['Paris']]
1000 loops, best of 3: 475 µs per loop

In [597]: %timeit d2[d2.index.values=="Paris"]
10000 loops, best of 3: 156 µs per loop

进一步提升

使用数组数据，我们可以将整个输入数据帧作为数组和索引提取出来。因此，使用该哲学的实现看起来像这样 -

def full_array_based(d2, indexval):
    df0 = pd.DataFrame(d2.values[d2.index.values==indexval])
    df0.index = [indexval]*df0.shape[0]
    df0.columns = d2.columns
    return df0

样本运行和计时 -

In [635]: full_array_based(d2, "Paris")
Out[635]: 
       population
Paris          56
Paris          36
Paris          52
Paris          34

In [636]: %timeit full_array_based(d2, "Paris")
10000 loops, best of 3: 146 µs per loop

Answer 2

<强>设置

df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(100000)]*10,columns=['city','value'])

<强>基线

df2 = df.set_index('city')
%timeit df2.loc[['Paris9999']]
10 loops, best of 3: 45.6 ms per loop

<强>解决方案

使用查找字典然后使用iloc：

idx_dict = df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()

%timeit df.iloc[d['Paris9999']]
1000 loops, best of 3: 432 µs per loop

这种方法似乎比基线快100倍。

与其他方法相比：

%timeit df2[df2.index.values=="Paris9999"]
100 loops, best of 3: 16.7 ms per loop

%timeit full_array_based(df2, "Paris9999")
10 loops, best of 3: 19.6 ms per loop

Answer 3

如果我们被允许预处理设置一个可以被索引的dictonary，用于从输入数据帧中提取基于city字符串的数据提取，这里有一个使用NumPy的解决方案 -

def indexed_dict_numpy(df):
    cs = df.city.values.astype(str)
    sidx = cs.argsort()
    scs = cs[sidx]    
    idx = np.concatenate(( [0], np.flatnonzero(scs[1:] != scs[:-1])+1, [cs.size]))
    return {n:sidx[i:j] for n,i,j in zip(cs[sidx[idx[:-1]]], idx[:-1], idx[1:])}

示例运行 -

In [10]: df
Out[10]: 
    city  population
0  Delhi        1000
1  Paris          56
2     NY          89
3  Paris          36
4  Delhi         300
5  Paris          52
6  Paris          34
7  Delhi          40
8     NY          89
9  Delhi         450

In [11]: dict1 = indexed_dict_numpy(df)

In [12]: df.iloc[dict1['Paris']]
Out[12]: 
    city  population
1  Paris          56
3  Paris          36
5  Paris          52
6  Paris          34

针对@Allen's solution的运行时测试，设置一个包含4 Mil行的类似字典 -

In [43]: # Setup 4 miliion rows of df
    ...: df = pd.DataFrame(data=[['Paris'+str(i),i] for i in range(400000)]*10,\
    ...:                                                 columns=['city','value'])
    ...: np.random.shuffle(df.values)
    ...: 

In [44]: %timeit df.groupby(by='city').apply(lambda x: x.index.tolist()).to_dict()
1 loops, best of 3: 2.01 s per loop

In [45]: %timeit indexed_dict_numpy(df)
1 loops, best of 3: 1.15 s per loop

使用pandas优化字符串查询。大数据

3 个答案: