Question

我有一个包含3列的数据框，在每一行中，我都有这行的可能性，特征T的值为1、2和3

import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})

对于第0行，T为1，机会为80％； 2为10％； 3为10％

我想模拟每一行的T值，并将列T1，T2，T3更改为二进制特征。我有一个解决方案，但是它需要在数据框的行上循环，这确实很慢（我的实际数据框有超过一百万行）：

possib = df.columns
for i in range(df.shape[0]):
    probas = df.iloc[i][possib].tolist()
    choix_transp = np.random.choice(possib,1, p=probas)[0]
    for pos in possib:
        if pos==choix_transp:
            df.iloc[i][pos] = 1
        else:
            df.iloc[i][pos] = 0

有矢量化此代码的方法吗？

谢谢！

Answer 1

这是基于vectorized random.choice with a given matrix of probabilities-

def matrixprob_to_onehot(ar):
    # Get one-hot encoded boolean array based on matrix of probabilities
    c = ar.cumsum(axis=1)
    idx = (np.random.rand(len(c), 1) < c).argmax(axis=1)
    ar_out = np.zeros(ar.shape, dtype=bool)
    ar_out[np.arange(len(idx)),idx] = 1
    return ar_out

ar_out = matrixprob_to_onehot(df.values)
df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)

使用大数据集验证概率-

In [139]: df = pd.DataFrame({"T1" : [0.8,0.5,0.01],"T2":[0.1,0.2,0.89],"T3":[0.1,0.3,0.1]})

In [140]: df
Out[140]: 
     T1    T2   T3
0  0.80  0.10  0.1
1  0.50  0.20  0.3
2  0.01  0.89  0.1

In [141]: p = np.array([matrixprob_to_onehot(df.values) for i in range(100000)]).argmax(2)

In [142]: np.array([np.bincount(p[:,i])/100000.0 for i in range(len(df))])
Out[142]: 
array([[0.80064, 0.0995 , 0.09986],
       [0.50051, 0.20113, 0.29836],
       [0.01015, 0.89045, 0.0994 ]])

In [145]: np.round(_,2)
Out[145]: 
array([[0.8 , 0.1 , 0.1 ],
       [0.5 , 0.2 , 0.3 ],
       [0.01, 0.89, 0.1 ]])

`1000,000`行上的时间-

# Setup input
In [169]: N = 1000000
     ...: a = np.random.rand(N,3)
     ...: df = pd.DataFrame(a/a.sum(1,keepdims=1),columns=[['T1','T2','T3']])

# @gmds's soln
In [171]: %timeit pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))
1 loop, best of 3: 4.82 s per loop

# Soln from this post
In [172]: %%timeit 
     ...: ar_out = matrixprob_to_onehot(df.values)
     ...: df_out = pd.DataFrame(ar_out.view('i1'), index=df.index, columns=df.columns)
10 loops, best of 3: 43.1 ms per loop

Answer 2

我们可以使用numpy：

result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))

这将生成一列随机值，并将其与数据帧的列式总和进行比较，这将产生DataFrame个值，其中第一个False值将显示随机值落入。使用idxmax，我们可以获取此存储桶的索引，然后可以使用pd.get_dummies将其转换回去。

示例：

import numpy as np
import pandas as pd

np.random.seed(0)
data = np.random.rand(10, 3)
normalised = data / data.sum(axis=1)[:, np.newaxis]

df = pd.DataFrame(normalised)
result = pd.get_dummies((np.random.rand(len(df), 1) > df.cumsum(axis=1)).idxmin(axis=1))

print(result)

输出：

注释：

大多数减速来自pd.get_dummies；如果您使用Divakar的pd.DataFrame(result.view('i1'), index=df.index, columns=df.columns)方法，则速度会更快。

将转机概率加速为二进制特征

2 个答案:

`1000,000`行上的时间-

将转机概率加速为二进制特征

2 个答案:

1000,000行上的时间-

`1000,000`行上的时间-