Question

我有一个宽表格，格式如下（最多10人）：

person1_status | person2_status | person3_status | person1_type | person_2 type | person3_type 
       0       |        1       |        0       |        7     |        4      |        6

状态可以是0或1（前3个cols）。

类型可以是4-7范围内的＃。此处的值对应于另一个表，该表指定基于类型的值。所以......

Type | Value
 4   |   10
 5   |   20
 6   |   30
 7   |   40

我需要计算两列，＆＃39; A＆＃39;和＆＃39; B＆＃39;，其中：

A是每个人类型（在该行中）的值的总和 status = 0。
B是每个人类型（在该行中）的值的总和 status = 1。

例如，生成的列＆＃39; A＆＃39;和＆＃39; B＆＃39;如下：

A  | B
70 | 10

对此的解释：

＆＃39; A＆＃39;值为70因为person1和person3具有＆＃34; status＆＃34; 0和相应的类型7和6（对应于值30和40）。

同样，应该有另一栏＆＃39; B＆＃39;具有价值＆＃34; 10＆＃34;因为只有person2有状态＆＃34; 1＆＃34;他们的类型是＆＃34; 4＆＃34; （其对应值为10）。

这可能是一个愚蠢的问题，但我如何以矢量化的方式做到这一点？我不想使用for循环或任何东西，因为它效率较低......

我希望有道理......任何人都可以帮助我吗？我想脑筋已经试图解决这个问题。

对于更简单的计算列，我只是在np.where中离开，但是我很少被困在这里，因为我需要在给定某些条件的同时从多个列中计算值的总和，同时从单独的表中提取这些值。 ..

希望有道理

Answer 1

使用过滤器方法，该方法将过滤字符串出现在其中的列名称。

为查找值other_table创建数据框，并将索引设置为类型列。

df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')
df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values

df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)

以下完整示例：

创建虚假数据

df = pd.DataFrame({'person_1_status':np.random.randint(0, 2,1000) , 
                   'person_2_status':np.random.randint(0, 2,1000), 
                   'person_3_status':np.random.randint(0, 2,1000), 
                   'person_1_type':np.random.randint(4, 8,1000), 
                   'person_2_type':np.random.randint(4, 8,1000),
                   'person_3_type':np.random.randint(4, 8,1000)},
                 columns= ['person_1_status', 'person_2_status', 'person_3_status',
                           'person_1_type', 'person_2_type', 'person_3_type'])

 person_1_status  person_2_status  person_3_status  person_1_type  \
0                1                0                0              7   
1                0                1                0              6   
2                1                0                1              7   
3                0                0                0              7   
4                0                0                1              4   

   person_3_type  person_3_type  
0              5              5  
1              7              7  
2              7              7  
3              7              7  
4              7              7

制作other_table

other_table = pd.Series({4:10, 5:20, 6:30, 7:40})

4    10
5    20
6    30
7    40
dtype: int64

过滤掉状态并在自己的数据框中键入列

df_status = df.filter(like = 'status')
df_type = df.filter(like = 'type')

制作查找表

df_type_lookup = df_type.applymap(lambda x: other_table.loc[x]).values

跨行应用矩阵乘法和求和。

df['A'] = np.sum((df_status == 0).values * df_type_lookup, 1)
df['B'] = np.sum((df_status == 1).values * df_type_lookup, 1)

输出

 person_1_status  person_2_status  person_3_status  person_1_type  \
0                0                0                1              7   
1                0                1                0              4   
2                0                1                1              7   
3                0                1                0              6   
4                0                0                1              5   

   person_2_type  person_3_type   A   B  
0              7              5  80  20  
1              6              4  20  30  
2              5              5  40  40  
3              6              4  40  30  
4              7              5  60  20

Answer 2

考虑数据框df

mux = pd.MultiIndex.from_product([['status', 'type'], ['p%i' % i for i in range(1, 6)]])
data = np.concatenate([np.random.choice((0, 1), (10, 5)), np.random.rand(10, 5)], axis=1)
df = pd.DataFrame(data, columns=mux)
df

这种结构的方式我们可以为type == 1

执行此操作

df.status.mul(df.type).sum(1)

0    0.935290
1    1.252478
2    1.354461
3    1.399357
4    2.102277
5    1.589710
6    0.434147
7    2.553792
8    1.205599
9    1.022305
dtype: float64

和type == 0

df.status.rsub（1）.mul（df.type）的.sum（1）

0    1.867986
1    1.068045
2    0.653943
3    2.239459
4    0.214523
5    0.734449
6    1.291228
7    0.614539
8    0.849644
9    1.109086
dtype: float64

您可以使用以下代码

以此格式获取列

df.columns = df.columns.str.split('_', expand=True)
df = df.swaplevel(0, 1, 1)

在多个条件下，Pandas计算多列的总和

2 个答案: