Question

假设我有df，其中包含'ID', 'col_1', 'col_2'列。我定义了一个函数：

f = lambda x, y : my_function_expression。

现在，我想将f应用于df的两列'col_1', 'col_2'，以便按元素计算新列'col_3'，有点像：

df['col_3'] = df[['col_1','col_2']].apply(f)  
# Pandas gives : TypeError: ('<lambda>() takes exactly 2 arguments (1 given)'

怎么办？

** 添加详细示例如下 的 ***

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

  ID  col_1  col_2            col_3
0  1      0      1       ['a', 'b']
1  2      2      4  ['c', 'd', 'e']
2  3      3      5  ['d', 'e', 'f']

Answer 1

以下是在数据框架上使用apply的示例，我使用axis = 1进行调用。

注意区别在于，不是尝试将两个值传递给函数f，而是重写函数以接受pandas Series对象，然后索引Series以获取所需的值。

In [49]: df
Out[49]: 
          0         1
0  1.000000  0.000000
1 -0.494375  0.570994
2  1.000000  0.000000
3  1.876360 -0.229738
4  1.000000  0.000000

In [50]: def f(x):    
   ....:  return x[0] + x[1]  
   ....:  

In [51]: df.apply(f, axis=1) #passes a Series object, row-wise
Out[51]: 
0    1.000000
1    0.076619
2    1.000000
3    1.646622
4    1.000000

根据您的使用情况，创建一个pandas group对象，然后在该组上使用apply有时会很有帮助。

Answer 2

一个简单的解决方案是：

df['col_3'] = df[['col_1','col_2']].apply(lambda x: f(*x), axis=1)

Answer 3

一个有趣的问题！我的答案如下：

import pandas as pd

def sublst(row):
    return lst[row['J1']:row['J2']]

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(sublst,axis=1)
print df

输出：

  ID  J1  J2
0  1   0   1
1  2   2   4
2  3   3   5
  ID  J1  J2      J3
0  1   0   1     [a]
1  2   2   4  [c, d]
2  3   3   5  [d, e]

我将列名更改为ID，J1，J2，J3以确保ID＆lt; J1＆lt; J2＆lt; J3，所以列以正确的顺序显示。

一个更简短的版本：

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'J1': [0,2,3], 'J2':[1,4,5]})
print df
lst = ['a','b','c','d','e','f']

df['J3'] = df.apply(lambda row:lst[row['J1']:row['J2']],axis=1)
print df

Answer 4

在Pandas中有一种简单的单行方法：

df['col_3'] = df.apply(lambda x: f(x.col_1, x.col_2), axis=1)

这允许f是具有多个输入值的用户定义函数，并使用（安全）列名而不是（不安全）数字索引来访问列。

数据示例（基于原始问题）

import pandas as pd

df = pd.DataFrame({'ID':['1', '2', '3'], 'col_1': [0, 2, 3], 'col_2':[1, 4, 5]})
mylist = ['a', 'b', 'c', 'd', 'e', 'f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = df.apply(lambda x: get_sublist(x.col_1, x.col_2), axis=1)

print(df)的输出：

  ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

Answer 5

您正在寻找的方法是Series.combine。但是，似乎必须注意数据类型。在你的例子中，你会（正如我在测试答案时所做的那样）天真地调用

df['col_3'] = df.col_1.combine(df.col_2, func=get_sublist)

然而，这会引发错误：

ValueError: setting an array element with a sequence.

我最好的猜测是，它似乎期望结果与调用方法的系列（df.col_1在这里）的类型相同。但是，以下工作：

df['col_3'] = df.col_1.astype(object).combine(df.col_2, func=get_sublist)

df

   ID   col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

Answer 6

你写的方式需要两个输入。如果您查看错误消息，则说明您没有为f提供两个输入，只有一个。错误信息是正确的不匹配是因为df [['col1'，'col2']]返回一个包含两列的数据帧，而不是两列。

您需要更改f以便它需要一个输入，将上面的数据框保留为输入，然后将其分解为函数体中的x，y 。然后做你需要的任何事情并返回一个值。

你需要这个函数签名，因为语法是.apply（f）所以f需要采用单一的东西=数据帧而不是两件事，这是你当前的预期。

由于你没有提供f的主体，我不能再详细介绍了 - 但这应该提供出路，而不是从根本上改变你的代码或使用其他方法而不是应用

Answer 7

我要为np.vectorize投票。它允许您只拍摄x列数而不处理函数中的数据帧，因此它对于您无法控制或执行诸如将2列和常量发送到函数中的功能非常有用（即col_1，col_2，＆＃39; foo＆＃39;）。

import numpy as np
import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

#df['col_3'] = df[['col_1','col_2']].apply(get_sublist,axis=1)
# expect above to output df as below 

df.loc[:,'col_3'] = np.vectorize(get_sublist, otypes=["O"]) (df['col_1'], df['col_2'])


df

ID  col_1   col_2   col_3
0   1   0   1   [a, b]
1   2   2   4   [c, d, e]
2   3   3   5   [d, e, f]

Answer 8

从apply返回列表是一项危险的操作，因为不保证生成的对象是Series或DataFrame。在某些情况下可能会出现例外情况。让我们来看一个简单的例子：

df = pd.DataFrame(data=np.random.randint(0, 5, (5,3)),
                  columns=['a', 'b', 'c'])
df
   a  b  c
0  4  0  0
1  2  0  1
2  2  2  2
3  1  2  2
4  3  0  0

从apply

返回列表有三种可能的结果

1）如果返回列表的长度不等于列数，则返回一系列列表。

df.apply(lambda x: list(range(2)), axis=1)  # returns a Series
0    [0, 1]
1    [0, 1]
2    [0, 1]
3    [0, 1]
4    [0, 1]
dtype: object

2）当返回列表的长度等于数量时然后返回一个DataFrame，每列都得到列表中的相应值。

df.apply(lambda x: list(range(3)), axis=1) # returns a DataFrame
   a  b  c
0  0  1  2
1  0  1  2
2  0  1  2
3  0  1  2
4  0  1  2

3）如果返回列表的长度等于第一行的列数，但至少有一行，其中列表的元素数量不同于列数，则ValueError为提高。

i = 0
def f(x):
    global i
    if i == 0:
        i += 1
        return list(range(3))
    return list(range(4))

df.apply(f, axis=1) 
ValueError: Shape of passed values is (5, 4), indices imply (5, 3)

无需应用

即可解决问题

使用轴为1的apply非常慢。使用基本的迭代方法可以获得更好的性能（特别是在较大的数据集上）。

创建更大的数据框

df1 = df.sample(100000, replace=True).reset_index(drop=True)

计时

# apply is slow with axis=1
%timeit df1.apply(lambda x: mylist[x['col_1']: x['col_2']+1], axis=1)
2.59 s ± 76.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# zip - similar to @Thomas
%timeit [mylist[v1:v2+1] for v1, v2 in zip(df1.col_1, df1.col_2)]  
29.5 ms ± 534 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

@Thomas回答

%timeit list(map(get_sublist, df1['col_1'],df1['col_2']))
34 ms ± 459 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

Answer 9

我确信这不如使用Pandas或Numpy操作的解决方案快，但如果你不想重写你的功能，你可以使用map。使用原始示例数据 -

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

df['col_3'] = list(map(get_sublist,df['col_1'],df['col_2']))
#In Python 2 don't convert above to list

我们可以通过这种方式将尽可能多的参数传递给函数。输出是我们想要的

ID  col_1  col_2      col_3
0  1      0      1     [a, b]
1  2      2      4  [c, d, e]
2  3      3      5  [d, e, f]

Answer 10

我的问题的例子：

def get_sublist(row, col1, col2):
    return mylist[row[col1]:row[col2]+1]
df.apply(get_sublist, axis=1, col1='col_1', col2='col_2')

Answer 11

我想您不想更改get_sublist功能，只想使用DataFrame的apply方法来完成这项工作。为了得到您想要的结果，我写了两个帮助功能：get_sublist_list和unlist。正如函数名称所示，首先获取子列表，第二个从该列表中提取子列表。最后，我们需要调用apply函数将这两个函数随后应用到df[['col_1','col_2']] DataFrame。

import pandas as pd

df = pd.DataFrame({'ID':['1','2','3'], 'col_1': [0,2,3], 'col_2':[1,4,5]})
mylist = ['a','b','c','d','e','f']

def get_sublist(sta,end):
    return mylist[sta:end+1]

def get_sublist_list(cols):
    return [get_sublist(cols[0],cols[1])]

def unlist(list_of_lists):
    return list_of_lists[0]

df['col_3'] = df[['col_1','col_2']].apply(get_sublist_list,axis=1).apply(unlist)

df

如果您不使用[]附上get_sublist函数，那么get_sublist_list函数将返回一个普通列表，它会引发ValueError: could not broadcast input array from shape (3) into shape (2)就像@Ted Petrou所提到的那样。

Answer 12

如果您有庞大的数据集，则可以使用更简单但更快（执行时间）的方法使用swifter：

import pandas as pd
import swifter

def fnc(m,x,c):
    return m*x+c

df = pd.DataFrame({"m": [1,2,3,4,5,6], "c": [1,1,1,1,1,1], "x":[5,3,6,2,6,1]})
df["y"] = df.swifter.apply(lambda x: fnc(x.m, x.x, x.c), axis=1)

如何将函数应用于两列Pandas数据帧

12 个答案:

无需应用

计时