检查Dataframe是否有特定用户的更改

时间:2016-12-11 19:01:49

标签: python pandas

ID  Date   T   Country 
1   2/5/12 120 US
1   2/4/13 110 US
1   3/4/12 120 France
2   3/4/12 110 US
2   3/5/12 140 US
3   3/4/12 133 US

我正在尝试编写一个代码,每个唯一ID都会看到T列是否低于阈值(即低于110)或者是否更改了国家/地区。如果是这样,我希望有另一个名为Treatment的列,其中有1个对应于该ID。我怎么做到这一点?

基本上:

给定ID     如果T < 110 - &gt; 1     如果国家/地区发生变化 - > 1     else-&GT; 0

预期产出:

ID日期T国家待遇

1 2/5/12 120 US 1

1 2/4/13 110 US 1

1 3/4/12 120法国1

2 3/4/12 110 US 0

2 3/5/12 140 US 0

3 3/4/12 133 US 0

2 个答案:

答案 0 :(得分:1)

使用groupbyapply获取布尔系列,指示是否已满足每个ID的条件,并astype转换为0/1。完成此操作后,请在ID列上使用map

def check_condition(grp):
    return (grp['T'] < 110).any() | (grp['Country'].nunique() > 1)

cond_map = df.groupby('ID').apply(check_condition).astype(int)
df['Treatment'] = df['ID'].map(cond_map)

或者,如果您不想创建中间人cond_map,可以将groupby放入map

df['Treatment'] = df['ID'].map(df.groupby('ID').apply(check_condition).astype(int))

结果输出:

   ID    Date    T Country  Treatment
0   1  2/5/12  120      US          1
1   1  2/4/13  110      US          1
2   1  3/4/12  120  France          1
3   2  3/4/12  110      US          0
4   2  3/5/12  140      US          0
5   3  3/4/12  133      US          0

答案 1 :(得分:0)

使用熊猫的力量:

import pandas as pd

# Future note: if you could include your sample data like this that would save 
# those who are trying to help you a LOT of time :)
df = pd.DataFrame({"ID":[1,1,1,2,2,3],
                   "Date":["2/5/12","2/4/13","3/4/12","3/4/12","3/5/12","3/4/12"],
                   "T":[120,110,120,110,140,133],
                   "Country":["US","US","France","US","US","US"]})

# Using a dictionary to map into the original DataFrame
d = {}

# For each ID 
for i in range(len(df["ID"].values)):
    unique_id = df["ID"][i]

    # Breaking the original data into rows to check each
    # instance of 'T'
    sub_frame = df.loc[i, :]

    # Checks both cases ('T'<110 and unique('Country')>1) at once
    if sub_frame["T"] < 110 or len(df.loc[df["ID"]==unique_id, "Country"].unique()) > 1:
        d[unique_id] = 1
    else:
        d[unique_id] = 0

df["Treatment"] = df["ID"].map(d)

print(df)

  Country    Date  ID    T  Treatment
0      US  2/5/12   1  120          1
1      US  2/4/13   1  110          1
2  France  3/4/12   1  120          1
3      US  3/4/12   2  110          0
4      US  3/5/12   2  140          0
5      US  3/4/12   3  133          0

注意:您的问题要求考虑每个唯一 ID,但由于您希望为每个实例找到T<110,因此您无法为每个唯一ID执行此操作(因为那里是单个ID的多个实例 - 您尝试比较数组110中的值[120,110,120]