更改每个特定列的列值

时间:2017-03-12 05:33:33

标签: python pandas dataframe

我正在玩一个大约有200列和70000行的大型数据集。这是一个混乱的数据,所以我应该更具可读性。

enter image description here

在数据列中的含义是: ATT_A(agree)ATT_SA(Strongly agree)ATT_D(disagree)等等

每5列仅代表1个答案

我的想法是,我可以使用.replace()函数,然后使每个值列表示值(如果列名称为__ SA则列值应为'SA'而不是1)

然后我可以在一列中加入5列。它会变得不那么混乱。

IDEA_COLUMN

SA
A
SD
A
D
SA

这是我试过的代码。

for c in cols.columns:
    if c.upper()[:4] == 'ATT_':
        if c[-2:] == 'SA':
             c.replace('1', 'SA')

我尝试过很多种不同的类型,但我看不出自己的错误。 我是编码的新手,所以我可能会遇到愚蠢的错误。

1 个答案:

答案 0 :(得分:3)

这是一个选项:

# split the columns at the second underscore to make the columns a multi-index
df.columns = df.columns.str.rsplit("_", n=1, expand=True)    

# transform the answer A,SA,D... to a column, group by level 0(row number) and find out the
# answer corresponding to 1 with idxmax
df.stack(level=1).groupby(level=0).agg(lambda x: x.idxmax()[1])

enter image description here

另一个选项

# split columns as above
df.columns = df.columns.str.rsplit("_", n=1, expand=True)    

# group columns based on the prefix along axis 1, and for each row find out the index with 
# value 1 using idxmax() function
df.groupby(level=0, axis=1).apply(lambda g: g.apply(lambda x: x.idxmax()[1], axis = 1))

数据设置

cols1 = ["ATT_TECHIMP_" + x for x in ["SA", "A", "NO", "D", "SD"]]
cols2 = ["ATT_BBB_" + x for x in ["SA", "A", "NO", "D", "SD"]]

df1 = pd.DataFrame([[1, None, None, None, None], [None, None, 1, None, None], [None, None, 1, None, None], [None, None, None, 1, None], [None, None, None, None, 1]], columns=cols1)
df2 = pd.DataFrame([[None, 1, None, None, None], [None, None, None, None, 1], [None, None, 1, None, None], [None, None, None, 1, None], [None, None, None, None, 1]], columns=cols2)

df = pd.concat([df1, df2], axis=1)