透视此数据的大多数Python方法

时间:2019-02-17 20:07:40

标签: python pandas dataframe

假设我有以下数据:

       ID  basetime  basevalue timestamp2  value2 timestamp3 value3
0     gj93  01/01/19    50         01/02/19  60      01/03/19   70
1     mif3  02/01/19    70         02/02/19  80       02/03/19   90

我将如何解决这个问题

ID     Date     Label     Value
gj93  01/01/19   basetime   50
gj93  01/02/19   timestamp2 60
gj93  01/01/19   timestamp3 70
mif3  02/01/19   basetime   70
mif3  02/01/19   timestamp2 80
mif3  02/01/19   timestamp3 90

一个警告,以后的一些值可能会丢失,例如timestamp3 ...

谢谢!

2 个答案:

答案 0 :(得分:2)

熊猫的melt应该可以工作。

out = pd.melt(df, id_vars=['ID'], value_vars=['basetime', 'timestamp2', 'timestamp3'], var_name="Label", value_name="Date")

out['Value'] = pd.melt(df, value_vars=['basevalue', 'value2', 'value3'])['value']

答案 1 :(得分:2)

一个较长的版本,它在结构上超出了要求的范围。

import pandas as pd
from io import StringIO

# Sample data
df = pd.read_fwf(StringIO("""     
i       ID  basetime  basevalue timestamp2  value2 timestamp3 value3
0     gj93  01/01/19         50   01/02/19      60   01/03/19     70
1     mif3  02/01/19         70   02/02/19      80   02/03/19     90
"""), header=1, parse_dates=[2,4,6], index_col=0)


# melt to a vertical/tall format 
df2 = df.melt(id_vars="ID").sort_values(["ID", "variable"])

# replace basetime and basevalue with timestamp1 and basevalue1 respectively
# ... to be consistent with other names
df2['variable'] = df2['variable'].str.replace("basetime", "timestamp1") \
                                 .str.replace("basevalue", "value1")

# extract the sequence number to a column and remove the sequence from the variable name                                 
df2['seq'] = df2['variable'].str.replace("[^\d]", "")
df2['variable'] = df2['variable'].str.replace("\d+$", "")
df3 = df2.sort_values(["ID",  "seq", "variable"])


# join back on itself to matchup the time and value rows,
df4 = df3[df3.variable == 'timestamp'].merge(df3[df3.variable=='value'], on=['ID', 'seq'])

# Clean up - taking and renaming only the neded values
df5 = df4[['ID', 'value_x', 'value_y']]
df5.columns = ['ID', 'timestamp', 'value']

#     ID            timestamp value
#0  gj93  2019-01-01 00:00:00    50
#1  gj93  2019-01-02 00:00:00    60
#2  gj93  2019-01-03 00:00:00    70
#3  mif3  2019-02-01 00:00:00    70
#4  mif3  2019-02-02 00:00:00    80
#5  mif3  2019-02-03 00:00:00    90