将年份的数据帧列转换为月份日

时间:2016-12-11 20:13:14

标签: python datetime pandas dataframe sklearn-pandas

我这样做是为了完成家庭作业。

我的目标是在过去的日子里拥有一个全新的专栏。这个行有500,000多行...所以我的目标是:

  1. 在Pandas数据框中,我有两个不同格式的日期列。我想减去这两列,然后创建一个新的“Elap; Days Elapsed'列是一个简单的整数列表。
  2. 输出到新CSV(此代码已完成)
  3. 现在我可以完全避免每次修改代码/读取CSV时解析日期,因为它花了很长时间并且放慢了我的工作。
  4. 我试图转换它:

       Yearmade         Saledate
    0      2004  11/16/2006 0:00
    1      1996   3/26/2004 0:00
    2      2001   2/26/2004 0:00
    

    分为:

           Days Elapsed
    0      1050
    1      3007
    2      1151
    

    目前的尝试:

    year_mean = df[df['YearMade'] > 1000]['YearMade'].mean()
    df.loc[df['YearMade'] == 1000, 'YearMade'] = year_mean
    ## There's lots of erroneous data of the year 1000, so replacing all of them with the mean of the column (mean of column without error data, that is)
    df['Yearmade'] = "1/1/"+df['YearMade'].astype(str)
    ## This is where it errors out.
    df['Yearmade'] = pd.to_datetime(df['Yearmade'])
    df['Saledate'] = pd.to_datetime(df['Saledate'])
    df['Age_at_Sale'] = df['Saledate'].sub(df['Yearmade'])
    df = df.drop(['Saledate', 'Yearmade'], axis=1)
    
    [then there's another class method to convert the current df into csv]
    

1 个答案:

答案 0 :(得分:1)

假设您有以下DF:

In [203]: df
Out[203]:
   Yearmade   Saledate
0      2004 2006-11-16
1      1996 2004-03-26
2      2001 2004-02-26
3      1000 2003-12-23     # <--- erroneous year 

解决方案:

In [204]: df.loc[df.Yearmade <= 1900, 'Yearmade'] = round(df.Yearmade.loc[df.Yearmade > 1900].mean())

In [205]: df
Out[205]:
   Yearmade   Saledate
0      2004 2006-11-16
1      1996 2004-03-26
2      2001 2004-02-26
3      2000 2003-12-23    # <--- replaced with avg. year 

In [206]: df['days'] = (pd.to_datetime(Saledate) - pd.to_datetime(df.Yearmade, format='%Y')).dt.days

In [207]: df
Out[207]:
   Yearmade   Saledate  days
0      2004 2006-11-16  1050
1      1996 2004-03-26  3007
2      2001 2004-02-26  1151
3      2000 2003-12-23  1452