在Pandas中对包含字符串的列进行排序

时间:2016-08-18 10:33:04

标签: python sorting pandas dataframe categorical-data

我是Pandas的新手,希望对包含字符串的列进行排序并生成一个数值来唯一标识字符串。我的数据框看起来像这样:

df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})

首先,我希望对'year_week'列进行排序,以按升序排列(2015_1, 2016_9, '2016_9', 2016_10, 2016_11, 2016_3, 2016_10, 2016_10),然后为每个唯一的'year_week'字符串生成数值。

2 个答案:

答案 0 :(得分:3)

您可以先转换to_datetimeyear_week,然后按sort_values排序,最后使用factorize

df = pd.DataFrame({'key': range(8), 'year_week': ['2015_10', '2015_1', '2015_11', '2016_9', '2016_10','2016_3', '2016_9', '2016_10']})

#http://stackoverflow.com/a/17087427/2901002
df['date'] = pd.to_datetime(df.year_week + '-0', format='%Y_%W-%w')
#sort by column date
df.sort_values('date', inplace=True)
#create numerical values
df['num'] = pd.factorize(df.year_week)[0]
print (df)
   key year_week       date  num
1    1    2015_1 2015-01-11    0
0    0   2015_10 2015-03-15    1
2    2   2015_11 2015-03-22    2
5    5    2016_3 2016-01-24    3
3    3    2016_9 2016-03-06    4
6    6    2016_9 2016-03-06    4
4    4   2016_10 2016-03-13    5
7    7   2016_10 2016-03-13    5

答案 1 :(得分:0)

       ## 1st method :-- This apply for large dataset

 ## Split the "year_week" column into 2 columns

             df[['year', 'week']] =df['year_week'].str.split("_",expand=True)

     ## Change the datatype of newly created columns
             df['year'] = df['year'].astype('int')

             df['week'] = df['week'].astype('int')

    ## Sort the dataframe by newly created column

             df= df.sort_values(['year','week'],ascending=True)

   ## Drop years & months column

             df.drop(['year','week'],axis=1,inplace=True)

   ## Sorted dataframe
            df


   ## 2nd method:-- 
        
     ## This apply for small dataset

           ## Change the datatype of column

                df['year_week'] = df['year_week'].astype('str')

          ## Categories the string, the way you want

               cats = ['2015_1', '2015_10','2015_11','2016_3','2016_9', '2016_10']

         # Use pd.categorical() to categories it 

 df['year_week']=pd.Categorical(df['year_week'],categories=cats,ordered=True)

          ## Sort the 'year_week' column

              df= df.sort_values('year_week')

           ## Sorted dataframe
              df