当分组后跟value_counts()

时间:2020-08-18 18:20:36

标签: python pandas group-by

我有这样的数据:

year = ['2010', '2011-2014', '2013', '2012-2016', '2018-present', '2019', '2015-present', '2015']
products = ['A', 'B', 'C', 'D', 'B', 'E', 'F', 'A']
rating = [4, 2, 2, 3, 1, 1, 2, 2]

data = pd.DataFrame({'Products': products, 'Year': year, 'Rating': rating})

在我的分析中,我想将年份范围转换为单年值(例如['2010', '2011', '2013', '2014', '2015', '2016', '2017', '2018', '2019', '2020']),对于其他列,请添加年份范围中的计数。例如,对于上面的示例,我想要: {'2010':'A','2011':'B','2013':'B','2014':'B','2013':'c','2012':'D', '2013':'D','2014':'D','2015':'D','2016':'D',...}

我相信我需要pandas.cut来进行装箱,但是我不知道如何在大熊猫中进行装箱

4 个答案:

答案 0 :(得分:3)

使用explode

# Extract the range information from the Year column
y = data['Year'].str.extract('(?P<From>\d+)-?(?P<To>\d+|present)?')
y['To'] = y['To'].combine_first(y['From']).replace({'present': '2020'})
y = y.astype('int')
y['Range'] = y.apply(lambda row: range(row['From'], row['To']+1), axis=1)

# The explosion
data['Range'] = y['Range']
data = data.explode('Range')

结果:

Products          Year  Rating Range
       A          2010       4  2010
       B     2011-2014       2  2011
       B     2011-2014       2  2012
       B     2011-2014       2  2013
       B     2011-2014       2  2014
       C          2013       2  2013
       D     2012-2016       3  2012
       D     2012-2016       3  2013
       D     2012-2016       3  2014
       D     2012-2016       3  2015
       D     2012-2016       3  2016
       B  2018-present       1  2018
       B  2018-present       1  2019
       B  2018-present       1  2020
       E          2019       1  2019
       F  2015-present       2  2015
       F  2015-present       2  2016
       F  2015-present       2  2017
       F  2015-present       2  2018
       F  2015-present       2  2019
       F  2015-present       2  2020
       A          2015       2  2015

根据需要重命名列

答案 1 :(得分:3)

IIUC,您可以str.splitYear,然后在某些条件下使用列表理解:

df["Year"] = [list(range(int(i[0]), int(i[1] if i[1]!= "present" else "2020")+1))
              if len(i)>1 else list(range(int(i[0]), int(i[0])+1))
              for i in df["Year"].str.split("-")]

print (df.explode("Year"))

  Products  Year  Rating
0        A  2010       4
1        B  2011       2
1        B  2012       2
1        B  2013       2
1        B  2014       2
2        C  2013       2
3        D  2012       3
3        D  2013       3
3        D  2014       3
3        D  2015       3
3        D  2016       3
4        B  2018       1
4        B  2019       1
4        B  2020       1
5        E  2019       1
6        F  2015       2
6        F  2016       2
6        F  2017       2
6        F  2018       2
6        F  2019       2
6        F  2020       2
7        A  2015       2

答案 2 :(得分:0)

一个简单的解决方案如下:)

data[["start", "end"]] = data["Year"].str.split('-',expand=True).ffill(axis=1)
data["end"] = data["end"].replace({"present":pd.Timestamp("now").year})
data[["start", "end"]] = data[["start", "end"]].astype(int)
data = data.drop("Year", axis=1)
data = data.loc[data.index.repeat(data.end - data.start + 1)].reset_index(drop=True)
data["counter"] = data.groupby(["Products", "start"]).cumcount()
data["Year"] = data["start"] + data["counter"]
data = data.drop(["start", "end", "counter"], axis=1)

答案 3 :(得分:0)

df1 = df['Year'].str.split("-", expand = True)\
.rename(columns={0:'Year1', 1:'Year2'}) #For Splitting into columns
df2 = pd.concat([df,df1], axis=1) #Merging

def a(b):
    if b['Year2'] == None:
        return b['Year1']
    if b['Year2'] == 'present':
        return 2020
    else:
        return b['Year2']

df2['Year3'] = df2.apply(a, axis=1) #Conditional replacement

df2['Year1'] = df2['Year1'].astype(int) #Character --> Integer
df2['Year3'] = df2['Year3'].astype(int) #Character --> Integer

df2['Year4'] = [np.arange(f,t+1) for f,t in zip(df2['Year1'], df2['Year3'])]
#For loop for number arrangement

df3 = df2.explode('Year4').drop(columns=['Year', 'Year2', 'Year3', 'Year1'])
#Explode --> List to Rows + Drop unwanted columns

df4 = df3[['Products']+['Year4']+['Rating']] #Rearranging
print(df4)
相关问题