Question

我有两个数据框，每个数据框包含数据值和月份（这些是相关的列）。第二个数据帧还包含在“元素”列下列出的TMIN（最小值）和TMAX（最大值）值。

第一个数据帧有12个条目，显示2005年至2014年之间给定月份发生的最高温度。我们称其为df_max

第二个数据框显示了2014年截止日期之后发生的温度。我们称它为df_2。

我想创建第三个数据框，以显示df2中的温度超过按月分组的df_max中的温度。

这些是df_max中的值

        Data_Value
Month   
1.0   217.0
2.0   194.0
3.0   317.0
4.0   306.0
5.0   367.0
6.0   406.0
7.0   406.0
8.0   372.0
9.0   372.0
10.0    328.0
11.0    256.0
12.0    194.0

这些是df2中的一些值：

ID  Date    Element Data_Value  Month
19  USC00205563 2015-01-03  TMIN    -39 1
30  USC00203712 2015-03-17  TMAX    800 3
34  USC00200032 2015-06-06  TMIN    128 6
46  USW00014833 2015-08-30  TMIN    178 8
50  USC00202308 2015-08-30  TMIN    156 8
51  USC00205563 2015-01-03  TMAX    22  1
59  USC00202308 2015-08-30  TMAX    600 8
72  USC00200230 2015-04-01  TMIN    -17 4
126 USC00200032 2015-06-06  TMAX    233 6
139 USW00014853 2015-05-17  TMIN    183 5
146 USC00208972 2015-04-09  TMAX    67  4
155 USC00205050 2015-01-05  TMIN    -139    1
157 USC00200230 2015-04-01  TMAX    183 4
170 USC00203712 2015-03-17  TMIN    11  3
179 USC00208972 2015-05-27  TMAX    500 5

我认为我应该首先将元素值在TMAX温度下按月分组为零，然后对值进行过滤，以仅关注那些大于df_max中每个月的最大值的值。这是我的代码：

df3 = df2[df2['Element'] =='TMAX'].groupby[('Month')('Data_Value')].filter(lambda x: x > df_max['Data_Value'])

这将返回错误消息“ TypeError：'str'对象不可调用”

所需结果

所以我想要的结果是：例如，假设df2有i。）3行属于第2个月，其值分别为800、400和150。ii）4行属于第5个月，其值分别为100、500、700、300和100。

新数据框（df3）将； i。）包含行800和400，因为它们超过了与第2个月相对应的df_max中的194个最大值。
ii。）包含行500和700，因为它们超过了与第5个月相对应的df_max中的367值。

更新为了找到df2中所有可能超过df_max中每个月最大值的值，我决定使用groupby和nlargest来确定每个月的前3个温度，并假设（基于观察数据集）只有3将超过df_max中每月的最大值。但是问题是输出是pd.series格式的，我不确定如何将每个月的值与df_max数据框中的值进行比较。

这是我写的代码

df3 = df2[df2['Element'] =='TMAX'].groupby("Month")["Data_Value"].nlargest(3)

#find values in df3 that exceed the maximum temperatures in df_max for each month in the year
df3_max = df3[df3.Data_Value >= df_max.Data_Value]

但是，我收到错误消息：AttributeError：'Series'对象没有属性'Data_Value'

Answer 1

这是您想要的吗？

df3 = df1.merge(df2.groupby('Month').agg({'Data_Value':'max'}).reset_index(), 
                on = 'Month', how='inner')
df3[df3.Data_Value_x > df3.Data_Value_y]

   Month  Data_Value_x  Data_Value_y
0     1         217.0            22
2     4         306.0           183
4     6         406.0           233

Answer 2

我想，这就是您想要的。

df3 = df2[df2['Element'] =='TMAX'].groupby("Month").max()
df3 = df3[df3.Data_value == df_max.Data_Value.max()]

我认为代码是不言自明的代码。

使用groupby和filter函数在多个条件下比较两个数据帧

2 个答案: