根据滚动条件分组

时间:2019-06-05 11:37:58

标签: python pandas dataframe group-by

我正在尝试根据一些条件对数据帧进行分组。

数据框:

Start Date  End Date    value
1971-07-01  1971-07-31  0.0
1971-08-01  1971-08-31  0.25
1971-09-01  1971-09-30  -0.62
1971-10-01  1971-10-31  0.0
1971-11-01  1971-11-30  -0.63
1971-12-01  1971-12-31  -1.0
1972-01-01  1972-01-31  0.0
1972-02-01  1972-02-29  0.0
1972-03-01  1972-03-31  2.0
1972-04-01  1972-04-30  0.0
.
.
1973-07-01  1973-07-31  2.0
1973-08-01  1973-08-31  0.5
1973-09-01  1973-09-30  -2.0
1973-10-01  1973-10-31  0.0
1973-11-01  1973-11-30  0.0
1973-12-01  1973-12-31  0.0
1974-01-01  1974-01-31  0.0
1974-02-01  1974-02-28  0.0
.
.
.
1974-11-01  1974-11-30  0.0
1974-12-01  1974-12-31  -1.25
1975-01-01  1975-01-31  -1.0
1975-02-01  1975-02-28  -1.0
1975-03-01  1975-03-31  -0.5
1975-04-01  1975-04-30  -0.25
1975-05-01  1975-05-31  0.0
1975-06-01  1975-06-30  1.25
1975-07-01  1975-07-31  0.0
1975-08-01  1975-08-31  0.0

分组条件

该组应始终以负值开头

只要我们具有负值,该组就会继续

如果达到正值三个连续的零,则组结束

上述数据框中的示例1

1971-09-01  1971-09-30  -0.62
1971-10-01  1971-10-31  0.0
1971-11-01  1971-11-30  -0.63
1971-12-01  1971-12-31  -1.0
1972-01-01  1972-01-31  0.0
1972-02-01  1972-02-29  0.0

示例2(在这种情况下,我们达到了3个连续的零)

1973-09-01  1973-09-30  -2.0
1973-10-01  1973-10-31  0.0
1973-11-01  1973-11-30  0.0
1973-12-01  1973-12-31  0.0

示例3(在这种情况下,我们达到了正值)

1974-12-01  1974-12-31  -1.25
1975-01-01  1975-01-31  -1.0
1975-02-01  1975-02-28  -1.0
1975-03-01  1975-03-31  -0.5
1975-04-01  1975-04-30  -0.25
1975-05-01  1975-05-31  0.0

我没有任何代码,因为我仍在寻找如何将条件放入groupby或任何其他有效的方式来执行此操作。

我尝试过循环,但是我不会去任何地方。

for i in df.index:
    no = 0
    if df['Value'][i] < 0:
        df['groupno'] = no

分组后,我想获取组第一列的开始日期和组最后一列的结束日期。

预期结果(来自示例):

Start Date   End Date
1971-09-01   1972-02-29
1973-09-01   1973-12-31
1974-12-01   1975-05-31

感谢阅读。

1 个答案:

答案 0 :(得分:0)

我认为这不是pythonic方式,但是它可以工作,并且我认为对您有帮助。

groups = []
start = '' # start date for group
end = '' # end date for group
nulls = 0 # count of nulls
for j,i in df.iterrows():
    # if it's first negativa value - start the group
    if i.value < 0 and start == '':
        start = i['Start Date']
        nulls = 0
    # if it's null - remember that
    if i.value == 0:
        nulls += 1
    else:
        nulls = 0
    # if value > 0 or we have seen 3 nulls - end group (if it was start)
    if ( (i.value > 0) or (nulls == 3) ) and start != '':
        # if we have seen 3 nulls - we want write this end date (not previous)
        if nulls == 3:
            end = i['End Date']
        groups.append((start, end))
        start = ''
        nulls = 0
    if nulls == 3:
        start = ''
        nulls = 0
    # remember previous end date
    end = i['End Date']
result = pd.DataFrame(groups, columns = ['Start Date', 'End Date'])
print(result)

它不是group by,但可以帮助您找到组的开始和结束日期。

出局:

   Start Date    End Date
0  1971-09-01  1972-02-29
1  1973-09-01  1973-12-31
2  1974-12-01  1975-05-31