合并数据框中的行

时间:2020-11-02 08:24:08

标签: python dataframe

我想制作一个新的DataFrame(或更新它)。怎么样?
如果用户在另一行中有“ Date_start” =“ Date_end + 1秒”,则应将其合并为一行。因此,对于第一个用户XXX使用以下数据框,我想将前3行合并为一个。 此外,只有在用户和他的日期在同一组中时,才应该执行此操作。

      Group   Date      User    Date_start            Date_end
1       A   2018-09-20  XXX 2018-09-20 00:01:35 2018-09-20 00:59:59
2       A   2018-09-20  XXX 2018-09-20 01:00:00 2018-09-20 01:59:59
3       A   2018-09-20  XXX 2018-09-20 02:00:00 2018-09-20 02:18:10
4       A   2018-09-20  XXY 2018-09-20 00:00:19 2018-09-20 00:59:59
5       A   2018-09-20  XXY 2018-09-20 01:00:00 2018-09-20 01:09:26
6       B   2018-09-20  XXZ 2018-09-20 00:28:39 2018-09-20 00:59:59
... ... ... ... ... ...
1999996 A   2018-09-20  ZZX 2018-09-20 00:00:08 2018-09-20 00:59:59
1999997 B   2018-09-20  ZZX 2018-09-20 01:00:00 2018-09-20 01:59:59
1999998 A   2018-09-20  ZZY 2018-09-20 00:04:22 2018-09-20 00:59:59
1999999 B   2018-09-20  ZZZ 2018-09-20 00:00:54 2018-09-20 00:59:59
2000000 A   2018-09-20  ZZZ 2018-09-20 01:00:00 2018-09-20 01:59:59

这是用户XXX的内容(前三行合为一个):

1       A   2018-09-20  XXX 2018-09-20 00:01:35 2018-09-20 02:18:10

1 个答案:

答案 0 :(得分:1)

IIUC,可以使用CREATE TABLE people ( PersonID INTEGER PRIMARY KEY AUTOINCREMENT, FirstName VARCHAR(100), LastName VARCHAR(100) ); INSERT INTO people (FirstName, LastName) VALUES ('Walter', 'White'), ('Jesse', 'Pinkman'), ('Saul', 'Goodman'); SELECT * FROM people; CREATE TABLE interests ( InterestID INTEGER PRIMARY KEY AUTOINCREMENT, Interest VARCHAR(100) ); INSERT INTO interests (Interest) values ('Swimming'), ('Basketball'), ('Running'); SELECT * FROM interests; CREATE TABLE persons_interests ( PersonID INTEGER, InterestID INTEGER, PRIMARY KEY (PersonID, InterestID), FOREIGN KEY (PersonID) REFERENCES people, FOREIGN KEY (InterestID) REFERENCES interests ); DROP TABLE persons_interests; INSERT INTO persons_interests (PersonID, InterestID) VALUES (1, 3), (2, 2), (3, 3); SELECT * FROM persons_interests; SELECT FirstName, LastName, Interest FROM people p, interests i JOIN persons_interests pi on p.PersonID = pi.PersonID JOIN persons_interests pi on i.Interest = pi.InterestID; 完成。首先,我将字符串的时间转换为日期时间:

groupby

第二,使用df['Date_start'] = pd.to_datetime(df['Date_start']) df['Date_end'] = pd.to_datetime(df['Date_end']) 操作对apply进行函数处理。我们将按groupbyUser分组以合并他们的时间:

Group

最后,def mygroup(d): out = d.iloc[0, :] # take the first row of each group x = df.columns.get_loc('Date_end') # get iloc position of date_end out.loc['Date_end'] = d.iloc[-1, x] # replace the first rows date_end with that of the last row return out 函数并重置索引:

apply

前5行的输出:

df = df.groupby(['Group', 'User']).apply(mygroup).reset_index(drop=True)

请注意,这没有利用您提到的“ 1秒之前”方面。我想如果对于每个 Group Date User Date_start Date_end 0 A 2018-09-20 XXX 2018-09-20 00:01:35 2018-09-20 02:18:10 1 A 2018-09-20 XXY 2018-09-20 00:00:19 2018-09-20 01:09:26 User组合,您要分组的时间不止一系列,那将是一个问题。在那种情况下,如果有一个额外的步骤来创建一个新列来标记要分组的每个时间段,那么仍然可以使用这种方法-这可能不是最简单的操作,但应该可行。

相关问题