我想制作一个新的DataFrame(或更新它)。怎么样?
如果用户在另一行中有“ Date_start” =“ Date_end + 1秒”,则应将其合并为一行。因此,对于第一个用户XXX使用以下数据框,我想将前3行合并为一个。
此外,只有在用户和他的日期在同一组中时,才应该执行此操作。
Group Date User Date_start Date_end
1 A 2018-09-20 XXX 2018-09-20 00:01:35 2018-09-20 00:59:59
2 A 2018-09-20 XXX 2018-09-20 01:00:00 2018-09-20 01:59:59
3 A 2018-09-20 XXX 2018-09-20 02:00:00 2018-09-20 02:18:10
4 A 2018-09-20 XXY 2018-09-20 00:00:19 2018-09-20 00:59:59
5 A 2018-09-20 XXY 2018-09-20 01:00:00 2018-09-20 01:09:26
6 B 2018-09-20 XXZ 2018-09-20 00:28:39 2018-09-20 00:59:59
... ... ... ... ... ...
1999996 A 2018-09-20 ZZX 2018-09-20 00:00:08 2018-09-20 00:59:59
1999997 B 2018-09-20 ZZX 2018-09-20 01:00:00 2018-09-20 01:59:59
1999998 A 2018-09-20 ZZY 2018-09-20 00:04:22 2018-09-20 00:59:59
1999999 B 2018-09-20 ZZZ 2018-09-20 00:00:54 2018-09-20 00:59:59
2000000 A 2018-09-20 ZZZ 2018-09-20 01:00:00 2018-09-20 01:59:59
这是用户XXX的内容(前三行合为一个):
1 A 2018-09-20 XXX 2018-09-20 00:01:35 2018-09-20 02:18:10
答案 0 :(得分:1)
IIUC,可以使用CREATE TABLE people (
PersonID INTEGER PRIMARY KEY AUTOINCREMENT,
FirstName VARCHAR(100),
LastName VARCHAR(100)
);
INSERT INTO people (FirstName, LastName)
VALUES ('Walter', 'White'),
('Jesse', 'Pinkman'),
('Saul', 'Goodman');
SELECT * FROM people;
CREATE TABLE interests (
InterestID INTEGER PRIMARY KEY AUTOINCREMENT,
Interest VARCHAR(100)
);
INSERT INTO interests (Interest)
values ('Swimming'),
('Basketball'),
('Running');
SELECT * FROM interests;
CREATE TABLE persons_interests (
PersonID INTEGER,
InterestID INTEGER,
PRIMARY KEY (PersonID, InterestID),
FOREIGN KEY (PersonID) REFERENCES people,
FOREIGN KEY (InterestID) REFERENCES interests
);
DROP TABLE persons_interests;
INSERT INTO persons_interests (PersonID, InterestID)
VALUES (1, 3),
(2, 2),
(3, 3);
SELECT * FROM persons_interests;
SELECT FirstName, LastName, Interest FROM people p, interests i
JOIN persons_interests pi on p.PersonID = pi.PersonID
JOIN persons_interests pi on i.Interest = pi.InterestID;
完成。首先,我将字符串的时间转换为日期时间:
groupby
第二,使用df['Date_start'] = pd.to_datetime(df['Date_start'])
df['Date_end'] = pd.to_datetime(df['Date_end'])
操作对apply
进行函数处理。我们将按groupby
和User
分组以合并他们的时间:
Group
最后,def mygroup(d):
out = d.iloc[0, :] # take the first row of each group
x = df.columns.get_loc('Date_end') # get iloc position of date_end
out.loc['Date_end'] = d.iloc[-1, x] # replace the first rows date_end with that of the last row
return out
函数并重置索引:
apply
前5行的输出:
df = df.groupby(['Group', 'User']).apply(mygroup).reset_index(drop=True)
请注意,这没有利用您提到的“ 1秒之前”方面。我想如果对于每个 Group Date User Date_start Date_end
0 A 2018-09-20 XXX 2018-09-20 00:01:35 2018-09-20 02:18:10
1 A 2018-09-20 XXY 2018-09-20 00:00:19 2018-09-20 01:09:26
和User
组合,您要分组的时间不止一系列,那将是一个问题。在那种情况下,如果有一个额外的步骤来创建一个新列来标记要分组的每个时间段,那么仍然可以使用这种方法-这可能不是最简单的操作,但应该可行。