匹配,然后分组列表元素

时间:2016-10-26 02:19:33

标签: python list parsing itertools pandas-groupby

我已经解析了一个拉动相关数据的文本文件。然后我将变量(dlOrbit2,imageId3,imageStart4,imageEnd4)组合在一起,在列表中创建了一系列4个字符串。

combined = str(','.join([dlOrbit2, imageId3, imageStart4, imageEnd4]))
strSplit = combined.split(',')

print strSplit

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57']
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53']
['46290', '514628', '2016-10-26 13:12:54', '2016-10-26 13:13:13']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

我想在第一列中匹配和分组元素。所以,46284 x 4,46288 x 6,46290 x 2,46291 x 4.在这些组中,我希望从元素2和元素3的最新时间得到最早的时间。所以期望的输出将是:

['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39']
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57']
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13']
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']

此列表将始终为4个元素,但分组元素的数量(第一列)将始终更改。

我要将这些结果导出为CSV文件。但是,我只需要上述部分的帮助。

3 个答案:

答案 0 :(得分:1)

使用pandas

import pandas as pd

dat = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

df = pd.DataFrame(dat).drop_duplicates()
df_times = df.groupby([0]).agg({2:min,3:max}).reset_index()
df_times.merge(df,on=[0,2])[[0,1,2,'3_x']]

输出:

0   46284   514607  2016-10-26 02:43:46 2016-10-26 02:48:39
1   46288   514626  2016-10-26 09:48:26 2016-10-26 09:54:57
2   46290   514628  2016-10-26 13:12:34 2016-10-26 13:13:13
3   46291   514738  2016-10-26 14:56:39 2016-10-26 14:59:06

答案 1 :(得分:1)

作为Python的新手,我希望在使用Big Hammers之前看到带有基本python功能的示例。

如果没有模块导入可以在不到十几行代码中完成,我希望能够学到第一行。

或许操纵具有双索引的列表列表不被理解?

combined = [['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'], ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'], ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'], ['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'], ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

combined[0][0]    # double index
Out[28]: '46284'

combined[2][2:]   # slice
Out[29]: ['2016-10-26 02:43:46', '2016-10-26 02:48:39']

max(combined[2][2:])    # duck type order comparison
Out[30]: '2016-10-26 02:48:39'

为什么不在分组之前定义函数在输入列表上使用这些基本的Python工具?

答案 2 :(得分:0)

您可以利用groupbytee

data = [
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
    ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:48:37'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46288', '514663', '2016-10-26 09:53:46', '2016-10-26 09:54:57'],
    ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:12:53'],
    ['46290', '514629', '2016-10-26 13:12:54', '2016-10-26 13:13:13'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06'],
    ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']
]


from itertools import groupby, tee
import pprint

res = []
for k, g in groupby(data, key=lambda x: x[0]):
    it1, it2, it3 = tee(g, 3)
    res.append(next(it1)[:2] + [min(x[2] for x in it2), max(x[3] for x in it3)])

pprint.pprint(res)

输出:

[['46284', '514607', '2016-10-26 02:43:46', '2016-10-26 02:48:39'],
 ['46288', '514626', '2016-10-26 09:48:26', '2016-10-26 09:54:57'],
 ['46290', '514628', '2016-10-26 13:12:34', '2016-10-26 13:13:13'],
 ['46291', '514738', '2016-10-26 14:56:39', '2016-10-26 14:59:06']]

for k, g in groupby(data, key=lambda x: x[0])将根据第一列对连续行进行分组。它将返回一个元组,其中第一项是用于分组的键,第二项是组项目上的迭代器。

it1, it2, it3 = tee(g, 3)会将组迭代器拆分为三个迭代器,每个迭代器将返回完全相同的项。最后,通过从第一个分组项目中取前两列并运行min&来构建结果。 max超过另外两个迭代器。