将多个 csv 与“日期时间”列组合在一起

时间:2021-06-12 13:14:18

标签: python pandas csv

我有多个具有不同编号的 CSV 文件。列说 csv142 列csv279 列csv320 列。它们具有 DateTime 列,在所有 csv 文件中都是唯一的。我正在尝试合并日期时间列上的所有文件。

我尝试使用以下代码,但它创建了一个很大的编号。的空列。请提出有效的解决方案。

import os
import glob
import pandas as pd
os.chdir("/home/reports")


extension = 'csv'

all_filenames = [i for i in glob.glob('report*.{}'.format(extension))]


#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, delimiter=';') for f in all_filenames ])

#export to csv
combined_csv.to_csv( "combined_report.csv", index=False, encoding='utf-8-sig')

2 个答案:

答案 0 :(得分:0)

  • 您已经描述了跨 CSV 的常见日期列和独立数据字段
  • 合成了这个。列数和行数是随机的。独立列填充随机值
  • 连接/合并的要求使用 concat()groupby().first()
  • 鉴于有很多列,样本输出只是每个 CSV 的第一个独立列
  • 查看 CSV,如果与您描述的不符,请更新合成过程
from pathlib import Path

# synthesize what's described - common date columns, independent data fields per CSV
# random number of columns and rows, random values in independent columns, overlapping dates in date columns
d = {
    f"df{i}": pd.DataFrame(
        {
            **{
                f"date{d}": pd.date_range(f"1-jan-{2018+d}", periods=r)
                for d in range(3)
            },
            **{f"{i}_{c2}": np.random.uniform(1, 10, r) for c2 in range(c)},
        }
    )
    for i, (r, c) in enumerate(
        zip(np.random.randint(2, 20, 10), np.random.randint(2, 80, 5))
    )
}

# generate CSVs from synthesized DFs
for df in d.keys():
    d[df].to_csv(Path.cwd().joinpath(f"report_{df}.csv"), index=False)

# now the requirement - concat / merge them.  rows with common dates are merged
pd.concat([pd.read_csv(p) for p in Path.cwd().glob("report_df*.csv")]).groupby(
    ["date0", "date1", "date2"], as_index=False
).first().to_csv(Path.cwd().joinpath("report_combined.csv"), index=False)

# sample output
pd.read_csv(Path.cwd().joinpath("report_combined.csv")).loc[
    :, ["date0", "date1", "date2"] + [f"{x}_0" for x in range(5)]
]

示例输出

<头>
date0 date1 date2 0_0 1_0 2_0 3_0 4_0
0 2018-01-01 2019-01-01 2020-01-01 4.97835 3.71253 5.01434 8.27109 2.99249
1 2018-01-02 2019-01-02 2020-01-02 5.73684 1.20299 1.85132 8.06872 9.11377
2 2018-01-03 2019-01-03 2020-01-03 2.09498 nan 8.0877 5.13207 2.50901
3 2018-01-04 2019-01-04 2020-01-04 9.64076 nan 7.33267 1.15581 7.05995
4 2018-01-05 2019-01-05 2020-01-05 5.27771 nan nan 4.75795 2.85646
5 2018-01-06 2019-01-06 2020-01-06 4.04003 nan nan 3.81245 1.52377
6 2018-01-07 2019-01-07 2020-01-07 nan nan nan nan 4.71341
7 2018-01-08 2019-01-08 2020-01-08 nan nan nan nan 8.18832
8 2018-01-09 2019-01-09 2020-01-09 nan nan nan nan 3.23354
9 2018-01-10 2019-01-10 2020-01-10 nan nan nan nan 8.50481
10 2018-01-11 2019-01-11 2020-01-11 nan nan nan nan 4.75847
11 2018-01-12 2019-01-12 2020-01-12 nan nan nan nan 3.05732
12 2018-01-13 2019-01-13 2020-01-13 nan nan nan nan 4.31586
13 2018-01-14 2019-01-14 2020-01-14 nan nan nan nan 7.94507
14 2018-01-15 2019-01-15 2020-01-15 nan nan nan nan 3.62756
15 2018-01-16 2019-01-16 2020-01-16 nan nan nan nan 1.09299
16 2018-01-17 2019-01-17 2020-01-17 nan nan nan nan 3.85213
17 2018-01-18 2019-01-18 2020-01-18 nan nan nan nan 1.14182

答案 1 :(得分:0)

还有 long 格式,可以从空列中保存。我也放了一些代码来生成假数据。

import pandas as pd
import numpy as np
from io import StringIO

dates = ["2021-01-01", "2021-02-01", "2021-03-01"]
widths = [3, 2, 1]

buf = []

for i in range(len(dates)):
    date_col = pd.to_datetime(sorted(np.round(np.random.rand(3), 4)), origin=dates[i], unit="D")
    data = np.random.rand(3, widths[i])
    df = pd.DataFrame(data, index=date_col, columns=["col_{0}{1}".format(i, j) for j in range(widths[i])])
    df.index.name = "Date"
    df.index = df.index.floor('s')
    buf.append(df)

df = pd.concat([df.melt(ignore_index=False) for df in buf])

现在我们有一个包含 3 个数据框的 buf 列表,它们的日期时间彼此完全不同。如果我们连接它们,无论如何我们都会有空列,因为日期时间索引和列名都不匹配。因此,我们必须 melt 我们的数据帧,例如将数据列放在一起,创建一个新列来存储前列名,并根据需要重复日期时间索引。这样我们的数据框将只有 3 列:DateTime、Variable 和 Value。而且它们可以很容易地连接起来。

df = pd.concat([df.melt(ignore_index=False) for df in buf])

示例串联数据帧,其中 col_0x 来自第一个数据帧,col_1x 来自第二个,col_2x 来自第三个:

                    variable     value
Date
2021-01-01 15:06:02   col_00  0.656970
2021-01-01 21:23:28   col_00  0.095424
2021-01-01 22:40:30   col_00  0.012732
2021-01-01 15:06:02   col_01  0.950258
2021-01-01 21:23:28   col_01  0.485026
2021-01-01 22:40:30   col_01  0.172216
2021-01-01 15:06:02   col_02  0.553444
2021-01-01 21:23:28   col_02  0.948313
2021-01-01 22:40:30   col_02  0.069310
2021-02-01 05:41:42   col_10  0.649030
2021-02-01 09:30:05   col_10  0.766949
2021-02-01 17:18:48   col_10  0.813431
2021-02-01 05:41:42   col_11  0.915467
2021-02-01 09:30:05   col_11  0.458455
2021-02-01 17:18:48   col_11  0.553780
2021-03-01 03:07:29   col_20  0.399751
2021-03-01 06:42:46   col_20  0.393324
2021-03-01 13:10:33   col_20  0.520106

以重复索引和前列名称为代价,我们现在得到了长格式的数据帧,以后更容易可视化。