Question

我有多个具有不同编号的 CSV 文件。列说 csv1 有 42 列，csv2 有 79 列 和 csv3 有20 列。它们具有 DateTime 列，在所有 csv 文件中都是唯一的。我正在尝试合并日期时间列上的所有文件。

我尝试使用以下代码，但它创建了一个很大的编号。的空列。请提出有效的解决方案。

import os
import glob
import pandas as pd
os.chdir("/home/reports")


extension = 'csv'

all_filenames = [i for i in glob.glob('report*.{}'.format(extension))]


#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, delimiter=';') for f in all_filenames ])

#export to csv
combined_csv.to_csv( "combined_report.csv", index=False, encoding='utf-8-sig')

Answer 1

您已经描述了跨 CSV 的常见日期列和独立数据字段
合成了这个。列数和行数是随机的。独立列填充随机值
连接/合并的要求使用 concat() 和 groupby().first()
鉴于有很多列，样本输出只是每个 CSV 的第一个独立列
查看 CSV，如果与您描述的不符，请更新合成过程

from pathlib import Path

# synthesize what's described - common date columns, independent data fields per CSV
# random number of columns and rows, random values in independent columns, overlapping dates in date columns
d = {
    f"df{i}": pd.DataFrame(
        {
            **{
                f"date{d}": pd.date_range(f"1-jan-{2018+d}", periods=r)
                for d in range(3)
            },
            **{f"{i}_{c2}": np.random.uniform(1, 10, r) for c2 in range(c)},
        }
    )
    for i, (r, c) in enumerate(
        zip(np.random.randint(2, 20, 10), np.random.randint(2, 80, 5))
    )
}

# generate CSVs from synthesized DFs
for df in d.keys():
    d[df].to_csv(Path.cwd().joinpath(f"report_{df}.csv"), index=False)

# now the requirement - concat / merge them.  rows with common dates are merged
pd.concat([pd.read_csv(p) for p in Path.cwd().glob("report_df*.csv")]).groupby(
    ["date0", "date1", "date2"], as_index=False
).first().to_csv(Path.cwd().joinpath("report_combined.csv"), index=False)

# sample output
pd.read_csv(Path.cwd().joinpath("report_combined.csv")).loc[
    :, ["date0", "date1", "date2"] + [f"{x}_0" for x in range(5)]
]

示例输出

<头>

	date0	date1	date2	0_0	1_0	2_0	3_0	4_0
0	2018-01-01	2019-01-01	2020-01-01	4.97835	3.71253	5.01434	8.27109	2.99249
1	2018-01-02	2019-01-02	2020-01-02	5.73684	1.20299	1.85132	8.06872	9.11377
2	2018-01-03	2019-01-03	2020-01-03	2.09498	nan	8.0877	5.13207	2.50901
3	2018-01-04	2019-01-04	2020-01-04	9.64076	nan	7.33267	1.15581	7.05995
4	2018-01-05	2019-01-05	2020-01-05	5.27771	nan	nan	4.75795	2.85646
5	2018-01-06	2019-01-06	2020-01-06	4.04003	nan	nan	3.81245	1.52377
6	2018-01-07	2019-01-07	2020-01-07	nan	nan	nan	nan	4.71341
7	2018-01-08	2019-01-08	2020-01-08	nan	nan	nan	nan	8.18832
8	2018-01-09	2019-01-09	2020-01-09	nan	nan	nan	nan	3.23354
9	2018-01-10	2019-01-10	2020-01-10	nan	nan	nan	nan	8.50481
10	2018-01-11	2019-01-11	2020-01-11	nan	nan	nan	nan	4.75847
11	2018-01-12	2019-01-12	2020-01-12	nan	nan	nan	nan	3.05732
12	2018-01-13	2019-01-13	2020-01-13	nan	nan	nan	nan	4.31586
13	2018-01-14	2019-01-14	2020-01-14	nan	nan	nan	nan	7.94507
14	2018-01-15	2019-01-15	2020-01-15	nan	nan	nan	nan	3.62756
15	2018-01-16	2019-01-16	2020-01-16	nan	nan	nan	nan	1.09299
16	2018-01-17	2019-01-17	2020-01-17	nan	nan	nan	nan	3.85213
17	2018-01-18	2019-01-18	2020-01-18	nan	nan	nan	nan	1.14182

Answer 2

还有 long 格式，可以从空列中保存。我也放了一些代码来生成假数据。

import pandas as pd
import numpy as np
from io import StringIO

dates = ["2021-01-01", "2021-02-01", "2021-03-01"]
widths = [3, 2, 1]

buf = []

for i in range(len(dates)):
    date_col = pd.to_datetime(sorted(np.round(np.random.rand(3), 4)), origin=dates[i], unit="D")
    data = np.random.rand(3, widths[i])
    df = pd.DataFrame(data, index=date_col, columns=["col_{0}{1}".format(i, j) for j in range(widths[i])])
    df.index.name = "Date"
    df.index = df.index.floor('s')
    buf.append(df)

df = pd.concat([df.melt(ignore_index=False) for df in buf])

现在我们有一个包含 3 个数据框的 buf 列表，它们的日期时间彼此完全不同。如果我们连接它们，无论如何我们都会有空列，因为日期时间索引和列名都不匹配。因此，我们必须 melt 我们的数据帧，例如将数据列放在一起，创建一个新列来存储前列名，并根据需要重复日期时间索引。这样我们的数据框将只有 3 列：DateTime、Variable 和 Value。而且它们可以很容易地连接起来。

df = pd.concat([df.melt(ignore_index=False) for df in buf])

示例串联数据帧，其中 col_0x 来自第一个数据帧，col_1x 来自第二个，col_2x 来自第三个：

                    variable     value
Date
2021-01-01 15:06:02   col_00  0.656970
2021-01-01 21:23:28   col_00  0.095424
2021-01-01 22:40:30   col_00  0.012732
2021-01-01 15:06:02   col_01  0.950258
2021-01-01 21:23:28   col_01  0.485026
2021-01-01 22:40:30   col_01  0.172216
2021-01-01 15:06:02   col_02  0.553444
2021-01-01 21:23:28   col_02  0.948313
2021-01-01 22:40:30   col_02  0.069310
2021-02-01 05:41:42   col_10  0.649030
2021-02-01 09:30:05   col_10  0.766949
2021-02-01 17:18:48   col_10  0.813431
2021-02-01 05:41:42   col_11  0.915467
2021-02-01 09:30:05   col_11  0.458455
2021-02-01 17:18:48   col_11  0.553780
2021-03-01 03:07:29   col_20  0.399751
2021-03-01 06:42:46   col_20  0.393324
2021-03-01 13:10:33   col_20  0.520106

以重复索引和前列名称为代价，我们现在得到了长格式的数据帧，以后更容易可视化。

将多个 csv 与“日期时间”列组合在一起

2 个答案:

示例输出