我有多个具有不同编号的 CSV 文件。列说 csv1 有 42 列,csv2 有 79 列 和 csv3 有20 列。它们具有 DateTime 列,在所有 csv 文件中都是唯一的。我正在尝试合并日期时间列上的所有文件。
我尝试使用以下代码,但它创建了一个很大的编号。的空列。请提出有效的解决方案。
import os
import glob
import pandas as pd
os.chdir("/home/reports")
extension = 'csv'
all_filenames = [i for i in glob.glob('report*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f, delimiter=';') for f in all_filenames ])
#export to csv
combined_csv.to_csv( "combined_report.csv", index=False, encoding='utf-8-sig')
答案 0 :(得分:0)
concat()
和 groupby().first()
from pathlib import Path
# synthesize what's described - common date columns, independent data fields per CSV
# random number of columns and rows, random values in independent columns, overlapping dates in date columns
d = {
f"df{i}": pd.DataFrame(
{
**{
f"date{d}": pd.date_range(f"1-jan-{2018+d}", periods=r)
for d in range(3)
},
**{f"{i}_{c2}": np.random.uniform(1, 10, r) for c2 in range(c)},
}
)
for i, (r, c) in enumerate(
zip(np.random.randint(2, 20, 10), np.random.randint(2, 80, 5))
)
}
# generate CSVs from synthesized DFs
for df in d.keys():
d[df].to_csv(Path.cwd().joinpath(f"report_{df}.csv"), index=False)
# now the requirement - concat / merge them. rows with common dates are merged
pd.concat([pd.read_csv(p) for p in Path.cwd().glob("report_df*.csv")]).groupby(
["date0", "date1", "date2"], as_index=False
).first().to_csv(Path.cwd().joinpath("report_combined.csv"), index=False)
# sample output
pd.read_csv(Path.cwd().joinpath("report_combined.csv")).loc[
:, ["date0", "date1", "date2"] + [f"{x}_0" for x in range(5)]
]
date0 | date1 | date2 | 0_0 | 1_0 | 2_0 | 3_0 | 4_0 | |
---|---|---|---|---|---|---|---|---|
0 | 2018-01-01 | 2019-01-01 | 2020-01-01 | 4.97835 | 3.71253 | 5.01434 | 8.27109 | 2.99249 |
1 | 2018-01-02 | 2019-01-02 | 2020-01-02 | 5.73684 | 1.20299 | 1.85132 | 8.06872 | 9.11377 |
2 | 2018-01-03 | 2019-01-03 | 2020-01-03 | 2.09498 | nan | 8.0877 | 5.13207 | 2.50901 |
3 | 2018-01-04 | 2019-01-04 | 2020-01-04 | 9.64076 | nan | 7.33267 | 1.15581 | 7.05995 |
4 | 2018-01-05 | 2019-01-05 | 2020-01-05 | 5.27771 | nan | nan | 4.75795 | 2.85646 |
5 | 2018-01-06 | 2019-01-06 | 2020-01-06 | 4.04003 | nan | nan | 3.81245 | 1.52377 |
6 | 2018-01-07 | 2019-01-07 | 2020-01-07 | nan | nan | nan | nan | 4.71341 |
7 | 2018-01-08 | 2019-01-08 | 2020-01-08 | nan | nan | nan | nan | 8.18832 |
8 | 2018-01-09 | 2019-01-09 | 2020-01-09 | nan | nan | nan | nan | 3.23354 |
9 | 2018-01-10 | 2019-01-10 | 2020-01-10 | nan | nan | nan | nan | 8.50481 |
10 | 2018-01-11 | 2019-01-11 | 2020-01-11 | nan | nan | nan | nan | 4.75847 |
11 | 2018-01-12 | 2019-01-12 | 2020-01-12 | nan | nan | nan | nan | 3.05732 |
12 | 2018-01-13 | 2019-01-13 | 2020-01-13 | nan | nan | nan | nan | 4.31586 |
13 | 2018-01-14 | 2019-01-14 | 2020-01-14 | nan | nan | nan | nan | 7.94507 |
14 | 2018-01-15 | 2019-01-15 | 2020-01-15 | nan | nan | nan | nan | 3.62756 |
15 | 2018-01-16 | 2019-01-16 | 2020-01-16 | nan | nan | nan | nan | 1.09299 |
16 | 2018-01-17 | 2019-01-17 | 2020-01-17 | nan | nan | nan | nan | 3.85213 |
17 | 2018-01-18 | 2019-01-18 | 2020-01-18 | nan | nan | nan | nan | 1.14182 |
答案 1 :(得分:0)
还有 long
格式,可以从空列中保存。我也放了一些代码来生成假数据。
import pandas as pd
import numpy as np
from io import StringIO
dates = ["2021-01-01", "2021-02-01", "2021-03-01"]
widths = [3, 2, 1]
buf = []
for i in range(len(dates)):
date_col = pd.to_datetime(sorted(np.round(np.random.rand(3), 4)), origin=dates[i], unit="D")
data = np.random.rand(3, widths[i])
df = pd.DataFrame(data, index=date_col, columns=["col_{0}{1}".format(i, j) for j in range(widths[i])])
df.index.name = "Date"
df.index = df.index.floor('s')
buf.append(df)
df = pd.concat([df.melt(ignore_index=False) for df in buf])
现在我们有一个包含 3 个数据框的 buf
列表,它们的日期时间彼此完全不同。如果我们连接它们,无论如何我们都会有空列,因为日期时间索引和列名都不匹配。因此,我们必须 melt
我们的数据帧,例如将数据列放在一起,创建一个新列来存储前列名,并根据需要重复日期时间索引。这样我们的数据框将只有 3 列:DateTime、Variable 和 Value。而且它们可以很容易地连接起来。
df = pd.concat([df.melt(ignore_index=False) for df in buf])
示例串联数据帧,其中 col_0x
来自第一个数据帧,col_1x
来自第二个,col_2x
来自第三个:
variable value
Date
2021-01-01 15:06:02 col_00 0.656970
2021-01-01 21:23:28 col_00 0.095424
2021-01-01 22:40:30 col_00 0.012732
2021-01-01 15:06:02 col_01 0.950258
2021-01-01 21:23:28 col_01 0.485026
2021-01-01 22:40:30 col_01 0.172216
2021-01-01 15:06:02 col_02 0.553444
2021-01-01 21:23:28 col_02 0.948313
2021-01-01 22:40:30 col_02 0.069310
2021-02-01 05:41:42 col_10 0.649030
2021-02-01 09:30:05 col_10 0.766949
2021-02-01 17:18:48 col_10 0.813431
2021-02-01 05:41:42 col_11 0.915467
2021-02-01 09:30:05 col_11 0.458455
2021-02-01 17:18:48 col_11 0.553780
2021-03-01 03:07:29 col_20 0.399751
2021-03-01 06:42:46 col_20 0.393324
2021-03-01 13:10:33 col_20 0.520106
以重复索引和前列名称为代价,我们现在得到了长格式的数据帧,以后更容易可视化。