连接数据框以添加其他列

时间:2020-02-09 17:50:59

标签: python pandas dataframe

我正在尝试根据一系列12个单独的CSV(一年中要合并的12个月)创建一个合并的数据框。所有CSV都具有相同的格式和列布局。

当我第一次运行它时,它似乎可以运行,并且剩下一个包含6列的组合数据框(如预期的那样)。进行查看后,我发现标题行已作为所有文件中的实际数据应用,因此我需要消除一些不良行。我可以手动进行这些更改,但是我希望代码能够自动处理。

因此,为此,我更新了代码,以使其仅在具有标题的第一个CSV中读取,而在没有标题的其余CSV中读取,并将所有内容连接在一起。 BUT 这似乎可行CSV,这显然不是我想要的(请参见下图)。

代码是相似的,我只对第一个CSV后的11个CSV使用header=None中的pd.read_csv()参数(对于第一个CSV我不使用该参数)。谁能给我一个提示,为什么我在运行此代码时为什么要获得12列(数据位置如上所述)? CSV文件的布局如下所示。

感谢任何帮助。

enter image description here

enter image description here

import pandas as pd
import numpy as np
import os

# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0

# Get list of csv files to read.
files = os.listdir('c:/data/datasets')

# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)

# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
    df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)
    dfSD = pd.concat([dfSD, df])
    totrows += df.shape[0]
    print(file + " == " + str(df.shape[0]) + " rows")               

print()
print("TOTAL ROWS = " + str(totrows + pd.read_csv('c:/data/datasets/' + files[0]).shape[0]))

1 个答案:

答案 0 :(得分:0)

以下是一个简单的解决方案。

import pandas as pd
import numpy as np
import os

totrows = 0

files = os.listdir('c:/data/datasets')

dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)

columns = []
dfSD = []
for file in files:
    df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True)
    if not columns:
        columns = df.columns
    df.columns = columns

    dfSD.append(df)

    totrows += df.shape[0]
    print(file + " == " + str(df.shape[0]) + " rows")               

dfSD = pd.concat(dfSD, axis = 0)

dfSD = dfSD.reset_index(drop = True)

另一种可能性是:

import pandas as pd
import numpy as np
import os

# Need to include the header row only for the first csv (otherwise header row will be included
# for each read csv, which places improperly formatted rows into the combined dataframe).
totrows = 0

# Get list of csv files to read.
files = os.listdir('c:/data/datasets')

# Read the first csv file, including the header row.
dfSD = pd.read_csv('c:/data/datasets/' + files[0], skip_blank_lines=True)
df_comb = [dfSD]
# Now read the remaining csv files (without header row) and concatenate their values
# into our full Sales Data dataframe.
for file in files[1:]:
    df = pd.read_csv('c:/data/datasets/' + file, skip_blank_lines=True, header=None)

    df.columns = dfSD.columns
    df_comb.append(df)
    totrows += df.shape[0]
    print(file + " == " + str(df.shape[0]) + " rows")

dfSD = pd.concat([df_comb], axis = 0).reset_index(drop = True)