Question

我有六个.csv文件。它们的整体大小约为4gigs。我需要清理每个对象，并对每个对象执行一些数据分析任务。所有帧的这些操作均相同。这是我阅读它们的代码。

#df = pd.read_csv(r"yellow_tripdata_2018-01.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-02.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-03.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-04.csv")
#df = pd.read_csv(r"yellow_tripdata_2018-05.csv")
df = pd.read_csv(r"yellow_tripdata_2018-06.csv")

每次运行内核时，我都会激活要读取的文件之一。我正在寻找一种更优雅的方式来做到这一点。我考虑过要进行循环。列出文件名，然后一个接一个地读取它们，但是我不想将它们合并在一起，因此我认为必须存在另一种方法。我一直在搜索它，但似乎所有问题都导致连接最后读取的文件。

Answer 1

您可以使用列表保存所有数据框：

number_of_files = 6
dfs = []

for file_num in range(len(number_of_files)):
    dfs.append(pd.read_csv(f"yellow_tripdata_2018-0{file_num}.csv")) #I use Python 3.6, so I'm used to f-strings now. If you're using Python <3.6 use .format()

然后使用特定的数据框：

df1 = dfs[0]

编辑：

当您试图避免将所有这些内容加载到内存中时，我将采用流式传输的方式。尝试将for循环更改为以下内容：

for file_num in range(len(number_of_files)):
    with open(f"yellow_tripdata_2018-0{file_num}.csv", 'wb') as f:
        dfs.append(csv.reader(iter(f.readline, '')))

然后仅使用dfs[n]或next(dfs[n])上的for循环将每一行读入内存。

PS

您可能需要多线程在同一时间遍历每个线程。

加载/编辑/保存：-使用csv模块

好吧，所以我做了很多研究，python的csv模块确实一次加载了一行，这很可能是在我们打开它的模式下进行的。（解释为{{3 }}）

如果您不想使用here （坦白地说，答案可能是分块，请在@seralouk的答案中实现），否则，那就可以了！在我看来，以下是最好的方法，我们只需要更改几件事即可。

number_of_files = 6
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    #notice I'm opening the original file as f in mode 'r' for read only
    #and the new file as nf in mode 'a' for append
    with open(filename.format(str(file_num).zfill(2)), 'r') as f,
         open(filename.format((str(file_num)+"-new").zfill(2)), 'a') as nf:
        #initialize the writer before looping every line
        w = csv.writer(nf)
        for row in csv.reader(f):
            #do your "data cleaning" (THIS IS PER-LINE REMEMBER)
        #save to file
        w.writerow(row)

注意：

您可能想考虑使用Pandas和/或DictReader ，因为我觉得它们更易于理解，所以我更喜欢它们而不是普通的读者。

熊猫方法-使用块

DictWriter-如果您想避开我的csv方法并坚持使用熊猫：)从字面上看，这和您的问题是一样的，答案就是您要的内容。

基本上，Panda允许您部分地将文件作为块加载，执行任何更改，然后可以将这些块写入新文件。下面主要来自该答案，但我确实在文档中做了一些进一步的阅读

number_of_files = 6
chunksize = 500 #find the chunksize that works best for you
filename = "yellow_tripdata_2018-{}.csv"

for file_num in range(number_of_files):
    for chunk in pd.read_csv(filename.format(str(file_num).zfill(2))chunksize=ch)
        # Do your data cleaning
        chunk.to_csv(filename.format((str(file_num)+"-new").zfill(2)), mode='a') #see again we're doing it in append mode so it creates the file in chunks

有关对数据进行分块的更多信息，请参见PLEASE READ this answer，对于那些对这些内存问题感到头疼的人来说，也是个不错的选择。

Answer 2

像这样使用`for`和`format`。我每天都使用它：

number_of_files = 6

for i in range(1, number_of_files+1):
    df = pd.read_csv("yellow_tripdata_2018-0{}.csv".format(i)))

    #your code here, do analysis and then the loop will return and read the next dataframe

Answer 3

使用glob.glob获取具有相似名称的所有文件：

import glob
files = glob.glob("yellow_tripdata_2018-0?.csv")
for f in files:
    df = pd.read_csv(f)
    # manipulate df
    df.to_csv(f)

这将与yellow_tripdata_2018-0<any one character>.csv相匹配。您还可以使用yellow_tripdata_2018-0*.csv太匹配yellow_tripdata_2018-0<anything>.csv甚至yellow_tripdata_*.csv来匹配所有以yellow_tripdata开头的csv文件。

请注意，这一次也只会加载一个文件。

Answer 4

使用os.listdir（）列出可以循环浏览的文件列表吗？

samplefiles = os.listdir(filepath)
for filename in samplefiles:
    df = pd.read_csv(filename)

文件路径是包含多个csv的目录吗？

或更改文件名的循环：

for i in range(1, 7):
    df = pd.read_csv(r"yellow_tripdata_2018-0%s.csv") % ( str(i))

Answer 5

# import libraries 
import pandas as pd
import glob

# store file paths in a variable

project_folder = r"C:\file_path\"

# Save all file path in a variable
 
all_files_paths = glob.glob(project_folder + "/*.csv")

# Create a list to save whole data
li = []

# Use list comprehension to iterate over all files; and append data in each file to list

list_all_data = [li.append(pd.read_csv(filename, index_col=None, header=0)) for filename in all_files]

# Convert list to pandas dataframe
df = pd.concat(li, axis=0, ignore_index=True)

对熊猫中的多个.csv文件应用相同的操作

5 个答案:

像这样使用`for`和`format`。我每天都使用它：

对熊猫中的多个.csv文件应用相同的操作

5 个答案:

像这样使用for和format。我每天都使用它：

像这样使用`for`和`format`。我每天都使用它：