解决方案

Question

我有一个庞大的学生数据集，每个学生都有自己的 csv 文件，数据集 B 有 297,444 个 csv 文件，我想知道该数据集中缺少哪个学生 csv 文件。

就像您在这张图片中看到的那样，该数据集中不存在 u2.csv 文件，那么我如何使用 Pandas 检查缺少的所有 csv 文件？

这是我目前尝试过的代码

import pandas as pd
import numpy as np
import glob

path = r'C:/Users/user1/Desktop/EDNET DATA/EdNet-KT4/KT4' # use your path
all_files = glob.glob(path + "/*.csv")

li = []

for i,filename in enumerate (all_files):
    
    df = pd.read_csv(filename, ',' ,index_col=None, header=0).assign(user_iD=filename.split("\\")[-1].split(".")[0])
    
    li.append(df)

data = pd.concat(li, axis=0, ignore_index=True)
df = data.copy()

df.isnull().sum()

df.to_feather('KT4.ftr')
data1= pd.read_feather('KT4.ftr')
data1.head()

Answer 1

解决方案

<块引用>

? 注意：您只需要文件名列表。但是您在发布的代码中所做的是读取文件的内容（这不是您想要的）！

您可以选择使用以下两种方法中的任何一种。为了重现性，我创建了一些虚拟数据，并在 Google Colab 上测试了该解决方案。我发现使用 Pandas (Method-2) 在某种程度上更快。

通用代码

import glob
# import pandas as pd

all_files = glob.glob(path + "/*.csv")

# I am deliberately using this for 
#   a small number of students to 
#   test the code.
num_students = 20 # 297444

方法 1：简单的 Python 循环

对于 100,000 个文件，在 Google Colab 上花费了大约 1 分 29 秒。
在 jupyter-notebook 单元中运行以下命令。

%%time
missing_files = []

for i in range(15):
    student_file = f'u{i}.csv'
    if f'{path}/{student_file}' not in all_files:
        missing_files.append(student_file)

#print(f"Total missing: {len(missing_files)}")
#print(missing_files)

## Runtime
# CPU times: user 1min 29s, sys: 0 ns, total: 1min 29s
# Wall time: 1min 29s

方法 2：使用 Pandas 库处理（更快）???

对于 100,000 个文件，在 Google Colab 上花费了大约 358 毫秒。
几乎250 times 比方法 1 快。
在 jupyter-notebook 单元中运行以下命令。

%%time
# import pandas as pd

existing_student_ids = (
    pd.DataFrame({'Filename': all_files})
      .Filename.str.extract(f'{path}/u(?P<StudentID>\d+)\.csv')
      .astype(int)
      .sort_values('StudentID')
      .StudentID.to_list()
)

missing_student_ids = list(set(range(num_students)) - set(existing_student_ids))

# print(f"Total missing students: {len(missing_student_ids)}")
# print(f'missing_student_ids: {missing_student_ids}')

## Runtime
# CPU times: user 323 ms, sys: 31.1 ms, total: 354 ms
# Wall time: 358 ms

虚拟数据

这里我会定义一些虚拟数据，目的是为了制作该解决方案可重现且易于测试。

我将跳过以下学生 ID (skip_student_ids) 并且不会为它们创建任何 .csv 文件。

import os

NUM_STUDENTS = 20

## CREATE FILE NAMES
num_students = NUM_STUDENTS
skip_student_ids = [3, 8, 10, 13] ## --> we will skip these student-ids
skip_files = [f'u{i}.csv' for i in skip_student_ids]
all_files = [f'u{i}.csv' for i in range(num_students) if i not in skip_student_ids]

if num_students <= 20:
    print(f'skip_files: {skip_files}')
    print(f'all_files: {all_files}')

## CREATE FILES
path = 'test'
if not os.path.exists(path):
    os.makedirs(path)
for filename in all_files:
    with open(path + '/' + filename, 'w') as f:
        student_id = str(filename).split(".")[0].replace('u', '')
        content = f"""
        Filename,StudentID
        {filename},{student_id}
        """
        f.write(content)

寻找缺失值

1 个答案:

解决方案

通用代码

方法 1：简单的 Python 循环

方法 2：使用 Pandas 库处理（更快）???

虚拟数据

参考文献