读取大型csv文件时内核死亡

时间:2019-06-25 15:15:49

标签: python-3.x pandas dataframe jupyter-notebook dask

我看不到大熊猫在jupyter笔记本中的2800万行文件...内核死于大熊猫。 达斯克可能正在做这项工作,但我在文档中找不到任何有趣的东西。我不想和dask一起工作。它甚至无法显示行数。...

所以首先我要让熊猫工作

我尝试了几件事:使用/不使用大块代码和使用dask(但我想使用熊猫)

文件:https://www.data.gouv.fr/fr/datasets/r/7c7e737a-0fc8-4da8-a887-155ce648e3d7

print("Loading...",end='\r')
import numpy as np
import pandas as pd
import requests
import time
import sys
import dask.dataframe as dd
pd.show_versions()
print("ok        ")
def timing(begin):
    now = time.time()
    run = round(now-begin,1)
    if run > 60:
        run = round(run/60,1)
        return str(run)+" minutes"
    else:
        return str(run)+" seconds"
names = ['Siren',
 'nic',
 'Siret',
 'statutDiffusionEtablissement',
 'dateCreationEtablissement',
 'trancheEffectifsEtablissement',
 'anneeEffectifsEtablissement',
 'activitePrincipaleRegistreMetiersEtablissement',
 'dateDernierTraitementEtablissement',
 'Head Quarter',
 'nombrePeriodesEtablissement',
 'complementAdresseEtablissement',
 'Street Number',
 'indiceRepetitionEtablissement',
 'Street Type',
 'Street',
 'Postal Code',
 'City',
 'libelleCommuneEtrangerEtablissement',
 'distributionSpecialeEtablissement',
 'codeCommuneEtablissement',
 'codeCedexEtablissement',
 'libelleCedexEtablissement',
 'codePaysEtrangerEtablissement',
 'Country',
 'complementAdresse2Etablissement',
 'numeroVoie2Etablissement',
 'indiceRepetition2Etablissement',
 'typeVoie2Etablissement',
 'libelleVoie2Etablissement',
 'codePostal2Etablissement',
 'libelleCommune2Etablissement',
 'libelleCommuneEtranger2Etablissement',
 'distributionSpeciale2Etablissement',
 'codeCommune2Etablissement',
 'codeCedex2Etablissement',
 'libelleCedex2Etablissement',
 'codePaysEtranger2Etablissement',
 'libellePaysEtranger2Etablissement',
 'dateDebut',
 'Status',
 'enseigne1Etablissement',
 'enseigne2Etablissement',
 'enseigne3Etablissement',
 'denominationUsuelleEtablissement',
 'activitePrincipaleEtablissement',
 'nomenclatureActivitePrincipaleEtablissement',
 'caractereEmployeurEtablissement']
usecols = ["Siren",
        "Siret",
        "Postal Code",
        "Head Quarter",
        "Status",
        "City",
        "Country",
        "Street",
        "Street Number",
        "Street Type"]

file = "siren_etab.csv"#28M rows Siren database
#Without Chunks: kernel dies quickly in ~2 minutes
df = pd.read_csv(file, header=None, skiprows=[0], names=names, usecols=usecols, low_memory=False, sep=',', error_bad_lines=False)
#No output before kernel dies
#With chunks of 1M rows (optimal): kernel dies after a wait of ~12 minutes
begin = time.time()
df_etab = pd.DataFrame()
for i,chunk in enumerate(pd.read_csv(file, header=None, skiprows=[0], names=names, usecols=usecols, chunksize=1000000, low_memory=False, sep=',', error_bad_lines=False)):
    running = timing(begin)
    print("#"+str(i),"|",round((i*1000000*100)/28000000),"% |",running,"|",df_etab.shape,end="\r")
    df_etab = pd.concat([df_etab, chunk])
#Output before kernel dies: #17 | 61 % | 12.4 minutes | (17000000, 10)
#With chunks of 1M rows and engine=Python: kernel dies after a wait of ~18 minutes
begin = time.time()
df_etab = pd.DataFrame()
for i,chunk in enumerate(pd.read_csv(file, header=None, skiprows=[0], names=names, usecols=usecols, chunksize=1000000, sep=',', error_bad_lines=False, engine="python")):
    running = timing(begin)
    print("#"+str(i),"|",round((i*1000000*100)/28000000),"% |",running,"|",df_etab.shape,end="\r")
    df_etab = pd.concat([df_etab, chunk])
#Output before kernel dies: #8 | 29 % | 18.0 minutes | (8000000, 10)
#With dask library
dtype={'codeCommune2Etablissement': 'object',
       'codeCommuneEtablissement': 'object',
       'complementAdresse2Etablissement': 'object',
       'enseigne2Etablissement': 'object',
       'enseigne3Etablissement': 'object',
       'indiceRepetition2Etablissement': 'object',
       'libelleCommuneEtrangerEtablissement': 'object',
       'libellePaysEtrangerEtablissement': 'object',
       'numeroVoieEtablissement': 'object'}
df_etab = dd.read_csv(file, dtype=dtype, low_memory=False)
df_etab.head()
#Outputs the first 5 rows of the dataframe
#Impossible to have a correct row count
#df_etab.isnull().sum(axis=1)
#df_etab.count(axis=1)
#Outputs:
    #Dask Series Structure:
    #npartitions=78
    #    int64
    #      ...
    #    ...  
    #      ...
    #      ...
    #dtype: int64
    #Dask Name: dataframe-sum, 390 tasks

没有错误。

结果:内核死亡,然后重新启动。

已安装的版本| 提交:无| 的Python:3.6.8.final.0 | python位:64 | 操作系统:Linux | 操作系统版本:3.10.0-327.el7.x86_64 | 机器和处理器:x86_64 | 字节序:小| LC_ALL:无| 熊猫:0.24.2 |

0 个答案:

没有答案