将大数据集加载到Pandas Python中

时间:2017-06-14 10:06:59

标签: python csv pandas

我想从InstaCart加载大型.csv(3.4m行,206k用户)开源数据集https://www.instacart.com/datasets/grocery-shopping-2017

基本上,我无法将orders.csv加载到Pandas DataFrame中。我想学习将大文件加载到Pandas / Python中的最佳实践。

3 个答案:

答案 0 :(得分:3)

最佳选择是以块的形式读取数据,而不是将整个文件加载到内存中

幸运的是,Option Explicit Sub AutoPivot() Dim PvtTbl As PivotTable Dim PvtCache As PivotCache Dim PvtTblName As String Dim pivotTableWs As Worksheet PvtTblName = "pivotTableName" ' set the worksheet object where we will create the Pivot-Table Set pivotTableWs = Sheets.Add(after:=Worksheets("Sheet1")) ' set the Pivot Cache (the Range is static) Set PvtCache = ActiveWorkbook.PivotCaches.Create(SourceType:=xlDatabase, SourceData:="Sheet1!R1C1:R1048576C8") ' create a new Pivot Table in the new created sheet Set PvtTbl = pivotTableWs.PivotTables.Add(PivotCache:=PvtCache, TableDestination:=pivotTableWs.Range("A1"), TableName:=PvtTblName) ' after we set the PvtTbl object, we can easily modifty all it's properties With PvtTbl .ColumnGrand = True .HasAutoFormat = True .DisplayErrorString = False .DisplayNullString = True .EnableDrilldown = True .ErrorString = "" .MergeLabels = False .NullString = "" .PageFieldOrder = 2 .PageFieldWrapCount = 0 .PreserveFormatting = True .RowGrand = True .SaveData = True .PrintTitles = False .RepeatItemsOnEachPrintedPage = True .TotalsAnnotation = False .CompactRowIndent = 1 .InGridDropZones = False .DisplayFieldCaptions = True .DisplayMemberPropertyTooltips = False .DisplayContextTooltips = True .ShowDrillIndicators = True .PrintDrillIndicators = False .AllowMultipleFilters = False .SortUsingCustomLists = True .FieldListSortAscending = False .ShowValuesRow = False .CalculatedMembersInFilters = False .RowAxisLayout xlCompactRow With .PivotCache .RefreshOnFileOpen = False .MissingItemsLimit = xlMissingItemsDefault End With .RepeatAllLabels xlRepeatLabels With .PivotFields("field1") .Orientation = xlRowField .Position = 1 End With .AddDataField .PivotFields("ticketid"), "Count of field1", xlCount With .PivotFields("field2") .Orientation = xlColumnField .Position = 1 End With End With End Sub 方法接受read_csv参数。

chunksize

注意:通过指定for chunk in pd.read_csv(file.csv, chunksize=somesize): process(chunk) chunksizeread_csv,返回值将是read_table类型的iterable对象:

另见:

答案 1 :(得分:0)

当您拥有可能不适合内存的大型数据框时,dask非常有用。我链接到的主页面上有关于如何创建一个dask数据帧的示例,该数据帧与pandas具有相同的API但可以分发。

答案 2 :(得分:0)

根据您的机器,您可以通过在读取csv文件时指定数据类型来读取内存中的所有内容。当pandas读取csv时,使用的默认数据类型可能不是最好的。使用dtype可以指定数据类型。它减少了读入内存的数据帧的大小。