如何采样一个非常大的CSV文件(6GB)

时间:2015-01-08 08:36:01

标签: python file memory

有一个很大的CSV文件(第一行作为标题),现在我想以100个样本(例如line_num%100)对其进行采样,如何有效地使用主内存约束?

将文件分成100个较小的一个。或者每1/100行作为子文件1,每2/100行作为子文件2,...,每100/100行作为文件100。 获得100个大小约600米的文件。

未获得100行或1/100大小的样本。

我试图像这样执行:

fi  = [open('split_data//%d.csv'%i,'w') for i in range(100)]
i = 0
with open('data//train.csv') as fin:
    first = fin.readline()
    for line in fin:
        fi[i%100].write(line)
        i = i + 1
for i in range(100):
    fi[i].close()

但是文件太大而无法用有限的内存运行它,如何处理呢? 我想用一轮〜

制作它

(我的代码有效,但它耗费了太多时间,我错误地认为它崩溃了,对不起~~)

3 个答案:

答案 0 :(得分:6)

如评论中所述,将文件拆分为100个部分(我想将文件拆分为100个模块的方式,即范围(200) - > | [0,100]; [1,101 ]; [2,102] 是的,将一个大文件分成几百个较小的文件

import csv

files = [open('part_{}'.format(n), 'wb') for n in xrange(100)]
csvouts = [csv.writer(f) for f in files]
with open('yourcsv') as fin:
    csvin = csv.reader(fin)
    next(csvin, None) # Skip header
    for rowno, row in enumerate(csvin):
        csvouts[rowno % 100].writerow(row)

for f in files:
    f.close()

您可以使用步骤而不是模数的行号islice覆盖文件,例如:

import csv
from itertools import islice

with open('yourcsv') as fin:
    csvin = csv.reader(fin)
    # Skip header, and then return every 100th until file ends
    for line in islice(csvin, 1, None, 100):
        # do something with line

示例:

r = xrange(1000)
res = list(islice(r, 1, None, 100))
# [1, 101, 201, 301, 401, 501, 601, 701, 801, 901]

答案 1 :(得分:1)

根据@Jon Clements的回答,我也会对这种变化进行基准测试:

import csv
from itertools import islice

with open('in.csv') as fin:
  first = fin.readline() # discard the header
  csvin = csv.reader( islice(fin, None, None, 100) )  # this line is the only difference
  for row in csvin:
    print row # do something with row

如果你只想要100个样本,你可以使用这个想法,它只在文件中的等距位置进行100次读取。这应该适用于行长度基本一致的CSV文件。

def sample100(path):
  with open(path) as fin:
    end = os.fstat(fin.fileno()).st_size
    fin.readline()              # skip the first line
    start = fin.tell()
    step = (end - start) / 100
    offset = start
    while offset < end:
      fin.seek(offset)
      fin.readline()            # this might not be a complete line
      if fin.tell() < end:
        yield fin.readline()    # this is a complete non-empty line
      else:
        break                   # not really necessary...
      offset = offset + step

for row in csv.reader( sample100('in.csv') ):
  # do something with row

答案 2 :(得分:0)

我认为您可以打开相同的文件10次,然后独立操作(读取)每个文件,有效地将其拆分为子文件而不实际执行。

不幸的是,这需要事先知道文件中有多少行,并且需要读取整个事物一次来计算它们。另一方面,这应该相对快速,因为不会发生其他处理。

为了说明和测试这种方法,我创建了一个更简单的 - 每行只有一个项目 - 和更小的csv测试文件看起来像这样(第一行是标题行而不计算):

line_no
1
2
3
4
5
...
9995
9996
9997
9998
9999
10000

以下是代码和示例输出:

from collections import deque
import csv

# count number of rows in csv file
# (this requires reading the whole file)
file_name = 'mycsvfile.csv'
with open(file_name, 'rb') as csv_file:
    for num_rows, _ in enumerate(csv.reader(csv_file)): pass
rows_per_section = num_rows // 10

print 'number of rows: {:,d}'.format(num_rows)
print 'rows per section: {:,d}'.format(rows_per_section)

csv_files = [open(file_name, 'rb') for _ in xrange(10)]
csv_readers = [csv.reader(f) for f in csv_files]
map(next, csv_readers)  # skip header

# position each file handle at its starting position in file
for i in xrange(10):
    for j in xrange(i * rows_per_section):
        try:
            next(csv_readers[i])
        except StopIteration:
            pass

# read rows from each of the sections
for i in xrange(rows_per_section):
    # elements are one row from each section
    rows = [next(r) for r in csv_readers]
    print rows  # show what was read

# clean up
for i in xrange(10):
    csv_files[i].close()

输出:

number of rows: 10,000
rows per section: 1,000
[['1'], ['1001'], ['2001'], ['3001'], ['4001'], ['5001'], ['6001'], ['7001'], ['8001'], ['9001']]
[['2'], ['1002'], ['2002'], ['3002'], ['4002'], ['5002'], ['6002'], ['7002'], ['8002'], ['9002']]
...
[['998'], ['1998'], ['2998'], ['3998'], ['4998'], ['5998'], ['6998'], ['7998'], ['8998'], ['9998']]
[['999'], ['1999'], ['2999'], ['3999'], ['4999'], ['5999'], ['6999'], ['7999'], ['8999'], ['9999']]
[['1000'], ['2000'], ['3000'], ['4000'], ['5000'], ['6000'], ['7000'], ['8000'], ['9000'], ['10000']]