读入大型表文件,但只使用pandas保留一小部分行

时间:2017-04-07 19:41:31

标签: python pandas dataframe

我有一个大表文件(大约2 GB),它包含一个由第一列索引的距离矩阵。它的行看起来像

A 0 1.2 1.3 ...
B 1.2 0 3.5 ...
C 1.5 0 4.5 ...

但是,我只需要保留一小部分行。如果我给出了我需要保留的索引列表,那么将此文件读入pandas数据帧的最佳和最快方法是什么。现在,我正在使用

distance_matrix = pd.read_table("hla_distmat.txt", header = None, index_col = 0)[columns_to_keep]

读取文件,但这会导致read_table命令的内存问题。是否有更快,更有效的内存方式?谢谢。

1 个答案:

答案 0 :(得分:1)

如果需要过滤器列,则需要usecols参数,而过滤器行需要skiprows,您必须指定必须由listrange或{{删除哪个列1}}:

np.array

示例:(在实际数据中省略distance_matrix = pd.read_table("hla_distmat.txt", header = None, index_col = 0, usecols=[columns_to_keep], skiprows = range(10, 100)) 参数,默认情况下sepsep='\t'

import pandas as pd
import numpy as np 
from pandas.compat import StringIO

temp=u"""0;119.02;0.0
1;121.20;0.0
3;112.49;0.0
4;113.94;0.0
5;114.67;0.0
6;111.77;0.0
7;117.57;0.0
6648;0.00;420.0
6649;0.00;420.0
6650;0.00;420.0"""
#after testing replace 'StringIO(temp)' to 'filename.csv'

columns_to_keep = [0,1]

df = pd.read_table(StringIO(temp), 
                   sep=";", 
                   header=None,
                   index_col=0, 
                   usecols=columns_to_keep,
                   skiprows = range(5, 100))
print (df)
        1
0        
0  119.02
1  121.20
3  112.49
4  113.94
5  114.67

使用read_table的更一般解决方案:

#if index_col = 0 always need first column (0)
columns_to_keep = [0,1]
#for keep second, third, fifth row
rows_to_keep = [1,2,4]
#estimated row count or use solution from http://stackoverflow.com/q/19001402/2901002
max_rows = 100

df = pd.read_table(StringIO(temp), 
                   sep=";", 
                   header=None,
                   index_col=0, 
                   usecols=columns_to_keep,
                   skiprows = np.setdiff1d(np.arange(max_rows), np.array(rows_to_keep)))
print (df)
        1
0        
1  121.20
3  112.49
5  114.67