Question

我试图在一个字符串中连接大量包含整数的列。

基本上，从：

开始

df = pd.DataFrame({'id':[1,2,3,4],'a':[0,1,2,3], 'b':[4,5,6,7], 'c':[8,9,0,1]})

获取：

我找到了几种方法（here和here）：

方法1：

conc['glued']=''
i=1
while i < len(df.columns):
        conc['glued'] = conc['glued'] + df[df.columns[i]].values.astype(str)
        i=i+1

这种方法有效，但有点长（我的＃34测试45分钟，测试＃34; 18,000行×40,000列的情况）。我对列上的循环感到担心，因为这个程序最后应该应用在600.000列的表格中，我担心它会太长。

方法2a

conc['join']=[''.join(row) for row in df[df.columns[1:]].values.astype(str)]

方法2b

conc['apply'] = df[df.columns[1:]].apply(lambda x: ''.join(x.astype(str)), axis=1)

这两种方法的效率都比前一种方法高10倍，迭代行很好，并且在我的＆＃34; debug＆＃34;表格1}}。但是，当我将它应用到我的＆＃34;测试＆＃34;表格为18k x 40k，它导致df（在读取相应的csv文件后，我占用了32GB内存的60％）。我可以在不超出内存的情况下复制我的DataFrame，但奇怪的是，应用此方法会导致代码崩溃。

您是否看到我如何修复和改进此代码以使用有效的基于行的迭代？谢谢！

附录： 这是我在测试用例中使用的代码：

MemoryError:

我应该使用chunksize选项来阅读此文件，但我还没有真正理解如何在阅读后使用它。

方法1：

geno_reader = pd.read_csv(genotype_file,header=0,compression='gzip', usecols=geno_columns_names)
fimpute_geno = pd.DataFrame({'SampID': geno_reader['SampID']})

这项工作在45分钟。有一些非常恶心的代码，比如fimpute_geno['Calls'] = '' for i in range(1,len(geno_reader.columns)): fimpute_geno['Calls'] = fimpute_geno['Calls']\ + geno_reader[geno_reader.columns[i]].values.astype(int).astype(str)。我不知道为什么Python不能识别我的整数并认为它们是浮点数。

方法2：

.astype(int).astype(str)

这会导致fimpute_geno['Calls'] = geno_reader[geno_reader.columns[1:]]\ .apply(lambda x: ''.join(x.astype(int).astype(str)), axis=1)

Answer 1

这里有一些尝试。它需要您将列转换为字符串。你的样本框架

    b   c   id
0   4   8   1
1   5   9   2
2   6   0   3
3   7   1   4

然后

#you could also do this conc[['b','c','id']] for the next two lines  
conc.ix[:,'b':'id'] = conc.ix[:,'b':'id'].astype('str')  
conc['join'] = np.sum(conc.ix[:,'b':'id'],axis=1)

会给予

    a   b   c   id  join
0   0   4   8   1   481
1   1   5   9   2   592
2   2   6   0   3   603
3   3   7   1   4   714

有效地连接大量列

您是否看到我如何修复和改进此代码以使用有效的基于行的迭代？谢谢！

1 个答案: