Question

我有以下数据

# Data set number 1
# 
# Number of lines 4081 
# 
# Max number of column 3 is 5  
# Blahblah
# The explanation about each rows
 3842 1 1 3843 0         0.873         0.922         0.000         0.317
 3843 2 2 3842 3844 0         0.873         0.873         1.747         2.000        -0.614
 3844 1 1 3843 0         0.873         0.922         0.000         0.312
......
2191 3 2 2117 2120 0         0.925         0.934         1.878         2.000        -0.750
# Data set number 2 
# 
# Number of lines 4081 
# 
# Max number of column 3 is 5  
# Blahblah
# The explanation about each rows
 3842 1 1 3843 0         0.873         0.922         0.000         0.317
 3843 2 2 3842 3844 0         0.873         0.873         1.747         2.000        -0.614

我的数据有2010年重复格式的数据集，由7个标题行+ 4081个数据行组成。如何对数据行进行排序，而不是整数但在重复数据集内？所以，我希望对每个数据集的每个8~4081行进行排序。

ps）我希望对数据w.r.t排序第一列，我的意思是，按列排序。因此，应该对数据的第一列进行排序，其他列如下：

Answer 1

你知道标题是7行，所以你可以忽略它：

data_txt='''\
# Data set number 1
# 
# Number of lines 4081 
# 
# Max number of column 3 is 5  
# Blahblah
# The explanation about each rows
 3842 1 1 3843 0         0.873         0.922         0.000         0.317
 3843 2 2 3842 3844 0         0.873         0.873         1.747         2.000        -0.614
 3844 1 1 3843 0         0.873         0.922         0.000         0.312'''

data_lines=data_txt.splitlines()
data=[map(float,line.split()) for line in data_lines[7:]]

print data
# [[3842.0, 1.0, 1.0, 3843.0, 0.0, 0.873, 0.922, 0.0, 0.317], [3843.0, 2.0, 2.0, 3842.0, 3844.0, 0.0, 0.873, 0.873, 1.747, 2.0, -0.614], [3844.0, 1.0, 1.0, 3843.0, 0.0, 0.873, 0.922, 0.0, 0.312]]

然后，如果您想按第一个元素对列表进行排序：

data=sorted(data, key=lambda l: l[0])
print data
# [[3842.0, 1.0, 1.0, 3843.0, 0.0, 0.873, 0.922, 0.0, 0.317], [3843.0, 2.0, 2.0, 3842.0, 3844.0, 0.0, 0.873, 0.873, 1.747, 2.0, -0.614], [3844.0, 1.0, 1.0, 3843.0, 0.0, 0.873, 0.922, 0.0, 0.312]]

如果您想单独留下第一个元素，而是对每个列表的其余部分进行排序：

data=[[e[0]]+sorted(e[1:]) for e in data]

等

Answer 2

这样的事情应该有效。

f_in = open('input', 'r')
f_out = open('output', 'w')

while True:
    hdr = []
    for i in xrange(7):
        hdr.append(f_in.readline())
    # Detect end-of-file condition
    if not hdr[0]:
        break

    data = []
    for i in xrange(4081):
        data.append(f_in.readline())
    data.sort()
    f_out.writelines(hdr)
    f_out.writelines(data)

f_in.close()
f_out.close()

Answer 3

您可以使用numpy将数据集拆分为数据块：

import numpy as np
full = [line for line in open("foo4",'r').readlines() if not line.startswith("#")]
datablocks = np.split(np.array(full), len(full)/4081)
for block in datablocks:
    #lines is a dataset, sorted by first column
    lines = sorted(block, key= lambda line : int(line.split()[0]))
    print lines

Python排序数据每第n行？

3 个答案: