我有一个很大的文本文件,如下所示,包括字符串和数字。我只想读取数字,还要删除只有3列的行,然后将它们写入矩阵(m×n)。谁能告诉我python处理此类文件的最佳方法是什么?
我的文件类似于:
# Chunk-averaged data for fix Dens and group ave
# Timestep Number-of-chunks Total-count
# Chunk Coord1 Ncount density/number
4010000 14 1500
1 4.323 138.758 0.00167105
2 12.969 121.755 0.00146629
3 21.615 127.7 0.00153788
4 30.261 131.682 0.00158584
5 38.907 127.525 0.00153578
6 47.553 136.322 0.00164172
7 56.199 118.014 0.00142124
8 64.845 125.842 0.00151551
9 73.491 120.684 0.00145339
10 82.137 132.282 0.00159306
11 90.783 121.567 0.00146402
12 99.429 97.869 0.00117863
13 108.075 0 0
14 116.721 0 0......
答案 0 :(得分:1)
您尚未指定矩阵的确切含义,因此这里提供了一种解决方案,它将文本文件转换为二维列表,使每个数字都可以单独访问。
它检查给定行中的第一项是数字,并且该行中有4个项,在这种情况下,它将把该行作为4个单独的数字附加到2d列表mat
中。如果您想访问mat
中的任何数字,则可以使用mat[i][j]
。
with open("test.txt") as f:
content = f.readlines()
content = [x.strip() for x in content]
mat = []
for line in content:
s = line.split(' ')
if s[0].isdigit() and len(s) == 4:
mat.append(s)
答案 1 :(得分:1)
将样本复制n粘贴到txt
:
In [350]: np.genfromtxt(txt.splitlines(), invalid_raise=False)
/usr/local/bin/ipython3:1: ConversionWarning: Some errors were detected !
Line #2 (got 4 columns instead of 3)
Line #3 (got 4 columns instead of 3)
....
#!/usr/bin/python3
Out[350]: array([4.01e+06, 1.40e+01, 1.50e+03])
这将读取第一条非注释行,并将其作为标准。跳过这一点,我可以阅读所有内容:
In [351]: np.genfromtxt(txt.splitlines(), invalid_raise=False,skip_header=4)
Out[351]:
array([[1.00000e+00, 4.32300e+00, 1.38758e+02, 1.67105e-03],
[2.00000e+00, 1.29690e+01, 1.21755e+02, 1.46629e-03],
[3.00000e+00, 2.16150e+01, 1.27700e+02, 1.53788e-03],
[4.00000e+00, 3.02610e+01, 1.31682e+02, 1.58584e-03],
[5.00000e+00, 3.89070e+01, 1.27525e+02, 1.53578e-03],
[6.00000e+00, 4.75530e+01, 1.36322e+02, 1.64172e-03],
[7.00000e+00, 5.61990e+01, 1.18014e+02, 1.42124e-03],
[8.00000e+00, 6.48450e+01, 1.25842e+02, 1.51551e-03],
[9.00000e+00, 7.34910e+01, 1.20684e+02, 1.45339e-03],
[1.00000e+01, 8.21370e+01, 1.32282e+02, 1.59306e-03],
[1.10000e+01, 9.07830e+01, 1.21567e+02, 1.46402e-03],
[1.20000e+01, 9.94290e+01, 9.78690e+01, 1.17863e-03],
[1.30000e+01, 1.08075e+02, 0.00000e+00, 0.00000e+00],
[1.40000e+01, 1.16721e+02, 0.00000e+00, 0.00000e+00]])
实际上,在这种情况下,所有其余部分都具有必需的4。如果我截断了最后两行,则会收到警告,但它仍会读取其他行。
在将行传递到genfromtxt
之前过滤行是另一种选择。 genfromtxt
接受任何能为其输入行的输入-文件,字符串列表或读取和过滤文件的函数。
答案 2 :(得分:0)
对于您的任务,您需要迭代器,string.split()和re.match:
import re #needed to use regexp to see if line in file contains only numbers
matrix = [] #here we'll put your numbers
i = 0 #counter for matrix rows
for line in open('myfile.txt'): #that will iterate lines in file one by one
if not re.match('[ 0-9\.]', line): #checking for symbols other than numbers in line
continue #and skipping an iteration if there are any
list_of_items = line.split(' ') #presumed numbers in string are divided with spaces - splittin line into list of separate strings
if len(list_of_items) <= 3: #we will not take ro of 3 or less into matrix
continue
matrix.append([]) #adding row to matrix
for an_item in list_of_items:
matrix[i].append(float(an_item)) #converting strings and adding floats to a row
i += 1
我试图使代码和注释说话,让我知道是否有任何不清楚的地方