Question

我有一个要解析的文件，我不知道哪个是制作正则表达式的最佳策略。我想获得de数据所在的行。（我已经从线上获取了我想要的数据，但我意识到我留下了一些匹配，因为我的第一个正则表达式并不好。）

以下是我尝试过的一些正则表达式/策略：

找到标题并匹配以下所有内容，直到两个空行：

data_regex = re.compile("(?<=    ------- ------ -----    ------- ------ -----   ---- --  --------     -----------\n)[^(\n)^(\n)^]+")

匹配的内容：

1.3e-26   92.9  13.7    4.3e-26   91.2   8.9    2.0  2  BPD_transp_1 Binding-protein-dependent transport system inne
4.7e-34  117.1  19.5      9e-34  116.2  13.5    1.4  1  BPD_transp_1 Binding-protein-dependent transport system inne
3.2e-153  509.4   5.2   3.6e-153  509.2   3.6    1.0  1  IMPDH        IMP dehydrogenase / GMP reductase domain
1.3e-20   73.2   0.2    3.4e-19   68.6   0.1    2.5  3  DEAD         DEAD/DEAH box helicase
6.9e-11   42.1   0.0    1.8e-09   37.5   0.0    2.4  2  CTP_transf_2 Cytidylyltransferase

正如你所看到的那样，它与某些数据相匹配，但并不是我想象的所有数据。但我尝试了另一个：

data_regex = re.compile("(?<=    E-value  score  bias    E-value  score  bias    exp  N  Model        Description\s)(.+\s)+")

在这个表达式中，我预计会有更多需要，包括---行，但我最终得到了这个：

3.6    7.2  11.6       0.13   11.9   3.6    2.0  2  Spore_YabQ   Spore cortex protein YabQ (Spore_YabQ)

0.63    9.6   3.1       0.42   10.2   0.3    2.1  2  IBV_3C       IBV 3C protein

0.38    9.6   4.8       0.65    8.9   0.8    2.6  3  PcrB         PcrB family

0.059   12.6   0.3          1    8.6   0.0    2.8  3  DUF699       Putative ATPase (DUF699)

0.028   14.1   0.9         14    5.7   0.0    3.8  4  HEAT         HEAT repeat

再次，一些结果，但不是我的预期

多次找到数字分隔的结构，然后找到单词：

data_regex = re.compile("(\s+([+-]?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s+)(\w+\s)+")

但它找到了许多数字，而不是数字空格，我想要的单词：

(' 2010 ', '2010', 'Medical ')
(' 1 ', '1', 'domain ')
('    1.5  ', '1.5', '1 ')
('   6.2e-27      ', '6.2e-27', '12 ')
('      17     ', '17', '129 ')
('       7     ', '7', '130 ')
(' 0.92\n\n  ', '0.92', 'each ')
(' 5.2e-31\n                        ', '5.2e-31', 'PucR ')

我用它来获得比赛

data_result = re.findall(data_regex, document)
print data_result

我正在解析的数据类型，文件的摘录：

# CPU time: 0.66u 0.50s 00:00:01.16 Elapsed: 00:00:00.55
# Mc/sec: 902.81
//
Query:       LD_216  [L=247]
Description: # 237337 # 238077 # 1 # ID=1_216;partial=00;start_type=ATG;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.390
Scores for complete sequence (score includes all domains):
   --- full sequence ---   --- best 1 domain ---    -#dom-
    E-value  score  bias    E-value  score  bias    exp  N  Model        Description
    ------- ------ -----    ------- ------ -----   ---- --  --------     -----------
      3e-24   85.3   0.0    5.2e-24   84.5   0.0    1.4  1  ABC_tran     ABC transporter
    3.2e-11   42.5   0.3    9.7e-11   40.9   0.2    1.7  1  SMC_N        RecF/RecN/SMC N terminal domain
    3.1e-05   22.4   0.1       0.17   10.1   0.0    2.6  2  ABC_ATPase   Predicted ATPase of the ABC class
    6.5e-05   21.8   0.1     0.0001   21.2   0.0    1.3  1  DUF258       Protein of unknown function, DUF258
      0.001   19.0   0.5       0.21   11.5   0.0    2.2  2  AAA          ATPase family associated with various cellular 
     0.0019   16.4   0.1     0.0046   15.1   0.0    1.6  2  DLIC         Dynein light intermediate chain (DLIC)
     0.0032   15.8   0.1      0.028   12.7   0.0    2.0  2  Adeno_IVa2   Adenovirus IVa2 protein
  ------ inclusion threshold ------
      0.016   14.5   0.3      0.037   13.4   0.2    1.8  1  Arch_ATPase  Archaeal ATPase
      0.018   14.3   0.0      0.046   13.0   0.0    1.6  1  UPF0079      Uncharacterised P-loop hydrolase UPF0079
       0.02   13.3   0.2      0.041   12.3   0.1    1.4  1  Rad17        Rad17 cell cycle checkpoint protein
      0.026   13.7   0.1      0.049   12.8   0.0    1.4  1  PduV-EutP    Ethanolamine utilisation - propanediol utilisat
      0.046   12.2   0.0      0.085   11.4   0.0    1.5  1  GSPII_E      Type II/IV secretion system protein
       0.05   12.4   0.0      0.087   11.6   0.0    1.4  1  Mg_chelatase Magnesium chelatase, subunit ChlI
      0.054   12.0   0.2      0.094   11.2   0.2    1.7  1  NB-ARC       NB-ARC domain
      0.056   12.9   0.1       0.15   11.5   0.1    1.8  1  MobB         Molybdopterin guanine dinucleotide synthesis pr
      0.059   12.0   0.4        8.9    4.8   0.0    2.4  2  KAP_NTPase   KAP family P-loop domain
      0.079   12.3   0.3       0.57    9.5   0.1    2.1  2  AAA_5        AAA domain (dynein-related subfamily)
      0.086   11.9   0.2       0.32   10.0   0.0    2.0  2  IstB         IstB-like ATP binding protein
       0.13   11.0   1.6        3.5    6.3   0.1    2.7  3  KaiC         KaiC
       0.23   11.3   1.3       0.92    9.4   0.1    2.7  4  RNA_helicase RNA helicase


Domain annotation for each model (and alignments):
>> ABC_tran  ABC transporter


# Here begins other type of data but above there are two empty lines

------ inclusion threshold ------行可以在------- ------ ----- ------- ------ ----- ---- -- -------- -----------行之后或随机位置。如果可能的话，我想知道它与每一行匹配的位置，因为如果它们包含在阈值中，我将需要对它们进行不同的处理。

如何获取文件的这些行？

预期产出：

      3e-24   85.3   0.0    5.2e-24   84.5   0.0    1.4  1  ABC_tran     ABC transporter
    3.2e-11   42.5   0.3    9.7e-11   40.9   0.2    1.7  1  SMC_N        RecF/RecN/SMC N terminal domain
    3.1e-05   22.4   0.1       0.17   10.1   0.0    2.6  2  ABC_ATPase   Predicted ATPase of the ABC class
    6.5e-05   21.8   0.1     0.0001   21.2   0.0    1.3  1  DUF258       Protein of unknown function, DUF258
      0.001   19.0   0.5       0.21   11.5   0.0    2.2  2  AAA          ATPase family associated with various cellular 
     0.0019   16.4   0.1     0.0046   15.1   0.0    1.6  2  DLIC         Dynein light intermediate chain (DLIC)
     0.0032   15.8   0.1      0.028   12.7   0.0    2.0  2  Adeno_IVa2   Adenovirus IVa2 protein

      0.016   14.5   0.3      0.037   13.4   0.2    1.8  1  Arch_ATPase  Archaeal ATPase
      0.018   14.3   0.0      0.046   13.0   0.0    1.6  1  UPF0079      Uncharacterised P-loop hydrolase UPF0079
       0.02   13.3   0.2      0.041   12.3   0.1    1.4  1  Rad17        Rad17 cell cycle checkpoint protein
      0.026   13.7   0.1      0.049   12.8   0.0    1.4  1  PduV-EutP    Ethanolamine utilisation - propanediol utilisat
      0.046   12.2   0.0      0.085   11.4   0.0    1.5  1  GSPII_E      Type II/IV secretion system protein
       0.05   12.4   0.0      0.087   11.6   0.0    1.4  1  Mg_chelatase Magnesium chelatase, subunit ChlI
      0.054   12.0   0.2      0.094   11.2   0.2    1.7  1  NB-ARC       NB-ARC domain
      0.056   12.9   0.1       0.15   11.5   0.1    1.8  1  MobB         Molybdopterin guanine dinucleotide synthesis pr
      0.059   12.0   0.4        8.9    4.8   0.0    2.4  2  KAP_NTPase   KAP family P-loop domain
      0.079   12.3   0.3       0.57    9.5   0.1    2.1  2  AAA_5        AAA domain (dynein-related subfamily)
      0.086   11.9   0.2       0.32   10.0   0.0    2.0  2  IstB         IstB-like ATP binding protein
       0.13   11.0   1.6        3.5    6.3   0.1    2.7  3  KaiC         KaiC
       0.23   11.3   1.3       0.92    9.4   0.1    2.7  4  RNA_helicase RNA helicase

修改我最后更改了使用readlines()读取文件，然后为每行执行以下操作：

elif lines.startswith("   "):
    data_regex = re.compile("-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?")#Matches numbers
    data_result = re.findall(data_regex, lines) 
    data_regex2 = re.compile("[?!]") # Some other characters found
    data_result2 = re.findall(data_regex2, lines)
    data_regex3 = re.compile("-{2,}") # Finds where are the ----- lines
    data_result3 = re.findall(data_regex3, lines)

# There are numbers in the line and there are 10 or more words and numbers (8 numbers
# and plus id and description), and it doesn't have any "strange" character or it is
# a --- line
    if data_result != [] and len(lines.split()) >= 10 and data_result2 == [] and data_result3 == []:
        print lines[:-1]

Answer 1

我的建议：

删除所有这些评论，例如-----blablabla-----等，以便您拥有仅包含数据列的文件
如果您使用numpy，请假设列以tab分隔。
```
#!/usr/bin/env python

import numpy as np

dat = np.genfromtxt('data.txt', delimiter='\t', dtype=str)
```
dat将包含类型为str的二维数组中的所有数字和单词，然后dat[:,0:7]将包含所有数字。

Answer 2

我在读完文件行之后最终做了这个正则表达式。

data_regex = re.compile("^ {3,10}((-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?)\s*){8}.+")

它检查行开头（{3,10}）的足够空格（^）以避免其他数据，后跟8（{8}）个数字（-?\d+(?:\.\d+)?(?:[eE][+-]?\d+)?）（\s*）与行的其余部分（.+）

之间可能存在空格

用数字和单词解析线条

2 个答案: