Use re (regular expression) to parse only chunks of a line

时间:2018-03-09 19:22:47

标签: python python-3.x pandas dataframe

My laboratory is working with a software that generates a mess of data as output, so I’m trying to make things easier using Python. So far, I believe that the best approach is to generate lists and treat it as chunks of data, but that is not so easy: The first chunk of data is easy: the 3 columns are fixed and can be obtained simply with:

chunk1 = my_data[:3]

The 2nd chunk of data is not easy because it can have 2, 3 or 4 columns. I believe that the key here is that the 2nd chunk ends when we find a letter (something like 1 3 7 CCC). In this case I believe that it is possible to use the re module to parse the two, three or four columns and stop before the first letter, but I don’t know how to do it. I intend to “normalize” these columns by filling the vacant spots with zeros or “-”, so if I have the 2 columns case I’ll fill it to [x, y, 0, 0] and the 3 columns case with [x, y, z, 0].

The 3rd chunk is fixed (two, three or four letters and a number) like this: CCC 119.62

And the 4th chunk is the rest.

Here is a representation of the messy output: enter image description here

The final result could be something like: ["s 91", "1.00", "OUT"] ["9", "3", "12", "7"] ["OCCC", "0.34"] ["f829", "27","f752","33"]

So far, I’m stuck trying to figure out how to make the re module work like this:

enter image description here

Any help is much appreciated, guys.

Data sample

s 27    1.00   STRE   30   16   OC    1.355049  f1291 50
s 28   -1.00   STRE    8    6   CC    1.494281  f1340 12  f1271 17
s 29   -1.00   STRE   14   15   NC    1.421282  f1358 49
s 30    1.00   STRE   14   15   NC    1.421282  f1337 10  f1290 33
s 31    1.00   STRE    8    6   CC    1.494281  f1171 15  f323 11
s 32    1.00   STRE   30   31   OC    1.419982  f1082 51  f1077 24
s 33    1.00   STRE   13   11   ClC    1.740581  f842 15  f323 19
s 34   -1.00   BEND    1    3    7   CCC   119.62  f1037 26  f485 10
s 35   -1.00   BEND    3    1    4   CCC   119.74  f1124 29
s 36    1.00   BEND    7    3    1   CCC   119.62  f733 25  f288 13
s 37    1.00   BEND   21   14   15   HNC   116.16  f1578 40  f1560 20
s 38    1.00   BEND   24    5    2   HCC   119.73  f1186 67
s 39    1.00   BEND   25    2    6   HCC   118.80  f1536 53  f1082 10  f1077 17
s 40   -1.00   BEND   24    5    2   HCC   119.73  f1508 44  f1171 14  f1124 13
s 41    1.00   BEND   25    2    6   HCC   118.80  f1669 14  f1271 32  f1124 15
s 42   -1.00   BEND   26   19   18   HCC   119.04  f1578 10  f1560 37  f1291 11
s 89    1.00   TORS   31   30   16   19   COCC     0.24  f161 14  f104 46  f87 19  f43 10
s 90    1.00   OUT     8    2    3    6   CCCC     1.09  f466 36  f125 22
s 91    1.00   OUT     9    3   12    7   OCCC     0.34  f829 27  f752 33

3 个答案:

答案 0 :(得分:2)

您不需要正则表达式来解决此问题。你可以这样做:

text = """s 27    1.00   STRE   30   16   OC    1.355049  f1291 50
s 34   -1.00   BEND    1    3    7   CCC   119.62  f1037 26  f485 10
s 89    1.00   TORS   31   30   16   19   COCC     0.24  f161 14  f104 46  f87 19  f43 10
s 91    1.00   OUT     9    3   12    7   OCCC     0.34  f829 27  f752 33"""

my_file = StringIO(text)

chunks = []
for line in my_file:
    my_data = line.split()
    chunk1 = my_data[:4]
    chunk2 = my_data[4:6]
    for i in range(6, 8):
        if my_data[i].isdigit():
            chunk2.append(my_data[i])
        else:
            break
    chunk3_start = len(chunk1) + len(chunk2)
    chunk3 = my_data[chunk3_start:chunk3_start+2]
    chunk4 = my_data[chunk3_start+2:]
    chunks.append({1: chunk1, 2: chunk2, 3: chunk3, 4: chunk4})

产生以下输出:

[{1: ['s', '27', '1.00', 'STRE'],
  2: ['30', '16'],
  3: ['OC', '1.355049'],
  4: ['f1291', '50']},
 {1: ['s', '34', '-1.00', 'BEND'],
  2: ['1', '3', '7'],
  3: ['CCC', '119.62'],
  4: ['f1037', '26', 'f485', '10']},
 {1: ['s', '89', '1.00', 'TORS'],
  2: ['31', '30', '16', '19'],
  3: ['COCC', '0.24'],
  4: ['f161', '14', 'f104', '46', 'f87', '19', 'f43', '10']},
 {1: ['s', '91', '1.00', 'OUT'],
  2: ['9', '3', '12', '7'],
  3: ['OCCC', '0.34'],
  4: ['f829', '27', 'f752', '33']}]

基本上你一直在向chunk2添加元素,直到你遇到一些不是数字的东西。使用chunk1和chunk2的长度来获取剩余的块。

答案 1 :(得分:2)

我编写了一个从迭代器中提取的生成器,直到找到一个alpha字符串。

from itertools import chain

def while_not_alpha(iterator):
    iterator = iter(iterator)
    for s in iterator:
        if not str(s).isalpha():
            yield s
        else:
            yield chain([s], iterator)
            break

def parse(line):
    *chunk1, rest = line.split(maxsplit=4)
    *chunk2, rest = while_not_alpha(rest.split())
    rest = list(rest)
    chunk3 = rest[:2]
    chunk4 = rest[2:]
    return chunk1, chunk2, chunk3, chunk4

# See below for definition of `txt`
chunk1, chunk2, chunk3, chunk4 = map(list, zip(*map(parse, txt.splitlines())))

我们可以看到chunk2看起来像

chunk2[:4]

[['30', '16'],
 ['8', '6'],
 ['14', '15'],
 ['14', '15']]

chunk3

chunk3[:4]

[['OC', '1.355049'],
 ['CC', '1.494281'],
 ['NC', '1.421282'],
 ['NC', '1.421282']]

我们可以更进一步制作数据框

chunk1, chunk2, chunk3, chunk4 = map(
    pd.DataFrame, map(list, zip(*map(parse, txt.splitlines()))))

chunk2.head()

     0   1     2     3
0   30  16  None  None
1    8   6  None  None
2   14  15  None  None
3   14  15  None  None
4    8   6  None  None
5   30  31  None  None
6   13  11  None  None
7    1   3     7  None
8    3   1     4  None
9    7   3     1  None
10  21  14    15  None
11  24   5     2  None
12  25   2     6  None
13  24   5     2  None
14  25   2     6  None
15  26  19    18  None
16  31  30    16    19
17   8   2     3     6
18   9   3    12     7

或者更进一步:

df = pd.concat(
    map(pd.DataFrame, map(list, zip(*map(parse, txt.splitlines())))),
    axis=1, keys=[f'chunk{i}' for i in range(1, 5)]
)

df

   chunk1                  chunk2                 chunk3           chunk4                                          
        0   1      2     3      0   1     2     3      0         1      0   1      2     3      4     5     6     7
0       s  27   1.00  STRE     30  16  None  None     OC  1.355049  f1291  50   None  None   None  None  None  None
1       s  28  -1.00  STRE      8   6  None  None     CC  1.494281  f1340  12  f1271    17   None  None  None  None
2       s  29  -1.00  STRE     14  15  None  None     NC  1.421282  f1358  49   None  None   None  None  None  None
3       s  30   1.00  STRE     14  15  None  None     NC  1.421282  f1337  10  f1290    33   None  None  None  None
4       s  31   1.00  STRE      8   6  None  None     CC  1.494281  f1171  15   f323    11   None  None  None  None
5       s  32   1.00  STRE     30  31  None  None     OC  1.419982  f1082  51  f1077    24   None  None  None  None
6       s  33   1.00  STRE     13  11  None  None    ClC  1.740581   f842  15   f323    19   None  None  None  None
7       s  34  -1.00  BEND      1   3     7  None    CCC    119.62  f1037  26   f485    10   None  None  None  None
8       s  35  -1.00  BEND      3   1     4  None    CCC    119.74  f1124  29   None  None   None  None  None  None
9       s  36   1.00  BEND      7   3     1  None    CCC    119.62   f733  25   f288    13   None  None  None  None
10      s  37   1.00  BEND     21  14    15  None    HNC    116.16  f1578  40  f1560    20   None  None  None  None
11      s  38   1.00  BEND     24   5     2  None    HCC    119.73  f1186  67   None  None   None  None  None  None
12      s  39   1.00  BEND     25   2     6  None    HCC    118.80  f1536  53  f1082    10  f1077    17  None  None
13      s  40  -1.00  BEND     24   5     2  None    HCC    119.73  f1508  44  f1171    14  f1124    13  None  None
14      s  41   1.00  BEND     25   2     6  None    HCC    118.80  f1669  14  f1271    32  f1124    15  None  None
15      s  42  -1.00  BEND     26  19    18  None    HCC    119.04  f1578  10  f1560    37  f1291    11  None  None
16      s  89   1.00  TORS     31  30    16    19   COCC      0.24   f161  14   f104    46    f87    19   f43    10
17      s  90   1.00   OUT      8   2     3     6   CCCC      1.09   f466  36   f125    22   None  None  None  None
18      s  91   1.00   OUT      9   3    12     7   OCCC      0.34   f829  27   f752    33   None  None  None  None

设置

txt = """\
s 27    1.00   STRE   30   16   OC    1.355049  f1291 50
s 28   -1.00   STRE    8    6   CC    1.494281  f1340 12  f1271 17
s 29   -1.00   STRE   14   15   NC    1.421282  f1358 49
s 30    1.00   STRE   14   15   NC    1.421282  f1337 10  f1290 33
s 31    1.00   STRE    8    6   CC    1.494281  f1171 15  f323 11
s 32    1.00   STRE   30   31   OC    1.419982  f1082 51  f1077 24
s 33    1.00   STRE   13   11   ClC    1.740581  f842 15  f323 19
s 34   -1.00   BEND    1    3    7   CCC   119.62  f1037 26  f485 10
s 35   -1.00   BEND    3    1    4   CCC   119.74  f1124 29
s 36    1.00   BEND    7    3    1   CCC   119.62  f733 25  f288 13
s 37    1.00   BEND   21   14   15   HNC   116.16  f1578 40  f1560 20
s 38    1.00   BEND   24    5    2   HCC   119.73  f1186 67
s 39    1.00   BEND   25    2    6   HCC   118.80  f1536 53  f1082 10  f1077 17
s 40   -1.00   BEND   24    5    2   HCC   119.73  f1508 44  f1171 14  f1124 13
s 41    1.00   BEND   25    2    6   HCC   118.80  f1669 14  f1271 32  f1124 15
s 42   -1.00   BEND   26   19   18   HCC   119.04  f1578 10  f1560 37  f1291 11
s 89    1.00   TORS   31   30   16   19   COCC     0.24  f161 14  f104 46  f87 19  f43 10
s 90    1.00   OUT     8    2    3    6   CCCC     1.09  f466 36  f125 22
s 91    1.00   OUT     9    3   12    7   OCCC     0.34  f829 27  f752 33"""

答案 2 :(得分:2)

这是我的变体:

def simple_parsing(string):
    from re import split
    parts = split('\s+',string)
    result = [];i=4
    while not parts[i].isalpha():
        result.append(parts[i])
        i+=1
    return([parts[0:4],result,parts[i:i+2],parts[i+2:]])

例如,拿了一串你的,结果是:

simple_parsing('s 91    1.00   OUT     9    3   12    7   OCCC     0.34  f829 27  f752 33')
[['s', '91', '1.00', 'OUT'], ['9', '3', '12', '7'], ['OCCC', '0.34'], ['f829', '27', 'f752', '33']]