提取问题和答案之间的界线

时间:2019-02-18 14:04:14

标签: python regex

Question No. 01 
Which of the following has more fire resisting characteristics? 
(A) Marble 
(B) Lime stone 
(C) Compact sand stone 
(D) Granite 
Answer: Option C 

Question No. 02 
The rocks which are formed due to cooling of magma at a considerable depth from earth's surface are called 
(A) Plutonic rocks 
(B) Hypabyssal rocks 
(C) Volcanic rocks 
(D) Igneous rocks 
Answer: Option A 

Question No. 03 
Plywood has the advantage of 
(A) Greater tensile strength in longer direction 
(B) Greater tensile strength in shorter direction 
(C) Same tensile strength in all directions 
(D) None of the above Answer: Option C 

我试图提取问题编号\ d +和答案:选项之间的问​​题 以列表格式

with open('Building materials.txt','r') as lines:
    for line in lines:
        if re.search('Question No. (\d+)',line):
            print line.split()

输出:

['Which of the following has more fire resisting characteristics?\n(A) Marble \n(B) Lime stone \n(C) Compact sand stone \n(D) Granite','The rocks which are formed due to cooling of magma at a considerable depth from earth's surface are called \n(A) Plutonic rocks \n(B) Hypabyssal rocks \n(C) Volcanic rocks \n(D) Igneous rocks']

3 个答案:

答案 0 :(得分:2)

您可以使用

^Question[^\d\r\n]+
(?P<nr>\d+)\s+
(?P<block>[\s\S]+?)(?=^Answer|\Z)

带有verbosemultiline标志,请参见a demo on regex101.com


Python中:

import re
rx = re.compile(r'''
    ^Question[^\d\r\n]+
    (?P<nr>\d+)\s+
    (?P<block>[\s\S]+?)(?=^$|\Z)''', re.M | re.X)

for m in rx.finditer(your_data_as_string_here):
    print(m.group('nr'), m.group('block'))

答案 1 :(得分:1)

这将逐行提取文件并将其存储在数组中。

with open(fname) as f:
    content = f.readlines()

如果您想摆脱换行符(如果需要的话),则只需从每行中提取最后一个字符。

for i in range(content):
    content[i] = content[i][:-1]

答案 2 :(得分:1)

"""
This question works if your schema is always the same, meaning...
Question Number
Question
Answer 1
Answer 2
Answer N
...
Good answer.

It doesn't care the number of answer you can have.
"""

if __name__ == '__main__':
    #   Opening your text file.
    with open('file.txt', 'r') as f:
        #   You're getting a list of lines out of it.
        lines = f.readlines()

    #   You want to split your text into blocks.
    #   You know that each blocks are separated by double '\n'.
    #   First, you join all the lines and then, resplit it using the
    #   token you identified.
    lines = ''.join(lines).split('\n\n')

    #   Here, we use the index to change the item in-place.
    for index in range(len(lines)):
        #   First : lines[index].split('\n')[1:-1]
        #   It will split the line using the inner '\n', and strip out
        #   The header, and the answer of your question.
        #   Then, rejoin using the '\n' that has been stripped by split.
        lines[index] = '\n'.join( lines[index].split('\n')[1:-1] )

    #   What stays is what you asked.
    for line in lines:
        print(type(line))
        print(line, end='\n\n')
    # <class 'str'>
    # Which of the following has more fire resisting characteristics? 
    # (A) Marble 
    # (B) Lime stone 
    # (C) Compact sand stone 
    # (D) Granite 

    # <class 'str'>
    # The rocks which are formed due to cooling of magma at a considerable depth from earth's surface are called 
    # (A) Plutonic rocks 
    # (B) Hypabyssal rocks 
    # (C) Volcanic rocks 
    # (D) Igneous rocks 

    # <class 'str'>
    # Plywood has the advantage of 
    # (A) Greater tensile strength in longer direction 
    # (B) Greater tensile strength in shorter direction 
    # (C) Same tensile strength in all directions 
    # (D) None of the above

如果您有一个 strict 模式,即与我之前显示的模式相同,并且您严格总是有4种可能,那么您可以...

if __name__ == '__main__':
    #   Opening your text file.
    with open('file.txt', 'r') as f:
        #   You're getting a list of lines out of it.
        lines = f.readlines()

    #   Create an empty list to store our result.
    my_lines = []
    for index in range(1, len(lines), 8):
        #   Since we exactly know where each line will be, we
        #   jump from blocks to blocks keeping only the first line of interest
        #   as our index.
        #   Plus, as the number of lines needed will always be the same, only
        #   keep a fixed amount of line, then join them all.
        my_lines.append( ''.join(lines[index : index+5]) )

    for line in my_lines:
        print(line)
    # Which of the following has more fire resisting characteristics? 
    # (A) Marble 
    # (B) Lime stone 
    # (C) Compact sand stone 
    # (D) Granite 

    # The rocks which are formed due to cooling of magma at a considerable depth from earth's surface are called 
    # (A) Plutonic rocks 
    # (B) Hypabyssal rocks 
    # (C) Volcanic rocks 
    # (D) Igneous rocks 

    # Plywood has the advantage of 
    # (A) Greater tensile strength in longer direction 
    # (B) Greater tensile strength in shorter direction 
    # (C) Same tensile strength in all directions 
    # (D) None of the above