将行分为段落

时间:2012-12-01 20:09:55

标签: python

输入:行列表

输出:行列表的列表,它是分割为(一个或多个)空行的序列的输入列表。

这是我迄今为止最难看的解决方案:

split_at_empty(lines):
    paragraphs = []
    p = []
    def flush():
        if p:
            paragraphs.append(p)
        p = []
    for l in lines:
        if l:
            p.append(l)
        else:
            flush()
    flush()
    return paragraphs

必须有更好的解决方案(甚至功能齐全)!任何人吗?

示例输入列表:

['','2','3','','5','6','7','8','','','11']

输出:

[['2','3'],['5','6','7','8'],['11']]

5 个答案:

答案 0 :(得分:2)

import re

ss =  '''Princess Maria Amelia of Brazil (1831–1853)


was the daughter of Dom Pedro I,
founder of Brazil's independence and its first emperor,

and Amelie of Leuchtenberg.



The only child from her father's second marriage,
Maria Amelia was born in France
following Pedro I's 1831 abdication in favor of his son Dom Pedro II.

Before Maria Amelia was a month old, Pedro I left for Portugal
to restore its crown to his eldest daughter Dona Maria II.
He defeated his brother Miguel I (who had usurped Maria II's throne),
only to die a few months later of tuberculosis.


'''

def select_lines(input,regx = re.compile('((?:^.+\n)+)',re.MULTILINE)):
    return [x.splitlines() for x in regx.findall(input)]

for sl in  select_lines(ss):
    print sl
    print

结果

['Princess Maria Amelia of Brazil (1831\x961853)']

['was the daughter of Dom Pedro I,', "founder of Brazil's independence and its first emperor,"]

['and Amelie of Leuchtenberg.']

["The only child from her father's second marriage,", 'Maria Amelia was born in France', "following Pedro I's 1831 abdication in favor of his son Dom Pedro II."]

['Before Maria Amelia was a month old, Pedro I left for Portugal', 'to restore its crown to his eldest daughter Dona Maria II.', "He defeated his brother Miguel I (who had usurped Maria II's throne),", 'only to die a few months later of tuberculosis.']

[['2', '3'], ['5', '6', '7', '8'], ['11']]

另一种方式,就列表行事:

li = [ '', '2', '3', '', '5', '6', '7', '8', '', '', '11']

lo = ['5055','','','2','54','87','','1','2','5','8','','']

lu = ['AAAAA','BB','','HU','JU','GU']

def selines(L):
    ye = []
    for x in L:
        if x:
            ye.append(x)
        elif ye:
            yield ye ; ye = []
    if ye:
        yield ye



for lx in (li,lo,lu):
    print lx
    print list(selines(lx))
    print

结果

['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
[['2', '3'], ['5', '6', '7', '8'], ['11']]

['5055', '', '', '2', '54', '87', '', '1', '2', '5', '8', '', '']
[['5055'], ['2', '54', '87'], ['1', '2', '5', '8']]

['AAAAA', 'BB', '', 'HU', 'JU', 'GU']
[['AAAAA', 'BB'], ['HU', 'JU', 'GU']]

答案 1 :(得分:2)

比原版稍微丑陋:

def split_at_empty(lines):
    r = [[]]
    for l in lines:
        if l:
            r[-1].append(l)
        else:
            r.append([])
    return [l for l in r if l]

(最后一行删除了原本会添加的空列表。)

答案 2 :(得分:1)

对于列表理解的痴迷......

def split_at_empty(L):
    return [L[start:end+1] for start, end in zip(
        [n for n in xrange(len(L)) if L[n] and (n == 0 or not L[n-1])],
        [n for n in xrange(len(L)) if L[n] and (n+1 == len(L) or not L[n+1])]
        )]

或更好

def split_at_empty(lines):
    L = [i for i, a in enumerate(lines) if not a]
    return [lines[s + 1:e] for s, e in zip([-1] + L, L + [len(lines)]) 
            if e > s + 1]

答案 3 :(得分:0)

您可以将列表组合成一个字符串然后重新分配它:

>>> a = ['', '2', '3', '', '5', '6', '7', '8', '', '', '11']
>>> [x.strip().split(' ') for x in ' '.join(a).split('  ')]
[['2', '3'], ['5', '6', '7', '8'], ['11']]

你应该使用正则表达式来捕获任何数量的空格(我在这里添加了另一个空格):

>>> import re
>>> pat = re.compile(r'\s{2,}')
>>> a = ['', '2', '3', '', '5', '6', '7', '8', '', '', '', '11']
>>> [x.strip().split(' ') for x in pat.split(' '.join(a))]
[['2', '3'], ['5', '6', '7', '8'], ['11']]

答案 4 :(得分:0)

以下是基于生成器的解决方案:

def split_at_empty(lines):
   sep = [0] + [i for (i,l) in enumerate(lines) if not l] + [len(lines)]
   for start, end in zip(sep[:-1], sep[1:]):
      if start + 1 < end:
         yield lines[start+1:end]

您的意见:

l = ['' , '2' , '3' , '' , '5' , '6' , '7' , '8' , '' , '' , '11']
for para in split_at_empty(l):
   print para

它产生

['2', '3']
['5', '6', '7', '8']
['11']