Question

我还在学习Python，我有一个问题我无法解决。我有一个非常长的字符串（数百万行），我希望根据指定的分隔符出现次数将其拆分成较小的字符串长度。

例如：

ABCDEF
//
GHIJKLMN
//
OPQ
//
RSTLN
//
OPQR
//
STUVW
//
XYZ
//

在这种情况下，我想基于＆＃34; //＆＃34;并在第n次出现分隔符之前返回所有行的字符串。

因此，将字符串除以1的输入将返回：

ABCDEF

按字符串将字符串分割为2的输入将返回：

ABCDEF
//
GHIJKLMN

将字符串分割为//的输入将返回：

ABCDEF
//
GHIJKLMN
//
OPQ

依此类推...... 然而，当我只是试图分割整个字符串时，原始200万行字符串的长度似乎是一个问题，并且＆＃34; //＆＃34;并只使用各个索引。（我收到了内存错误）也许Python无法在一次拆分中处理这么多行？所以我无法做到这一点。

我正在寻找一种方法，当我可能只需要100时，我不需要将整个字符串分成十万个索引，而只是从头开始直到某一点，在它之前停止并返回一切，我认为它也可能更快？我希望我的问题尽可能清楚。

有没有简单或优雅的方法来实现这一目标？谢谢！

Answer 1

如果你想在内存中使用文件而不是字符串，这是另一个答案。

这个版本是作为一个函数编写的，它读取行并立即将它们打印出来，直到找到指定数量的分隔符（不需要额外的内存来存储整个字符串）。

def file_split(file_name, delimiter, n=1):
    with open(file_name) as fh:
        for line in fh:
            line = line.rstrip()    # use .rstrip("\n") to only strip newlines
            if line == delimiter:
                n -= 1
                if n <= 0:
                    return
            print line

file_split('data.txt', '//', 3)

您可以使用它将输出写入新文件，如下所示：

python split.py > newfile.txt

通过一些额外的工作，您可以使用argparse将参数传递给程序。

Answer 2

例如：

   i = 0
   s = ""
   fd = open("...")
   for l in fd:
       if l[:-1] == delimiter:  # skip last '\n'
          i += 1
       if i >= max_split:
          break
       s += l
   fd.close()

Answer 3

作为一种更有效的方法，您可以阅读由分隔符分隔的第一行N，因此如果您确定所有行都是通过分隔符分割的，则可以使用itertools.islice来完成工作：

from itertools import islice
with open('filename') as f :
   lines = islice(f,0,2*N-1)

Answer 4

当我阅读您的问题时，我想到的方法使用for循环你将字符串切成几个（例如你调用的100）并遍历子字符串。

thestring = "" #your string
steps = 100 #length of the strings you are going to use for iteration
log = 0
substring = thestring[:log+steps] #this is the string you will split and iterate through
thelist = substring.split("//")
for element in thelist:
    if(element you want):
        #do your thing with the line
    else:
        log = log+steps
        # and go again from the start only with this offset

现在你可以通过所有元素遍历整个200万（！）行字符串。

在这里做的最好的事情实际上就是从这里做一个递归函数（如果这是你想要的）：

 thestring = "" #your string
 steps = 100 #length of the strings you are going to use for iteration

 def iterateThroughHugeString(beginning):
     substring = thestring[:beginning+steps] #this is the string you will split and iterate through
     thelist = substring.split("//")
     for element in thelist:
         if(element you want):
             #do your thing with the line
         else:
             iterateThroughHugeString(beginning+steps)
             # and go again from the start only with this offset

Answer 5

由于您正在学习Python，因此建立完整的动态解决方案将是一项挑战。这是一个如何模拟一个概念的概念。

注意：以下代码段仅适用于给定格式的文件（请参阅问题中的“For Instance”）。因此，这是一个静态的解决方案。

num = (int(input("Enter delimiter: ")) * 2)
with open("./data.txt") as myfile:
    print ([next(myfile) for x in range(num-1)])

现在有了这个想法，你可以使用模式匹配等等。

Python - 按分隔符出现次数拆分大字符串

5 个答案: