重新格式化TB级数据的最快方法

时间:2018-09-05 20:02:22

标签: python pandas bigdata

我有100个文件,每个文件10 GB。我需要重新格式化文件并合并为更可用的表格式,以便可以对数据进行分组,求和,取平均值等。使用Python重新格式化数据需要一周以上的时间。即使将其重新格式化为表格后,我也不知道它是否对于数据框而言太大,但这一次只是一个问题。

谁能建议一种更快的方法来重新格式化文本文件?我会考虑任何C ++,perl等。

样本数据:

Scenario:  Modeling_5305 (0.0001)

Position:  NORTHERN UTILITIES SR NT,

"  ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",

"2017/12/31",16.0137 T,4.4194 % SEMI 30/360,0.4980 % SEMI 30/360,"6,934,452.0000 USD","6,884,052.0000 USD","7,000,000.0000 USD",371.6160 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8548 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,

"2018/01/12",15.5666 T,4.8499 % SEMI 30/360,0.4980 % SEMI 30/360,"6,477,803.7492 USD","6,418,163.7492 USD","7,000,000.0000 USD",356.9428 T,CachedFilterPartitions-PL_SPLITTER.2:665876C#3,29.8219 T,Fixed Rate Bond,2047/11/01,4.3200 % SEMI 30/360,"70,000.0000",All Portfolios,030421000,0.0000 USD,FRB,N/A,0.4980 % SEMI 30/360,

Scenario:  Modeling_5305 (0.0001)

Position:  OLIVIA ISSUER TR SER A (A,

"  ","THEO/Effective Duration","THEO/Yield","THEO/Implied Spread","THEO/Value","THEO/Price","THEO/Outstanding Balance","THEO/Effective Convexity","ID","WAL","Type","Maturity Date","Coupon Rate","POS/Position Units","POS/Portfolio","POS/User Defined 1","POS/SE Cash 1","User Defined 2","CMO WAL","Spread Over Yield",

"2017/12/31",1.3160 T,19.0762 % SEMI 30/360,0.2990 % SEMI 30/360,"3,862,500.0000 USD","3,862,500.0000 USD","5,000,000.0000 USD",2.3811 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.3288 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,

"2018/01/12",1.2766 T,21.9196 % SEMI 30/360,0.2990 % SEMI 30/360,"3,815,391.3467 USD","3,815,391.3467 USD","5,000,000.0000 USD",2.2565 T,CachedFilterPartitions-PL_SPLITTER.2:681071AA4,1.2959 T,Interest Rate Index Linked Note,2019/05/30,0.0000 % MON 30/360,"50,000.0000",All Portfolios,010421002,0.0000 USD,IRLIN,N/A,0.2990 % SEMI 30/360,

我想重新格式化为此csv表,以便可以导入到数据框:

Position, Scenario, TimeSteps, THEO/Value

NORTHERN UTILITIES SR NT, Modeling_5305, 2018/01/12, 6477803.7492

OLIVIA ISSUER TR SER A (A, Modeling_5305, 2018/01/12, 3815391.3467

2 个答案:

答案 0 :(得分:0)

当您必须处理大文件或大量文件时,有两个大瓶颈。一种是您的文件系统,它受HDD或SSD(存储介质),与存储介质的连接以及操作系统的限制。通常您无法更改。但是您必须问自己,我的最高速度是多少?系统读写速度有多快?您永远不会比这更快。 粗略估计最大速度是需要读取所有数据的时间加上写入所有数据的时间。

另一个瓶颈是您用来进行更改的库。并非所有的Python软件包都是一样创建的,存在巨大的速度差异。我建议您在一个小的测试样本上尝试几种方法,直到找到适合您的方法为止。

请记住,大多数文件系统都喜欢读取或写入大量数据。因此,您应尽量避免在读一行和写一行之间交替的情况。换句话说,不仅库很重要,而且库的使用方式也很重要。

不同的编程语言虽然可以为该任务提供一个很好的库并且可以成为一个好主意,但是它们不会以任何有意义的方式加快该过程的速度(因此,您获得的速度不会是其十倍之多)。 / p>

答案 1 :(得分:0)

我会在内存映射中使用C / C ++。

使用内存映射,您可以像处理一个大字节数组一样遍历数据(这还将防止将数据从内核空间复制到用户空间(在Windows上,不确定Linux))。 / p>

对于非常大的文件,您一次可以映射一个块(例如10GB)。

要进行写操作,请使用缓冲区(例如1MB)存储结果,然后每次(使用fwrite())将该缓冲区写入文件。

无论您做什么,都不要使用流式I / O或readline()

该过程的时间不应(或至少不长于)不超过将文件复制到磁盘上(或由于使用网络文件存储而通过网络)所需的时间。

如果有此选项,则将数据写入与要读取的磁盘不同的(物理)磁盘。