从文本文件中删除空白行,空格,段落标记

时间:2020-05-05 09:35:16

标签: python python-3.x

我的文本文件样本数据如下

 E-RECEIPT FOR  TRANSFER FUNDS                                                                                                                                                                                                                                                                                                                                                                                                         

   Payee Name:                                                   AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Nickname:                                                     AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Credit Account No::                                           AAAA0000006666                                                                                                                                                                                                                                                                                                                                                         

   Remarks:                                                      4869                                                                                                                                                                                                                                                                                                                                                                    

   Debit Account:                                                99999999999999                                                                                                                                                                                                                                                                                                                                                         

   Date:                                                         05 May '20                                                                                                                                                                                                                                                                                                                                                              

   Amount:                                                       INR 4,869.00         (Rupees     Four Thousand Eight Hundred Sixty  Nine  and Zero Paisa only) 

如果我看到此文件(文件->选项->显示->始终在屏幕上显示格式掩码,并选择它所有显示的选项,如下所示)

 ....E-RECEIPT FOR  TRANSFER Of Funds...................................................................Payee Name...................
.....................................................................................................
AAA CHS.........................................................AAA CHS...........................Nickname ....etc 

Here (...) means spaces and in between lines it also shows paragraph symbols(¶) pillow cover and also at the end of file it is showing 3 paragraph symbols.

我想要输出(删除空格和段落符号)

E-RECEIPT FOR  TRANSFER FUNDS
Payee Name:                                                   AAA CHS 
Nickname:                                                     AAA CHS 
Credit Account No::                                           AAAA0000006666
...
...

我尝试了以下操作

file=open("c:\\temp1\\tt1.txt", "r+")
for line in file.readlines():
    print(line.strip())
file.close()

它不起作用。请注意,我不想删除单词之间的空格,我想删除行之间的空格/特殊字符。

第二,虽然不是必须的,例如,我可以在“:”或“ ::”之前和之后仅放置一个空格。

E-RECEIPT FOR  TRANSFER FUNDS
Payee Name : AAA CHS 
Nickname : AAA CHS 
Credit Account No :: AAAA0000006666

...等

1 个答案:

答案 0 :(得分:0)

使用此便捷功能:

import re
def text_processor(s):
    # s = your text
    return '\n'.join(str.split(re.sub('\s{2,}', ' ', re.sub('\n\n', '|\n', s.replace('::',':'))), '|')).replace(':', ' :')

示例

# s = your text
# assuming you are reading in from a file: 'data.txt'
# with open('data.txt', 'r') as f:
#    s = f.read()
print(text_processor(s))

输出

E-RECEIPT FOR TRANSFER FUNDS 
 Payee Name : AAA CHS 
 Nickname : AAA CHS 
 Credit Account No : AAAA0000006666 
 Remarks : 4869 
 Debit Account : 99999999999999 
 Date : 05 May '20 
 Amount : INR 4,869.00 (Rupees Four Thousand Eight Hundred Sixty Nine and Zero Paisa only) 

虚拟数据

s = """
E-RECEIPT FOR  TRANSFER FUNDS                                                                                                                                                                                                                                                                                                                                                                                                         

   Payee Name:                                                   AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Nickname:                                                     AAA CHS                                                                                                                                                                                                                                                                                                                                                            

   Credit Account No::                                           AAAA0000006666                                                                                                                                                                                                                                                                                                                                                         

   Remarks:                                                      4869                                                                                                                                                                                                                                                                                                                                                                    

   Debit Account:                                                99999999999999                                                                                                                                                                                                                                                                                                                                                         

   Date:                                                         05 May '20                                                                                                                                                                                                                                                                                                                                                              

   Amount:                                                       INR 4,869.00         (Rupees     Four Thousand Eight Hundred Sixty  Nine  and Zero Paisa only) 
"""

print(s)

从Python打开Docx文件

参考:source

import docx2txt

# read in word file
s = docx2txt.process("data.docx")

# Copy pasting the dummy data into a docx file
# and trying to read and correcting the data 
# requires the following fix

print(text_processor(s).replace(' \n \n ', '\n'))