我的文本文件样本数据如下
E-RECEIPT FOR TRANSFER FUNDS
Payee Name: AAA CHS
Nickname: AAA CHS
Credit Account No:: AAAA0000006666
Remarks: 4869
Debit Account: 99999999999999
Date: 05 May '20
Amount: INR 4,869.00 (Rupees Four Thousand Eight Hundred Sixty Nine and Zero Paisa only)
如果我看到此文件(文件->选项->显示->始终在屏幕上显示格式掩码,并选择它所有显示的选项,如下所示)
....E-RECEIPT FOR TRANSFER Of Funds...................................................................Payee Name...................
.....................................................................................................
AAA CHS.........................................................AAA CHS...........................Nickname ....etc
Here (...) means spaces and in between lines it also shows paragraph symbols(¶) pillow cover and also at the end of file it is showing 3 paragraph symbols.
我想要输出(删除空格和段落符号)
E-RECEIPT FOR TRANSFER FUNDS
Payee Name: AAA CHS
Nickname: AAA CHS
Credit Account No:: AAAA0000006666
...
...
我尝试了以下操作
file=open("c:\\temp1\\tt1.txt", "r+")
for line in file.readlines():
print(line.strip())
file.close()
它不起作用。请注意,我不想删除单词之间的空格,我想删除行之间的空格/特殊字符。
第二,虽然不是必须的,例如,我可以在“:”或“ ::”之前和之后仅放置一个空格。
E-RECEIPT FOR TRANSFER FUNDS
Payee Name : AAA CHS
Nickname : AAA CHS
Credit Account No :: AAAA0000006666
...等
答案 0 :(得分:0)
使用此便捷功能:
import re
def text_processor(s):
# s = your text
return '\n'.join(str.split(re.sub('\s{2,}', ' ', re.sub('\n\n', '|\n', s.replace('::',':'))), '|')).replace(':', ' :')
示例:
# s = your text
# assuming you are reading in from a file: 'data.txt'
# with open('data.txt', 'r') as f:
# s = f.read()
print(text_processor(s))
输出:
E-RECEIPT FOR TRANSFER FUNDS
Payee Name : AAA CHS
Nickname : AAA CHS
Credit Account No : AAAA0000006666
Remarks : 4869
Debit Account : 99999999999999
Date : 05 May '20
Amount : INR 4,869.00 (Rupees Four Thousand Eight Hundred Sixty Nine and Zero Paisa only)
s = """
E-RECEIPT FOR TRANSFER FUNDS
Payee Name: AAA CHS
Nickname: AAA CHS
Credit Account No:: AAAA0000006666
Remarks: 4869
Debit Account: 99999999999999
Date: 05 May '20
Amount: INR 4,869.00 (Rupees Four Thousand Eight Hundred Sixty Nine and Zero Paisa only)
"""
print(s)
参考:source
import docx2txt
# read in word file
s = docx2txt.process("data.docx")
# Copy pasting the dummy data into a docx file
# and trying to read and correcting the data
# requires the following fix
print(text_processor(s).replace(' \n \n ', '\n'))