Question

我有许多需要提取和格式化数据的日志文件。其中一些日志文件非常大，超过10,000行。

任何人都可以建议使用代码示例来帮助我阅读文本文件，删除不需要的行，然后将剩余行编辑为特定格式。我无法找到任何以前拥有我所追求的线程。

我需要编辑的数据示例如下：

136: add student 50000000 35011 / Y01T :Unknown id in field 3 - ignoring line

137: add student 50000000 5031 / Y01S :Unknown id in field 3 - ignoring line

138: add student 50000000 881 / Y01S :Unknown course idnumber in field 4 - ignoring line

139: add student 50000000 5732 / Y01S :Unknown id in field 3 - ignoring line

134: add student 50000000 W250 / Y02S :OK

135: add student 50000000 35033 / Y01T :OK

我需要搜索文件并删除任何带有后缀的行：OK。然后，我需要将其余行编辑为CSV格式，例如：

add,student,50000000,1234 / abcd

任何提示，代码片段等都会非常有用，我将非常感激。在问之前我先尝试一下，但我没有时间自学python文件访问/字符串格式。因此，请允许我提前道歉，因为在提出要求之前没有尝试

Answer 1

这可能是一个解决方案：

import sys

if len(sys.argv) != 2:
    print 'Add an input file as parameter'
    sys.exit(1)

print 'opening file: %s' % sys.argv[1]

with open(sys.argv[1]) as input, open('output', 'w+') as output:
    for line in input:
        if line is not None:
            if line == '\n':
                pass
            elif 'OK' in line:
                pass
            else:
                new_line = line.split(' ', 7)
                output.write('%s,%s,%s,%s / %s\n' % (new_line[1], new_line[2], new_line[3], new_line[4], new_line[6]))
                # just for checking purposes let's print the lines
                print '%s,%s,%s,%s / %s' % (new_line[1], new_line[2], new_line[3], new_line[4], new_line[6])

注意输出文件名！

Answer 2

如果它们不同，您可以更改正则表达式以满足您的需要，如果您需要其他分隔符，还可以修改csv.writer的参数：

import re, csv

regex = re.compile(r"(\d+)\s*:\s*(\w+)\s+(\w+)\s+(\w+)\s+([\w/ ]+?):\s*(.+)")
with open("out.csv", "w") as outfile:
    writer = csv.writer(outfile, delimiter=',', quotechar='"')
    with open("log.txt") as f:
        for line in f:
            m = regex.match(line)
            if m and m.group(6) != "OK":
                writer.writerow(m.groups()[1:-1])

Answer 3

感谢帮助人员。作为一个新手，我最终得到的代码并不那么优雅，但它仍然可以完成工作:)。

#open the file and create the CSV after filtering the input file.
def openFile(filename, keyword): #defines the function to open the file. User to pass two variables.

    list = []
    string = ''

    f = open(filename, 'r') #opens the file as a read and places it into the variable 'f'.
    for line in f: #for each line in 'f'.
        if keyword in line: #check to see if the keyword is in the line.
            list.append(line) #add the line to the list.

    print(list) #test.

    for each in list: #filter and clean the info, format the info into a CSV format.
        choppy = each.partition(': ') #split to remove the prefix.
        chunk = choppy[2] #take the good string.
        choppy = chunk.partition(' :') #split to remove the suffix.
        chunk = choppy[0] #take the good string.
        strsplit = chunk.split(' ') #split the string by spaces ' '.
        line = strsplit[0] + ',' + strsplit[1] + ',' + strsplit[2] + ',' + strsplit[3] + ' ' + strsplit[4] + ' ' + strsplit[5] + '\n' #concatenate the strings.

        string = string + line #concatenate each line to create a single string.

    print(string) #test.

    f = open(keyword + '.csv', 'w') #open a file to write.
    f.write(string) #write the string to the file.
    f.close() #close the file.



openFile('russtest.txt', 'cat')
openFile('CRON ENROL LOG 200913.txt', 'field 4')

谢谢:)。

Python - 格式化文本文件中的特定数据

3 个答案: