使用Python解析文本文件

时间:2010-07-15 07:34:38

标签: python parsing

我是Python的新手,我希望用它来解析文本文件。该文件包含以下格式的250-300行:

---- Mark Grey (mark.grey@gmail.com) changed status from Busy to Available @ 14/07/2010 16:32:36 ----
----  Silvia Pablo (spablo@gmail.com) became Available @ 14/07/2010 16:32:39 ----

我需要将以下信息存储到此文件中所有条目的另一个文件(excel或文本)中

UserName/ID  Previous Status New Status Date Time

所以我的结果文件对于上面提到的

应该是这样的
Mark Grey/mark.grey@gmail.com  Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo@gmail.com  NaN  Available 14/07/2010 16:32:39

提前致谢,

任何帮助都会非常感激

6 个答案:

答案 0 :(得分:15)

为了帮助您入门:

result = []
regex = re.compile(
    r"""^-*\s+
    (?P<name>.*?)\s+
    \((?P<email>.*?)\)\s+
    (?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
    (?P<new>.*?)\s+@\s+
    (?P<date>\S+)\s+
    (?P<time>\S+)\s+
    -*$""", re.VERBOSE)
with open("inputfile") as f:
    for line in f:
        match = regex.match(line)
        if match:
            result.append([
                match.group("name"),
                match.group("email"),
                match.group("previous")
                # etc.
            ])
        else:
            # Match attempt failed

会为你提供一系列匹配的部分。然后,我建议您使用csv module以标准格式存储结果。

答案 1 :(得分:6)

import re

pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) @ (.*?) ----\s*")
with open("data.txt") as f:
    for line in f:
        (name, email, prev, curr, date) = pat.match(line).groups()
        print "{0}/{1}  {2} {3} {4}".format(name, email, prev or "NaN", curr, date)

这假设了空白,并假设每一行都符合模式。如果要优雅地处理脏输入,可能需要添加错误检查(例如检查pat.match()不返回None)。

答案 2 :(得分:6)

感兴趣的两种RE模式似乎是......:

p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'

所以我会这样做:

import csv, re, sys

# assign p1, p2 as above (or enhance them, etc etc)

r1 = re.compile(p1)
r2 = re.compile(p2)
data = []

with open('somefile.txt') as f:
    for line in f:
        m = p1.match(line)
        if m:
            data.append(m.groups())
            continue
        m = p2.match(line)
        if not m:
            print>>sys.stderr, "No match for line: %r" % line
            continue
        listofgroups = m.groups()
        listofgroups.insert(2, 'NaN')
        data.append(listofgroups)

with open('result.csv', 'w') as f:
    w = csv.writer(f)
    w.writerow('UserName/ID Previous Status New Status Date Time'.split())
    w.writerows(data)

如果我描述的两种模式不够通用,当然可能需要调整它们,但我认为这种通用方法很有用。虽然Stack Overflow上的许多Python用户非常不喜欢RE,但我觉得它们对于这种实用的临时文本处理非常有用。

也许其他人希望将RE用于荒谬的用途,例如对CSV,HTML,XML的临时解析,以及许多其他类型的结构化文本格式哪个非常好的解析器存在!而且,其他任务远远超出了RE的“舒适区”,而是需要像pyparsing这样的可靠的通用解析器系统。或者用简单的字符串完成其他极端超级简单的任务(例如我记得最近使用if re.search('something', s):代替if 'something' in s:的SO问题! - )。

但是对于相当广泛的任务(不包括一端最简单的任务,以及另一方面的结构化或有些复杂的语法解析)RE是合适的,使用它们确实没有错,并且我建议所有程序员至少学习RE的基础知识。

答案 3 :(得分:4)

Alex提到了pyparsing,因此这是针对同一问题的一种pyparsing方法:

from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime

DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()@")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")

statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR + 
            (statechange1 | statechange2) +
            AT + date('date') + time('time') + DASHES)

def convertFields(tokens):
    if 'fromstate' not in tokens:
        tokens['fromstate'] = 'NULL'
    tokens['name'] = tokens.name.strip()
    tokens['email'] = tokens.email.strip()
    d,mon,yr = map(int, tokens.date.split('/'))
    h,m,s = map(int, tokens.time.split(':'))
    tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print "%(name)s/%(email)s  %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields

打印:

Mark Grey/mark.grey@gmail.com  Busy       Available  2010-07-14 16:32:36
Silvia Pablo/spablo@gmail.com  NULL       Available  2010-07-14 16:32:39

pyparsing允许您将名称附加到结果字段(就像Tom Pietzcker的RE样式答案中的命名组一样),以及用于处理或操纵已解析操作的分析时操作 - 请注意单独日期的转换和时间字段成为一个真正的日期时间对象,已经转换并准备好在解析之后进行处理,没有额外的麻烦。

这是一个修改过的循环,只是转出解析的标记和每行的命名字段:

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print fields.dump()

打印:

['Mark Grey ', 'mark.grey@gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey@gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo@gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo@gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available

我怀疑在您继续解决此问题时,您会发现输入文本格式的其他变体,指定用户状态的更改方式。在这种情况下,您只需添加其他定义,例如statechange1statechange2,然后将其与其他定义一起插入linefmt。我觉得pyparsing的解析器定义结构可以帮助开发人员在事情发生变化后回到解析器,并轻松扩展他们的解析程序。

答案 4 :(得分:1)

好吧,如果我要解决这个问题,我可能会先将每个条目拆分成自己独立的字符串。这看起来可能是面向行的,因此inputfile.split('\n')可能就足够了。从那里我可能会创建一个正则表达式来匹配每个可能的状态更改,子组包含每个重要的字段。

答案 5 :(得分:1)

非常感谢你的所有评论。它们非常有用。我使用目录功能编写了代码。它的作用是读取文件,并为每个用户创建一个输出文件,并提供所有状态更新。这是下面粘贴的代码。

#Script to extract info from individual data files and print out a data file combining info from these files

import os
import commands

dataFileDir="data/";

#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};

#User id  to user name mapping to check for duplicate names
usrId2Name={};

#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};

#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):

    userName="";
    for i in range(mailInd-1,0,-1):

        if info[i].endswith("-") or info[i].endswith("+"):
            break;

        userName=info[i]+" "+userName;

    userName=userName.strip();
    userName=userName.replace("  "," ");
    userName=userName.replace(" ","_");

    return userName;

#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
    timeStamp="";
    for i in range(timeStartInd+1,len(info)):
        timeStamp=timeStamp+" "+info[i];

    timeStamp=timeStamp.replace("-","");
    timeStamp=timeStamp.strip();
    return timeStamp;

#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
    msg="";
    for i in range(startInd,endInd):
        msg=msg+" "+info[i];
    msg=msg.strip();
    msg=msg.replace(" ","_");
    return msg;

#Extract and store info from each line in the datafile
def extractLineInfo(line):

    print line;
    info=line.split(" ");

    mailInd=-1;userId="-NONE-";
    timeStartInd=-1;timeStamp="-NONE-";
    becameInd="-1";
    statusMsg="-NONE-";

    #Find indices of email id and "@" char indicating start of timestamp
    for i in range(0,len(info)):
        #print (str(i)+" "+info[i]);
        if(info[i].startswith("(") and info[i].endswith("@in.ibm.com)")):
            mailInd=i;
        if(info[i]=="@"):
            timeStartInd=i;

        if(info[i]=="became"):
            becameInd=i;

    #Debug print of mail and time stamp start inds
    """print "\n";
    print "Index of mail id: "+str(mailInd);
    print "Index of time start index: "+str(timeStartInd);
    print "\n";"""

    #Extract IBM user id and name for lines with ibm id
    if(mailInd>=0):
        userId=info[mailInd].replace("(","");
        userId=userId.replace(")","");
        userName=getUserName(info,mailInd);
    #Lines with no ibm id are of the form "Suraj Godar Mr became idle @ 15/07/2010 16:30:18"
    elif(becameInd>0):
        userName=getUserName(info,becameInd);

    #Time stamp info
    if(timeStartInd>=0):
        timeStamp=getTimeStamp(info,timeStartInd);
        if(mailInd>=0):
            statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
        elif(becameInd>0):
            statusMsg=getStatusMsg(info,becameInd,timeStartInd);

    print userId;
    print userName;
    print timeStamp
    print statusMsg+"\n";

    if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
        usrName2Id[userName]=userId;

    #Store status messages keyed by user email ids
    timeDict={};

    #Retrieve user id corresponding to user name
    if userName in usrName2Id:
        userId=usrName2Id[userName];

    #For valid user ids, store status message in the dict within dict data str arrangement
    if not(userId=="-NONE-"):

        if not(userId in infoDict.keys()):
            infoDict[userId]={};

        timeDict=infoDict[userId];
        if not(timeStamp in timeDict.keys()):
            timeDict[timeStamp]=statusMsg;
        else:
            timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;


#Print for each user a file containing status
def printStatusFiles(dataFileDir):


    volNum=0;

    for userName in usrName2Id:
        volNum=volNum+1;

        filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
        file = open(filename,"w");

        print "Printing output file name: "+filename;
        print volNum,userName,usrName2Id[userName]+"\n";
        file.write(userName+" "+usrName2Id[userName]+"\n");

        timeDict=infoDict[usrName2Id[userName]];
        for time in sorted(timeDict.keys()):
            file.write(time+" "+timeDict[time]+"\n");


#Read and store data from individual data files
def readDataFiles(dataFileDir):

    #Process each datafile
    files=os.listdir(dataFileDir)
    files.sort();
    for i in range(0,len(files)):
    #for i in range(0,1):

        file=files[i];

        #Do not process other non-data files lying around in that dir
        if not file.endswith(".txt"):
            continue

        print "Processing data file: "+file
        dataFile=dataFileDir+str(file);
        inpFile=open(dataFile,"r");
        lines=inpFile.readlines();

        #Process lines
        for line in lines:

            #Clean lines
            line=line.strip();
            line=line.replace("/India/Contr/IBM","");
            line=line.strip();

            #Skip header line of the file and L's sign in sign out times
            if(line.startswith("System log for account") or line.find("signed")>-1):
                continue;


            extractLineInfo(line);


print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");