Python在大文本文件中跳过某些行

时间:2016-11-16 21:52:49

标签: python

我有一个非常大的文本文件,我正在阅读。当我运行我的代码时,我得到一个超出范围的列表索引'错误。我注意到我的数据中需要忽略某些行。每组9行应该看起来像下面的第一个例子。有些集合具有随机线(参见第二组)。如何忽略或删除某些行,以便我的计数不被丢弃?我需要所有数据都是9行的集合。我是否可能要求第1行以产品开头,第2-8行与审核相关,第9行是空白的?

product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have
found them all to be of good quality. The product looks more like a stew than a
processed meat and it smells better. My Labrador is finicky and she appreciates this
product better than most.

product/productId: B001E4KFG0
review/userId: A3SGXH7AUHU8GW
review/profileName: delmartian
review/helpfulness: 1/1
error error error
review/score: 5.0
review/time: 1303862400
review/summary: Good Quality Dog Food
review/text: I have bought several of the Vitality canned dog food products and have
found them all to be of good quality. The product looks more like a stew than a
processed meat and it smells better. My Labrador is finicky and she appreciates this
product better than most.

代码

import pandas as pd
import numpy as np
import collections

%time
with open('foods.txt',encoding='ISO-8859-1') as food_file:

    dict_list = []
    column_names = ('Product ID', 'Number of people who voted this review helpful', 'Total number of people who rated this review', 'Rating of product', 'Text of the review')

    line_num = 1
    while line_num <20000000:  
        #Read Lines
        line1 = food_file.readline()
        line2 = food_file.readline()
        line3 = food_file.readline()
        line4 = food_file.readline()
        line5 = food_file.readline()
        line6 = food_file.readline()
        line7 = food_file.readline()
        line8 = food_file.readline()
        line9 = food_file.readline()

        #Break out of the loop if we hit the end of the file
        if not line1:
            break

        #This code when in use tells me the last successful line. I then searched the text file to make corrections.
        #Manual process - not desirable
        #if len(line9) > 1:
            #print(line9)
            #break

        #Split Lines for Dataframe
        prod = line1.split(':')[1].strip()
        helpful = line4.split(':')[1].strip()
        helpful = helpful.split('/')[0] #More efficient approach?
        review_total = "/".join(line4.split("/",2)[2:]).strip()
        rating = line5.split(':')[1].strip()
        review_text = line8.split(':')[1].strip()    

        dict_list.append(collections.OrderedDict(zip(column_names, [prod, helpful, review_total, rating, review_text])))

        line_num += 9

amazon_df = pd.DataFrame(dict_list)
amazon_df

0 个答案:

没有答案