使用python重新格式化csv文件?

时间:2016-06-04 08:37:13

标签: python-3.x csv

我有这个只有两个条目的csv文件。这是:

Meat One,['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']

第一个是标题,第二个是商业标题。

问题在于第二项。

这是我的代码:

import csv

with open('phonebookCOMPK-Directory.csv', "rt") as textfile:
    reader = csv.reader(textfile)

    for row in reader:
        row5 = row[5].replace("[", "").replace("]", "")
        listt = [(''.join(row5))]
        print (listt[0])

打印:

'Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers'

我需要做的是,我想创建一个包含这些单词的列表,然后使用for循环打印它们,分别打印每个项目:

Abattoirs
Exporters
Food Delivery
Butchers Retail
Meat Dealers-Retail
Meat Freezer
Meat Packers

实际上我正在尝试重新格式化我当前的csv文件并对其进行清理,以使其更加精确和易懂。

完成第一行csv是这样的:

Meat One,+92-21-111163281,Al Shaheer Corporation,Retailers,2008,"['Abattoirs', 'Exporters', 'Food Delivery', 'Butchers Retail', 'Meat Dealers-Retail', 'Meat Freezer', 'Meat Packers']","[[' Outlets Address : Shop No. Z-10, Station Shopping Complex, MES Market, Malir-Cantt, Karachi. Landmarks : MES Market, Station Shopping Complex City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi. Landmarks : Nadra Chowrangi, Sky Garden, Tipu Sultan Road City : Karachi UAN : +92-21-111163281 '], ["" Outlets Address : Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi. Landmarks : Boat Basin, Jans Broast, Khayaban-e-Roomi City : Karachi UAN : +92-21-111163281 View Map ""], [' Outlets Address : Gulistan-e-Johar, Karachi. Landmarks : Perfume Chowk City : Karachi UAN : +92-21-111163281 '], [' Outlets Address : Tee Emm Mart, Creek Vista Appartments, Khayaban-e-Shaheen, Phase VIII, DHA, Karachi. Landmarks : Creek Vista Appartments, Nueplex Cinema, Tee Emm Mart, The Place City : Karachi Mobile : 0302-8333666 '], [' Outlets Address : Y-Block, DHA, Lahore. Landmarks : Y-Block City : Lahore UAN : +92-42-111163281 '], [' Outlets Address : Adj. PSO, Main Bhittai Road, Jinnah Supermarket, F-7 Markaz, Islamabad. Landmarks : Bhittai Road, Jinnah Super Market, PSO Petrol Pump City : Islamabad UAN : +92-51-111163281 ']]","Agriculture, fishing & Forestry > Farming equipment & services > Abattoirs in Pakistan"

First column is Name
Second column is Number
Third column is Owner
Forth column is Business type
Fifth column is Y.O.E
Sixth column is Business Headings
Seventh column is Outlets (List of lists containing every branch address)
Eighth column is classification

使用csv.reader没有任何限制,我可以使用任何可用于清理文件的技术。

2 个答案:

答案 0 :(得分:1)

根据两个单独的任务来考虑它:

  • 从“脏”来源(此CSV文件)中收集一些数据项
  • 将数据存储在某处,以便以编程方式轻松访问和操作(根据您要使用的内容)

处理脏CSV

执行此操作的一种方法是使用函数MainActivity从CSV中的每个传入行中提取结构化业务信息。此函数可能很复杂,因为这是任务的性质,但仍然建议将其拆分为包含自包含的较小函数(例如deserialize_business()get_outlets()等)。这个函数可以返回一个字典,但根据你的需要,它可以是一个[named]元组,一个自定义对象等。

此功能将是此特定CSV数据源的“适配器”。

反序列化功能示例:

get_headings()

调用它的示例:

def deserialize_business(csv_line):
    """
    Distills structured business information from given raw CSV line.
    Returns a dictionary like {name, phone, owner,
    btype, yoe, headings[], outlets[], category}.
    """

    pieces = [piece.strip("[[\"\']] ") for piece in line.strip().split(',')]

    name = pieces[0]
    phone = pieces[1]
    owner = pieces[2]
    btype = pieces[3]
    yoe = pieces[4]

    # after yoe headings begin, until substring Outlets Address
    headings = pieces[4:pieces.index("Outlets Address")]

    # outlets go from substring Outlets Address until category
    outlet_pieces = pieces[pieces.index("Outlets Address"):-1]

    # combine each individual outlet information into a string
    # and let ``deserialize_outlet()`` deal with that
    raw_outlets = ', '.join(outlet_pieces).split("Outlets Address")
    outlets = [deserialize_outlet(outlet) for outlet in raw_outlets]

    # category is the last piece
    category = pieces[-1]

    return {
        'name': name,
        'phone': phone,
        'owner': owner,
        'btype': btype,
        'yoe': yoe,
        'headings': headings,
        'outlets': outlets,
        'category': category,
    }

存储数据

您将使用with open("phonebookCOMPK-Directory.csv") as f: lineno = 0 for line in f: lineno += 1 try: business = deserialize_business(line) except: # Bad line formatting? log.exception(u"Failed to deserialize line #%s!", lineno) else: # All is well store_business(business) 函数获取数据结构并将其写入某处。也许它将是另一个更好的结构化CSV,可能是多个CSV,一个JSON文件,或者你可以利用SQLite关系数据库工具,因为Python内置它。

这完全取决于你以后想做什么。

关系示例

在这种情况下,您的数据将分散在多个表中。 (我使用“table”这个词,但它可以是一个CSV文件,尽管你也可以使用SQLite DB,因为Python有内置的。)

确定所有可能的业务标题的表格:

store_business()

确定所有可能类别的表格:

business heading ID, name
1, Abattoirs
2, Exporters
3, Food Delivery
4, Butchers Retail
5, Meat Dealers-Retail
6, Meat Freezer
7, Meat Packers

表识别业务:

category ID, parent category, name
1, NULL, "Agriculture, fishing & Forestry"
2, 1, "Farming equipment & services"
3, 2, "Abattoirs in Pakistan"

描述其渠道的表格:

business ID, name, phone, owner, type, yoe, category
1, Meat One, +92-21-111163281, Al Shaheer Corporation, Retailers, 2008, 3

表格描述了他们的标题:

business ID, city, address, landmarks, phone
1, Karachi UAN, "Shop 13, Ground Floor, Plot 14-D, Sky Garden, Main Tipu Sultan Road, KDA Scheme No.1, Karachi", "Nadra Chowrangi, Sky Garden, Tipu Sultan Road", +92-21-111163281
1, Karachi UAN, "Near Jan's Broast, Boat Basin, Khayaban-e-Roomi, Block 5, Clifton, Karachi", "Boat Basin, Jans Broast, Khayaban-e-Roomi", +92-21-111163281

处理所有这些都需要一个复杂的business ID, business heading ID 1, 1 1, 2 1, 3 … 函数。如果采用保持数据的关系方式,可能值得研究SQLite和一些ORM框架。

答案 1 :(得分:0)

您只需更换一行:

print(listt[0])

with:

print(*listt[0], sep='\n')