Question

好的，我有一个包含多行的CSV文件（目前超过40k）。由于线路数量庞大，我需要逐一阅读并进行一系列操作。这是第一个问题。第二个是：如何读取csv文件并将其编码为utf-8？其次是如何在示例后面的utf-8中读取文件：csv documentation。 Mesmo utilizando a classe class UTF8Recoder: o retorno no meuprinté\xe9 s\xf3。有人可以帮我解决这个问题吗？

import preprocessing
import pymongo
import csv,codecs,cStringIO
from pymongo import MongoClient
from unicodedata import normalize
from preprocessing import PreProcessing

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]
    def __iter__(self):
        return self

with open('data/MyCSV.csv','rb') as csvfile:
    reader = UnicodeReader(csvfile)
    #writer = UnicodeWriter(fout,quoting=csv.QUOTE_ALL)
    for row in reader:
        print row

def status_processing(corpus):

    myCorpus = preprocessing.PreProcessing()
    myCorpus.text = corpus

    print "Starting..."
    myCorpus.initial_processing()
    print "Done."
    print "----------------------------"

编辑1：S Ringne先生的解决方案有效。但现在，我无法在def内进行操作。这是新代码：

for csvfile in pd.read_csv('data/AracajuAgoraNoticias_facebook_statuses.csv',encoding='utf-8',sep=',', header='infer',engine='c', chunksize=2):

    def status_processing(csvfile):

        myCorpus = preprocessing.PreProcessing()
        myCorpus.text = csvfile

        print "Fazendo o processo inicial..."
        myCorpus.initial_processing()
        print "Feito."
        print "----------------------------"

在剧本的最后：

def main():
    status_processing(csvfile)

main()

当我使用BeautifulSoup删除链接时输出：

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Answer 1

这是一个在UTF-8中逐行读取的简单模式：

with open(filename, 'r', encoding="utf-8") as csvfile:
    spamreader = csv.reader(csvfile, delimiter=',', quotechar='"')
    for row in spamreader:
        # your operations go here

Answer 2

你可以将你的csv存储在pandas中并进行进一步的操作，这会更快。

import pandas as pd
df = pd.read_csv('path_to_file.csv',encoding='utf-8',header = 'infer',engine = 'c')

Python - CSV阅读器 - 每次读取一行

2 个答案: