Question

我有一个非常大的CSV文件（超过一百万行），我想对其执行一些操作。问题是，某些行有一些不需要的换行符，例如：

$appName = "myapp"
$appService = Get-AzWebApp -Name $appName
$appConfig = $appService.SiteConfig

$rulesList = New-Object -TypeName System.Collections.Generic.List[Microsoft.Azure.Management.WebSites.Models.RampUpRule]
$rule = New-Object -TypeName Microsoft.Azure.Management.WebSites.Models.RampUpRule
$rule.Name = "LiveTraffic"
$rule.ActionHostName = "myapp-staging.azurewebsites.net"
$rule.ReroutePercentage = 100
$rulesList.Add($rule)

$appConfig.Experiments.RampUpRules = $rulesList

Set-AzWebApp -WebApp $appService

文件因此具有三列（New York City; Iron Man; no superpowers; Metropolis; Superman; superpowers; New York City; Spider-Man; superpowers; Gotham; Batman; no superpowers; New York City; Doctor Strange; superpowers;，location，superhero）。由于蜘蛛侠的条目是错误的，因为它的条目之间有换行符，所以熊猫错误地认为那是三行分开的行，即第二列和第三列中的superpowers。

我的想法是在使用正则表达式导入期间修复此问题。根据{{3}}，此正则表达式正确匹配所需的行，而不匹配有问题的行（即蜘蛛侠）。

NaNs

它的反数（(.*[;].*[;].*)不起作用，因为它不仅不匹配三个有问题的行，而且不匹配每个正常行的第三个条目。

我的另一种方法是简单地设置列数，然后从整个文件中删除所有换行符。但是，那也不起作用。

(?!(.*[;].*[;].*))

所需的输出应如下所示：

superhero_df = pd.read_csv("superheroes.csv", sep=' *; *', skiprows=12, names=["location", "superhero", "superpower"], index_col=False, engine="python")
superhero_df = superhero_df.replace('\r\n','', regex=True)

Answer 1

那又怎么样：

^([^;]+);[\r\n]*([^;]+);[\r\n]*([^;]+);

并替换为：

\1;\2;\3;

regex101

run here

import re

regex = r"^([^;]+);[\r\n]*([^;]+);[\r\n]*([^;]+);"

test_str = ("New York City; Iron Man; no superpowers;\n"
    "Metropolis; Superman; superpowers;\n"
    "New York City;\n"
    "Spider-Man;\n"
    "superpowers;\n"
    "Gotham; Batman; no superpowers;\n"
    "New York City; Doctor Strange; superpowers;\n\n")

subst = "\\1;\\2;\\3;"

# You can manually specify the number of replacements by changing the 4th argument
result = re.sub(regex, subst, test_str, 0, re.MULTILINE | re.DOTALL)

if result:
    print (result)

Answer 2

以下正则表达式消除了每三个字段后出现的不必要的换行符和其他空格。假设这些字段没有任何内部分号：

print(re.sub(r'([^;]*);\s*([^;]*);\s*([^;]*);\s+', r'\1;\2;\3\n', 
      line, flags=re.M))
#New York City; Iron Man;no superpowers
#Metropolis;Superman;superpowers
#New York City;Spider-Man;superpowers
#Gotham;Batman;no superpowers
#New York City;Doctor Strange;superpowers

您可以循环使用它来预处理文件，然后再使用Pandas。

Answer 3

如果您是我，我将在原始文本文件上进行一次简单的迭代就将整个数据重写到一个新的文本文件中，然后将结果文件加载到Pandas中，而无需使用re：

with open('source.txt') as fin, open('target.txt', 'w') as fout:
    lc = 0
    for line in fin:
        lc += line.count(';')
        if  lc < 3:
            fout.write(line[:-1])
        else:
            fout.write(line)
            lc = 0

结果：

# New York City; Iron Man; no superpowers;
# Metropolis; Superman; superpowers;
# New York City;Spider-Man;superpowers;
# Gotham; Batman; no superpowers;
# New York City; Doctor Strange; superpowers;

读熊猫：

pd.read_csv('target.txt', header=None, sep=';', usecols=range(3))

#                0                1                2
# 0  New York City         Iron Man   no superpowers
# 1     Metropolis         Superman      superpowers
# 2  New York City       Spider-Man      superpowers
# 3         Gotham           Batman   no superpowers
# 4  New York City   Doctor Strange      superpowers

注意：usecols仅由于尾部分号而需要。通过使用

导入可以避免这种情况

with open('source.txt') as fin, open('target.txt', 'w') as fout:
    lc = 0
    for line in fin:
        lc += line.count(';')
        if  lc < 3:
            fout.write(line.strip())
        else:
            fout.write(line.strip()[:-1] + '\n')
            lc = 0

读熊猫：

pd.read_csv('target.txt', header=None, sep=';')

#                0                1                2
# 0  New York City         Iron Man   no superpowers
# 1     Metropolis         Superman      superpowers
# 2  New York City       Spider-Man      superpowers
# 3         Gotham           Batman   no superpowers
# 4  New York City   Doctor Strange      superpowers

Answer 4

最简单的解决方案：

import pandas as pd
import re

string = """New York City; Iron Man; no superpowers;
Metropolis; Superman; superpowers;
New York City;
Spider-Man;
superpowers;
Gotham; Batman; no superpowers;
New York City; Doctor Strange; superpowers;"""

cities=[]
superheros=[]
superpowers = []

splited_list = re.split(';', string)
splited_list.pop(len(splited_list) - 1 )

i = 0

while i < len(splited_list) - 1:
    cities.append(splited_list[i])
    superheros.append(splited_list[i + 1])
    superpowers.append(splited_list[i + 2])

    i = i + 3


df = pd.DataFrame({
    "City": cities,
    "Superhero": superheros,
    "superpowers": superpowers
})

Answer 5

这是我的方法，未针对性能进行优化，但我可以做到：

from pprint import pprint

def main():
    count=0
    outer_list=[]
    row=[]
    with open('superheroes.csv') as f:
        for line in f:
            for word in line.split(";"):
                if not str.isspace(word):
                    word=word.strip()
                    row.append(str(word))
                    count = count + 1
                    if count % 3 == 0:
                        outer_list.append(row)
                        row=[]
    pprint(outer_list)

if __name__== "__main__":
    main()

输出是列表列表：

[['New York City', 'Iron Man', 'no superpowers'],
 ['Metropolis', 'Superman', 'superpowers'],
 ['New York City', 'Spider-Man', 'superpowers'],
 ['Gotham', 'Batman', 'no superpowers'],
 ['New York City', 'Doctor Strange', 'superpowers']]

消除CSV文件中不需要的换行符

5 个答案: