在读取csv文件时使用re

时间:2018-05-30 07:23:05

标签: python regex csv

我有一个关键字healthy_list列表,我想在csv文件的列中查看。如果列表中至少有一个关键字出现,那么我将整行写入新的csv文件。

我使用re.search检查关键字,然后记录行号,然后使用csv.writer写入新的csv。但是包含关键字的许多行都没有显示在我的新csv文件中。有什么意见吗?

healthy_new=[]
with open("Data 2017.csv","rb") as f:
    csvreader=csv.reader(f,delimiter=",")
    next(csvreader)
    for line, row in enumerate(csvreader):
        for word in healthy_list:
            try:
                if  (re.search(word,row[4].lower()) ):
                    healthy_new.append(line)
            except ValueError:
                continue 

healthy_new=list(set(healthy_new))

....

f = open("Data 2017.csv", "r")
reader = csv.reader(f)

data = open("healthy_new_output.csv", "w")
w = csv.writer(data, delimiter=',')
for idx, row in enumerate(reader):
    idx+=-1
    if idx in healthy_new:
        my_row = row
        w.writerow(my_row)

编辑: 一些数据2017.csv Data 2017.csv

healthy_list:

 [...'diet', 'low-fat', 'light', 'diet', 'salad', 'salads', 'baked', 'grilled', 'whole grain']

1 个答案:

答案 0 :(得分:0)

您可以使用pandas将其过滤掉,然后根据需要使用name,age,description Andy,15,Having a bad stomach Bobby,21,Having a good stomach and a little flu Connie,22,Not having anything particularly bad Derry,12,Bad stomach & lightheaded 方法将其输出到csv。

以下是有关其工作原理的基本说明:

数据2017.csv

In []: df = pd.read_csv('Data 2017.csv')

In []: word_flags = ['bad', 'flu', 'lightheaded']

In []: df_filtered = df.loc[:, :][df.description.str.contains("|".join(word_flags), re.IGNORECASE)]

In []: df_filtered
Out[]: 
     name  age                             description
0    Andy   15                    Having a bad stomach
1   Bobby   21  Having a good stomach and a little flu
2  Connie   22    Not having anything particularly bad
3   Derry   12               Bad stomach & lightheaded

In []: word_flags = ['flu', 'foo', 'bar']

In []: df_filtered = df.loc[:, :][df.description.str.contains("|".join(word_flags), re.IGNORECASE)]

In []: df_filtered
Out[]: 
    name  age                             description
1  Bobby   21  Having a good stomach and a little flu

df_filtered.to_csv("Filtered Data 2017.csv", index=False)

这是如何工作的基本说明:

name,age,description
Bobby,21,Having a good stomach and a little flu

现在你有了这个:

In []: word_flags = ['bad', 'flu', 'lightheaded']

In []: df2 = pd.DataFrame()

In []: for col in df.select_dtypes(object):
    ...:     df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
    ...:     

In []: df2
Out[]: 
     name  age                             description
0    Andy   15                    Having a bad stomach
1   Bobby   21  Having a good stomach and a little flu
2  Connie   22    Not having anything particularly bad
3   Derry   12               Bad stomach & lightheaded

In []: word_flags = ['flu', 'foo', 'bar']

In []: df2 = pd.DataFrame()

In []: for col in df.select_dtypes(object):
    ...:     df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
    ...:     

In []: df2
Out[]: 
    name  age                             description
1  Bobby   21  Having a good stomach and a little flu

要专门解决您的问题,请参阅下面的代码段落:

word_flags

但是,只有将过滤器定义为仅过滤掉特定列时,此方法才有效。假设您以这种方式定义In []: word_flags = ['flu', 'foo', 'bar', 'bobby']

In []: df2 = pd.DataFrame()

In []: for col in df.select_dtypes(object):
    ...:     df2 = pd.concat([df2, df[df[col].str.contains("|".join(word_flags), flags=re.IGNORECASE)]])
    ...:     

In []: df2
Out[]: 
    name  age                             description
1  Bobby   21  Having a good stomach and a little flu
1  Bobby   21  Having a good stomach and a little flu

这将产生重复记录,需要进一步清理。

<ul id="navbar-main" class="navbar-nav mr-auto">
    <li class="nav-item active">
        <a href="https://travian.dev/materials" class="nav-link nav-materials">
            <span class="invisible">Materials</span>
        </a>
    </li>
</ul>