从.csv中删除包含特定子字符串的URL字符串

时间:2019-06-27 14:34:37

标签: python-3.x csv

我编写了以下代码,该代码肯定是从列表中删除了一些URL,但是我看到许多URL仍包含我要查找的参数。

我添加了

row[0].lower() 

尝试对此进行补救,但仍然无法正常工作。

带有参数的URL如下:

?currentPage = 2&Nrpp = 24&No = 24 ?pagination = 1&currentPage = 2

与“?”有关吗?

import csv

values =  [
   "/blog",
   "nrpp",
   "pagination"
]  

added_vals = []

with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
  writer = csv.writer(out)
  for row in csv.reader(inp):
     for value in values:
         if value not in row[0].lower() and row[0] not in added_vals:
            writer.writerow(row)
         added_vals.append(row[0])

该文件应该基本上返回相同的文件,但行数要少得多。以下是一些示例URL:

  

/ category / dresses-5699972 / juna-rose / N-ihuZ20cbZc1y?currentPage = 29&Nrpp = 24&No = 672   / category / dresses-5699972 / tall-dresses-204374 / purple / N-ij9ZbyvZc1y   / category / dresses-5699972 / pencil-dresses-204531 / short-sleeve / N-iisZ21b9Zc1y?pagination = 1&currentPage = 2   / category / dresses-5699972 / tan / N-ihuZbyyZc1y?currentPage = 10&Nrpp = 24&No = 216

2 个答案:

答案 0 :(得分:0)

这是问题所在:您遍历了三个值。因此,您要测试第一个值是否在row[0]中。如果不是这样,您仍将行[0]添加到added_vals中,因此将不再对该行进行测试,也将无法对其进行写入。

您应该执行的操作类似于:

for row in csv.reader(inp):
     if not any(v.lower() in row.lower() for v in values):
         writer.writerow(row)

此外,使用in可能会导致很多假阴性,所以这样做会更好:

import re

rx = re.compile(r".*\?currentPage=\d+&Nrpp=\d+&No=\d+\?pagination=\d+&currentPage=\d+.*", re.IGNORECASE)

for row in csv.reader(inp):
     if not rx.match(row):
         writer.writerow(row)
  

有关正则表达式的更多信息:https://docs.python.org/3.7/library/re.html

答案 1 :(得分:0)

我不确定您的added_vals变量的作用,但我认为您正在使事情复杂化。

它应该很容易修复:

import csv

values =  [
   "/blog",
   "nrpp",
   "pagination"
]

# Open input and output files
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
    writer = csv.writer(out)

    # Iterate through the rows in the file
    for row in csv.reader(inp):
        url = row[0].lower()

        # Iterate through the values, and see if one matches
        for value in values:
            # If we find a match, cancel the current `for` loop
            if value in url:
                break
        else:
            # This will only run if we finished the `for` loop without a `break`.
            # So, if we reached this code, no match was found
            writer.writerow(row)

如果使用正则表达式,代码将变得更加紧凑:

import csv
import re

rx = re.compile(r"^[^?]*/blog|[?&](currentPage|nrpp)=", re.IGNORECASE)

with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
    writer = csv.writer(out)

    for row in csv.reader(inp):
        if not rx.search(row[0]):
            writer.writerow(row)

替代版本,更接近您的原始代码:

import csv

values =  [
   "/blog",
   "nrpp",
   "pagination"
]

# Open input and output files
with open("internal_all_dup_facets.csv", "rt", encoding="utf-8") as inp, open("dupfacets.csv", "w", newline='') as out:
    writer = csv.writer(out)

    # Iterate through the rows in the file
    for row in csv.reader(inp):
        url = row[0].lower()

        # Iterate through the values, and see if one matches
        matches = False
        for value in values:
            if value in url:
                matches = True
                break

        # If none match, write to output csv
        if not matches:
            writer.writerow(row)