Question

是否有一种简单的方法可以从CSV文件中查找和删除重复的行？

示例test.csv文件：

row1 test tyy......
row2 tesg ghh
row2 tesg ghh
row2 tesg ghh
....
row3 tesg ghh
row3 tesg ghh
...
row4 tesg ghh

预期结果：

row1 test tyy......
row2 tesg ghh
....
row3 tesg ghh
...
row4 tesg ghh

我可以从哪里开始用PHP实现这个目标？

Answer 1

直接指向的方法是逐行读取文件并跟踪您之前看到的每一行。如果已经看到当前行，请跳过它。

以下（未经测试的）代码可能有效：

<?php
// array to hold all "seen" lines
$lines = array();

// open the csv file
if (($handle = fopen("test.csv", "r")) !== false) {
    // read each line into an array
    while (($data = fgetcsv($handle, 8192, ",")) !== false) {
        // build a "line" from the parsed data
        $line = join(",", $data);

        // if the line has been seen, skip it
        if (isset($lines[$line])) continue;

        // save the line
        $lines[$line] = true;
    }
    fclose($handle);
}

// build the new content-data
$contents = '';
foreach ($lines as $line => $bool) $contents .= $line . "\r\n";

// save it to a new file
file_put_contents("test_unique.csv", $contents);
?>

此代码使用fgetcsv()并使用 ~~space~~ 逗号作为列分隔符（基于问题注释中的样本数据）。

如上所述，存储已经看过的每一行都将确保删除文件中的所有重复行，无论它们是否直接相互跟随。如果他们总是背对背，一个更简单的方法（更多的记忆意识）将只存储最后看到的行，然后与当前的行进行比较。

更新（通过SKU列重复行，而不是全行）
根据评论中提供的样本数据，“重复行”实际上并不相等（尽管它们相似，但它们的列数相差很大）。它们之间的相似性可以链接到单个列sku。

以下是上述代码的扩展版本。此块将解析CSV文件的第一行（列列表），以确定哪个列包含sku代码。从那里，它将保留一个唯一的SKU代码列表，如果当前行有一个“新”代码，它将使用fputcsv()将该行写入新的“唯一”文件：

<?php
// array to hold all unique lines
$lines = array();

// array to hold all unique SKU codes
$skus = array();

// index of the `sku` column
$skuIndex = -1;

// open the "save-file"
if (($saveHandle = fopen("test_unique.csv", "w")) !== false) {
    // open the csv file
    if (($readHandle = fopen("test.csv", "r")) !== false) {
        // read each line into an array
        while (($data = fgetcsv($readHandle, 8192, ",")) !== false) {
            if ($skuIndex == -1) {
                // we need to determine what column the "sku" is; this will identify
                // the "unique" rows
                foreach ($data as $index => $column) {
                    if ($column == 'sku') {
                        $skuIndex = $index;
                        break;
                    }
                }
                if ($skuIndex == -1) {
                    echo "Couldn't determine the SKU-column.";
                    die();
                }
                // write this line to the file
                fputcsv($saveHandle, $data);
            }

            // if the sku has been seen, skip it
            if (isset($skus[$data[$skuIndex]])) continue;
            $skus[$data[$skuIndex]] = true;

            // write this line to the file
            fputcsv($saveHandle, $data);
        }
        fclose($readHandle);
    }
    fclose($saveHandle);
}
?>

总的来说，这种方法对内存非常友好，因为它不需要在内存中保存每一行的副本（只有SKU代码）。

Answer 2

单行解决方案：

file_put_contents('newdata.csv', array_unique(file('data.csv')));

如何从CSV文件中删除重复的行？

2 个答案: