Question

我有两个.csv文件，其中文件1中的第一行是：

MPID,Title,Description,Model,Category ID,Category Description,Subcategory ID,Subcategory Description,Manufacturer ID,Manufacturer Description,URL,Manufacturer (Brand) URL,Image URL,AR Price,Price,Ship Price,Stock,Condition

文件2的第一行：

Regular Price,Sale Price,Manufacturer Name,Model Number,Retailer Category,Buy URL,Product Name,Availability,Shipping Cost,Condition,MPID,Image URL,UPC,Description

然后每个文件的其余部分都填充了信息。

如您所见，两个文件都有一个名为MPID的公共字段（文件1：col 1，文件2：col 9，其中第一个col为col 1）。

我想创建一个新文件，它将通过查看此列来组合这两个文件（如：如果两个文件中都有一个MPID，那么在新文件中，此MPID将同时显示其行来自文件1及其来自文件2的行）。如果一个MPID只出现在一个文件中，那么它也应该进入这个组合文件。

文件没有以任何方式排序。

如何在带有shell脚本或python的debian机器上执行此操作？

感谢。

编辑：两个文件除了分隔字段之外没有逗号。

Answer 1

sort -t , -k index1 file1 > sorted1
sort -t , -k index2 file2 > sorted2
join -t , -1 index1 -2 index2 -a 1 -a 2 sorted1 sorted2

Answer 2

这是典型的“关系连接”问题。

你有几种算法。

嵌套循环。您从一个文件中读取以选择“主”记录。您读取整个其他文件，找到与主服务器匹配的所有“详细信息”记录。这是一个坏主意。
排序合并。您可以根据公共密钥将每个文件排序为临时副本。然后，您可以通过从主服务器读取来合并这两个文件，然后从详细信息中读取所有匹配的行并写入合并的记录。
查找。您将其中一个文件完全读入内存中的字典，并由关键字段索引。这对于详细文件来说可能很棘手，每个键都有多个子项。然后你读取另一个文件并在字典中查找匹配的记录。

其中，排序合并通常是最快的。这完全使用unix sort命令完成。

查找实施

import csv
import collections

index = collections.defaultdict(list)

file1= open( "someFile", "rb" )
rdr= csv.DictReader( file1 )
for row in rdr:
    index[row['MPID']].append( row )
file1.close()

file2= open( "anotherFile", "rb" )
rdr= csv.DictReader( file2 )
for row in rdr:
    print row, index[row['MPID']]
file2.close()

Answer 3

您需要查看shell中的join命令。您还需要对数据进行排序，并可能丢失第一行。如果任何数据包含逗号，整个过程将会失败。或者您需要使用CSV敏感的流程处理数据，该流程引入了一个不同的字段分隔符（可能是control-A），您可以使用它来明确地分割字段。

使用Python的替代方法是将两个文件读入一对字典（键入公共列），然后使用循环覆盖两个字典中较小的字典中的所有元素，查找匹配值在另一个。（这是基本的嵌套循环查询处理。）

Answer 4

您似乎正尝试在shell脚本中执行此操作，这通常使用SQL Server完成。是否可以使用SQL执行该任务？例如，您可以将这两个文件导入mysql，然后创建一个连接，然后将其导出为CSV。

Answer 5

您可以查看我的FOSS项目CSVfix，它是一个用于处理CSV文件的流编辑器。它支持连接，以及其他功能，并且不需要使用脚本。

Answer 6

为了基于一个或多个公共列合并多个文件（甚至＆gt; 2），python中最好和最有效的方法之一就是使用“brewery”。您甚至可以指定合并时需要考虑哪些字段以及需要保存哪些字段。

import brewery
from brewery
import ds
import sys

sources = [
    {"file": "grants_2008.csv",
     "fields": ["receiver", "amount", "date"]},
    {"file": "grants_2009.csv",
     "fields": ["id", "receiver", "amount", "contract_number", "date"]},
    {"file": "grants_2010.csv",
     "fields": ["receiver", "subject", "requested_amount", "amount", "date"]}
]

创建所有字段的列表并添加文件名以存储有关数据记录来源的信息。浏览源定义并收集字段：

for source in sources:
    for field in source["fields"]:
        if field not in all_fields:

out = ds.CSVDataTarget("merged.csv")
out.fields = brewery.FieldList(all_fields)
out.initialize()

for source in sources:

    path = source["file"]

# Initialize data source: skip reading of headers
# use XLSDataSource for XLS files
# We ignore the fields in the header, because we have set-up fields
# previously. We need to skip the header row.

    src = ds.CSVDataSource(path,read_header=False,skip_rows=1)

    src.fields = ds.FieldList(source["fields"])

    src.initialize()


    for record in src.records():

   # Add file reference into ouput - to know where the row comes from
    record["file"] = path

        out.append(record)

# Close the source stream

    src.finalize()


cat merged.csv | brewery pipe pretty_printer

通过公共列组合2个.csv文件

6 个答案: