Question

我是编程新手，所以对获得独特的数据组合感到困惑。

这是我的数据集：

Customer, Transaction, Date, Product, Cost

X,1,02/02,A,10.99

X,1,02/02,B,4.99

X,2,04/02,A,9.99

Y,4,10/02,C,0.99

Y,5,03/03,D,13.99

Z,7,03/04,D,13.99

Z,9,07/05,B,5.99

Z,9,07/05,A,11.99

我想要这样的输出：

Product, CustomerCount, TotalRevenue

A,2,32.97

B,2,10.98

C,1,0.99

D,2,27.98

这里要注意的一点是，客户数量是指带给定ID产品的唯一客户数量

我在MRJob中编写了如下代码：

def mapper(self, _, file):
    customer, transaction, date, product, cost = file.split(',')
    yield [customer, product], 1


def reducer(self, key, values):
    yield key, sum(values)

但是上面的代码对我不起作用。我对如何建立仅获得唯一计数的关系感到困惑？将不胜感激！

我想以纯pythonic方式做到这一点！

Answer 1

尽可能采用原始方法（使用mapreduce）

from functools import reduce
import csv

def mapper(f):
  " Select fields of input data"
  while f:
    customer, _, _, product, cost =  f.rstrip().split(',')
    return customer, product, cost

def reducer(acc, v):
  " Used by reduce to generate dictionary of products by customer and revenue"
  customer, product, cost = v
  if not product in acc:
    acc[product] = {}
  if not customer in acc[product]:
    acc[product][customer] = []
  acc[product][customer].append(cost)
  return acc

def create_row(results, product):
  " Generate stats row for a product "
  customers = results[product]
  return {"Product":product, 
          "CustomerCount": len(results[product]),
          "TotalRevenue": sum(float(cost) for customer, costs in customers.items() for cost in costs)}

with open('test.csv') as ifile:
  next(ifile)
  data = map(mapper, ifile)

  # Generate results as a nested dictionary
  results = reduce(reducer, data, {})

# List of products in alphabetical order
products = sorted(results.keys())

# Show results as CSV File
fieldnames = ["Product", "CustomerCount", "TotalRevenue"]
with open("results.csv", "w") as ofile:
  writer = csv.DictWriter(ofile, fieldnames=fieldnames)
  writer.writeheader()
  rows = map(lambda product: create_row(results, product), products)
  writer.writerows(rows)

测试

输入文件

Customer, Transaction, Date, Product, Cost
X,1,02/02,A,10.99
X,1,02/02,B,4.99
X,2,04/02,A,9.99
Y,4,10/02,C,0.99
Y,5,03/03,D,13.99
Z,7,03/04,D,13.99
Z,9,07/05,B,5.99
Z,9,07/05,A,11.99

输出文件

Product, CustomerCount, TotalRevenue
A,2,32.97
B,2,10.98
C,1,0.99
D,2,27.98

注意：结果字典为

{'A': {'X': ['10.99', '9.99'], 'Z': ['11.99']},
 'B': {'X': ['4.99'], 'Z': ['5.99']},
 'C': {'Y': ['0.99']},
 'D': {'Y': ['13.99'], 'Z': ['13.99']}}

使用Python Groupby

的替代方法

from itertools import groupby

with open('test.csv') as ifile, open('results.csv', 'w') as ofile:
  next(ifile) # skip input file header
  # Input Data as list of list
  data = [line.rstrip().split(',') for line in ifile]

  # Function key for sorting and grouping by product field in each sublist
  keyfunc = lambda x: x[3]  # product column

  # Inplace sort
  data.sort(key=keyfunc) # Sort by product

  # Write Header
  ofile.write('Product, CustomerCount, TotalRevenue' + '\n')  # Header

  # Process by grouping by product field
  for product, g in groupby(data, keyfunc):
    g = list(g)
    customers = set(x[0] for x in g)  # set of customers in current grouping
    total_revenue = sum(float(x[4]) for x in g) 

    ofile.write(f'{product},{len(customers)},{total_revenue:.2f}\n')

获取唯一元素数

1 个答案: