Question

我正在尝试透过我的measure列，使其值变为字段。

含义net_revenue和vic应该成为他们自己的字段。

在下图中，输入位于左侧，Desired Output位于右侧：

我知道measure有重复的密钥（例如，net_revenue出现多次），但我正在索引的date_budget对于该块是不同的数据的。 date_budget会重复，但只有在measure发生更改时才会重复，因此我们永远不会为索引列创建真正重复的行。

问题：在Pentaho CPython脚本中，当我查看脚本中的输出时，我只返回我的索引列，但不返回透视列net_revenue和vic 。这是为什么？

脚本：

import pandas as pd

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'], errors='coerce')

# Perform the pivot.
budget = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure'
    )

budget.reset_index(inplace=True)

result_df = budget

示例数据帧：

d = {
    'country': ['us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us', 'us'],
    'customer': ['customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer1', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2', 'customer2',],
    'measure': ['net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic', 'net_revenue', 'net_revenue', 'net_revenue', 'vic', 'vic', 'vic'],
    'date_budget': ['1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018', '1/1/2018', '2/1/2018', '3/1/2018'],
    'monthly_budget_phasing': ['$55', '$23', '$42', '$29', '$35', '$98', '$87', '$77', '$34', '$90', '$75', '$12']
    }
df = pd.DataFrame(data=d)

在Pandas工作aggfunc='first'，但在Pentaho工作。 Pentaho仍在输出country，customer，measure。

终端输出的熊猫：

   country   customer date_budget      measure monthly_budget_phasing
0       us  customer1    1/1/2018  net_revenue                    $55
1       us  customer1    2/1/2018  net_revenue                    $23
2       us  customer1    3/1/2018  net_revenue                    $42
3       us  customer1    1/1/2018          vic                    $29
4       us  customer1    2/1/2018          vic                    $35
5       us  customer1    3/1/2018          vic                    $98
6       us  customer2    1/1/2018  net_revenue                    $87
7       us  customer2    2/1/2018  net_revenue                    $77
8       us  customer2    3/1/2018  net_revenue                    $34
9       us  customer2    1/1/2018          vic                    $90
10      us  customer2    2/1/2018          vic                    $75
11      us  customer2    3/1/2018          vic                    $12
measure country   customer date_budget net_revenue  vic
0            us  customer1    1/1/2018         $55  $29
1            us  customer1    2/1/2018         $23  $35
2            us  customer1    3/1/2018         $42  $98
3            us  customer2    1/1/2018         $87  $90
4            us  customer2    2/1/2018         $77  $75
5            us  customer2    3/1/2018         $34  $12

即使上面的Python工作，Pentaho 8.0 CPython插件仍然会引发问题。

首先我融化日期：

然后我解开措施：

我的net_revenue和vic字段在哪里？

Answer 1

您似乎需要添加replace：

budget['monthly_budget_phasing'] = pd.to_numeric(budget['monthly_budget_phasing'].replace('\$','', regex=True), errors='coerce')
#alternative
#budget['monthly_budget_phasing'] = budget['monthly_budget_phasing'].replace('\$','', regex=True).astype(int)


df = pd.pivot_table(budget,
    values='monthly_budget_phasing',
    index=['country', 'customer', 'date_budget'],
    columns='measure',
    aggfunc='first'

    ).reset_index()

替代：

cols = ['country', 'customer', 'date_budget', 'measure']
#if duplicates, first remove it
df = budget.drop_duplicates(cols)
#pivot by unstack
df = df.set_index(cols)['monthly_budget_phasing'].unstack().reset_index()

print (df)
measure country   customer date_budget  net_revenue  vic
0            us  customer1    1/1/2018           55   29
1            us  customer1    2/1/2018           23   35
2            us  customer1    3/1/2018           42   98
3            us  customer2    1/1/2018           87   90
4            us  customer2    2/1/2018           77   75
5            us  customer2    3/1/2018           34   12

Answer 2

Kettle需要知道每个步骤在转换运行之前产生的列 - 这就是为什么我不认为它可以用Python完成（select *查询有点例外，但它们太秘密了在转换运行之前获取元数据）。在Kettle中执行pivot的常用方法是使用Row denormalizer步骤。该步骤要求您为未透露的值指定列名称，但如果您无法对值进行硬编码，则可以通过ETL Metadata Injection步骤根据您的数据传递这些值。

为了动态传递值，创建2个转换：子转换将从父转换中获取输入数据，并通过行denormalizer执行转轴操作。父转换将读取输入数据，获取唯一值，这将成为列名，然后将这些值传递给ETL元数据注入步骤。注入步骤将使用列名填充行反规范化器元数据并执行转换，为输入数据提供信息。

使用数据透视表仅返回索引列，省略了透视列

2 个答案: