根据某些条件替换列的值

时间:2020-08-13 11:15:30

标签: sql apache-spark pyspark apache-spark-sql

输入:

item   loc   month    year    qty_name      qty_value
a       x     8        2020    chocolate      10
a       x     8        2020    gum            15
a       x     8        2020    maggi          11
a       x     8        2020    colgate        18
b       y     8        2020    chocolate      20
b       y     8        2020    gum            30
b       y     8        2020    maggi          40
b       y     8        2020    colgate        9
c       s     8        2020    gum            15
c       s     8        2020    maggi          11
c       s     8        2020    colgate        18

预期输出:

item   loc   month    year    qty_name      qty_value
a       x     8        2020    chocolate      10
a       x     8        2020    gum            15
a       x     8        2020    maggi          0
a       x     8        2020    colgate        0
b       y     8        2020    chocolate      20
b       y     8        2020    gum            30
b       y     8        2020    maggi          0
b       y     8        2020    colgate        0
c       s     8        2020    gum            15
c       s     8        2020    maggi          11
c       s     8        2020    colgate        18

说明:

对于itemlocmonthyear组合:

如果chocolate>0,则除了巧克力和口香糖外,其他所有值都将变为0(这发生在itam和b中)

并且如果不存在巧克力,那么值将保持不变(这在item = c和loc = s中是封闭的)

3 个答案:

答案 0 :(得分:0)

如果使用的是mysql 8或更高版本,则可以使用窗口函数。在这里COUNT() OVER()对另一列中的巧克力进行计数,并使其所有行的值相同。然后在上层查询中可以检查结果。

SELECT ITEM,
       LOC,
       MONTH,
       YEAR,
       QTY_NAME,
       CASE
          WHEN QTY_NAME NOT IN ('chocolate', 'gum') AND CNT > 0 THEN 0
          ELSE QTY_NAME
       END
          QTY_NAME
  FROM (  SELECT ITEM,
                 LOC,
                 MONTH,
                 YEAR,
                 QTY_NAME,
                 QTY_VALUE,
                 COUNT (CASE WHEN QTY_NAME = 'chocolate' THEN 1 ELSE NULL END)
                    OVER ()
                    CNT
            FROM TEST_TABLE
        GROUP BY ITEM,
                 LOC,
                 MONTH,
                 YEAR,
                 QTY_NAME,
                 QTY_VALUE)

答案 1 :(得分:0)

下面的解决方案假设在给定的itemlocmonthyear组合中没有多个“ chocolate”记录。与样本数据一样。有了这个假设,就不需要对每个组合进行汇总。

仅将所有记录更新为零数量,这些数量不是“ chocolate”或“ gum”,对于相同组合存在记录且“ chocolate”的数量大于0。

样本数据

create table quantities
(
  item nvarchar(1),
  loc nvarchar(1),
  month int,
  year int,
  qty_name nvarchar(10),
  qty_value int
);

insert into quantities (item, loc, month, year, qty_name, qty_value) values
('a', 'x', 8, 2020, 'chocolate', 10),
('a', 'x', 8, 2020, 'gum'      , 15),
('a', 'x', 8, 2020, 'maggi'    , 11),
('a', 'x', 8, 2020, 'colgate'  , 18),
('b', 'y', 8, 2020, 'chocolate', 20),
('b', 'y', 8, 2020, 'gum'      , 30),
('b', 'y', 8, 2020, 'maggi'    , 40),
('b', 'y', 8, 2020, 'colgate'  , 9),
('c', 's', 8, 2020, 'gum'      , 15),
('c', 's', 8, 2020, 'maggi'    , 11),
('c', 's', 8, 2020, 'colgate'  , 18);

解决方案

update quantities q
join quantities q2
  on  q2.item = q.item
  and q2.loc = q.loc
  and q2.month = q.month
  and q2.year = q.year
  and q2.qty_name = 'chocolate'
  and q2.qty_value > 0
set q.qty_value = 0
where q.qty_name not in ('chocolate', 'gum');

结果

select * from quantities;

item    loc month   year    qty_name    qty_value
------- --- ------- ------- ----------- ----------
a       x   8       2020    chocolate   10
a       x   8       2020    gum         15
a       x   8       2020    maggi       0
a       x   8       2020    colgate     0
b       y   8       2020    chocolate   20
b       y   8       2020    gum         30
b       y   8       2020    maggi       0
b       y   8       2020    colgate     0
c       s   8       2020    gum         15
c       s   8       2020    maggi       11
c       s   8       2020    colgate     18

SQL Fiddle

EDIT:这是一个MySql解决方案,因为该问题先前已用它进行了标记。我手头没有Apache Spark SQL引擎来验证此解决方案。

答案 2 :(得分:0)

这是pyspark方式。

import pyspark.sql.functions as f

df2 = df.filter("qty_name = 'chocolate' and qty_value > 0").select('item', 'loc', 'month', 'year').withColumn('marker', f.lit('Y'))

df.join(df2, ['item', 'loc', 'month', 'year'], 'left') \
  .withColumn('qty_value', f.when(f.expr("marker = 'Y' and qty_name not in ('chocolate', 'gum')"), 0).otherwise(f.col('qty_value'))) \
  .drop('marker').show(12, False)

+----+---+-----+----+---------+---------+
|item|loc|month|year|qty_name |qty_value|
+----+---+-----+----+---------+---------+
|a   |x  |8    |2020|chocolate|10       |
|a   |x  |8    |2020|gum      |15       |
|a   |x  |8    |2020|maggi    |0        |
|a   |x  |8    |2020|colgate  |0        |
|b   |y  |8    |2020|chocolate|20       |
|b   |y  |8    |2020|gum      |30       |
|b   |y  |8    |2020|maggi    |0        |
|b   |y  |8    |2020|colgate  |0        |
|c   |s  |8    |2020|gum      |15       |
|c   |s  |8    |2020|maggi    |11       |
|c   |s  |8    |2020|colgate  |18       |
+----+---+-----+----+---------+---------+