使用嵌套和重复字段获取最新列值

时间:2016-02-02 10:21:58

标签: sql google-bigquery

我的桌子结构如下:
enter image description here

及其中的以下数据:

[
  {
    "addresses": [
      {
        "city": "New York"
      },
      {
        "city": "San Francisco"
      }
    ],
    "age": "26.0",
    "name": "Foo Bar",
    "createdAt": "2016-02-01 15:54:25 UTC"
  },
  {
    "addresses": [
      {
        "city": "New York"
      },
      {
        "city": "San Francisco"
      }
    ],
    "age": "26.0",
    "name": "Foo Bar",
    "createdAt": "2016-02-01 15:54:16 UTC"
  }
]

我想要做的是重新创建相同的表(相同的结构),但只使用最新版本的行。在这个例子中,让我们说我希望按名称对所有内容进行分组,并使用最新的createdAt获取行。 我尝试做类似这样的事情:Google Big Query SQL - Get Most Recent Column Value但是我无法使用记录和重复字段来处理它。

1 个答案:

答案 0 :(得分:2)

I really hoped someone from Google Team will provide answer on this question as it is very frequent topic/problem asked here on SO. BigQuery definitelly not friendly enough with writing Nested / Repeated stuff back to BQ off of BQ query.

So, I will provide the workaround I found relatively long time ago. I DO NOT like it, but (and that is why I hoped for the answer from Google Team) it works. I hope you will be able to adopt it for you particular scenario

So, based on your example, assume you have table as below

enter image description here

and you expect to get most recent records based on createdAt column, so result will look like:

enter image description here

Below code does this:

SELECT name, age, createdAt, addresses.city
FROM JS( 
  ( // input table 
    SELECT name, age, createdAt, NEST(city) AS addresses 
    FROM (
      SELECT name, age, createdAt, addresses.city 
      FROM (
        SELECT 
          name, age, createdAt, addresses.city, 
          MAX(createdAt) OVER(PARTITION BY name, age) AS lastAt
        FROM yourTable
      )
      WHERE createdAt = lastAt
    )
    GROUP BY name, age, createdAt
  ), 
  name, age, createdAt, addresses, // input columns 
  "[ // output schema 
    {'name': 'name', 'type': 'STRING'},
    {'name': 'age', 'type': 'INTEGER'},
    {'name': 'createdAt', 'type': 'INTEGER'},
    {'name': 'addresses', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'city', 'type': 'STRING'}
       ]    
     }
  ]", 
  "function(row, emit) { // function 
    var c = []; 
    for (var i = 0; i < row.addresses.length; i++) { 
      c.push({city:row.addresses[i]});
    }; 
    emit({name: row.name, age: row.age, createdAt: row.createdAt, addresses: c}); 
  }"
) 

the way above code works is: it implicitely flattens original records; find rows that belong to most recent records (partitioned by name and age); assembles those rows back into respective records. final step is processing with JS UDF to build proper schema that can be actually written back to BigQuery Table as nested/repeated vs flatten

The last step is the most annoying part of this workaround as it needs to be customized each time for specific schema(s)

Please note, in this example - it is only one nested field inside addresses record, so NEST() fuction worked. In scenarious when you have more than just one field inside - above approach still works, but you need to involve concatenation of those fields to put them inside nest() and than inside js function to do extra splitting those fields, etc.
You can see examples in below answers:
Create a table with Record type column
create a table with a column type RECORD
How to store the result of query on the current table without changing the table schema?

I hope this is good foundation for you to experiment with and make your case work!