如何从json字符串中提取重复的嵌套字段,并与bigquery中现有的重复嵌套字段连接

时间:2018-01-13 18:30:38

标签: google-bigquery standard-sql

我有一个表,其中包含一个名为article_id的嵌套重复字段和一个包含json字符串的字符串字段。

这是我桌子的架构:

以下是表格的示例行:

[
  {
"article_id": "2732930586",
"author_names": [
  {
    "AuN": "h kanahashi",
    "AuId": "2591665239",
    "AfN": null,
    "AfId": null,
    "S": "1"
  },
  {
    "AuN": "t mukai",
    "AuId": "2607493793",
    "AfN": null,
    "AfId": null,
    "S": "2"
  },
  {
    "AuN": "y yamada",
    "AuId": "2606624579",
    "AfN": null,
    "AfId": null,
    "S": "3"
  },
  {
    "AuN": "k shimojima",
    "AuId": "2606600298",
    "AfN": null,
    "AfId": null,
    "S": "4"
  },
  {
    "AuN": "m mabuchi",
    "AuId": "2606138976",
    "AfN": null,
    "AfId": null,
    "S": "5"
  },
  {
    "AuN": "t aizawa",
    "AuId": "2723380540",
    "AfN": null,
    "AfId": null,
    "S": "6"
  },
  {
    "AuN": "k higashi",
    "AuId": "2725066679",
    "AfN": null,
    "AfId": null,
    "S": "7"
  }
],
"extra_informations": "{
\"DN\": \"Experimental study for improvement of crashworthiness in AZ91 magnesium foam controlling its microstructure.\",
\"S\":[{\"Ty\":1,\"U\":\"https://shibaura.pure.elsevier.com/en/publications/experimental-study-for-improvement-of-crashworthiness-in-az91-mag\"}],
 \"VFN\":\"Materials Science and Engineering\",
 \"FP\":283,
 \"LP\":287,
 \"RP\":[{\"Id\":2024275625,\"CoC\":5},{\"Id\":2035451257,\"CoC\":5},     {\"Id\":2141952446,\"CoC\":5},{\"Id\":2126566553,\"CoC\":6},  {\"Id\":2089573897,\"CoC\":5},{\"Id\":2069241702,\"CoC\":7},  {\"Id\":2000323790,\"CoC\":6},{\"Id\":1988924750,\"CoC\":16}],
\"ANF\":[
{\"FN\":\"H.\",\"LN\":\"Kanahashi\",\"S\":1},
{\"FN\":\"T.\",\"LN\":\"Mukai\",\"S\":2},    
{\"FN\":\"Y.\",\"LN\":\"Yamada\",\"S\":3},    
{\"FN\":\"K.\",\"LN\":\"Shimojima\",\"S\":4},    
{\"FN\":\"M.\",\"LN\":\"Mabuchi\",\"S\":5},    
{\"FN\":\"T.\",\"LN\":\"Aizawa\",\"S\":6},    
{\"FN\":\"K.\",\"LN\":\"Higashi\",\"S\":7}
],
\"BV\":\"Materials Science and Engineering\",\"BT\":\"a\"}"
  }
]

extra_information.ANF我有一个包含更多作者姓名信息的嵌套数组。

嵌套的重复author_name字段有一个子字段author_name.S,可以映射到extra_informations.ANF.S以进行连接。使用此映射我试图实现下表:

| article_id | author_names.AuN | S | extra_information.ANF.FN | extra_information.ANF.LN|
| 2732930586 |  h kanahashi     | 1 | H.                       | Kanahashi               | 
| 2732930586 |  t mukai         | 2 | T.                       | Mukai                   | 
| 2732930586 |  y yamada        | 3 | Y.                       | Yamada.                 |
| 2732930586 |  k shimojima     | 4 | K.                       | Shimojima               |
| 2732930586 |  m mabuchi       | 5 | M.                       | Mabuchi                 |
| 2732930586 |  t aizawa        | 6 | T.                       | Aizawa                  |
| 2732930586 |  k higashi       | 7 | K.                       | Higashi                 |

我遇到的主要问题是,当我使用JSON_EXTRACT(extra_information,"$.ANF")转换json_string时,它不会给我一个数组,而是它给出了嵌套重复数组的字符串格式,我无法将其转换为阵列。

是否可以在bigquery中使用standards-sql生成这样的表?

1 个答案:

答案 0 :(得分:2)

   
  

选项1

这是基于REGEXP_REPLACE函数和更少的函数(REPLACE,SPLIT等)来处理结果。注意 - 我们需要额外的操作,因为BigQuery中的JsonPath表达式不支持通配符和过滤器?

#standard SQL
SELECT 
  article_id, author.AuN, author.S, 
  REPLACE(SPLIT(extra, '","')[OFFSET(0)], '"FN":"', '') FirstName,
  REPLACE(SPLIT(extra, '","')[OFFSET(1)], 'LN":"', '') LastName
FROM `table` , UNNEST(author_names) author
LEFT JOIN UNNEST(SPLIT(REGEXP_REPLACE(JSON_EXTRACT(extra_informations, '$.ANF'), r'\[{|}\]', ''), '},{')) extra
ON author.S = CAST(REPLACE(SPLIT(extra, '","')[OFFSET(2)], 'S":', '') AS INT64) 
  

选项2

克服BigQuery"限制"对于JsonPath,您可以使用custom function,如下例所示:
注意:它使用jsonpath-0.8.0.js,可以从https://code.google.com/archive/p/jsonpath/downloads下载并假设上传到Google云端存储 - gs://your_bucket/jsonpath-0.8.0.js

#standard SQL
CREATE TEMPORARY FUNCTION CUSTOM_JSON_EXTRACT(json STRING, json_path STRING)
RETURNS STRING
LANGUAGE js AS """
    try { var parsed = JSON.parse(json);
        return jsonPath(parsed, json_path);
    } catch (e) { return null }
"""
OPTIONS (
    library="gs://your_bucket/jsonpath-0.8.0.js"
);
SELECT 
  article_id, author.AuN, author.S,
  CUSTOM_JSON_EXTRACT(extra_informations, CONCAT('$.ANF[?(@.S==', CAST(author.S AS STRING), ')].FN')) FirstName,
  CUSTOM_JSON_EXTRACT(extra_informations, CONCAT('$.ANF[?(@.S==', CAST(author.S AS STRING), ')].LN')) LastName
FROM `table`, UNNEST(author_names) author 

正如您所看到的 - 现在您可以在一个简单的JsonPath

中完成所有魔术