如何从Spark SQL查询[PySpark]获取表名?

时间:2019-10-25 09:45:52

标签: python sql scala apache-spark pyspark

要从SQL查询中获取表名,

replacement element 1 has 768 rows to replace 1 rows

我在Scala How to get table names from SQL query?中找到了解决方案

select *
from table1 as t1
full outer join table2 as t2
  on t1.id = t2.id

当我遍历返回序列def getTables(query: String): Seq[String] = { val logicalPlan = spark.sessionState.sqlParser.parsePlan(query) import org.apache.spark.sql.catalyst.analysis.UnresolvedRelation logicalPlan.collect { case r: UnresolvedRelation => r.tableName } }

时,它为我提供了正确的表名
getTables(query).foreach(println)

PySpark的等效语法是什么?我遇到的最接近的是 How to extract column name and column type from SQL in pyspark

table1
table2

以回溯失败

plan = spark_session._jsparkSession.sessionState().sqlParser().parsePlan(query)
print(f"table: {plan.tableDesc().identifier().table()}")
  

我知道,问题源于以下事实:我需要过滤所有Py4JError: An error occurred while calling o78.tableDesc. Trace: py4j.Py4JException: Method tableDesc([]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.base/java.lang.Thread.run(Thread.java:835) 类型的计划项目,但在python / pyspark中找不到等效的符号

1 个答案:

答案 0 :(得分:2)

我有一个方法,但是很复杂。它转储Java对象和JSON(可怜的人的序列化过程),反序列化为python对象,过滤并解析表名

import json
def get_tables(query: str):
    plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
    plan_items = json.loads(plan.toJSON())
    for plan_item in plan_items:
        if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
            yield plan_item['tableIdentifier']['table']

当我遍历函数['fast_track_gv_nexus', 'buybox_gv_nexus']时会产生list(get_tables(query))

注意,不幸的是,这会中断CTE

示例

with delta as (
   select *
    group by id
    cluster by id
 )
select   *
  from ( select  *
         FROM
          (select   *
            from dmm
            inner join delta on dmm.id = delta.id
           )
  )

要解决它,我必须通过正则表达式来破解

import json
import re
def get_tables(query: str):
    plan = spark._jsparkSession.sessionState().sqlParser().parsePlan(query)
    plan_items = json.loads(plan.toJSON())
    plan_string = plan.toString()
    cte = re.findall(r"CTE \[(.*?)\]", plan_string)
    for plan_item in plan_items:
        if plan_item['class'] == 'org.apache.spark.sql.catalyst.analysis.UnresolvedRelation':
            tableIdentifier = plan_item['tableIdentifier']
            table =  plan_item['tableIdentifier']['table']   
            database =  tableIdentifier.get('database', '')
            table_name = "{}.{}".format(database, table) if database else table
            if table_name not in cte:
                yield table_name
相关问题