我是Apache Airflow的新手,我想编写DAG将一些数据从源数据库中的一组表移动到目标数据库中的一组表。我正在尝试设计DAG,以便有人可以为新的源表->目标表过程简单地编写create table
和insert into
SQL脚本并将其放入文件夹中。然后,在下一次DAG运行时,DAG将从文件夹中拾取脚本并运行新任务。我将DAG设置为:
source_data_check_task_1 (Check Operator or ValueCheckOperator)
source_data_check_task_2 (Check Operator or ValueCheckOperator, Trigger on ALL_SUCCESS)
source_data_check_task_3 (Check Operator or ValueCheckOperator, Trigger on ALL_SUCCESS)
source_data_check_task_1 >> source_data_check_task_2 >> source_data_check_task_3
for tbl_name in tbl_name_list:
tbl_exists_check (Check Operator, trigger on ALL_SUCCESS): check if `new_tbl` exists in database by querying `information_schema`
tbl_create_task (SQL Operator, trigger on ALL_FAILED): run the `create table` SQL script
tbl_insert_task (SQL Operator ,trigger on ONE_SUCCESS): run the `insert into` SQL script
source_data_check_task_3 >> tbl_exists_check
tbl_exists_check >> tbl_create_task
tbl_exists_check >> tbl_insert_task
tbl_create_task >> tbl_insert)task
此设置遇到两个问题:(1)如果任何数据质量检查任务失败,tbl_create_task
仍会启动,因为它在ALL_FAILED
上触发;(2)不管是哪个任务失败,DAG显示运行是SUCCESS
。如果tbl_exists_check
失败了就很好了,因为它应该至少失败一次,但是如果某些关键任务失败了(比如任何数据质量检查任务),那不是理想的选择。
是否可以通过其他方式来设置DAG以解决这些问题?
下面的实际代码:
from airflow import DAG
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.check_operator import ValueCheckOperator, CheckOperator
from airflow.operators.bash_operator import BashOperator
from airflow.models import Variable
from datetime import datetime, timedelta
from airflow.utils.trigger_rule import TriggerRule
sql_path = Variable.get('sql_path')
default_args = {
'owner': 'enmyj',
'depends_on_past':True,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0
}
dag = DAG(
'test',
default_args=default_args,
schedule_interval=None,
template_searchpath=sql_path
)
# check number of weeks in bill pay (made up example)
check_one = CheckOperator(
task_id='check_one',
conn_id='conn_name',
sql="""select count(distinct field) from dbo.table having count(distinct field) >= 4 """,
dag=dag
)
check_two = CheckOperator(
task_id='check_two',
conn_id='conn_name',
sql="""select count(distinct field) from dbo.table having count(distinct field) <= 100""",
dag=dag
)
check_one >> check_two
ls = ['foo','bar','baz','quz','apple']
for tbl_name in ls:
exists = CheckOperator(
task_id='tbl_exists_{}'.format(tbl_name),
conn_id='conn_name',
sql =""" select count(*) from information_schema.tables where table_schema = 'test' and table_name = '{}' """.format(tbl_name),
trigger_rule=TriggerRule.ALL_SUCCESS,
depends_on_past=True,
dag = dag
)
create = PostgresOperator(
task_id='tbl_create_{}'.format(tbl_name),
postgres_conn_id='conn_name',
database='triforcedb',
sql = 'create table test.{} (like dbo.source)'.format(tbl_name), # will be read from SQL file
trigger_rule=TriggerRule.ONE_FAILED,
depends_on_past=True,
dag = dag
)
insert = PostgresOperator(
task_id='tbl_insert_{}'.format(tbl_name),
postgres_conn_id='conn_name',
database='triforcedb',
sql = 'insert into test.{} (select * from dbo.source limit 10)'.format(tbl_name), # will be read from SQL file
trigger_rule=TriggerRule.ONE_SUCCESS,
depends_on_past=True,
dag = dag
)
check_two >> exists
exists >> create
create >> insert
exists >> insert
答案 0 :(得分:3)
您有一个完美的用例,可以利用BranchPythonOperator进行检查,以查看表是否存在,然后在插入表之前继续创建表,而不必担心TRIGGER_RULES并通过UI使您的DAG逻辑更加清晰。
答案 1 :(得分:0)
下面是我最终得到的代码。此解决方案解决了上述两个问题:
1.如果上游任务失败,则不会触发tbl_create
任务
2.如果任何FAILED
任务失败,则DAG注册为check
。
我觉得这个解决方案似乎有些混乱,很乐意提出改进建议或使它更加“流畅”的方法
from airflow.models import DAG
from airflow.models import Variable
from airflow.operators.postgres_operator import PostgresOperator
from airflow.operators.check_operator import ValueCheckOperator, CheckOperator
from airflow.operators.python_operator import BranchPythonOperator
from airflow.operators.dummy_operator import DummyOperator
from airflow.utils.trigger_rule import TriggerRule
from datetime import datetime, timedelta
from airflow.hooks.postgres_hook import PostgresHook
sql_path = Variable.get('sql_path')
default_args = {
'owner': 'enmyj',
'depends_on_past':False,
'email_on_failure': False,
'email_on_retry': False,
'retries': 0
}
dag = DAG(
'test',
default_args=default_args,
schedule_interval=None,
template_searchpath=sql_path
)
# check number of weeks in bill pay (made up example)
check_one = CheckOperator(
task_id='check_one',
conn_id='conn_id',
sql="""select count(distinct field) from dbo.table having count(distinct field) >= 4 """,
dag=dag
)
def check_two_func():
p = Hook('conn_id')
sql="""select count(distinct field) from dbo.table having count(distinct field) <= 100"""
count = p.get_records(sql)[0][0]
if count == 0:
return 'dummy_fail'
else:
return 'dummy_success'
check_two = BranchPythonOperator(
task_id = 'check_two',
python_callable = check_two_func,
dag=dag
)
dummy_fail = DummyOperator(task_id='dummy_fail',dag=dag)
dummy_success = DummyOperator(task_id='dummy_success',dag=dag)
join = DummyOperator(task_id='join',dag=dag)
check_one >> check_two
check_two >> dummy_fail
check_two >> dummy_success
ls = ['foo','bar','baz','quz','apple']
for tbl_name in ls:
def has_table(tbl_name=tbl_name):
p = PostgresHook('conn_id')
sql =""" select count(*) from information_schema.tables where table_schema = 'test' and table_name = '{}' """.format(tbl_name)
count = p.get_records(sql)[0][0] #unpack the list/tuple
# If the query didn't return rows, branch to create table
# otherwise, branch to dummy
if count == 0:
return 'tbl_create_{}'.format(tbl_name)
else:
return 'dummy_{}'.format(tbl_name)
exists = BranchPythonOperator(
task_id='tbl_exists_{}'.format(tbl_name),
python_callable=has_table,
depends_on_past=False,
dag=dag
)
create = PostgresOperator(
task_id='tbl_create_{}'.format(tbl_name),
postgres_conn_id='conn_id',
database='database_name',
sql = 'create table test.{} (like dbo.source)'.format(tbl_name), # will be read from SQL file
dag = dag
)
insert = PostgresOperator(
task_id='tbl_insert_{}'.format(tbl_name),
postgres_conn_id='conn_id',
database='database_name',
sql = 'insert into test.{} (select * from dbo.source limit 10)'.format(tbl_name), # will be read from SQL file
trigger_rule=TriggerRule.ONE_SUCCESS,
dag = dag
)
dummy_success >> exists
exists >> create >> insert
exists >> dummy >> insert
insert >> join