每隔X分钟运行一次Airflow DAG

时间:2017-09-12 18:03:33

标签: python airflow apache-airflow

我使用LocalScheduler选项在EC2实例上使用气流。我已经调用airflow schedulerairflow webserver,一切似乎都运行良好。也就是说,在向cron提供schedule_interval字符串后,每隔10分钟就会执行此操作," '*/10 * * * *',默认情况下,作业每24小时继续执行一次。这是代码的标题:

from datetime import datetime
import os
import sys

from airflow.models import DAG
from airflow.operators.python_operator import PythonOperator

import ds_dependencies

SCRIPT_PATH = os.getenv('PREPROC_PATH')

if SCRIPT_PATH:
    sys.path.insert(0, SCRIPT_PATH)
    import workers
else:
    print('Define PREPROC_PATH value in environmental variables')
    sys.exit(1)

default_args = {
  'start_date': datetime(2017, 9, 9, 10, 0, 0, 0), #..EC2 time. Equal to 11pm hora México
  'max_active_runs': 1,
  'concurrency': 4,
  'schedule_interval': '*/10 * * * *' #..every 10 minutes
}

DAG = DAG(
  dag_id='dash_update',
  default_args=default_args
)

...

2 个答案:

答案 0 :(得分:5)

default_args仅用于填充传递给DAG内的运营商的参数。 max_active_runsconcurrencyschedule_interval都是初始化DAG的参数,而不是运算符。这就是你想要的:

DAG = DAG(
  dag_id='dash_update',
  start_date=datetime(2017, 9, 9, 10, 0, 0, 0), #..EC2 time. Equal to 11pm hora México
  max_active_runs=1,
  concurrency=4,
  schedule_interval='*/10 * * * *', #..every 10 minutes
  default_args=default_args,
)

我之前也将它们混合在一起,以供参考(注意有重叠):

DAG参数:https://airflow.incubator.apache.org/code.html?highlight=dag#airflow.models.DAG 运算符参数:https://airflow.incubator.apache.org/code.html#baseoperator

答案 1 :(得分:1)

对于 >2.1 的气流版本,您可以使用 datetime.timedelta() 对象:

DAG = DAG(
  dag_id='dash_update',
  start_date=datetime(2017, 9, 9, 10, 0, 0, 0),
  max_active_runs=1,
  concurrency=4,
  schedule_interval=timedelta(minutes=10),
  default_args=default_args,
)

处理 start_date 的另一个很酷的功能是 days_ago

from airflow.utils.dates import days_ago

DAG = DAG(
  dag_id='dash_update',
  start_date=days_ago(2, minute=15), # would start 2 days ago at 00:15
  max_active_runs=1,
  concurrency=4,
  schedule_interval=timedelta(minutes=10),
  default_args=default_args,
)