气流任务需要7分钟才能执行

时间:2018-09-19 04:49:15

标签: google-cloud-platform airflow google-cloud-shell

我开始使用Airflow,我的第一个工作流程涉及将文件从GCP移至S3(来回)。

完成工作(和整个DAG)的任务成功完成,但是需要7分钟,如下面的日志所示,用于文件传输(我猜想一些身份验证和协议的内容)。

  

[2018-09-19 13:58:34,498] {logging_mixin.py:95}信息-[2018-09-19   13:58:34,496] {credentials.py:1032}信息-在共享中找到凭据   凭证文件:〜/ .aws / credentials

     

[2018-09-19 14:05:55,920] {logging_mixin.py:95}信息-[2018-09-19   14:05:55,920] {gcp_api_base_hook.py:84}信息-建立连接   使用google.auth.default(),因为没有为钩子定义密钥文件。

在同一DAG中,有一个任务可以执行补充任务,即从S3到GCP的文件传输非常快(不到1分钟)。

from __future__ import print_function

from builtins import range
from datetime import datetime
import airflow
from airflow.operators import OmegaFileSensor
from airflow.operators.python_operator import PythonOperator
from airflow.contrib.operators.s3_to_gcs_operator import S3ToGoogleCloudStorageOperator
from airflow.contrib.operators.gcs_to_s3 import GoogleCloudStorageToS3Operator
from airflow.contrib.sensors.gcs_sensor import GoogleCloudStorageObjectSensor
from airflow.models import DAG

import time
from pprint import pprint

S3_BUCKET = 'data-preprod-redshift-exports'
# S3_OBJECT = 'airflow/seattlecheckoutsbytitle.zip' # 2GB
S3_OBJECT = 'airflow/cnpjqsa.zip' # 400Mb   
# S3_OBJECT = '/airflow/chicagobusinesslicensesandowners.zip' # 100 Mb  

GCS_BUCKET = 'ds_de_airflow' 

args = {
    'owner': 'airflow',
    'start_date': datetime(2018,9,18)#,
    #'execution_timeout':None,
    #'dagrun_timeout': None
}

def print_context(ds, **kwargs):
    pprint(kwargs)
    print(ds)
    return 'Whatever you return gets printed in the logs'

with DAG( dag_id='a_second', default_args=args, schedule_interval=None) as dag:

    run_this = PythonOperator(
        task_id='run_this',
        provide_context=True,
        python_callable=print_context
    )

    s3_to_gcs_op = S3ToGoogleCloudStorageOperator( 
        task_id = 's3_to_gcs_op', 
        bucket = S3_BUCKET,
        prefix = S3_OBJECT, 
        dest_gcs_conn_id = 'google_cloud_default',
        dest_gcs = 'gs://ds_de_airflow/Task1_upload/', 
        replace = False
    )

    # for some reason this takes no less than 7 minutes (tried 3 times) 
    gcs_to_s3_op = GoogleCloudStorageToS3Operator(
        task_id = 'gcs_to_s3_op', 
        bucket = GCS_BUCKET,
        prefix = 'Task1_upload',
        delimiter = 'fileGCS.txt',
        google_cloud_storage_conn_id ='google_cloud_default',
        dest_aws_conn_id = 'aws_default',
        dest_s3_key = 's3://data-preprod-redshift-exports/airflow/',
        replace = False
    )

    gcs_sensor = GoogleCloudStorageObjectSensor(
        task_id = 'gcs_sensor',
        bucket = GCS_BUCKET,
        object = 'Task1_upload/airflow/fileS3.txt'  # this is not the most interesting file to sensor for but for now...
    )

    run_this >> s3_to_gcs_op >> gcs_sensor >> gcs_to_s3_op

我们已使用默认的数据库引擎(1个线程)在Google Cloud Shell中安装了气流。

问题是:如何将7分钟任务的执行时间减少到更合理的时间?

0 个答案:

没有答案
相关问题