job.com在aws胶水中执行了哪些操作?

时间:2018-01-14 08:35:09

标签: amazon-web-services aws-glue

每个作业脚本代码都应以job.commit()结束,但此功能的确切操作是什么?

  1. 这只是工作结束标记吗?
  2. 可以在一个工作期间调用两次(如果是 - 在什么情况下)?
  3. 调用job.commit()后执行任何python语句是否安全?
  4. P.S。我在PyGlue.zip中找不到任何描述源代码:(

3 个答案:

答案 0 :(得分:6)

截至今天,Job对象有用的唯一情况是使用Job Bookmarks时。当您从Amazon S3(only supported source for bookmarks so far)读取文件并调用job.commit时,到目前为止读取的时间和路径将在内部存储,因此,如果由于某种原因您再次尝试读取该路径,则只会找回未读(新)文件。

在此代码示例中,我尝试分别读取和处理两个不同的路径,并在处理完每个路径后提交。如果由于某种原因我停止工作,相同的文件将无法处理。

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)
job.init(args[‘JOB_NAME’], args)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    try:
        dynamic_frame = glue_context.create_dynamic_frame_from_options(
            connection_type='s3',
            connection_options={'paths'=[s3_path]},
            format='json',
            transformation_ctx="path={}".format(path))
        do_something(dynamic_frame)
        # Commit file read to Job Bookmark
        job.commit()
    except:
        # Something failed

只有在启用了Job Bookmark的情况下,才能在Job对象上调用commit方法,并且存储的引用将从JobRun保留到JobRun,直到您重置或暂停Job Bookmark为止。在Job.commit之后执行更多python语句是完全安全的,并且如前面的代码示例所示,多次提交也是有效的。

希望这有帮助

答案 1 :(得分:1)

根据AWS支持团队的说法,commit不应被多次调用。这是我从他们那里得到的确切答复:

The method job.commit() can be called multiple times and it would not throw any error 
as well. However, if job.commit() would be called multiple times in a Glue script 
then job bookmark will be updated only once in a single job run that would be after 
the first time when job.commit() gets called and the other calls for job.commit() 
would be ignored by the bookmark. Hence, job bookmark may get stuck in a loop and 
would not able to work well with multiple job.commit(). Thus, I would recommend you 
to use job.commit() once in the Glue script.

答案 2 :(得分:0)

扩展@yspotts答案。正如他们提到的那样,虽然书签仅会更新一次,但可以在一个AWS Glue Job脚本中执行多个job.commit()但是,也可以多次拨打job.init()。在这种情况下,将使用自上一次提交以来处理的S3文件正确更新书签。如果为false,则不执行任何操作。

init()函数中,有一个“已初始化”标记会被更新并设置为true。然后,在commit()函数中检查此标记,如果true,则执行步骤以提交书签并重置“已初始化”标记。

因此,与@hoaxz答案不同的唯一事情是在for循环的每次迭代中调用job.init()

args = getResolvedOptions(sys.argv, [‘TempDir’,’JOB_NAME’])
sc = SparkContext()
glue_context = GlueContext(sc)
# Init my job
job = Job(glue_context)

paths = [
    's3://bucket-name/my_partition=apples/',
    's3://bucket-name/my_partition=oranges/']
# Read each path individually, operate on them and commit
for path in paths:
    job.init(args[‘JOB_NAME’], args)
    dynamic_frame = glue_context.create_dynamic_frame_from_options(
        connection_type='s3',
        connection_options={'paths'=[s3_path]},
        format='json',
        transformation_ctx="path={}".format(path))
    do_something(dynamic_frame)
    # Commit file read to Job Bookmark
    job.commit()