无法使用pyarrow的hdfs API从Kerberized群集上的辅助服务器/数据节点连接到HDFS

时间:2019-07-11 22:51:23

标签: hdfs dask pyarrow

这就是我要尝试的:

import pyarrow as pa

conf = {"hadoop.security.authentication": "kerberos"}
fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)

但是,当我使用Dask-YARN将此作业提交到群集时,出现以下错误:

  File "test/run.py", line 3
    fs = pa.hdfs.connect(kerb_ticket="/tmp/krb5cc_44444", extra_conf=conf)
  File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 211, in connect
  File "/opt/hadoop/data/10/hadoop/yarn/local/usercache/hdfsf6/appcache/application_1560931326013_183242/container_e47_1560931326013_183242_01_000003/environment/lib/python3.7/site-packages/pyarrow/hdfs.py", line 38, in __init__
  File "pyarrow/io-hdfs.pxi", line 105, in pyarrow.lib.HadoopFileSystem._connect
  File "pyarrow/error.pxi", line 83, in pyarrow.lib.check_status
pyarrow.lib.ArrowIOError: HDFS connection failed

我也尝试设置host (to a name node)port (=8020),但是遇到相同的错误。由于该错误不是描述性错误,因此我不确定需要更改哪个设置。有任何线索吗?

1 个答案:

答案 0 :(得分:0)

通常情况下,配置和kerberos票证会自动加载,您应该可以使用进行连接

fs = pa.hdfs.connect()

一个人。这确实需要您已经调用kinit(在工作节点上,凭据(但不是票证)会自动传输到工作环境,而无需执行任何操作)。我建议尝试在本地不使用任何参数,然后在工作节点上尝试。