Dask:从HDFS读取时,pyarrow / hdfs.py返回OSError:获取符号hdfsNewBuilder失败

时间:2020-07-06 04:28:37

标签: hadoop hdfs dask dask-distributed pyarrow

我试图在研究小组的Hadoop集群上运行dask-on-yarn。

我尝试了以下每条说明:

  • dd.read_parquet('hdfs://file.parquet', engine='fastparquet')
  • dd.read_parquet('hdfs://file.parquet', engine='pyarrow')
  • dd.read_csv('hdfs://file.csv')

每次,都会出现以下错误消息:

~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/core.py in get_fs_token_paths(urlpath, mode, num, name_function, storage_options, protocol)
    521         path = cls._strip_protocol(urlpath)
    522         update_storage_options(options, storage_options)
--> 523         fs = cls(**options)
    524 
    525         if "w" in mode:

~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/spec.py in __call__(cls, *args, **kwargs)
     52             return cls._cache[token]
     53         else:
---> 54             obj = super().__call__(*args, **kwargs)
     55             # Setting _fs_token here causes some static linters to complain.
     56             obj._fs_token_ = token

~/miniconda3/envs/dask/lib/python3.8/site-packages/fsspec/implementations/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf, **kwargs)
     42         AbstractFileSystem.__init__(self, **kwargs)
     43         self.pars = (host, port, user, kerb_ticket, driver, extra_conf)
---> 44         self.pahdfs = HadoopFileSystem(
     45             host=host,
     46             port=port,

~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/hdfs.py in __init__(self, host, port, user, kerb_ticket, driver, extra_conf)
     38             _maybe_set_hadoop_classpath()
     39 
---> 40         self._connect(host, port, user, kerb_ticket, extra_conf)
     41 
     42     def __reduce__(self):

~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/io-hdfs.pxi in pyarrow.lib.HadoopFileSystem._connect()

~/miniconda3/envs/dask/lib/python3.8/site-packages/pyarrow/error.pxi in pyarrow.lib.check_status()

OSError: Getting symbol hdfsNewBuilderfailed

我应该如何解决此问题?

我的环境

这是我在conda env中的软件包:

# Name                    Version                   Build  Channel
_libgcc_mutex             0.1                        main
abseil-cpp                20200225.2           he1b5a44_0    conda-forge
arrow-cpp                 0.17.1          py38h1234567_9_cpu    conda-forge
attrs                     19.3.0                     py_0
aws-sdk-cpp               1.7.164              hc831370_1    conda-forge
backcall                  0.2.0                      py_0
blas                      1.0                         mkl
bleach                    3.1.5                      py_0
bokeh                     2.1.1                    py38_0
boost-cpp                 1.72.0               h7b93d67_1    conda-forge
brotli                    1.0.7                he6710b0_0
brotlipy                  0.7.0           py38h7b6447c_1000
bzip2                     1.0.8                h7b6447c_0
c-ares                    1.15.0            h7b6447c_1001
ca-certificates           2020.6.24                     0
certifi                   2020.6.20                py38_0
cffi                      1.14.0           py38he30daa8_1
chardet                   3.0.4                 py38_1003
click                     7.1.2                      py_0
cloudpickle               1.4.1                      py_0
conda-pack                0.4.0                      py_0
cryptography              2.9.2            py38h1ba5d50_0
curl                      7.71.0               hbc83047_0
cytoolz                   0.10.1           py38h7b6447c_0
dask                      2.19.0                     py_0
dask-core                 2.19.0                     py_0
dask-yarn                 0.8.1            py38h32f6830_0    conda-forge
decorator                 4.4.2                      py_0
defusedxml                0.6.0                      py_0
distributed               2.19.0                   py38_0
entrypoints               0.3                      py38_0
fastparquet               0.3.2            py38heb32a55_0
freetype                  2.10.2               h5ab3b9f_0
fsspec                    0.7.4                      py_0
gflags                    2.2.2                he6710b0_0
glog                      0.4.0                he6710b0_0
grpc-cpp                  1.30.0               h9ea6770_0    conda-forge
grpcio                    1.27.2           py38hf8bcb03_0
heapdict                  1.0.1                      py_0
icu                       67.1                 he1b5a44_0    conda-forge
idna                      2.10                       py_0
importlib-metadata        1.7.0                    py38_0
importlib_metadata        1.7.0                         0
intel-openmp              2020.1                      217
ipykernel                 5.3.0            py38h5ca1d4c_0
ipython                   7.16.1           py38h5ca1d4c_0
ipython_genutils          0.2.0                    py38_0
jedi                      0.17.1                   py38_0
jinja2                    2.11.2                     py_0
jpeg                      9b                   h024ee3a_2
json5                     0.9.5                      py_0
jsonschema                3.2.0                    py38_0
jupyter_client            6.1.3                      py_0
jupyter_core              4.6.3                    py38_0
jupyterlab                2.1.5                      py_0
jupyterlab_server         1.1.5                      py_0
krb5                      1.18.2               h173b8e3_0
ld_impl_linux-64          2.33.1               h53a641e_7
libcurl                   7.71.0               h20c2e04_0
libedit                   3.1.20191231         h7b6447c_0
libevent                  2.1.10               hcdb4288_1    conda-forge
libffi                    3.3                  he6710b0_1
libgcc-ng                 9.1.0                hdf63c60_0
libgfortran-ng            7.3.0                hdf63c60_0
libllvm9                  9.0.1                h4a3c616_0
libpng                    1.6.37               hbc83047_0
libprotobuf               3.12.3               hd408876_0
libsodium                 1.0.18               h7b6447c_0
libssh2                   1.9.0                h1ba5d50_1
libstdcxx-ng              9.1.0                hdf63c60_0
libtiff                   4.1.0                h2733197_1
llvmlite                  0.33.0           py38hd408876_0
locket                    0.2.0                    py38_1
lz4-c                     1.9.2                he6710b0_0
markupsafe                1.1.1            py38h7b6447c_0
mistune                   0.8.4           py38h7b6447c_1000
mkl                       2020.1                      217
mkl-service               2.3.0            py38he904b0f_0
mkl_fft                   1.1.0            py38h23d657b_0
mkl_random                1.1.1            py38h0573a6f_0
msgpack-python            1.0.0            py38hfd86e86_1
nbconvert                 5.6.1                    py38_0
nbformat                  5.0.7                      py_0
ncurses                   6.2                  he6710b0_1
notebook                  6.0.3                    py38_0
numba                     0.50.1           py38h0573a6f_0
numpy                     1.18.5           py38ha1c710e_0
numpy-base                1.18.5           py38hde5b4d6_0
olefile                   0.46                       py_0
openssl                   1.1.1g               h7b6447c_0
packaging                 20.4                       py_0
pandas                    1.0.5            py38h0573a6f_0
pandoc                    2.9.2.1                       0
pandocfilters             1.4.2                    py38_1
parquet-cpp               1.5.1                         2    conda-forge
parso                     0.7.0                      py_0
partd                     1.1.0                      py_0
pexpect                   4.8.0                    py38_0
pickleshare               0.7.5                 py38_1000
pillow                    7.1.2            py38hb39fc2d_0
pip                       20.1.1                   py38_1
prometheus_client         0.8.0                      py_0
prompt-toolkit            3.0.5                      py_0
protobuf                  3.12.3           py38he6710b0_0
psutil                    5.7.0            py38h7b6447c_0
ptyprocess                0.6.0                    py38_0
pyarrow                   0.17.1          py38h1234567_9_cpu    conda-forge
pycparser                 2.20                       py_0
pygments                  2.6.1                      py_0
pyopenssl                 19.1.0                   py38_0
pyparsing                 2.4.7                      py_0
pyrsistent                0.16.0           py38h7b6447c_0
pysocks                   1.7.1                    py38_0
python                    3.8.3                hcff3b4d_2
python-dateutil           2.8.1                      py_0
python_abi                3.8                      1_cp38    conda-forge
pytz                      2020.1                     py_0
pyyaml                    5.3.1            py38h7b6447c_1
pyzmq                     19.0.1           py38he6710b0_1
re2                       2020.07.01           he1b5a44_0    conda-forge
readline                  8.0                  h7b6447c_0
requests                  2.24.0                     py_0
send2trash                1.5.0                    py38_0
setuptools                47.3.1                   py38_0
six                       1.15.0                     py_0
skein                     0.8.0            py38h32f6830_1    conda-forge
snappy                    1.1.8                he6710b0_0
sortedcontainers          2.2.2                      py_0
sqlite                    3.32.3               h62c20be_0
tbb                       2020.0               hfd86e86_0
tblib                     1.6.0                      py_0
terminado                 0.8.3                    py38_0
testpath                  0.4.4                      py_0
thrift                    0.13.0           py38he6710b0_0
thrift-cpp                0.13.0               h62aa4f2_2    conda-forge
tk                        8.6.10               hbc83047_0
toolz                     0.10.0                     py_0
tornado                   6.0.4            py38h7b6447c_1
traitlets                 4.3.3                    py38_0
typing_extensions         3.7.4.2                    py_0
urllib3                   1.25.9                     py_0
wcwidth                   0.2.5                      py_0
webencodings              0.5.1                    py38_1
wheel                     0.34.2                   py38_0
xz                        5.2.5                h7b6447c_0
yaml                      0.2.5                h7b6447c_0
zeromq                    4.3.2                he6710b0_2
zict                      2.0.0                      py_0
zipp                      3.1.0                      py_0
zlib                      1.2.11               h7b6447c_3
zstd                      1.4.4                h0b5b093_3

Hadoop集群正在运行版本Hadoop 2.7.0-mapr-1607

使用以下方法创建群集对象

# Create a cluster where each worker has two cores and eight GiB of memory
cluster = YarnCluster(
    environment='conda-env-packed-for-worker-nodes.tar.gz',
    
    worker_env={
        # See https://github.com/dask/dask-yarn/pull/30#issuecomment-434001858
        'ARROW_LIBHDFS_DIR': '/opt/mapr/hadoop/hadoop-0.20.2/c++/Linux-amd64-64/lib',
    },
)

疑似原因

我怀疑hadoop-0.20.2环境变量中的ARROW_LIBHDFS_DIR和hadoop CLI版本Hadoop 2.7.0之间的版本不匹配。

我必须手动指定pyarrow才能使用此文件(使用以下设置:https://stackoverflow.com/a/62749053/1147061)。 libhdfs.so下未提供必需的文件/opt/mapr/hadoop/hadoop-2.7.0/。通过libhdfs3安装conda install -c conda-forge libhdfs3也不能解决要求。

这可能是问题吗?

1 个答案:

答案 0 :(得分:0)

(部分答案)

要使用libhdfs3(目前维护不佳),您需要致电

dd.read_csv('hdfs://file.csv', storage_options={'driver': 'libhdfs3'})

,当然,还要安装libhdfs3。 hadoop库选项对此没有帮助,因为它们是独立的代码路径。

我还怀疑使JNI libhdfs(不带“ 3”)工作是找到正确的.so文件的情况。