使用PyArrow

时间:2017-11-22 20:10:31

标签: hdfs parquet pyarrow

我知道我可以使用pyarrow.hdfs.connect()

通过pyarrow连接到HDFS群集

我也知道我可以使用pyarrow.parquet read_table()

读取镶木地板文件

但是,read_table()接受文件路径,而hdfs.connect()为我提供HadoopFileSystem个实例。

是否有可能只使用pyarrow(安装了libhdfs3)来获取驻留在HDFS集群中的镶木地板文件/文件夹?我希望得到的是to_pydict()函数,然后我可以传递数据。

2 个答案:

答案 0 :(得分:4)

尝试

fs = pa.hdfs.connect(...)
fs.read_parquet('/path/to/hdfs-file', **other_options)

import pyarrow.parquet as pq
with fs.open(path) as f:
    pq.read_table(f, **read_options)

我打开https://issues.apache.org/jira/browse/ARROW-1848关于添加一些关于此

的更明确的文档

答案 1 :(得分:1)

我通过Pydoop库和engine = pyarrow尝试了同样的方法,对我来说很完美。这是广义方法。

restapi1   | internal/modules/cjs/loader.js:957
restapi1   |     throw err;
restapi1   |     ^
restapi1   |
restapi1   | Error: Cannot find module 'validator'
restapi1   | Require stack:
restapi1   | - /app/models/users.js
restapi1   | - /app/database.js
restapi1   | - /app/index.js
restapi1   |     at Function.Module._resolveFilename (internal/modules/cjs/loader.js:954:17)
restapi1   |     at Function.Module._load (internal/modules/cjs/loader.js:847:27)
restapi1   |     at Module.require (internal/modules/cjs/loader.js:1016:19)
restapi1   |     at require (internal/modules/cjs/helpers.js:69:18)
restapi1   |     at Object.<anonymous> (/app/models/users.js:5:19)
restapi1   |     at Module._compile (internal/modules/cjs/loader.js:1121:30)
restapi1   |     at Object.Module._extensions..js (internal/modules/cjs/loader.js:1160:10)
restapi1   |     at Module.load (internal/modules/cjs/loader.js:976:32)
restapi1   |     at Function.Module._load (internal/modules/cjs/loader.js:884:14)
restapi1   |     at Module.require (internal/modules/cjs/loader.js:1016:19)
restapi1   |     at require (internal/modules/cjs/helpers.js:69:18)
restapi1   |     at Object.<anonymous> (/app/database.js:45:1)
restapi1   |     at Module._compile (internal/modules/cjs/loader.js:1121:30)
restapi1   |     at Object.Module._extensions..js (internal/modules/cjs/loader.js:1160:10)
restapi1   |     at Module.load (internal/modules/cjs/loader.js:976:32)
restapi1   |     at Function.Module._load (internal/modules/cjs/loader.js:884:14) {
restapi1   |   code: 'MODULE_NOT_FOUND',
restapi1   |   requireStack: [ '/app/models/users.js', '/app/database.js', '/app/index.js' ]
restapi1   | }
restapi1   | [nodemon] app crashed - waiting for file changes before starting...