Question

我正在尝试为hdfs中的目录获取一些统计信息。我试图得到文件/子目录和每个的大小。我开始以为我可以用bash做到这一点。

#!/bin/bash
OP=$(hadoop fs -ls hdfs://mydirectory)
echo $(wc -l < "$OP")

到目前为止我只有这么多，我很快意识到python可能是一个更好的选择。但是我无法弄清楚如何执行像hadoop fs -ls from python

这样的hadoop命令

Answer 1

您可以参考子流程示例： https://community.hortonworks.com/articles/92321/interacting-with-hadoop-hdfs-using-python-codes.html

您可以单独获取返回状态，输出和错误消息。

OR 运行python命令：

output = subprocess.Popen（[＆＃34; hadoop＆＃34;，＆＃34; fs＆＃34;，＆＃34; -ls＆＃34;，＆＃34; / user＆＃34;]，stdout = subprocess.PIPE，stderr = subprocess.PIPE） for output.stdout中的行： ...打印线 ...

Answer 2

请参阅https://docs.python.org/2/library/commands.html了解您的选项，包括如何获取退货状态（如果出现错误）。您缺少的基本代码是

import commands

hdir_list = commands.getoutput('hadoop fs -ls hdfs://mydirectory')

是：在2.6中弃用，在2.7中仍然有用，但从Python 3中删除。如果这让您感到困扰，请切换到

os.command (<code string>)

...或者更好地使用 subprocess.call （在2.4中引入）。

来自python的Hadoop命令

2 个答案: