Question

环境

Hadoop：0.20.205.0
集群中的机器数量：2个节点
复制：设置为1
DFS块大小：1MB

我使用put命令将7.4MB文件放入HDFS。我运行fsck命令来检查数据节点中文件的块分布。我看到文件的所有8个块都只到一个节点。这会影响负载分布，并且在运行mapred任务时始终只使用一个节点。

有没有办法可以将文件分发到多个datanode？

bin/hadoop dfsadmin -report
Configured Capacity: 4621738717184 (4.2 TB)
Present Capacity: 2008281120783 (1.83 TB)
DFS Remaining: 2008281063424 (1.83 TB)
DFS Used: 57359 (56.01 KB)
DFS Used%: 0%
Under replicated blocks: 0
Blocks with corrupt replicas: 0
Missing blocks: 0

-------------------------------------------------
Datanodes available: 2 (6 total, 4 dead)

Name: 143.215.131.246:50010
Decommission Status : Normal
Configured Capacity: 2953506713600 (2.69 TB)
DFS Used: 28687 (28.01 KB)
Non DFS Used: 1022723801073 (952.49 GB)
DFS Remaining: 1930782883840(1.76 TB)
DFS Used%: 0%
DFS Remaining%: 65.37%
Last contact: Fri Jul 18 10:31:51 EDT 2014

bin/hadoop fs -put /scratch/rkannan3/hadoop/test/pg20417.txt /user/rkannan3

bin/hadoop fs -ls /user/rkannan3
Found 1 items
-rw-------   1 rkannan3 supergroup    7420270 2014-07-18 10:40 /user/rkannan3/pg20417.txt

bin/hadoop fsck /user/rkannan3 -files -blocks -locations
FSCK started by rkannan3 from /143.215.131.246 for path /user/rkannan3 at Fri Jul 18 10:43:13 EDT 2014
/user/rkannan3 <dir>
/user/rkannan3/pg20417.txt 7420270 bytes, 8 block(s):  OK <==== All the 8 blocks in one DN
0. blk_3659272467883498791_1006 len=1048576 repl=1 [143.215.131.246:50010]
1. blk_-5158259524162513462_1006 len=1048576 repl=1 [143.215.131.246:50010]
2. blk_8006160220823587653_1006 len=1048576 repl=1 [143.215.131.246:50010]
3. blk_4541732328753786064_1006 len=1048576 repl=1 [143.215.131.246:50010]
4. blk_-3236307221351862057_1006 len=1048576 repl=1 [143.215.131.246:50010]
5. blk_-6853392225410344145_1006 len=1048576 repl=1 [143.215.131.246:50010]
6. blk_-2293710893046611429_1006 len=1048576 repl=1 [143.215.131.246:50010]
7. blk_-1502992715991891710_1006 len=80238 repl=1 [143.215.131.246:50010]

Answer 1

如果要在文件级别进行分发，请至少使用复制因子2.第一个副本始终放在编写器所在的位置（见http://waset.org/publications/16836/optimizing-hadoop-block-placement-policy-and-cluster-blocks-distribution中的引言段落）;通常一个文件只有一个编写器，因此文件的几个块的第一个副本将始终位于该节点上。您可能不想改变这种行为，因为当您想要避免产生太多的映射器而不会丢失映射器的数据位置时，您希望有可用的选项来增加最小分割大小。

Answer 2

您必须使用Hadoop balancer命令。详情如下。教程link

<强>平衡器

运行群集平衡实用程序。您只需按Ctrl-C即可停止重新平衡过程。请在此处找到更多详细信息

   Usage: hadoop balancer [-threshold <threshold>]

   -threshold <threshold>   Percentage of disk capacity. This overwrites the default threshold.

HDFS文件阻止在两个节点集群中分发

2 个答案: