Question

我已经在其上设置了Hadoop完全分布式群集和Apache Hive。我正在从Java代码中将数据加载到hive表。 hdfs-site.xml中的复制因子为2.当我从hadoop fs -put将文件复制到HDFS时，显示该文件被复制两次。但是加载到hive表中的文件显示为有3个副本。

是否为hive加载文件设置了不同的复制参数？

Answer 1

要在将表加载到HIVE时设置表的复制因子，您需要在hive客户端上设置以下属性。

SET dfs.replication=2;
LOAD DATA LOCAL ......;

Answer 2

最后，我找到了这种行为的原因。

在将文件加载到表之前，我曾使用以下方法将文件从本地计算机复制到HDFS：

Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://mycluster:8020");
FileSystem dfs = FileSystem.get(config);
Path src = new Path("D:\\testfile.txt"); 
Path dst = new Path(dfs.getWorkingDirectory()+"/testffileinHDFS.txt");
dfs.copyFromLocalFile(src, dst);

API copyFromLocalFile（）用于默认保留3个副本（即使我在hdfs-site.xml中将复制因子保持为2。但不知道这种行为的原因）。

现在在代码中明确指定复制因子后，如下所示：

Configuration config = new Configuration();
config.set("fs.defaultFS","hdfs://mycluster:8020");
config.set("dfs.replication", "1");  /**Replication factor specified here**/
FileSystem dfs = FileSystem.get(config);
Path src = new Path("D:\\testfile.txt"); 
Path dst = new Path(dfs.getWorkingDirectory()+"/testffileinHDFS.txt");
dfs.copyFromLocalFile(src, dst);

现在HDFS中只有一个文件副本。

Hive将文件加载到表副本

2 个答案: