Hadoop：如何将化简器输出合并到单个文件中？

merge java mapreduce hadoop hdfs

2022-09-03 07:14:13

我知道shell中的“getmerge”命令可以完成这项工作。

但是，如果我想在通过JAVA的HDFS API完成作业后合并这些输出，我该怎么办？

我真正想要的是HDFS上的单个合并文件。

我唯一能想到的就是在那之后开始一份额外的工作。

谢谢！

答案 1

但是，如果我想在通过JAVA的HDFS API完成作业后合并这些输出，我该怎么办？

猜测，因为我自己没有尝试过，但我认为您要查找的方法是FileUtil.copyMerge，这是FsShell在运行命令时调用的方法。将两个文件系统对象作为参数 - FsShell使用FileSystem.getLocal来检索目标文件系统，但我看不出有任何理由不能在目标上使用Path.getFileSystem来获取输出流-getmergeFileUtil.copyMerge

也就是说，我不认为它能为你赢得太多 - 合并仍在本地JVM中发生;所以你并没有真正节省太多，而不是跟随 .-getmerge-put

答案 2

通过在代码中设置单个化简器，您可以获得单个 Out-put 文件。

Job.setNumberOfReducer(1);

将满足您的要求，但成本高昂

或

Static method to execute a shell command. 
Covers most of the simple cases without requiring the user to implement the Shell interface.

Parameters:
env the map of environment key=value
cmd shell command to execute.
Returns:
the output of the executed command.

org.apache.hadoop.util.Shell.execCommand(String[])