应该使用hadoop将spilled记录总是等于mapinuce中的mapinput记录或mapoutput记录?

时间:2011-12-14 12:38:36

标签: hadoop mapreduce

我正在使用hadoop中的mapreduce处理矩阵乘法示例。我想问一下溢出的记录是否总是等于mapinput和mapoutput记录。 我有与mapinput和mapoutput记录不同的溢出记录

这是我得到的其中一个测试的输出:

Three by three test
   IB = 1
   KB = 2
   JB = 1
11/12/14 13:16:22 INFO input.FileInputFormat: Total input paths to process : 2
11/12/14 13:16:22 INFO mapred.JobClient: Running job: job_201112141153_0003
11/12/14 13:16:23 INFO mapred.JobClient:  map 0% reduce 0%
11/12/14 13:16:32 INFO mapred.JobClient:  map 100% reduce 0%
11/12/14 13:16:44 INFO mapred.JobClient:  map 100% reduce 100%
11/12/14 13:16:46 INFO mapred.JobClient: Job complete: job_201112141153_0003
11/12/14 13:16:46 INFO mapred.JobClient: Counters: 17
11/12/14 13:16:46 INFO mapred.JobClient:   Job Counters
11/12/14 13:16:46 INFO mapred.JobClient:     Launched reduce tasks=1
11/12/14 13:16:46 INFO mapred.JobClient:     Launched map tasks=2
11/12/14 13:16:46 INFO mapred.JobClient:     Data-local map tasks=2
11/12/14 13:16:46 INFO mapred.JobClient:   FileSystemCounters
11/12/14 13:16:46 INFO mapred.JobClient:     FILE_BYTES_READ=1464
11/12/14 13:16:46 INFO mapred.JobClient:     HDFS_BYTES_READ=528
11/12/14 13:16:46 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=2998
11/12/14 13:16:46 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=384
11/12/14 13:16:46 INFO mapred.JobClient:   Map-Reduce Framework
11/12/14 13:16:46 INFO mapred.JobClient:     Reduce input groups=36
11/12/14 13:16:46 INFO mapred.JobClient:     Combine output records=0
11/12/14 13:16:46 INFO mapred.JobClient:     Map input records=18
11/12/14 13:16:46 INFO mapred.JobClient:     Reduce shuffle bytes=735
11/12/14 13:16:46 INFO mapred.JobClient:     Reduce output records=15
11/12/14 13:16:46 INFO mapred.JobClient:     Spilled Records=108
11/12/14 13:16:46 INFO mapred.JobClient:     Map output bytes=1350
11/12/14 13:16:46 INFO mapred.JobClient:     Combine input records=0
11/12/14 13:16:46 INFO mapred.JobClient:     Map output records=54
11/12/14 13:16:46 INFO mapred.JobClient:     Reduce input records=54
11/12/14 13:16:46 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId=
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1
11/12/14 13:16:46 INFO mapred.JobClient: Running job: job_local_0001
11/12/14 13:16:46 INFO input.FileInputFormat: Total input paths to process : 1
11/12/14 13:16:46 INFO mapred.MapTask: io.sort.mb = 100
11/12/14 13:16:46 INFO mapred.MapTask: data buffer = 79691776/99614720
11/12/14 13:16:46 INFO mapred.MapTask: record buffer = 262144/327680
11/12/14 13:16:46 INFO mapred.MapTask: Starting flush of map output
11/12/14 13:16:46 INFO mapred.MapTask: Finished spill 0
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_000000_0' done.
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.Merger: Merging 1 sorted segments
11/12/14 13:16:46 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 128 bytes
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.TaskRunner: Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
11/12/14 13:16:46 INFO mapred.LocalJobRunner:
11/12/14 13:16:46 INFO mapred.TaskRunner: Task attempt_local_0001_r_000000_0 is allowed to commit now
11/12/14 13:16:46 INFO output.FileOutputCommitter: Saved output of task 'attempt_local_0001_r_000000_0' to hdfs://localhost:9000/tmp/MatrixMultiply/out
11/12/14 13:16:46 INFO mapred.LocalJobRunner: reduce > reduce
11/12/14 13:16:46 INFO mapred.TaskRunner: Task 'attempt_local_0001_r_000000_0' done.
11/12/14 13:16:47 INFO mapred.JobClient:  map 100% reduce 100%
11/12/14 13:16:47 INFO mapred.JobClient: Job complete: job_local_0001
11/12/14 13:16:47 INFO mapred.JobClient: Counters: 14
11/12/14 13:16:47 INFO mapred.JobClient:   FileSystemCounters
11/12/14 13:16:47 INFO mapred.JobClient:     FILE_BYTES_READ=89412
11/12/14 13:16:47 INFO mapred.JobClient:     HDFS_BYTES_READ=37206
11/12/14 13:16:47 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=37390
11/12/14 13:16:47 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=164756
11/12/14 13:16:47 INFO mapred.JobClient:   Map-Reduce Framework
11/12/14 13:16:47 INFO mapred.JobClient:     Reduce input groups=9
11/12/14 13:16:47 INFO mapred.JobClient:     Combine output records=9
11/12/14 13:16:47 INFO mapred.JobClient:     Map input records=15
11/12/14 13:16:47 INFO mapred.JobClient:     Reduce shuffle bytes=0
11/12/14 13:16:47 INFO mapred.JobClient:     Reduce output records=9
11/12/14 13:16:47 INFO mapred.JobClient:     Spilled Records=18
11/12/14 13:16:47 INFO mapred.JobClient:     Map output bytes=180
11/12/14 13:16:47 INFO mapred.JobClient:     Combine input records=15
11/12/14 13:16:47 INFO mapred.JobClient:     Map output records=15
11/12/14 13:16:47 INFO mapred.JobClient:     Reduce input records=9
...........X[0][0]=30, Y[0][0]=9
Bad Answer
...........X[0][1]=36, Y[0][1]=36
...........X[0][2]=42, Y[0][2]=42
...........X[1][0]=66, Y[1][0]=24
Bad Answer
...........X[1][1]=81, Y[1][1]=81
...........X[1][2]=96, Y[1][2]=96
...........X[2][0]=102, Y[2][0]=39
Bad Answer
...........X[2][1]=126, Y[2][1]=126
...........X[2][2]=150, Y[2][2]=150 

此处描述此示例以及代码:

http://www.norstad.org/matrix-multiply/index.html

你能告诉我问题在哪里,我怎么能把它弄好?感谢

WL

1 个答案:

答案 0 :(得分:5)

根据 Hadoop:The Definitive Guide ," Spilled Records"计算在作业过程中溢出到磁盘的记录总数,包括映射和减少侧溢出。 " Spilled Records"数到零,这完全没问题。通常,溢出记录意味着您已超过映射输出缓冲区中可用的内存量。拥有少量"溢出记录"通常不是问题。您io.sort.mb中可用内存的设置为io.sort.spill.percentmapred-site.xml。如果性能是一个问题,您可能希望调整这些以最小化溢出的记录。演示文稿Optimizing MapReduce Job Performance包含更多详细信息,特别是幻灯片#12和#13。如果您不止一次泄漏,那么由于需要合并泄漏,您需要支付3倍的罚款。如果"溢出记录"超过"地图输出记录" +"减少输出记录"然后你做了不止一次泄漏。请注意,最终,RAM的数量受Java VM的堆大小的限制,因此您可能需要通过增加给定作业的输入拆分来增加群集大小或增加映射任务的数量,以便减少泄漏次数。

在您的具体示例中,"溢出记录"更大,所以你不止一次溢出。