我正在尝试从HDFS读取以下目录中的网络日志。 weblogs目录中有很多日志文件。但不确定为什么读取第一个日志文件并忽略其余部分。 2013-09-15.log,如下所示。
webrdd1=sc.textFile('hdfs://localhost:9000/HDFSHomeDir/data/weblogs/*').keyBy(lambda line: fun_findDocid(line) ).filter( lambda x : x[0] )
日志中的摘录:
8/07/15 11:24:13 INFO DAGScheduler: Submitting ResultStage 0 (PythonRDD[2] at RDD at PythonRDD.scala:48), which has no missing parents 18/07/15 11:24:13 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 7.4 KB, free 366.0 MB) 18/07/15 11:24:13 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 4.8 KB, free 366.0 MB) 18/07/15 11:24:13 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on kuchibaby:49665 (size: 4.8 KB, free: 366.3 MB) 18/07/15 11:24:13 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1039 18/07/15 11:24:13 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (PythonRDD[2] at RDD at PythonRDD.scala:48) (first 15 tasks are for partitions Vector(0)) 18/07/15 11:24:13 INFO TaskSchedulerImpl: Adding task set 0.0 with 1 tasks 18/07/15 11:24:13 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 7908 bytes) 18/07/15 11:24:13 INFO Executor: Running task 0.0 in stage 0.0 (TID 0) 18/07/15 11:24:13 INFO HadoopRDD: Input split: hdfs://localhost:9000/HDFSHomeDir/data/weblogs/2013-09-15.log:0+539134 [Stage 0:> (0 + 1) / 1]18/07/15 11:24:14 INFO PythonRunner: Times: total = 548, boot = 470, init = 78, finish = 0 18/07/15 11:24:14 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 3497 bytes result sent to driver