apache-spark - FileStream无法从HDFS读取所有现有文件

我需要从HDFS文件夹中读取文件。我正在使用以下代码读取文件。它正在读取我们在最近1分钟内创建的文件，但不会读取早于1分钟的现有文件。

val filterF = new Function[Path, Boolean] {
def apply(x: Path): Boolean = {
  println("looking if "+x+" to be consider or not")
  val flag: Boolean = true
  return flag
}
}

def processStream(inputPath: String) = {

val messages = streamingContext.fileStream [LongWritable, Text, TextInputFormat]( "/user/cust/sample", filterF, false).map{case (x, y) => (y.toString)}
val words = messages.flatMap(_.split(" "))
val wordCount = words.map(rec => (rec, 1)).reduceByKey(_ + _)
wordCount.print()
}

可以帮忙吗？

谢谢

FileStream无法从HDFS读取所有现有文件

0 个答案: