在Scala中将输入流输入到外部进程

时间:2015-01-22 17:52:08

标签: scala streaming

我有一个Iterable [String],我希望将它传递给外部进程并为输出返回一个Iterable [String]。

我觉得这应该在编译时起作用

import scala.sys.process._

object PipeUtils {
  implicit class IteratorStream(s: TraversableOnce[String]) {
    def pipe(cmd: String) = s.toStream.#>(cmd).lines
    def run(cmd: String) = s.toStream.#>(cmd).!
  }
}

然而,Scala尝试执行s的内容而不是将它们传递到标准中。有人可以告诉我我做错了吗?

更新:

我认为我最初的问题是s.toStream被隐含转换为ProcessBuilder然后执行。这是不正确的,因为它是流程的输入。

我提出了以下解决方案。这感觉非常hacky和错误,但它似乎现在工作。我不是把这个写成答案,因为我觉得答案应该是一行而不是这个巨大的东西。

object PipeUtils {

  /**
   * This class feels wrong.  I think that for the pipe command it actually loads all of the output
   * into memory.  This could blow up the machine if used wrong, however, I cannot figure out how to get it to
   * work properly.  Hopefully http://stackoverflow.com/questions/28095469/stream-input-to-external-process-in-scala
   * will get some good responses.
   * @param s
   */
  implicit class IteratorStream(s: TraversableOnce[String]) {

    val in = (in: OutputStream) => {
      s.foreach(x => in.write((x + "\n").getBytes))
      in.close
    }

    def pipe(cmd: String) = {
      val output = ListBuffer[String]()
      val io = new ProcessIO(in,
      out => {Source.fromInputStream(out).getLines.foreach(output += _)},
      err => {Source.fromInputStream(err).getLines.foreach(println)})

      cmd.run(io).exitValue
      output.toIterable
    }

    def run(cmd: String) = {
      cmd.run(BasicIO.standard(in)).exitValue
    }
  }
}

修改

这样做的动机来自于在RDD上使用Spark的.pipe功能。我希望在我的本地代码上使用完全相同的功能。

2 个答案:

答案 0 :(得分:4)

假设scala 2.11+,您应该使用@edi建议的lineStream。原因是您获得流式响应,因为它变得可用而不是批量响应。我们假设我有一个shell脚本echo-sleep.sh

#/usr/bin/env bash
# echo-sleep.sh
while read line; do echo $line; sleep 1; done

我们希望使用以下代码从scala调用它:

import scala.sys.process._
import scala.language.postfixOps
import java.io.ByteArrayInputStream

implicit class X(in: TraversableOnce[String]) {
  // Don't do the BAOS construction in real code.  Just for illustration.
  def pipe(cmd: String) = 
    cmd #< new ByteArrayInputStream(in.mkString("\n").getBytes) lineStream
}

然后,如果我们做最后的通话,如:

1 to 10 map (_.toString) pipe "echo-sleep.sh" foreach println

序列中的数字每1秒出现在STDOUT上。如果您缓冲并转换为示例中的Iterable,则会失去此响应能力。

答案 1 :(得分:3)

这是一个解决方案,演示如何编写流程代码,以便流式传输输入和输出。关键是产生一个传递给过程输入的java.io.PipedInputStream。通过java.io.PipedOutputStream以异步方式从迭代器填充此流。显然,可以随意将隐式类的输入类型更改为Iterable

这是一个用于显示此作品的迭代器。

/**
 * An iterator with pauses used to illustrate data streaming to the process to be run.
 */
class PausingIterator[A](zero: A, until: A, pauseMs: Int)(subsequent: A => A) 
extends Iterator[A] {
  private[this] var current = zero
  def hasNext = current != until
  def next(): A = {
    if (!hasNext) throw new NoSuchElementException
    val r = current
    current = subsequent(current)
    Thread.sleep(pauseMs)
    r
  }
}

这是您想要的实际代码

import java.io.PipedOutputStream
import java.io.PipedInputStream
import java.io.InputStream
import java.io.PrintWriter

// For process stuff
import scala.sys.process._
import scala.language.postfixOps

// For asynchronous stream writing.
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.Future

/**
 * A streaming version of the original class.  This does not block to wait for the entire 
 * input or output to be constructed.  This allows the process to get data ASAP and allows 
 * the process to return information back to the scala environment ASAP.  
 *
 * NOTE: Don't forget about error handling in the final production code.
 */
implicit class X(it: Iterator[String]) {
  def pipe(cmd: String) = cmd #< iter2is(it) lineStream

  /**
   * Convert an iterator to an InputStream for use in the pipe function.
   * @param it an iterator to convert
   */
  private[this] def iter2is[A](it: Iterator[A]): InputStream = {
    // What is written to the output stream will appear in the input stream.
    val pos = new PipedOutputStream
    val pis = new PipedInputStream(pos)
    val w = new PrintWriter(pos, true)

    // Scala 2.11 (scala 2.10, use 'future').  Executes asynchrously.  
    // Fill the stream, then close.
    Future {
      it foreach w.println
      w.close
    }

    // Return possibly before pis is fully written to.
    pis
  }
}

最终调用将显​​示0到9,并在显示每个数字之间暂停3秒(scala端的第二次暂停,shell脚本端的1秒暂停)。

// echo-sleep.sh is the same script as in my previous post
new PausingIterator(0, 10, 2000)(_ + 1)
  .map(_.toString)
  .pipe("echo-sleep.sh")
  .foreach(println)

输出

0          [ pause 3 secs ]
1          [ pause 3 secs ]
...
8          [ pause 3 secs ]
9          [ pause 3 secs ]