spark:读取镶木地板文件并进行处理

时间:2016-12-09 11:39:27

标签: scala apache-spark parquet

我是Spark 1.6的新手。我想读一个镶木地板文件并加工它。 为了简化,假设有这种结构的镶木地板:

id, amount, label

我有3条规则:

amount < 10000 => label=LOW
10000 < amount < 100000 => label=MEDIUM
amount > 1000000 => label = HIGH

如何在spark和scala中做到这一点?

我尝试这样的事情:

case class SampleModels(
  id: String,
  amount: Double,
  label: String,
)

val sc = SparkContext.getOrCreate()
val sqlContext = new SQLContext(sc)
import sqlContext.implicits._

val df = sqlContext.read.parquet("/path/file/")
val ds = df.as[SampleModels].map( row=>
    MY LOGIC 
    WRITE OUTPUT IN PARQUET
)

这是正确的做法吗?它有效吗? “MYLOGIC”可能更复杂。

由于

2 个答案:

答案 0 :(得分:1)

是的,这是使用spark的正确方法。如果你的逻辑很简单,你可以尝试使用内置函数直接操作数据帧(比如你的when),它会比将行映射到case类并在jvm中执行代码快一点并且您可以轻松地将结果保存回实木复合地板。

答案 1 :(得分:0)

是的,这是正确的做法。 它会对您的完整数据进行一次传递,以构建您需要的额外列。

如果你想要一种sql方式,这就是你要走的路,

    [12/9/16 10:42:56:432 MST] 0000008c ServletWrappe E com.ibm.ws.webcontainer.servlet.ServletWrapper service SRVE0014E: Uncaught service() exception root cause dispatcher: javax.servlet.ServletException: Could not resolve view with name 'tiles_sign_on' in servlet with name 'dispatcher'
    at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1237)
    at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1037)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:980)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:897)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:575)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1232)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:781)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:480)
    at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:79)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:967)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1107)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.dispatch(WebAppRequestDispatcher.java:1385)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.forward(WebAppRequestDispatcher.java:194)
    at com.ibm.ws.webcontainer.extension.DefaultExtensionProcessor.handleRequest(DefaultExtensionProcessor.java:681)
    at com.ibm.ws.webcontainer.extension.DefaultExtensionProcessor.service(DefaultExtensionProcessor.java:1483)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1232)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:781)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:480)
    at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:79)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:967)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1107)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.dispatch(WebAppRequestDispatcher.java:1385)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.forward(WebAppRequestDispatcher.java:194)
    at org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler.handleRequest(DefaultServletHttpRequestHandler.java:122)
    at org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter.handle(HttpRequestHandlerAdapter.java:51)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:963)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:897)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:575)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1232)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:781)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:480)
    at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:79)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:967)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1107)
    at com.ibm.ws.webcontainer.servlet.CacheServletWrapper.handleRequest(CacheServletWrapper.java:87)
    at com.ibm.ws.webcontainer.WebContainer.handleRequest(WebContainer.java:948)
    at com.ibm.ws.webcontainer.WSWebContainer.handleRequest(WSWebContainer.java:1817)
    at com.ibm.ws.webcontainer.channel.WCChannelLink.ready(WCChannelLink.java:200)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleDiscrimination(HttpInboundLink.java:463)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleNewRequest(HttpInboundLink.java:530)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.processRequest(HttpInboundLink.java:316)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.ready(HttpInboundLink.java:287)
    at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.sendToDiscriminators(NewConnectionInitialReadCallback.java:214)
    at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.complete(NewConnectionInitialReadCallback.java:113)
    at com.ibm.ws.tcp.channel.impl.AioReadCompletionListener.futureCompleted(AioReadCompletionListener.java:175)
    at com.ibm.io.async.AbstractAsyncFuture.invokeCallback(AbstractAsyncFuture.java:217)
    at com.ibm.io.async.AsyncChannelFuture.fireCompletionActions(AsyncChannelFuture.java:161)
    at com.ibm.io.async.AsyncFuture.completed(AsyncFuture.java:138)
    at com.ibm.io.async.ResultHandler.complete(ResultHandler.java:204)
    at com.ibm.io.async.ResultHandler.runEventProcessingLoop(ResultHandler.java:775)
    at com.ibm.io.async.ResultHandler$2.run(ResultHandler.java:905)
    at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1892)

[12/9/16 10:42:56:432 MST] 0000008c ServletWrappe E com.ibm.ws.webcontainer.servlet.ServletWrapper service SRVE0014E: Uncaught service() exception root cause SimpleFileServlet: javax.servlet.ServletException: javax.servlet.ServletException: Could not resolve view with name 'tiles_sign_on' in servlet with name 'dispatcher'
    at com.ibm.ws.webcontainer.extension.DefaultExtensionProcessor.service(DefaultExtensionProcessor.java:1486)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1232)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:781)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:480)
    at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:79)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:967)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1107)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.dispatch(WebAppRequestDispatcher.java:1385)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.forward(WebAppRequestDispatcher.java:194)
    at org.springframework.web.servlet.resource.DefaultServletHttpRequestHandler.handleRequest(DefaultServletHttpRequestHandler.java:122)
    at org.springframework.web.servlet.mvc.HttpRequestHandlerAdapter.handle(HttpRequestHandlerAdapter.java:51)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:963)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:897)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:575)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1232)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:781)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:480)
    at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:79)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:967)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1107)
    at com.ibm.ws.webcontainer.servlet.CacheServletWrapper.handleRequest(CacheServletWrapper.java:87)
    at com.ibm.ws.webcontainer.WebContainer.handleRequest(WebContainer.java:948)
    at com.ibm.ws.webcontainer.WSWebContainer.handleRequest(WSWebContainer.java:1817)
    at com.ibm.ws.webcontainer.channel.WCChannelLink.ready(WCChannelLink.java:200)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleDiscrimination(HttpInboundLink.java:463)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.handleNewRequest(HttpInboundLink.java:530)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.processRequest(HttpInboundLink.java:316)
    at com.ibm.ws.http.channel.inbound.impl.HttpInboundLink.ready(HttpInboundLink.java:287)
    at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.sendToDiscriminators(NewConnectionInitialReadCallback.java:214)
    at com.ibm.ws.tcp.channel.impl.NewConnectionInitialReadCallback.complete(NewConnectionInitialReadCallback.java:113)
    at com.ibm.ws.tcp.channel.impl.AioReadCompletionListener.futureCompleted(AioReadCompletionListener.java:175)
    at com.ibm.io.async.AbstractAsyncFuture.invokeCallback(AbstractAsyncFuture.java:217)
    at com.ibm.io.async.AsyncChannelFuture.fireCompletionActions(AsyncChannelFuture.java:161)
    at com.ibm.io.async.AsyncFuture.completed(AsyncFuture.java:138)
    at com.ibm.io.async.ResultHandler.complete(ResultHandler.java:204)
    at com.ibm.io.async.ResultHandler.runEventProcessingLoop(ResultHandler.java:775)
    at com.ibm.io.async.ResultHandler$2.run(ResultHandler.java:905)
    at com.ibm.ws.util.ThreadPool$Worker.run(ThreadPool.java:1892)
Caused by: javax.servlet.ServletException: Could not resolve view with name 'tiles_sign_on' in servlet with name 'dispatcher'
    at org.springframework.web.servlet.DispatcherServlet.render(DispatcherServlet.java:1237)
    at org.springframework.web.servlet.DispatcherServlet.processDispatchResult(DispatcherServlet.java:1037)
    at org.springframework.web.servlet.DispatcherServlet.doDispatch(DispatcherServlet.java:980)
    at org.springframework.web.servlet.DispatcherServlet.doService(DispatcherServlet.java:897)
    at org.springframework.web.servlet.FrameworkServlet.processRequest(FrameworkServlet.java:970)
    at org.springframework.web.servlet.FrameworkServlet.doGet(FrameworkServlet.java:861)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:575)
    at org.springframework.web.servlet.FrameworkServlet.service(FrameworkServlet.java:846)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:668)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.service(ServletWrapper.java:1232)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:781)
    at com.ibm.ws.webcontainer.servlet.ServletWrapper.handleRequest(ServletWrapper.java:480)
    at com.ibm.ws.webcontainer.servlet.ServletWrapperImpl.handleRequest(ServletWrapperImpl.java:178)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.invokeTarget(WebAppFilterChain.java:136)
    at com.ibm.ws.webcontainer.filter.WebAppFilterChain.doFilter(WebAppFilterChain.java:79)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.doFilter(WebAppFilterManager.java:967)
    at com.ibm.ws.webcontainer.filter.WebAppFilterManager.invokeFilters(WebAppFilterManager.java:1107)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.dispatch(WebAppRequestDispatcher.java:1385)
    at com.ibm.ws.webcontainer.webapp.WebAppRequestDispatcher.forward(WebAppRequestDispatcher.java:194)
    at com.ibm.ws.webcontainer.extension.DefaultExtensionProcessor.handleRequest(DefaultExtensionProcessor.java:681)
    at com.ibm.ws.webcontainer.extension.DefaultExtensionProcessor.service(DefaultExtensionProcessor.java:1483)
    ... 45 more

请记住使用hiveContext而不是sparkContext。