在Windows 10上设置pyspark

时间:2016-10-06 05:02:32

标签: python-2.7 windows-10 pyspark anaconda

我试图在我的Windows 10机器上安装spark。我有一个python 2.7的anacondo2。我设法打开了ipython笔记本实例。我能够运行以下几行:

airlines=sc.textFile("airlines.csv")
print (airlines)

但是当我跑步时出现错误:airlines.first()

这是我得到的错误:

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-6-85a5d6f5110f> in <module>()
----> 1 airlines.first()

C:\spark\python\pyspark\rdd.py in first(self)
   1326         ValueError: RDD is empty
   1327         """
-> 1328         rs = self.take(1)
   1329         if rs:
   1330             return rs[0]

C:\spark\python\pyspark\rdd.py in take(self, num)
   1308 
   1309             p = range(partsScanned, min(partsScanned + numPartsToTry, totalParts))
-> 1310             res = self.context.runJob(self, takeUpToNumLeft, p)
   1311 
   1312             items += res

C:\spark\python\pyspark\context.py in runJob(self, rdd, partitionFunc, partitions, allowLocal)
    932         mappedRDD = rdd.mapPartitions(partitionFunc)
    933         port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
--> 934         return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))
    935 
    936     def show_profiles(self):

C:\spark\python\pyspark\rdd.py in _load_from_socket(port, serializer)
    137         break
    138     if not sock:
--> 139         raise Exception("could not open socket")
    140     try:
    141         rf = sock.makefile("rb", 65536)

Exception: could not open socket

执行时遇到不同的错误:airlines.collect()

这是错误:

---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-5-3745b2fa985a> in <module>()
      1 # Using the collect operation, you can view the full dataset
----> 2 airlines.collect()

C:\spark\python\pyspark\rdd.py in collect(self)
    775         with SCCallSiteSync(self.context) as css:
    776             port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
--> 777         return list(_load_from_socket(port, self._jrdd_deserializer))
    778 
    779     def reduce(self, f):

C:\spark\python\pyspark\rdd.py in _load_from_socket(port, serializer)
    140     try:
    141         rf = sock.makefile("rb", 65536)
--> 142         for item in serializer.load_stream(rf):
    143             yield item
    144     finally:

C:\spark\python\pyspark\serializers.py in load_stream(self, stream)
    515         try:
    516             while True:
--> 517                 yield self.loads(stream)
    518         except struct.error:
    519             return

C:\spark\python\pyspark\serializers.py in loads(self, stream)
    504 
    505     def loads(self, stream):
--> 506         length = read_int(stream)
    507         if length == SpecialLengths.END_OF_DATA_SECTION:
    508             raise EOFError

C:\spark\python\pyspark\serializers.py in read_int(stream)
    541 
    542 def read_int(stream):
--> 543     length = stream.read(4)
    544     if not length:
    545         raise EOFError

C:\Users\AS\Anaconda2\lib\socket.pyc in read(self, size)
    382                 # fragmentation issues on many platforms.
    383                 try:
--> 384                     data = self._sock.recv(left)
    385                 except error, e:
    386                     if e.args[0] == EINTR:

error: [Errno 10054] An existing connection was forcibly closed by the remote host

请帮忙。

1 个答案:

答案 0 :(得分:0)

在Windows 10上安装PYSPARK 带有ANACONDA NAVIGATOR的JUPYTER-NOTEBOOK

步骤1

下载软件包

1)spark-2.2.0-bin-hadoop2.7.tgz Download

2)java jdk 8版本Download

3)Anaconda v 5.2 Download

4)scala-2.12.6.msi Download

5)hadoop v2.7.1 Download

STEP 2

C:/ 中制作火花文件夹,驱动并放入其中的所有内容 It will look like this

注意:在安装SCALA时将SCALA放入火花文件夹内的路径

步骤3

现在设置新的Windows环境变量

  1. HADOOP_HOME=C:\spark\hadoop

  2. JAVA_HOME=C:\Program Files\Java\jdk1.8.0_151

  3. SCALA_HOME=C:\spark\scala\bin

  4. SPARK_HOME=C:\spark\spark\bin

  5. PYSPARK_PYTHON=C:\Users\user\Anaconda3\python.exe

  6. PYSPARK_DRIVER_PYTHON=C:\Users\user\Anaconda3\Scripts\jupyter.exe

  7. PYSPARK_DRIVER_PYTHON_OPTS=notebook

  8. 立即选择火花路径:编辑并添加新内容

    在变量“ Path” Windows中添加“ C:\ spark \ spark \ bin

步骤4

  • 在要存储Jupyter-Notebook输出和文件的文件夹中创建
  • 之后,打开Anaconda命令提示符,并 cd文件夹名称
  • 然后输入 Pyspark

就这样,您的浏览器将使用Juypter localhost弹出

步骤5

检查pyspark是否正常工作!

输入简单代码并运行

from pyspark.sql import Row
a = Row(name = 'Vinay' , age=22 , height=165)
print("a: ",a)