Question

我以拼花格式保存了一个熊猫DataFrame。由于它很大，因此我需要执行梯度增强分类，因此我想使用PySpark来加快该过程。我的熊猫df就是这样

 <com.google.android.material.appbar.CollapsingToolbarLayout
        android:layout_width="match_parent"
        android:layout_height="350dp"
        app:layout_scrollFlags="scroll|enterAlways|enterAlwaysCollapsed"
        app:contentScrim="?attr/colorPrimary"
        app:expandedTitleTextAppearance="@android:color/transparent"
        android:fitsSystemWindows="true"
        >

我所有X的类型都是int64或float64，Y则是对象。因此，我将数据集保存在拼花（Y X a 3.0 b 3.5 c 4.9 d 6.8）中，然后按照本文档https://spark.apache.org/docs/2.3.0/ml-classification-regression.html#gradient-boosted-tree-classifier进行

 df.to_parquet('DF.parquet')

In: data = spark.read.load("DF.parquet")
Out: DataFrame[X: double, Y: string]

Answer 1

您的列名Y似乎有空格或\ t。

请检查并删除它。

Answer 2

这应该有效：

data = spark.read.parquet("DF.parquet")

不确定接受的答案是否有帮助。

无法在PySpark中导入实木复合地板数据

2 个答案: