Question

可能这是一个简单的问题，但我用火花开始冒险。

问题：我想在spark中获得以下结构（预期结果）。现在我有以下结构。

title1，{word11，word12，word13 ...}
title2，{word12，word22，word23 ...}

数据存储在数据集[（String，Seq [String]）]

中

例外结果 我想得到Tuple [word，title]

word11，{title1}
word12，{title1}

我做什么
1.制作（标题，序号[word1，word2，word，3]）

docs.mapPartitions { iter =>
  iter.map {
     case (title, contents) => {
        val textToLemmas: Seq[String] = toText(....)
        (title, textToLemmas)
     }
  }
}

我尝试使用.map将我的结构转换为Tuple，但不能这样做。
我试图迭代所有元素，但后来我不能返回类型

感谢您的回答。

Answer 1

这应该有效：

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

Answer 2

另一个解决方案是调用explode函数，如下所示：

import org.apache.spark.sql.functions.explode
dataset.withColumn("_2", explode("_2")).as[(String, String)]

希望这能帮到你，Best Regrads。

Answer 3

我很惊讶没有人提供Scala的 for-comprehension 的解决方案（得到＃34; desugared＆＃34;到map和{{ 1}}与Yuval Itzchakov在编译时的回答一样）。

当您看到一系列flatMap和map（可能带有filter）时，Scala的理解能力就会出现。

以下内容：

val result = dataSet.flatMap { case (title, words) => words.map((_, title)) }

等同于以下内容：

val result = for {
  (title, words) <- dataSet
  w <- words
} yield (w, title)

毕竟，这就是为什么我们享受Scala的灵活性，不是吗？

如何将数据集[（String，Seq [String]）]转换为数据集[（String，String）]？

3 个答案: