Question

我正在pyspark上处理超过50gb的大型CSV文件。现在，我需要找到两个引用相同值的不同值的数量。例如，

input dataframe:
+----+
|col1|
+----+
|   a|
|   b|
|   c|
|   c| 
|   a|   
|   b|
|   a|     
+----+


output dataframe:
+----+-----+
|col1|col2 |
+----+-----+
|   a| null|
|   b| null|
|   c| null|
|   c|    0| 
|   a|    2|
|   b|    2|   
|   a|    1| 
+----+-----+

过去一个星期我一直在为此苦苦挣扎。尝试过的窗口功能和许多其他功能。但是什么也没得到。如果有人知道如何解决此问题，那将是一个很大的帮助。谢谢。

如果您需要对问题进行澄清，请发表评论。

Answer 1

我提供了一些假设的解决方案。

假设，以前的参考文献可以在前'n'行的最大值中找到。如果'n'是合理的值，那么我认为这是一个很好的解决方案。

我假设您可以在5行中找到上一个参考。

def get_distincts(list, current_value):
    cnt = {}
    flag = False
    for i in list:
        if current_value == i :
            flag = True
            break
        else:
            cnt[i] = "some_value"

    if flag:
        return len(cnt)
    else:
        return None

get_distincts_udf = udf(get_distincts, IntegerType())

df = spark.createDataFrame([["a"],["b"],["c"],["c"],["a"],["b"],["a"]]).toDF("col1")
#You can replace this, if you have some unique id column 
df = df.withColumn("seq_id", monotonically_increasing_id()) 

window = Window.orderBy("seq_id")
df = df.withColumn("list", array([lag(col("col1"),i, None).over(window) for i in range(1,6) ]))

df = df.withColumn("col2", get_distincts_udf(col('list'), col('col1'))).drop('seq_id','list')
df.show()

结果

+----+----+
|col1|col2|
+----+----+
|   a|null|
|   b|null|
|   c|null|
|   c|   0|
|   a|   2|
|   b|   2|
|   a|   1|
+----+----+

Answer 2

您可以尝试以下方法：

添加一个单调递增的列id，以跟踪行的顺序
为每个prev_id查找col1并将结果保存到新的df
对于新DF（别名为'd1'），使用条件LEFT JOIN对DF本身（别名为'd2'）进行(d2.id > d1.prev_id) & (d2.id < d1.id)
然后使用groupby（'d1.col1'，'d1.id'）并在countDistinct（'d2.col1'）上进行汇总

基于上述逻辑和示例数据的代码如下所示：

from pyspark.sql import functions as F, Window

df1 = spark.createDataFrame([ (i,) for i in list("abccaba")], ["col1"])

# create a WinSpec partitioned by col1 so that we can find the prev_id
win = Window.partitionBy('col1').orderBy('id')

# set up id and prev_id
df11 = df1.withColumn('id', F.monotonically_increasing_id())\
          .withColumn('prev_id', F.lag('id').over(win))

# check the newly added columns
df11.sort('id').show()
# +----+---+-------+
# |col1| id|prev_id|
# +----+---+-------+
# |   a|  0|   null|
# |   b|  1|   null|
# |   c|  2|   null|
# |   c|  3|      2|
# |   a|  4|      0|
# |   b|  5|      1|
# |   a|  6|      4|
# +----+---+-------+

# let's cache the new dataframe
df11.persist()

# do a self-join on id and prev_id and then do the aggregation
df12 = df11.alias('d1') \
           .join(df11.alias('d2')
               , (F.col('d2.id') > F.col('d1.prev_id')) & (F.col('d2.id') < F.col('d1.id')), how='left') \
           .select('d1.col1', 'd1.id', F.col('d2.col1').alias('ids')) \
           .groupBy('col1','id') \
           .agg(F.countDistinct('ids').alias('distinct_values'))

# display the result
df12.sort('id').show()
# +----+---+---------------+
# |col1| id|distinct_values|
# +----+---+---------------+
# |   a|  0|              0|
# |   b|  1|              0|
# |   c|  2|              0|
# |   c|  3|              0|
# |   a|  4|              2|
# |   b|  5|              2| 
# |   a|  6|              1|
# +----+---+---------------+

# release the cached df11
df11.unpersist()

注意，您将需要保留此id列以对行进行排序，否则，每次您收集它们时，得到的行将被完全弄乱。

Answer 3

var arr = [10, 0, -1, 20, 25, 30];
var sum = 29;
var newArr = [];
var sum_expected = 0;
var y = 0;
while (y < arr.length) {
    for (let i = 0; i < arr.length; i++) {
        var subArr = [];
        sum_expected = arr[i];
        if (arr[i] != 0) subArr.push(arr[i]);
        for (let j = 0; j < arr.length; j++) {
            if (i == j)
                continue;
            sum_expected += arr[j];
            if (arr[j] != 0) subArr.push(arr[j]);
            if (sum_expected == sum) {
                var result = arr.filter((el)=>(subArr.indexOf(el) > -1));
                !newArr.length ? newArr = result : result.length < newArr.length ? newArr = result :  1;
                break;
            }
        }
    }
    let x = arr.shift();
    arr.push(x);
    y++;
}
if (newArr.length) {
    console.log(newArr);
} else {
    console.log('Not found');
}

这里的区块不过是您要从csv中读取和搜索的字符

reuse_distance = []

block_dict = {}
stack_dict = {}
counter_reuse = 0
counter_stack = 0
reuse_list = []

使用pyspark查找csv文件中两个相同值之间的不同值计数

3 个答案: