TAGS :Viewed: 6 - Published at: a few seconds ago

[ Access to a particular RDD partition from one other ]

I want get elements from beside partitions on the current partition when i'm using mapPartition or another function.

More generally, i'm curious to know how to get access to a particular partition from a RDD.

val rdd = sc.parallelize(Array(1,2,3,4,5,6,7,8,9,10,11,12),4)

I would like

val rdd_2 = rdd.something(2) = RDD[Array(4,5,6)]

Thank you to tell me if it's unclear.

Answer 1


More generally, i'm curious to know how to get access to a particular partition from a RDD.

For debugging purposes you can use TaskContext

import org.apache.spark.TaskContext

rdd
   .mapPartitions(iter => Iterator((TaskContext.get.partitionId, iter.toList)))
   .filter{case (k, _) => k == 1}
   .values

Internally Spark is using runJob to operate only on selected partitions.

I want get elements from beside partitions on the current partition when I'm using mapPartition or another function.

There is probably some hacky way to achieve something like this but generally speaking it is not possible. Assumption that each partition can be processed independently is pretty much the core concept behind Spark computation model.

If you want to access some specific subset of your data at once you can use a custom partitioner.