7 RDD常用算子(2)（rd算法）

itomcoil 2025-07-02 21:22 11 浏览

filter()

def filter(f: T => Boolean): RDD[T]

函数说明

将数据根据指定的规则进行筛选过滤，符合规则的数据保留，不符合规则的数据丢弃。当数据进行筛选过滤后，分区不变，但是分区内的数据可能不均衡，生产环境下，可能会出现数据倾斜。

object RDD_Transform_filter {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    // Step2: 算子 filter
    val rdd = sc.makeRDD(List(1, 2, 3, 4))
    val filterRDD = rdd.filter(num => num % 2 != 0)
    filterRDD.collect.foreach(println)
    // Step3: 关闭环境
    sc.stop()
  }

}

输出结果：

1
3

应用场景，日志过滤。

object RDD_Transform_filter_LogFilter {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)

    // Step2: 算子 groupBy
    val rdd = sc.textFile("datas/apache.log")
    var timeRDD = rdd.filter(
      line => {
        val data = line.split(" ")
        val time = data(3)
        time.startsWith("17/May/2023")
      }
    ).collect().foreach(println)

    // Step3: 关闭环境
    sc.stop()
  }

}

日志内容：

https://gitee.com/wuji1626/bigdata/blob/master/datas/apache.log

输出结果：

220.181.108.112 - - 17/May/2023:02:04:48 +0800 "GET / HTTP/1.1" 200 9603 www.sohu.com
220.181.108.112 - - 17/May/2023:01:04:58 +0800 "GET / HTTP/1.1" 200 9603 www.iqiyi.com
220.181.108.112 - - 17/May/2023:12:04:58 +0800 "GET / HTTP/1.1" 200 9603 www.163.com

sample()

函数签名

def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T]
)

参数说明：

抽取数据不放回（伯努利算法）伯努利算法：又叫 0、1 分布。例如扔硬币，要么正面，要么反面。

具体实现：根据种子和随机算法算出一个数和第二个参数设置几率比较，小于第二个参数要，大于不

要

withReplacement：boolean，抽取的数据不放回，false；
fraction：Double，抽取的几率，范围在[0,1]之间,0：全不取；1：全取；
seed：Long，随机数种子。

抽取数据放回（泊松算法）

withReplacement：boolean，抽取的数据是否放回，true：放回；false：不放回；
fraction：Double，重复数据的几率，范围大于等于 0.表示每一个元素被期望抽取到的次数，但实际的抽取次数可能远大于期望值；
seed：Long，随机数种子。

def sample(
     withReplacement: Boolean,
     fraction: Double, 
     seed: Long = Utils.random.nextLong): RDD[T] 
)

def sample(
    withReplacement: Boolean,
    fraction: Double,
    seed: Long = Utils.random.nextLong): RDD[T] = {
  require(fraction >= 0,
    s"Fraction must be nonnegative, but got ${fraction}")

  withScope {
    require(fraction >= 0.0, "Negative fraction value: " + fraction)
    if (withReplacement) {
      new PartitionwiseSampledRDD[T, T](this, new PoissonSampler[T](fraction), true, seed)
    } else {
      new PartitionwiseSampledRDD[T, T](this, new BernoulliSampler[T](fraction), true, seed)
    }
  }
}

函数说明

根据指定的规则从数据集中抽取数据。

object RDD_Transform_sample {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    // Step2: 算子 filter
    val rdd = sc.makeRDD(List(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
    println(rdd.sample(
      withReplacement = false,
      0.4,
      1
    ).collect().mkString(","))
    // Step3: 关闭环境
    sc.stop()
  }
}

输出结果：

2,5,6,8

sample() 算子源码如下，

当 withReplacement 为 false 时，会使用 BernoulliSampler() 贝努力（伯努利）算法选取样本，贝努力算法相当于抛硬币，将符合人头或字的数字作为样本返回。
当 withReplacement 为 true 时，会使用 PoissonSampler() 离散概率分布算法生成样本。

应用场景：在产生数据倾斜时使用。在 shuffle 的情况下，数据可能出现倾斜，通过 sample() 算子获取一定数量的数据样本观察其中的某些 key 值是否重复的比例过高，如果反复取 sample 对应的 key 都有重复值较多的情况，就说明该 key 可能引起数据倾斜，需要加以处理。

distinct()

函数签名

def distinct()(implicit ord: Ordering[T] = null): RDD[T]
def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

函数说明

将数据集中数据去重。

object RDD_Transform_distinct {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    // Step2: 算子 distinct
    val rdd = sc.makeRDD(List(1, 2, 3, 4, 1, 2, 3, 4))
    val distinctRDD = rdd.distinct()
    distinctRDD.collect.foreach(println)
    // Step3: 关闭环境
    sc.stop()
  }
}

输出结果：

Scala 集合的 distinct 方法，其中通过 HashSet 的方法识别重复值。

default Repr distinct() {
    boolean isImmutable = this instanceof scala.collection.immutable.Seq;
    if (isImmutable && this.lengthCompare(1) <= 0) {
        return this.repr();
    } else {
        Builder b = this.newBuilder();
        HashSet seen = new HashSet();
        Iterator it = this.iterator();
        boolean different = false;
        ...
}

RDD 的去重方式，主要看 partitioner 部分，由于 partitioner 默认为 null，因此主要逻辑体现在下述代码行：

 case _ => map(x => (x, null)).reduceByKey((x, _) => x, numPartitions).map(_._1)

以数组 List(1, 2, 3, 4, 1, 2, 3, 4) 为例：

map(x => (x, null))：将数组变为：(1, null)，(2, null)，(3, null)，(4, null)，(1, null)，(2, null)，(3, null)，(4, null) 的 tuple 数组；
reduceByKey((x, _) => x, numPartitions)：对 tuple 数组做聚合 (1, null)，(1, null)，聚合后，key 不变，(null，null)，取第一个值，即 null，所以聚合后的结果：(1, null)，(2, null)，(3, null)，(4, null)；
map(_._1)：只保留 tuple 中的第一个元素，(1, null)，(2, null)，(3, null)，(4, null) 变回 1, 2, 3, 4。

distinct() 算子的完整源码如下：

def distinct(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  def removeDuplicatesInPartition(partition: Iterator[T]): Iterator[T] = {
    // Create an instance of external append only map which ignores values.
    val map = new ExternalAppendOnlyMap[T, Null, Null](
      createCombiner = _ => null,
      mergeValue = (a, b) => a,
      mergeCombiners = (a, b) => a)
    map.insertAll(partition.map(_ -> null))
    map.iterator.map(_._1)
  }
  partitioner match {
    case Some(_) if numPartitions == partitions.length =>
      mapPartitions(removeDuplicatesInPartition, preservesPartitioning = true)
    case _ => map(x => (x, null)).reduceByKey((x, _) => x, numPartitions).map(_._1)
  }
}

coalesce()

函数签名

def coalesce(numPartitions: Int, shuffle: Boolean = false,
    partitionCoalescer: Option[PartitionCoalescer] = Option.empty)
    (implicit ord: Ordering[T] = null)
: RDD[T]

函数说明

根据数据量缩减分区，用于大数据集过滤后，提高小数据集的执行效率当 spark 程序中，存在过多的小任务的时候，可以通过 coalesce 方法，收缩合并分区，减少分区的个数，减小任务调度成本。

object RDD_Transform_coalesce {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    // Step2: 算子 distinct
    val rdd = sc.makeRDD(List(1, 2, 3, 4), numSlices = 4)
    val coalesceRDD = rdd.coalesce(2)
    coalesceRDD.saveAsTextFile("output")
    // Step3: 关闭环境
    sc.stop()
  }
}

生成的分区文件：

[1,2] [3,4]

如果原始数据与分区数据调整成下述的方式，数组有 6 个元素，变化前 3 个分区，变化后 2 个分区。

object RDD_Transform_coalesce {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    // Step2: 算子 coalesce
    val rdd = sc.makeRDD(List(1, 2, 3, 4, 5, 6), numSlices = 3)
    val coalesceRDD = rdd.coalesce(2)
    coalesceRDD.saveAsTextFile("output")
    // Step3: 关闭环境
    sc.stop()
  }
}

默认情况下，coalesce() 算子不打乱原始分区，如下图所示，coalesce() 采用的是整个分区合并的方式。

可以通过指定 shuffle 参数的方式，显式让 shuffle 执行。

val coalesceRDD = rdd.coalesce(2, shuffle = true)

指定 shuffle 参数后，分区随机进行打散：

[1,2] [3,4] [5,6] => [1,4,5] [2,3,6]

如果需要扩展分区，可以使用 coalesce() 进行分区扩展，如果不指定 shuffle = true，将无效，只有指定 shuffle = true 才能实现分区扩大。

repartition()

为让 coalesce() 功能更明确，只作为分区缩减，专门增加 repartition() 算子，进行分区扩展。

函数签名

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T]

函数说明

该操作内部其实执行的是 coalesce 操作，参数 shuffle 的默认值为 true。无论是将分区数多的 RDD 转换为分区数少的 RDD，还是将分区数少的 RDD 转换为分区数多的 RDD，repartition 操作都可以完成，因为无论如何都会经 shuffle 过程。

查看 repartion() 源代码，发现底层就是调用 coalesce() 算子，并默认指定 shuffle = ture。

def repartition(numPartitions: Int)(implicit ord: Ordering[T] = null): RDD[T] = withScope {
  coalesce(numPartitions, shuffle = true)
}

sortBy()

根据指定规则对数据进行排序。

函数签名

def sortBy[K](
     f: (T) => K,
     ascending: Boolean = true,
     numPartitions: Int = this.partitions.length)
    (implicit ord: Ordering[K], ctag: ClassTag[K]): RDD[T]

函数说明

该操作用于排序数据。在排序之前，可以将数据通过 f 函数进行处理，之后按照 f 函数处理的结果进行排序，默认为升序排列。排序后新产生的 RDD 的分区数与原 RDD 的分区数一致。中间存在 shuffle 的过程。

object RDD_Transform_sortBy {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    // Step2: 算子 sortBy
    val rdd = sc.makeRDD(List(6, 2, 4, 5, 3, 1), numSlices = 2)
    val sortRDD = rdd.sortBy(num => num)
    sortRDD.collect().foreach(println)
    // Step3: 关闭环境
    sc.stop()
  }
}

数据的顺序：[6, 2, 4] [5, 3, 1] => [1, 2, 3] [4, 5, 6]

从结果知：sortBy() 算子要进行 shuffle。

应用场景，对 tuple 进行排序。

object RDD_Transform_sortBy_tuple {

  def main(args: Array[String]): Unit = {
    // Step1: 准备环境
    val sparkConf = new SparkConf().setMaster("local[*]").setAppName("Operator")
    val sc = new SparkContext(sparkConf)
    // Step2: 算子 sortBy
    val rdd = sc.makeRDD(List(("1",1),("11",2),("2",3)), numSlices = 2)
    val sortRDD = rdd.sortBy(t=> t._1)
    sortRDD.collect().foreach(println)
    // Step3: 关闭环境
    sc.stop()
  }
}

默认会按照字典序进行排序：

(1,1)
(11,2)
(2,3)

若不想按字典序排列，可以进行数据类型转换：

val sortRDD = rdd.sortBy(t=> t._1.toInt)

输出结果：

(1,1)
(2,3)
(11,2)

sortBy() 默认为升序，第二个方式可以改变排序方式：

val sortRDD = rdd.sortBy(t=> t._1,ascending = false)

输出结果，按字典序降序：

(2,3)
(11,2)
(1,1)

partition函数

上一篇：从零开始学SQL进阶，数据分析师必备SQL取数技巧，建议收藏
下一篇：excel的高级用法——宏，原来如此实用

7 RDD常用算子(2)（rd算法）

filter()

sample()

distinct()

coalesce()

repartition()

sortBy()

相关推荐

我用 1 个 2 手计算器换了 3 台 MacBook(上)

零基础也能搞定!DeepSeek大模型本地安装全攻略

Win7中同时安装python2和python3的方法

Python三目运算符(三元运算符)用法详解

PS零基础入门教程:Photoshop 2024工具详解—标尺工具

按颜色计数、求和、算平均值或最大值?学这个函数就够啦!

SpringBoot中使用LocalDateTime踩坑记录

中药古今研究:人参

「mysql第二次安装不了」mysql安装失败怎么清理干净?

最全的linux安装软件方法 linux安装软件流程