Spark + Kafka's Direct way to send the offset to the Zookeeper implementation

Apache Spark 1.3.0 introduces the Direct API, uses Kafka 's low-level API to read data from the Kafka cluster, and maintains offset-related information within the Spark Streaming system and implements zero data loss in this way Data loss) is more efficient than using Receiver-based methods. But because the Spark Streaming system itself maintains Kafka 's read offset, and the Spark Streaming system does not send this consumption offset to Zookeeper, which will result in offset-based Kafka cluster monitoring software (eg Apache Kafka Web Console by Kafka, KafkaOffsetMonitor for Apache Kafka, etc.). This article is based on the problem in order to solve this problem, so that we prepared the Spark Streaming program can receive data in each time after the automatic update Zookeeper Kafka offset. We can know from Spark's official HasOffsetRanges that offsetRanges Spark's internal HasOffsetRanges Kafka cheap HasOffsetRanges is stored in the HasOffsetRanges class, which we can get in the Spark Streaming program:

val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

So that we can get the partition consumption information, just traverse offsetsList, and then send this information to Zookeeper to update the Kafka consumption offset. The complete code snippet is as follows:

val messages = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topicsSet)

      messages.foreachRDD(rdd => {

        val offsetsList = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

        val kc = new KafkaCluster(kafkaParams)

        for (offsets < - offsetsList) {

          val topicAndPartition = TopicAndPartition("iteblog", offsets.partition)

          val o = kc.setConsumerOffsets(args(0), Map((topicAndPartition, offsets.untilOffset)))

          if (o.isLeft) {

            println(s"Error updating the offset to Kafka cluster: ${o.left.get}")

          }

        }

})

KafkaCluster class is used to KafkaCluster classes KafkaCluster with the Kafka cluster. We can set its Map((topicAndPartition, offsets.untilOffset)) offset Map((topicAndPartition, offsets.untilOffset)) for each Map((topicAndPartition, offsets.untilOffset)) , and then call the KafkaCluster class Method to update the information inside the Zookeeper, so that we can update the Kafka offset, and finally we can KafkaOffsetMonitor software to monitor Kafka in the corresponding Topic of consumer information, the following figure is KafkaOffsetMonitor monitoring

From the figure we can see that KafkaOffsetMonitor monitoring software has been able to monitor the consumption of Kafka-related partitions, which is very important to monitor our entire Spark Streaming program, because we can understand Spark read speed at any time. In addition, the complete code for the KafkaCluster tool class is as follows:

package org.apache.spark.streaming.kafka

import kafka.api.OffsetCommitRequest

import kafka.common.{ErrorMapping, OffsetMetadataAndError, TopicAndPartition}

import kafka.consumer.SimpleConsumer

import org.apache.spark.SparkException

import org.apache.spark.streaming.kafka.KafkaCluster.SimpleConsumerConfig

import scala.collection.mutable.ArrayBuffer

import scala.util.Random

import scala.util.control.NonFatal

/**

 * User: 过往记忆

 * Date: 2015-06-02

 * Time: 下午23:46

 * bolg: https://www.iteblog.com

 * 本文地址：https://www.iteblog.com/archives/1381

 * 过往记忆博客，专注于hadoop、hive、spark、shark、flume的技术博客，大量的干货

 * 过往记忆博客微信公共帐号：iteblog_hadoop

 */

class KafkaCluster(val kafkaParams: Map[String, String]) extends Serializable {

  type Err = ArrayBuffer[Throwable]

  @transient private var _config: SimpleConsumerConfig = null

  def config: SimpleConsumerConfig = this.synchronized {

    if (_config == null) {

      _config = SimpleConsumerConfig(kafkaParams)

    }

    _config

  }

  def setConsumerOffsets(groupId: String,

                         offsets: Map[TopicAndPartition, Long]

                          ): Either[Err, Map[TopicAndPartition, Short]] = {

    setConsumerOffsetMetadata(groupId, offsets.map { kv =>

      kv._1 -> OffsetMetadataAndError(kv._2)

    })

  }

  def setConsumerOffsetMetadata(groupId: String,

                                metadata: Map[TopicAndPartition, OffsetMetadataAndError]

                                 ): Either[Err, Map[TopicAndPartition, Short]] = {

    var result = Map[TopicAndPartition, Short]()

    val req = OffsetCommitRequest(groupId, metadata)

    val errs = new Err

    val topicAndPartitions = metadata.keySet

    withBrokers(Random.shuffle(config.seedBrokers), errs) { consumer =>

      val resp = consumer.commitOffsets(req)

      val respMap = resp.requestInfo

      val needed = topicAndPartitions.diff(result.keySet)

      needed.foreach { tp: TopicAndPartition =>

        respMap.get(tp).foreach { err: Short =>

          if (err == ErrorMapping.NoError) {

            result += tp -> err

          } else {

            errs.append(ErrorMapping.exceptionFor(err))

          }

        }

      }

      if (result.keys.size == topicAndPartitions.size) {

        return Right(result)

      }

    }

    val missing = topicAndPartitions.diff(result.keySet)

    errs.append(new SparkException(s"Couldn't set offsets for ${missing}"))

    Left(errs)

  }

  private def withBrokers(brokers: Iterable[(String, Int)], errs: Err)

                         (fn: SimpleConsumer => Any): Unit = {

    brokers.foreach { hp =>

      var consumer: SimpleConsumer = null

      try {

        consumer = connect(hp._1, hp._2)

        fn(consumer)

      } catch {

        case NonFatal(e) =>

          errs.append(e)

      } finally {

        if (consumer != null) {

          consumer.close()

        }

      }

    }

  }

  def connect(host: String, port: Int): SimpleConsumer =

    new SimpleConsumer(host, port, config.socketTimeoutMs,

      config.socketReceiveBufferBytes, config.clientId)

}

Rechercher dans ce blog

Big data

Spark + Kafka's Direct way to send the offset to the Zookeeper implementation

Commentaires

Enregistrer un commentaire

Posts les plus consultés de ce blog

Controlling Parallelism in Spark by controlling the input partitions by controlling the input partitions

Spark optimization

Spark performance optimization: shuffle tuning