Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
76913e2
Batch oriented kafka rdd, WIP. todo: cluster metadata / finding leader
koeninger Nov 23, 2014
1d70625
WIP on kafka cluster
koeninger Nov 23, 2014
0b94b33
use dropWhile rather than filter to trim beginning of fetch response
koeninger Nov 24, 2014
4dafd1b
method to get leader offsets, switch rdd bound to being exclusive sta…
koeninger Nov 24, 2014
ce91c59
method to get consumer offsets, explicit error handling
koeninger Nov 24, 2014
7d050bc
methods to set consumer offsets and get topic metadata, switch back t…
koeninger Nov 24, 2014
783b477
update tests for kafka 8.1.1
koeninger Nov 25, 2014
29c6b43
cleanup logging
koeninger Nov 25, 2014
3c2a96a
fix scalastyle errors
koeninger Nov 25, 2014
4b078bf
differentiate between leader and consumer offsets in error message
koeninger Nov 25, 2014
8d7de4a
make sure leader offsets can be found even for leaders that arent in …
koeninger Nov 25, 2014
979da25
dont allow empty leader offsets to be returned
koeninger Nov 26, 2014
38bb727
give easy access to the parameters of a KafkaRDD
koeninger Dec 3, 2014
326ff3c
add some tests
koeninger Dec 16, 2014
6bf14f2
first attempt at a Kafka dstream that allows for exactly-once semantics
koeninger Dec 24, 2014
bcca8a4
Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
koeninger Dec 24, 2014
37d3053
make KafkaRDDPartition available to users so offsets can be committed…
koeninger Dec 25, 2014
cac63ee
additional testing, fix fencepost error
koeninger Dec 25, 2014
e09045b
[SPARK-4964] add foreachPartitionWithIndex, to avoid doing equivalent…
koeninger Dec 26, 2014
8bfd6c0
[SPARK-4964] configure rate limiting via spark.streaming.receiver.max…
koeninger Dec 30, 2014
1d50749
[SPARK-4964] code cleanup per tdas
koeninger Dec 30, 2014
adf99a6
[SPARK-4964] fix serialization issues for checkpointing
koeninger Jan 5, 2015
356c7cc
[SPARK-4964] code cleanup per helena
koeninger Jan 9, 2015
e93eb72
[SPARK-4964] refactor to add preferredLocations. depends on SPARK-4014
koeninger Jan 9, 2015
e86317b
[SPARK-4964] try seed brokers in random order to spread metadata requ…
koeninger Jan 10, 2015
0458e4e
[SPARK-4964] recovery of generated rdds from checkpoint
koeninger Jan 10, 2015
548d529
Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
koeninger Jan 14, 2015
c1bd6d9
[SPARK-4964] use newly available attemptNumber for correct retry beha…
koeninger Jan 14, 2015
d4a7cf7
[SPARK-4964] allow for use cases that need to override compute for cu…
koeninger Jan 14, 2015
bb80bbe
[SPARK-4964] scalastyle line length
koeninger Jan 26, 2015
2e67117
[SPARK-4964] one potential way of hiding most of the implementation, …
koeninger Jan 28, 2015
19406cc
Merge branch 'master' of https://github.com/apache/spark into kafkaRdd
koeninger Jan 28, 2015
99d2eba
[SPARK-4964] Reduce level of nesting. If beginning is past end, its …
koeninger Jan 29, 2015
80fd6ae
[SPARK-4964] Rename createExactlyOnceStream so it isnt over-promising…
koeninger Jan 29, 2015
2b340d8
[SPARK-4964] refactor per TD feedback
koeninger Jan 30, 2015
9a838c2
[SPARK-4964] code cleanup, add more tests
koeninger Jan 30, 2015
0090553
[SPARK-4964] javafication of interfaces
koeninger Jan 30, 2015
9adaa0a
[SPARK-4964] formatting
koeninger Jan 30, 2015
4354bce
[SPARK-4964] per td, remove java interfaces, replace with final class…
koeninger Feb 3, 2015
825110f
[SPARK-4964] rename stuff per TD
koeninger Feb 3, 2015
8991017
[SPARK-4964] formatting
koeninger Feb 3, 2015
0df3ebe
[SPARK-4964] add comments per pwendell / dibbhatt
koeninger Feb 3, 2015
8c31855
[SPARK-4964] remove HasOffsetRanges interface from return types
koeninger Feb 4, 2015
59e29f6
[SPARK-4964] settle on "Direct" as a naming convention for the new st…
koeninger Feb 4, 2015
1dc2941
[SPARK-4964] silence ConsumerConfig warnings about broker connection …
koeninger Feb 4, 2015
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
method to get consumer offsets, explicit error handling
  • Loading branch information
koeninger committed Nov 24, 2014
commit ce91c591569b8ac4e91dd29d013961fe0ee5c316
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,9 @@
package org.apache.spark.rdd.kafka

import scala.util.control.NonFatal
import scala.collection.mutable.ArrayBuffer
import java.util.Properties
import kafka.api.{OffsetRequest, OffsetResponse, PartitionOffsetRequestInfo, TopicMetadataRequest, TopicMetadataResponse}
import kafka.api.{OffsetRequest, OffsetResponse, OffsetFetchRequest, OffsetFetchResponse, PartitionOffsetRequestInfo, TopicMetadataRequest, TopicMetadataResponse}
import kafka.common.{ErrorMapping, TopicAndPartition}
import kafka.consumer.{ConsumerConfig, SimpleConsumer}

Expand All @@ -30,6 +31,8 @@ import kafka.consumer.{ConsumerConfig, SimpleConsumer}
* NOT zookeeper servers, specified in host1:port1,host2:port2 form
*/
class KafkaCluster(val kafkaParams: Map[String, String]) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would probably be good to make this private[spark] and keep it as an internal utility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rdd would be really unpleasant to actually use without the convenience methods exposed by KafkaCluster, especially if you're keeping your offsets in zookeeper and doing idempotent writes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - sorry let me look more, I didn't realize this is necessary for users.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for example

https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala#L60

We also use it for doing things like e.g. starting a stream at the leader offsets before a given time

type Err = ArrayBuffer[Throwable]

val brokers: Array[(String, Int)] =
kafkaParams.get("metadata.broker.list")
.orElse(kafkaParams.get("bootstrap.servers"))
Expand All @@ -47,79 +50,113 @@ class KafkaCluster(val kafkaParams: Map[String, String]) {
def connect(hostAndPort: (String, Int)): SimpleConsumer =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better remove this method since it doesn't do much ...

connect(hostAndPort._1, hostAndPort._2)

def connectLeader(topic: String, partition: Int): Option[SimpleConsumer] =
findLeader(topic, partition).map(connect)
def connectLeader(topic: String, partition: Int): Either[Err, SimpleConsumer] =
findLeader(topic, partition).right.map(connect)

def findLeader(topic: String, partition: Int): Option[(String, Int)] = {
def findLeader(topic: String, partition: Int): Either[Err, (String, Int)] = {
val req = TopicMetadataRequest(TopicMetadataRequest.CurrentVersion, 0, config.clientId, Seq(topic))
brokers.foreach { hp =>
var consumer: SimpleConsumer = null
try {
consumer = connect(hp)
val resp: TopicMetadataResponse = consumer.send(req)
resp.topicsMetadata.find(_.topic == topic).flatMap { t =>
t.partitionsMetadata.find(_.partitionId == partition)
}.foreach { partitionMeta =>
partitionMeta.leader.foreach { leader =>
return Some((leader.host, leader.port))
}
val errs = new Err
withBrokers(errs) { consumer =>
val resp: TopicMetadataResponse = consumer.send(req)
resp.topicsMetadata.find(_.topic == topic).flatMap { t =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this block of code is pretty hard to understand with so many level of nesting. can you rewrite it? maybe by introducing variables and adding comments to explain what is going on. overall I feel this PR went slightly overboard with Scala. With no explicit type, intermediate variable, and comment, it is pretty hard to understand a lot of blocks

t.partitionsMetadata.find(_.partitionId == partition)
}.foreach { partitionMeta =>
partitionMeta.leader.foreach { leader =>
return Right((leader.host, leader.port))
}
} catch {
case NonFatal(e) =>
} finally {
if (consumer != null) consumer.close()
}
}
None
Left(errs)
}

def getLatestLeaderOffsets(topicAndPartitions: Set[TopicAndPartition]): Map[TopicAndPartition, Long] =
def getLatestLeaderOffsets(topicAndPartitions: Set[TopicAndPartition]): Either[Err, Map[TopicAndPartition, Long]] =
getLeaderOffsets(topicAndPartitions, OffsetRequest.LatestTime)

def getEarliestLeaderOffsets(topicAndPartitions: Set[TopicAndPartition]): Map[TopicAndPartition, Long] =
def getEarliestLeaderOffsets(topicAndPartitions: Set[TopicAndPartition]): Either[Err, Map[TopicAndPartition, Long]] =
getLeaderOffsets(topicAndPartitions, OffsetRequest.EarliestTime)

def getLeaderOffsets(topicAndPartitions: Set[TopicAndPartition], before: Long): Map[TopicAndPartition, Long] =
getLeaderOffsets(topicAndPartitions, before, 1).map { kv =>
// mapValues isnt serializable, see SI-7005
kv._1 -> kv._2.head
def getLeaderOffsets(topicAndPartitions: Set[TopicAndPartition], before: Long): Either[Err, Map[TopicAndPartition, Long]] =
getLeaderOffsets(topicAndPartitions, before, 1).right.map { r =>
r.map { kv =>
// mapValues isnt serializable, see SI-7005
kv._1 -> kv._2.head
}
}

def getLeaderOffsets(topicAndPartitions: Set[TopicAndPartition], before: Long, maxNumOffsets: Int): Map[TopicAndPartition, Seq[Long]] = {
def getLeaderOffsets(
topicAndPartitions: Set[TopicAndPartition],
before: Long,
maxNumOffsets: Int
): Either[Err, Map[TopicAndPartition, Seq[Long]]] = {
var result = Map[TopicAndPartition, Seq[Long]]()
val req = OffsetRequest(
topicAndPartitions.map(tp => tp -> PartitionOffsetRequestInfo(before, 1)).toMap
)
val errs = new Err
withBrokers(errs) { consumer =>
val resp: OffsetResponse = consumer.getOffsetsBefore(req)
val respMap = resp.partitionErrorAndOffsets
val needed = topicAndPartitions.diff(result.keys.toSet)
needed.foreach { tp =>
respMap.get(tp).foreach { errAndOffsets =>
if (errAndOffsets.error == ErrorMapping.NoError) {
result += tp -> errAndOffsets.offsets
} else {
errs.append(ErrorMapping.exceptionFor(errAndOffsets.error))
}
}
}
if (result.keys.size == topicAndPartitions.size) {
return Right(result)
}
}
val missing = topicAndPartitions.diff(result.keys.toSet)
errs.append(new Exception(s"Couldn't find offsets for ${missing}"))
Left(errs)
}

def getConsumerOffsets(groupId: String, topicAndPartitions: Set[TopicAndPartition]): Either[Err, Map[TopicAndPartition, Long]] = {
var result = Map[TopicAndPartition, Long]()
val req = OffsetFetchRequest(groupId, topicAndPartitions.toSeq)
val errs = new Err
withBrokers(errs) { consumer =>
val resp: OffsetFetchResponse = consumer.fetchOffsets(req)
val respMap = resp.requestInfo
val needed = topicAndPartitions.diff(result.keys.toSet)
needed.foreach { tp =>
respMap.get(tp).foreach { offsetMeta =>
if (offsetMeta.error == ErrorMapping.NoError) {
result += tp -> offsetMeta.offset
} else {
errs.append(ErrorMapping.exceptionFor(offsetMeta.error))
}
}
}
if (result.keys.size == topicAndPartitions.size) {
return Right(result)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return? how about just
Right(result)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's an early return from withBrokers, which would otherwise keep running the closure on each broker.

}
}
val missing = topicAndPartitions.diff(result.keys.toSet)
errs.append(new Exception(s"Couldn't find offsets for ${missing}"))
Left(errs)
}

def setConsumerOffsets(groupId: String, offsets: Map[TopicAndPartition, Long]): Unit = ???

private def withBrokers(errs: Err)(fn: SimpleConsumer => Any): Unit = {
brokers.foreach { hp =>
var consumer: SimpleConsumer = null
try {
consumer = connect(hp)
val resp: OffsetResponse = consumer.getOffsetsBefore(req)
val respParts = resp.partitionErrorAndOffsets
val needed = topicAndPartitions.diff(result.keys.toSet)
needed.foreach { tp =>
respParts.get(tp).foreach { errAndOffsets =>
if (errAndOffsets.error == ErrorMapping.NoError) {
result += tp -> errAndOffsets.offsets
}
}
}
if (result.keys.size == topicAndPartitions.size) {
return result
}
fn(consumer)
} catch {
case NonFatal(e) =>
errs.append(e)
} finally {
if (consumer != null) consumer.close()
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or you can stay away from java nulls

var consumer: Option[SimpleConsumer] = None
try {
  consumer = Some(connect(hp))
  consumer map (c => fn(c)) 
} catch {
  case NonFatal(e) => errs.append(e)
} finally consumer map (_.close())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the feedback, but it's impossible to "stay away from java nulls" in a jvm language, without runtime checks.

Despite propaganda to the contrary, option.map is not a replacement for null checks.

The code you wrote can still throw a null pointer exception (if SimpleConsumer returns null, for instance).

You can hide the null pointer check inside of Option.apply instead of using Some, but at that point I'd rather just be explicit about what is going on.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, it could because I was not conclusive on the suggestion adding None where applicable. catch => None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the use of null here is fine, and very clear. the pattern matching with the finally actually makes it much harder to understand what is going on.

one nitpick, you need to put curly braces around consumer.close(), i.e.

if (consumer != null) {
  consumer.close()
}

}
val missing = topicAndPartitions.diff(result.keys.toSet)
throw new Exception(s"Couldn't find offsets for ${missing}")
}

def getConsumerOffsets(topicAndPartitions: Set[TopicAndPartition]): Map[TopicAndPartition, Long] = ???

def setConsumerOffsets(offsets: Map[TopicAndPartition, Long]): Unit = ???
}

object KafkaCluster {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -82,8 +82,10 @@ class KafkaRDD[
val valueDecoder = classTag[T].runtimeClass.getConstructor(classOf[VerifiableProperties])
.newInstance(kc.config.props)
.asInstanceOf[Decoder[V]]
val consumer: SimpleConsumer = kc.connectLeader(part.topic, part.partition)
.getOrElse(throw new Exception(s"Couldn't connect to leader for topic ${part.topic} ${part.partition}"))
val consumer: SimpleConsumer = kc.connectLeader(part.topic, part.partition).fold(
errs => throw new Exception(s"""Couldn't connect to leader for topic ${part.topic} ${part.partition}: ${errs.mkString("\n")}"""),
consumer => consumer
)
var requestOffset = part.afterOffset + 1
var iter: Iterator[MessageAndOffset] = null

Expand Down