Kafka Consumer Management

''Oh no! Something is wrong with the consumer!'' (set: $_expert to false) (set: $_rootCauseFound to false) (set: $_deploy to false) (set: $mitigated to false) (set: $timeToRecovery to 0) You have been using Kafka in your team for a while. The consumer is chugging along, no problem. Suddenly, you get tagged into that one random Slack thread as the on-call for your team. It seems your data is not being updated... what do you do? (button:)[[Try to redeploy the service->Deploy1]] (button:)[[Open Datadog->Datadog1]] (button:)[[I'm not a Kafka expert. Let's call someone else->Other1]]

<div class="title"> (align:"=><=")+(box:"X=")[(text-colour:purple)[''Mean Time To Recovery'']] </div> Are you a operational hero? A champion of the systems? A paladin of reliability? Show it! How fast can you help recover "business as usual" in these scenarios? (align:"=><=")+(box:"X=")[(button:)[[Play Game: Kafka Troubles->Kafka1]]]

You go into Rundeck to restart the service deployment. (set: $_deploy to true) (Wait 5 minutes.) (click-append: "(Wait 5 minutes.)")[ The deploy finishes. The data doesn't seem to appear. (set: $timeToRecovery to $timeToRecovery+5) (button:)[[Open Datadog->Datadog1]] (button:)[[I'm not a Kafka expert. Let's call someone else->Other1]] ]

You open Datadog. You see a wide array of dashboards. (button:)[[I have a Kafka dashboard!->Datadog2]] (button:)[[I don't know what to look for...->DatadogLost1]]

You spend time figuring out who to call. (set: $timeToRecovery to $timeToRecovery+10) Eventually, exhausted, you throw up a message in the #kafka channel. (Wait 30 minutes.) (click-append: "(Wait 30 minutes.)")[You are joined by a passerby who has seen your plea for help. (set: $timeToRecovery to $timeToRecovery+30) They ask, "Do you have consumer offset lag?" (set:$_expert to true) (button:)[[Open Datadog->Datadog1]] ]

Your Kafka dashboard is blinking around and showing many charts with various changes on them. <img src="https://files.slack.com/files-pri/T02CZ747G-F0439ALBFUL/image.png?pub_secret=c6cd3ab52f" /> <img src="https://files.slack.com/files-pri/T02CZ747G-F0436MZTYF5/image.png?pub_secret=27b7b2c373"/> <img src="https://files.slack.com/files-pri/T02CZ747G-F0433T8B5U5/image.png?pub_secret=b501cc72ba" /> (button:)[[Check your consumer performance (first chart)->Perf1]] (button:)[[Check your offset lag (second chart)->OffsetLag1]] (button:)[[Check your producer message rate (third chart)->Perf2]]

(set: $timeToRecovery to $timeToRecovery+1) (if: $_expert)[ Your expert is telling you to look up consumer offset lag. They give you the correct metric name. (button:)[[Look up offset lag->OffsetLag1]] ] (else:)[ You don't know what metrics you can use. Or do you? (click-append:"do you?")[ Type the metric name. (input: bind _metric, "X=")(button:)[Look up metric.] (click-append: "Look up metric.")[ (if: _metric contains "kafka.kafka_consumergroup_group" and _metric contains "lag")[(go-to:"OffsetLag1")] (else:)[ This isn't a metric name we know... (button:)[[Go back to Datadog->Datadog1]] (button:)[[I'm not a Kafka expert. Let's call someone else->Other1]] ] ] ] ]

The consumer is lagging. Its offset lag has been growing. (set: $timeToRecovery to $timeToRecovery+1) (if: $_deploy is false)[ (button:)[[Try redeploying maybe.->Deploy1]] ] (if: $_expert)[ (button:)[[Ask the expert what we should do.->Other2]] ] (button:)[[Look for the root cause.->RootCause1]] (button:)[[Let's deploy more consumers.->Deploy3]]

You need to investigate the root cause of the consumer lag problem. How do you proceed? (button:)[[Ask Confluent support.->Confluent1]] (button:)[[Look for messages in the consumer log.->Logging1]] (button:)[[Check your consumer performance.->Perf1]]

You have a 5 min discussion on what to do. (set: $timeToRecovery to $timeToRecovery+5) Your Kafka expert looks around a bit and says it can't hurt to deploy more consumers, as it may give the ones that are stuck time to recover. However, you can only do this if you have enough partitions for the new number of consumers. (button:)[[Let's deploy more consumers.->Deploy3]] (button:)[[Let's try to find the root cause first.->RootCause1]]

You write to Confluent support. (Wait 15 minutes.) (click-append: "(Wait 15 minutes.)")[ (set: $timeToRecovery to $timeToRecovery+15) The support agent says there is no known issue on their side, but an engineer can look into more details and get back to you in 1-3 days. (button:)[[Let's wait till then.->TimesUp1]] (button:)[[Look for messages in the consumer log.->Logging1]] (button:)[[Check your consumer performance.->Perf1]] ]

You see a lot of messages in the consumer log. (set: $timeToRecovery to $timeToRecovery+1) <img src="https://files.slack.com/files-pri/T02CZ747G-F042RLCC89M/image.png?pub_secret=30dfb5e4c3"/> There seem to be groups re-balancing, but you're not sure how that would manifest in your issue. (if: $_expert)[ The Kafka expert in your team tells you that if each consumer re-joins the group after a balancing event, it means they should be getting messages. Maybe your consumer is the one that's not processing them. ] (button:)[[Ask Confluent support.->Confluent1]] (button:)[[Let's get Infraplat involved.->Infra1]] (button:)[[Check your consumer performance.->Perf1]]

The following charts show your consumer's CPU usage, memory usage, and time spent on the last 15 minutes of requests. <img src="https://files.slack.com/files-pri/T02CZ747G-F0439ALBFUL/image.png?pub_secret=c6cd3ab52f" /> <img src="https://files.slack.com/files-pri/T02CZ747G-F043C9A7TS6/image.png?pub_secret=940566646d"/> <img src="https://files.slack.com/files-pri/T02CZ747G-F043CBUAAUT/image.png?pub_secret=652dad9dff" /> (button:)[[Let's try to give it more CPU.->Deploy2]] (if: $mitigated is false)[ (button:)[[Let's deploy more consumers.->Deploy3]] ] (button:)[[It looks like it's taking longer on some requests...->Spikes1]] (button:)[[All this looks normal. We should ask Infraplat if the reason for the slowness is on their side.->Infra1]]

<div class="end"> (text-rotate-z:348)[''The End''] </div> You waited too long for the issue resolution. When you have an incident that threatens your app functionality, you need to mitigate it as fast as possible. Research shows that in high performing teams, mean time to recovery is less than an hour. ''Your Score:'' 0

You update skydome-playbooks and have to kick off a redeploy for it to pick up the new CPU allocation. (Wait 5 minutes.) (click-append: "(Wait 5 minutes.)")[ The deploy finishes. There is no improvement. (set: $timeToRecovery to $timeToRecovery+5) (Wait 5 more minutes.) (click-append: "(Wait 5 more minutes.)")[ The lag continues. (set: $timeToRecovery to $timeToRecovery+5) ] (if: $mitigated is false)[ (button:)[[Let's deploy more consumers.->Deploy3]] ] (button:)[[Let's get Infraplat involved.->Infra1]] ]

You increase the `min_count` of consumer containers. You need to re-deploy to propagate your change. (Wait 5 minutes.) (click-append: "(Wait 5 minutes.)")[ The deploy finishes. The consumers are starting up. (set: $timeToRecovery to $timeToRecovery+5) (Wait 5 more minutes.) (click-append: "(Wait 5 more minutes.)")[ The lag seems to slowly reduce. (set: $timeToRecovery to $timeToRecovery+5) (set: $mitigated to true) (set: $timeToMitigation to $timeToRecovery) ''Congratulations!'' You have mitigated the incident. ''Time To Recovery'': $timeToRecovery minutes. (button:)[[Cool, let's go back to work!->Quality1]] (if: $_rootCauseFound)[ (button:)[[Try to fix the root cause.->Spikes1]] ] (else:)[ (button:)[[Look for the root cause.->RootCause1]] ] ] ]

The infraplat on-call joins. They do not believe your sudden problem is related because no one else has seen it on their service, but are still willing to investigate. (Wait 3 hours.) (set: $timeToRecovery to $timeToRecovery+180) (button:)[[Maybe it'll get better tomorrow.->TimesUp1]] (if: $_rootCauseFound)[ (button:)[[I can fix it myself. I know the root cause.->Spikes1]] ] (else:)[ (button:)[[Look for the root cause in your consumer.->RootCause1]] ]

You see little spikes on some requests. It seems that your consumer is periodically doing an API call to another service that can sometimes take 2 minutes to fail... (set:$_rootCauseFound to true) (button:)[[The owning team should fix it.->Other3]] (button:)[[Why not make it an async call?->Fix1]] (button:)[[We should add a timeout on our consumer->Fix2]]

<div class="end"> (text-rotate-z:348)[''The End''] </div> You mitigated the issue, but did not attempt to find or fix the real reason for the problem. It will reoccur, and you will keep throwing money at it, only being alerted after your clients have already experienced data that is hard to trust or orders that are too late. (set: $score to 3000 - $timeToMitigation) ''Your Score:'' $score

You contact the team that owns the other service. (set: $timeToRecovery to $timeToRecovery+15) They promise to have a fix out next week. (button:)[[Let's wait a week.->TimesUp1]]

To change the code of your consumer to make the call async, you need to figure out what to do with messages where this call fails, and how to track state between the downstream calls and the Kafka event you have to process. You're thinking of something like a dead-letter queue, and you're wondering what to do if you run out of workers to do your async requests, too. Maybe you should put the API request result in a cache to help prevent that. You analyze how many new components you need, and you estimate it will take two weeks to ship. (set: $timeToRecovery to $timeToRecovery+20) (if: $mitigated is false)[ (button:)[[Let's wait till then.->TimesUp1]] ] (else:)[ The issue has been mitigated, we only have to keep increasing consumers until we can ship the new architecture and hope the problem doesn't worsen. (button:)[[Let's take the two weeks.->LTFix1]] ] Unless we want to see if there's a quicker fix? (button:)[[We should add a timeout on our consumer->Fix2]] (if: $mitigated is false)[ (button:)[[Let's deploy more consumers.->Deploy3]] ]

You add a timeout to your consumer so that it doesn't wait forever after this API call (but fails processing a message when the call fails). You need to redeploy for the changes to take effect. (Wait 5 minutes.) (click-append: "(Wait 5 minutes.)")[ The deploy finishes. The consumers are starting up. (set: $timeToRecovery to $timeToRecovery+5) (Wait 5 more minutes.) (click-append: "(Wait 5 more minutes.)")[ The lag seems to slowly reduce. (set: $timeToRecovery to $timeToRecovery+5) (if: $mitigated is false)[ (set: $mitigated to true) (set: $timeToMitigation to $timeToRecovery) ''Congratulations!'' You have mitigated the incident. ''Time To Recovery'': $timeToRecovery minutes. ] (button:)[[Cool, let's go back to work!->LTFix2]] ] ]

<div class="end"> (text-rotate-z:348)[''The End''] </div> You mitigated the issue and then improved your system so it did not reoccur, congratulations! While you had to make many architecture changes to get there, you're still happy about what you've learned in the process. (set: $score to 3000 - $timeToMitigation) ''Your Score:'' $score

Your producer is emitting messages at a regular rate and you do not see any spikes. (set: $timeToRecovery to $timeToRecovery+1) (if: $_expert)[ The expert you pulled in indicates that this means your producer probably needs no changes. ] (else:)[ (button:)[[Try to find an expert to help.->Other1]] ] (button:)[[Try to redeploy the producers.->Deploy4]] (button:)[[Go back to looking at the charts.->Datadog2]]

You go into Rundeck to restart the producer service deployment. (Wait 5 minutes.) (click-append: "(Wait 5 minutes.)")[ The deploy finishes. The data doesn't seem to appear. (set: $timeToRecovery to $timeToRecovery+5) (button:)[[Go back to looking at the charts.->Datadog2]] (if: $_expert is false)[ (button:)[[Try to find an expert to help.->Other1]] ] ]

<div class="end"> (text-rotate-z:348)[''The End''] </div> You mitigated the issue and then improved your system so it did not reoccur, congratulations! The fix was actually really quick -all in a day's work! You're especially happy about what you've learned in the process. (set: $score to 6000 - $timeToMitigation) ''Your Score:'' $score

↶

↷

Mean Time To Recovery

Are you a operational hero? A champion of the systems? A paladin of reliability?
Show it! How fast can you help recover "business as usual" in these scenarios?

Play Game: Kafka Troubles