Camunda Community Podcast

Dynamic Scaling of Zeebe Brokers: “Christmas is coming, so I'm going to get new orders, so I'm going to scale”

March 21, 2024 Camunda Community Podcast Season 5 Episode 6
Camunda Community Podcast
Dynamic Scaling of Zeebe Brokers: “Christmas is coming, so I'm going to get new orders, so I'm going to scale”
Show Notes Transcript Chapter Markers

On this episode of the Camunda Community Podcast, we discuss the dynamic scaling of Zeebe clusters. Host and Senior Developer Advocate Niall Deehan is joined by Senior Software Engineer Backend, Deepthi Akkoorath to learn more about this new feature and how it came to be, including the technical challenges we overcame and decisions we made, as well as what’s next for scaling Zeebe. 

Zeebe, the cloud-native workflow and decision engine that powers Camunda 8, has many next-generation capabilities, including a language-agnostic approach so you can build clients in any programming language, enterprise-grade resilience, and horizontal scaling via a distributed architecture. We have further improved this last one by providing the ability to dynamically scale Zeebe brokers and are committed to further enhancing this feature in the future. 

Listen to learn more about: 

  • How this “daisy of destiny” came to Niall’s attention
  • The structure of the Zeebe team and Deepthi’s role on the Zeebe Distributed Platform Team
  • What is the dynamic scaling Zeebe brokers? 
  • How users can dynamically add brokers
  • What actually happens under the hood when dynamically scaling Zeebe
  • How to scale down Zeebe
  • Failover with scaling Zeebe
  • What’s next for dynamic scaling 

Learn more about Zeebe: https://bit.ly/3vjvslt

Get started with Camunda 8 today:
https://bit.ly/3vjM8tb

---

Additional Resources:
[docs] Camunda 8 Concepts: Clusters
[docs] Zeebe Technical Concepts: Clustering
[docs] Create your cluster
[docs] Setting up a Zeebe cluster
[docs] Cluster Scaling
[github] Zeebe
[github issue] Dynamic resize of an existing Zeebe cluster
[github issue] [EPIC] Dynamic Scaling of Zeebe Cluster - Phat 1: Number of brokers
[github issue] Dynamically start and stop partitions
[blogpost] Zeebe Performance 2023 Year in Review

---

Visit our website.
Connect with us on LinkedIn, Facebook, Mastodon, Threads, and Bluesky.
Check out our videos on YouTube.
Tweet with us.

---

Camunda enables organizations to orchestrate processes across people, systems, and devices to continuously overcome complexity and increase efficiency. With Camunda, business users and developers collaborate using BPMN to model end-to-end processes and run sophisticated automation with the speed, scale, and resilience required to stay competitive. Hundreds of enterprises such as Atlassian, ING, and Vodafone design, orchestrate, and improve business-critical processes with Camunda to accelerate digital transformation.

---

Camunda presents this podcast for informational and entertainment purposes only and does not wish or intend to provide any legal, technical, or any other advice or services to the listeners of this podcast. Please see here for the full disclaimer.

[NIALL DEEHAN]  How quick is it to actually scale Zeebe?

[DEEPTHI AKKOORATH]
As for all questions, the answer is, it depends.

[NIALL]
Perfect. love that answer.

MUSIC

[NIALL] Hello and welcome along to the Camunda Community Podcast, where you’ll learn all sorts of things about Camunda, BPMN, DMN, and, of course, general orchestration topics. I’m Niall Deehan, senior developer advocate here at Camunda, and I’m host of the podcast.

This week, especially, we’re talking about Zeebe, and even more specifically than that, we’re talking about some new features that lets Zeebe scale even better.

So I do a bunch of things here at Camunda–some of which they even pay me for!
One of them is that I get to talk to community members and partners and customers. And they’ll generally give me some feedback about things they would like or things they would like to change.

And then I’ll tend to pop along to the field of dreams, I call–the backlog of tickets that we have–and just take a look at what’s going with those requests and maybe move some along if I can. But also, while I’m there, I wander this wonderful meadow of potential and sometimes pick a daisy of destiny–potentially a feature that I think is really, really good that I want to keep an eye on. And I do that. And that happened a couple of years ago with dynamic scaling in Zeebe. And I thought it was a feature I really would love to see and it’s incredibly complex so I knew I would like to find out how they were going to build it.

Now, I could try to explain to you myself how we built this but, partly because I don’t know and partly because I was able to get our Senior Software Engineer Deepthi Akkoorath in today to chat to me about that. Deepthi Akkoorath, she managed to build this bit of software and has been working on it for a while. And, in this little chat today, we discuss some of the challenges that we overcame, the decisions we made in building this, there were lots to make– and I found this fascinating!

It’s one thing to hear that this new feature exists. It’s another to talk about its journey into the release. There’s lots more to come from this feature as well, so we’ll also talk a little bit about what the potential of this feature is. I really hope you enjoy it.

Here we go!

[NIALL]
So Deepthi has been with Camunda for about 5 years now, and people may remember one of her most popular talks actually at CamundaCon a couple of years ago was when Deepthi was discussing how Zeebe's brokers deal with disaster recovery– specifically, when a broker goes down, what happens? How does it get set up?

Now we want to talk about how Zeebe specifically scales, because it's one of those things that you may not see behind the hood. Very kindly, Deepthi is taking valuable time away from developing Zeebe, in order to answer these questions.

Deepthi, tell me a little bit about the team you work on here at Camunda.

[DEEPTHI]
Hey, hello! I'm working with the Zeebe but we have like two Zeebe teams here. One is focusing on Zeebe process automation–that is all related to BPMN and DMN and handling those stuff. And the other team, it's called Zeebe Distributed Platform team. And that's where I'm in. And we handle everything distributed within Zeebe– so all the cluster management, distributed data management, and all the performance and reliability, scalability part of Zeebe.

And we also provide the stream processing platform that's being used by the other Zeebe team to implement the actual workflow engine.

[NIALL] Perfect. So, as you told me earlier, you always describe it as being something like if everything you're doing goes according to plan, the other team does not need to worry about anything involving data streaming or scaling or executing. They just are dealing with the engine aspects, BPMN symbols, and how the engine fundamentally works. 

Now, the reason we're talking, of course, is I was going through some of the tickets in Zeebe, and I came across an epic that described dynamic scaling of brokers, and this was a really interesting one for me. I looked at it and thought, “That is a really big deal.” I wanted to talk to you because you are involved in that.

Can you give me a quick understanding of what it's about and what we have managed to achieve?

[DEEPTHI] Well, this has been a much-requested feature from community and also from our customers. The fundamental architecture of Zeebe allows horizontal scaling because we have these partitions. And you know you can add as many partitions as as you want, and it should be horizontally scalable. But support for this was not available. If you look at the, one of the tickets, it has been open for more than two years, I think.

I was also very excited to work on it, because this is something that also exciting to me, very interesting topic. And this also would be important for the future of Zeebe.

[NIALL] Yeah, I completely agree. I still remember when you gave that talk at CamundaCon–the diagram you showed, which was a big deal of people, is that when you start up Zeebe, you're actually starting up three nodes, usually, by default right? So there's three nodes that all talk to each other, and the reason they exist is for failover, right? So that you know, if one goes down, you still have two others until the other one comes back, right?

And the dynamic aspect of this is the ability to add more brokers in runtime to this. Can you kind of describe sort of how that works, like what the current setup is? Let's imagine you have your three brokers. And you want to add a fifth or or sixth. What's the way in which someone can then dynamically add these brokers?

[DEEPTHI] First let me explain from a user's perspective, and then I can go into details what happens under the hood. So let's say you start your cluster with three brokers and six partitions, for example, just to keep it simple. Every broker has six partitions, because we also assume that replication factor minimum is three.

So each broker has six partitions, and maybe it doesn't have enough resources for six partitions. Let's just keep it simple. Let's assume one broker has only one CPU and all six partitions are competing for this CPU. So you have a limit on how much throughput or performance that you get from that system.

But maybe that's okay when you started your Zeebe cluster. Your business was just starting up, so you don't have enough load. So that was fine, and maybe in one year your business is boosted up, and you have lot of load, and, at that point, you want to increase the throughput.

That's when you, you want to scale Zeebe. So to do that, it's really quite simple. You just start new Zeebe brokers. Like, if you want to scale from three to six, you start three more Zeebe brokers. And then you send a request to Zeebe, that, “Okay scale and these are the new brokers.” And then, in background, Zeebe will distribute these partitions to to the new one. So now every broker has three partition seats. So now only three partitions are competing for the one CPU in the broker, and you, in theory, you will get double the performance.

[NIALL] Ah, okay. So the idea here is, the reason you would scale is to, because there's too many partitions per broker. That's the reason why you'd want to scale these and as you've mentioned a bunch of requests coming to the same CPU, and you say, “Well, to lessen the number of requests in the CPU, I can add more Zeebe brokers.” And then they have their own resources so they can then handle the additional partitions.

How would someone find out that they need to dynamically scale their brokers?

[DEEPTHI] I would say. It's mostly about what users can observe from outside. So it could be either a throughput–like how many instances that. that you can create or complete per second or the overall latency that is taken for a single task, or maybe it was never processed.

If you are competing for CPU, then the partitions are not able to process your instance request faster. So you will see that in your, either the processing data or overall latency that you observe in your process, or just from the terms of throughput.

[NIALL] I think the first example I had for someone who really wanted this was– there was someone who was using the older version of Camunda and they like, they would manually scale it up in certain parts of the year, because they were a risk management company, and they did most of their business on a particular month of the year. I don't know why that is. I can't remember, I wish I did.

But they have a huge throughput then when people are are trying to hit their system to get this risk management stuff checked and most of the time they don't. So they quite liked the idea of being able to just bring up a bunch of Camunda nodes. It wasn't all that easy, actually, to bring up Camunda 7 nodes, because, unlike Zeebe, Camunda 7 required you to have a load balancer on the top of it; which Zeebe doesn't need because it has gateways. It happens out of the box. You then need to connect to the same database. So that means you're actually limiting the number of requests to a database.

But actually, here, we've greatly improved this, because now you can scale it. There's no database to worry about. You don’t need to talk to a load balancer anymore. You just need to do that scaling. And there's only one component that you need to talk to. I think that's a massive benefit from a use case perspective. And another reason why it's a lot easier and better to scale this than it would have been the previous version of Camunda.

Now, this, this feature, of course, is difficult, I imagine. I was a software developer a long time ago, and I never meant, made anything very complicated. So even just thinking about where to start with this, it it all burns my brain so I'd, I'd love you to talk me through, when you were assigned this, like, what did you see as being the potential, let's say, hurdles, the potential challenges? And how did you solve them?

[DEEPTHI] I mean, the the main thing is that Zeebe’s a stateful system, right? So you could think of it's like also a database. So it has state in it. But if it was a stateless application, then it's easy to scale up. You just spin up new instances of it, and then everything is fine. So a thing with the stateful system is also about distributing this data, right? When you start a new broker, you have to kind of distribute it. You cannot like, if you keep this state in the old brokers itself, then you're not really scaling anything, or you're not making any change.

So the first thing is all about how you can correctly move partitions from the old brokers to the new brokers. Moving partitions means that you're moving the data along with it, and also  the process engine that is processing that should also be–

[NIALL] Should also be with the right broker, yeah. Did you have any ability to move partitions before this?

[DEEPTHI]
No.

[NIALL] Okay, so you to build the whole partition movement thing, first, I assume before anything else? Okay.

[DEEPTHI] 
Yes. Yeah. But fortunately, we are, we are using Raft protocol for managing the data replication and content, it's part of partitions. And, fortunately, the Raft paper already kind of outline how to make changes to this partition group.

[NIALL] Ah, cool.

[DEEPTHI] So let me say it simply, like, you have 3 replications for one partition, and you say, broker 0, 1, and 2 are part of that replication. And now, when you add broker 4, you have to move, let's say, from broker 0 to broker 4. Thus, you have to move partition from broker 0 to broker 4. So the new, the new replication group will be 1, 2, and 4.

So this is what you want to achieve and to achieve that for Raft, we can make this configuration change. So first you add Broker 4. So you have temporarily 0 1, 2, and 4 in the replication group, and then you remove 0, broker 0.

So you do that in 2 steps, and then, without going too much into the detail of the protocol, so you can just do it. So once you joined the group, then there are protocols just kicks in, and then it will replicate everything that was in this partition to in broker 4.

So under the hood, is just Raft protocol

[NIALL]  That's really cool. Now, does this change the broker that is like, that owns the process? Because in Zeebe, of course, only one broker can move the state of any given process instance, correct?

[DEEPTHI] Hm-hm.

[NIALL] So, do we move instances that are already owned by, let's say, Broker 0? Do we move the ownership to broker 4, or do we just move the replication to different brokers?

[DEEPTHI] So each process instance is residing in one partition. So, and whoever is the leader for that partition is making the changes. So when we make these changes like when we are moving one partition from one broker to other, we are just moving one replica. And then depending on how much we change, it can result in a new leader.

So let's say, in previous example, we only change only one broker. So the other two were still the same, and that means it most likely that the leader will remain the same. But I'm seeing there might be cases where we have changed all replicas to new brokers. Then there will be a new leader. And then that leader will continue processing from where the previous leader stopped.

[NIALL] Ah, gotchu.

[DEEPTHI] So, in that sense, it’s not like an ownership. The owner is still in the same partition. But the leader can change.

[NIALL] Gotchu. So yeah, actually, that's a misunderstanding that I had. I always assumed that the leadership was linked to a broker. But it's not. It’s linked to a partition. That's the leading partition, right? So yeah, I've always just assumed that way. But that makes more sense, of course, that way of moving partitions.

So that's actually quite a risky thing to do, to move a partition of a running instance, as it potentially could have its state changed. How do we deal with the fact that the partition, while being moved, could need to also change?

[DEEPTHI]  So that's where the Raft protocol comes in. So there's a safe way to do it like, if you do this step by step, you are also making sure that there is the data consistently moved. So, there is no data loss and even if there's some leader change in between, or there is a broker restarts in between, the Raft protocol guarantees that everything works.

[NIALL] Okay, very, very handy, then.

How long would it take, start to finish, to let's say dynamically, scale a cluster of like three brokers to six, let's say. So, double the size of brokers. How long? And let's imagine, as you use your example, let's say there's three partitions per broker. How quick is it to actually scale Zeebe?

[DEEPTHI] As for all questions, the answer is, it depends.

[NIALL] Perfect. love that answer.

[DEEPTHI] It depends on how much data you have in your system, and also how much, what's the current workload on on your system because you need resources, and if you're sharing the resources with the actual processing, then it can get slower.

So moving data from one broker to another is time-consuming, right? So it depends on how much. So it will be like, maybe take few seconds if you have like, very small state, but it can take to a several minutes if you have already a lot of state on each partition.

[NIALL]
Yeah, that makes sense. So ballpark is between seconds to minutes.

[DEEPTHI] So in, at least in one test that I did it, I think it took like more than 5 minutes.

[NIALL] Okay, that seems quite reasonable. And I guess, if you plan for the scalability like, if you can predict high throughput is coming, it's much easier and less resource-intensive than if you do it before you have the problems of high throughput means you have to scale. But if we're able to predict a little bit about when the scalability is required, then it'll be much faster, I guess.

[DEEPTHI] Yeah, that's right. So I think what would be a good use case for this, this feature like, you know that, “Okay, Christmas is coming. So I'm going to get new orders. So I'm going to scale,” something like that.

[NIALL] Nice, perfect sense. Yes, I always scale much, scale up around Christmas time, too, and then do my very best to scale back down in January. And yeah, and actually, what a wonderful segue that is.

Because, of course, dynamic scaling goes both ways, I-I assume. I actually have always assumed, and you can tell me if this is correct or not, that it’s much easier to scale up than scale down. It seems far more complicated to remove, let's say, a broker when there's less throughput than there would be to scale it up. But maybe you can tell me if that is a, is a fallacy that I've always believed. How does scaling down work?

[DEEPTHI] So in this, currently, scale up and scale down works the same way. So when you scale up, you're moving some partitions to the new brokers. And when you scale down, you have to move those partitions back.

[NIALL] With new partitions, they're completely empty. They have all their resources, so it's a very easy thing to say, “There's an empty bucket. I'm going to put a bunch of stuff in there.” But if you have like let's say six half-filled buckets, then how do you decide where everything goes to make sure things are evened out. Is there decision-making happening there?

[DEEPTHI] So for for that, Zeebe uses a round-robin distribution for partition. So when you add new brokers, it uses the same logic to equally distribute the partitions to brokers. So one thing to note is that we don't, right now, allow to change the number of partitions right? So we don't stop like when you add a new brokers, we are not starting new partitions. So we just move existing partitions to the new brokers. And then, yeah, we just use this round-robin strategy.

[NIALL] Okay, so discuss now about like the scaling down again. What decisions are made from the course of a say, 6 brokers–we want to get back down to 3. How do we shut them down? When do we know to shut them down? How does how does all that work?

[DEEPTHI] So Zeebe doesn't try to be intelligent there. So we just assume that the decision is taken by a user or an operator. So they know when to scale up or scale down, and when they can just send a request to Zeebe when to do that.

For scaling down, you first have to send request to Zeebe to move all the partitions, and once the partitions have completely moved then you can shut down the brokers.

[NIALL]  Gotchu. Okay? So it's pretty safe then. So nothing is going to accidentally shut down a broker that might have stuff on it?

[DEEPTHI]  Yeah.

[NIALL] That makes a lot of sense. Okay, cool. That's really useful. 

What other sort of hurdles or or or hard programming problems did you find yourself dealing with with this ticket?

[DEEPTHI] Yeah. So one of the blocker was that Zeebe uses a static configuration. So when you create the cluster, you're saying, what is the cluster size, and what, how many partitions are there? How? What's the replication factor?

And most of the components use this information. So now, the problem is that it's changing dynamically, right? You are allowed to add new brokers and then change the partition distribution. And this has to be managed.

So the first step was to store this cluster configuration within Zeebe. So instead of using the static configuration that we provided by environment variables or this configuration dot-yaml. We cannot rely on that anymore. We should use that only first time when you start up, and after that, it can dynamically change. So, so for that, we had implemented a new topology management which is via gossip mechanism within Zeebe.

And this topology gives the up-to-date information about which are the current brokers within the cluster and what's the partition distribution?

And this was also really interesting, like how to do it. So we had some discussions initially regarding whether we should use a centralized system to manage this information, or whether we need to have a distributed way of managing and coordinating this. And at the end we decided to do it in a distributed way. So everything is managed via gossip so, and it's also coordinated in a very distributed way.

So, when you have to scale, this information is also updated into the topology and then this topology management keep tracks of at what point are we scaling. So like we discussed before, we have to move all partitions, and we do that step by step. We first move one partition and then move the second one. And this is all kind of stored and managed and coordinated via this topology manager.

This is also like a completely new stuff we had to build to support this feature.

[NIALL] Yeah, I can imagine. That sounds complicated. So obviously, I know the configuration file, the yaml file that I use all the time for defining Zeebe stuff. And I can imagine that the difference between starting up, reading a single static file, and then doing what that says to then creating the knowledge in runtime of that dynamic value– that's incredibly complicated. The gossip protocol means that something changes on a broker, and then the broker then goes ahead and tells everyone else stuff is happening. That, that's basically what's going on right? So each broker then stores the settings, the number of partitions, I guess, and the number of brokers that are supposed to be in this cluster. And then so if you change them for any of the brokers, that broker then tells the rest of them, “Hey! Stuff has changed. Here's the new paradigm we're working with.”

The other option was centralizing it. Why did you decide not to take the centralized approach?

[DEEPTHI] Then it's also single point of failure if you have a centralized approach, and, if if that system is not available, we won't be able to even just restart a broker. If you, if you talk about in in SaaS, we expect that we constantly restart the pods or move to a new node or stuff like that. It's not running on a static hardware. And then if if every time it restart, if the central component is not available, then it's you cannot complete startup because it doesn't know what the configuration is.

So that's the main thing. So we want to make it as much as not a single-point failure whenever we can. That was the main decision to go with this distributed coordination.

[NIALL]
I think it makes sense, considering it might even be the only component in Zeebe that would ever have a single point of failure. I think everything else is scalable. You can have multiple gateways, multiple brokers.

Are you happy with how that that decision has been working?

[DEEPTHI] So far. Yes. Let's see what happens in the future, If we have to add more features. But I but I think it's still was a good decision to do that.

[NIALL] And obviously, as well as creating this brand new aspect to the gossip protocol to be able to send that information, you need to create the API so that users could actually tell Zeebe this information. So with that regard, what does that API look like? What can you tell Zeebe exactly? Like if I was, let's say, a software developer and I have my Zeebe cluster, what is the call I use? And what kind of information do I give Zeebe in order to scale it?

[DEEPTHI] So it's a REST API that's exposed in the gateway, so you can send this request to any of the available gateway, and you just have to say, “Scale. And these are the new set of broker IDs.” So, in in our case, we just say, “Scale 0 1, 2, 3, 4, 5,” because that's the broker IDs for 6, cluster size of 6.

[NIALL] Nice! That seems very straightforward actually; a little easier than I thought.

[DEEPTHI] Yeah, I mean that would also probably interesting in future for more nuanced use cases where you want to have a specific way of scaling or specific way of partition distribution. But that's more like an extended thing, and something that we can build on demand if there's any any use for it.

So also one more thing to add on this API is, this is just for requesting the scale, and then you have, as as we discussed before, it's not an instantaneous process. It takes time to move data and then complete all the scaling. So you have to keep monitoring it.

So you also have, like a query API, where you can just query the current topology and then monitor the pro- progress of the scaling.

[NIALL] Cool! Previously, if I wanted to know what the configuration of my Zeebe cluster is, I would check out the yaml file–that would never change. But that's not, that's not gonna be the same anymore. Right? Because you can change it in runtime.  So now, if I wanted to check, hey, what is the topology of my Zeebe cluster look like, I've got an API to do that.

What kind of failover, do we have in case something goes wrong? Like, okay, here's my, here's my theory for you. So you've you've asked it to scale a certain number of brokers, right? And, as you said, you're querying it, saying, “Hey, what's the current topology?” until you see the one you want, right?

If something goes wrong with scaling it– I'm not sure what, you can maybe tell me what kind of things that potentially could go wrong here, if anything. It could be, could be flawless, I mean, I don't know. What are the ways of finding out what's going on?

[DEEPTHI]
So this query API also returns a lot of details in in what step you are in scaling. So you can also monitor and then also, if there's any failures, it will also show you at what point it failed.

Hopefully, there are no failures. Everything works. So there's also in built for retrying and everything–that's already been built within Zeebe. So for restart, or maybe there is temporary network problems in the cluster, Zeebe will retry this until there’s success.

[NIALL]  Oh, that’s really cool.

[DEEPTHI] 
In practice there could be maybe some bugs or something.

[NIALL] No, not in our software.

[DEEPTHI] LAUGHTER. Yeah, Zeebe is perfect. So it's fine.

I think, though that's the case mostly when something fails, and there's no way to recover from that kind of scaling.

This is also a bit tricky and it's not perfect how we can recover from it. So we also expose an API to cancel the scaling.

[NIALL] Oh, right!

[DEEPTHI]  Not a completely safe thing to do, because maybe you’re in the middle of scaling, you have already moved some partitions but not all. So then you are in a, in a intermediate state of the scaling.

[NIALL] Yep.

[DEEPTHI] So then you probably have to do some manual steps to recover– like to bring it to a state where you want. So that's it's not like a perfect solution yet. But there's certainly some way to recover from it.

[NIALL]  That's really interesting. Yeah, it's a great like, of course, there's a lot in there for essentially the first iteration of a really popular feature.

And it brings us actually to the final thing I wanted to talk to you about, which is: what's, what's next for dynamic scaling? What's the next things to expect for the next year or so?

[DEEPTHI] So the next thing for for for dynamic scaling. So this, what we have done so far is just the first step towards scaling. So we can add new brokers. But we have to keep the number of partitions same. So that is kind of a limit to how much you can scale. And you should have already provisioned enough partitions in advance so you can use this feature.

But, of course, we want infinite scaling in future, and that means once we can add new partitions to the brokers. So when you add new partitions, that means it's a bit more, more challenging than what we have already done. The main problem is that now we have to move like, cut partitions, and you know, like you have to split existing partitions to move some data from that to the other one.

[NIALL] Yeah, that makes a lot of sense.

[DEEPTHI] And that's a bit challenging. That's also one of the feature that I'm looking forward to see in Zeebe. And hopefully we can work on it soon. And I think that's also something that's requested by many users.

[NIALL] Yeah, because we already have a really nice, infinite scalability thing, provided you give all the information upfront. Because you can always just scale as much as you want, and we do have the capability. And it’s all about, can we get that capability in runtime? And that's the next sort of big step towards that.

Cool. That sounds very complicated. I'm glad you're taking a break from doing something as simple as redefining the topology control and also how to move partitions. And now you're going to take an easy break now, and work out how to create new partitions across a cluster. So I'm glad you're taking it easy, Deepthi.

Beyond like the dynamic scaling, what other kind of things are you looking forward to seeing from Zeebe in the next while?

[DEEPTHI] So, my team–Zeebe distributed platform team–we are continuously working on trying to improve performance, reliability, and availability of the system. So we have some focus on how to support last state and and stuff like that which was a problem before–in relates to performance as well as stability. And recently, we have all also implemented something called soft push. Maybe you have heard about that. This is also having a big impact on better performance and scaling. And for future, we have putting a bit more focus on disaster recovery with multiple regions.

This is an ongoing topic, because, there are a lot of customers who have to deployZeebe as a Camunda engine in multiple regions for many reasons. So mostly for disaster recovery. And this is something that is an ongoing and hopefully we have something for that.

[NIALL] That's a massive topic. I remember years ago trying to help a very large customer of ours build an active-active setup– 2 different regions, 2 different data centers with Camunda 7, and it was virtually impossible without incredible delays because you needed to like, replicate everything across the data center via REST, which was a nightmare.

They didn't do that, I think in the end I can't remember. It's a lot easier than knowing that Zeebe is able to already do all that replication internally. And I guess then it's just a matter of can we do that across data centers? And is active-active what you're looking to do?

[DEEPTHI] By Zeebe’s architecture, we can already support like 3 regions because, you know with three replicas, you can already issue three active-active kind of semantics like, if one region is down, the other two regions can still continue working. That's, that's by default Zeebe’s architecture. And of course, performance is a problem. Because if you have high network latency across regions, this will definitely have an impact on performance.

I don't have any number now, like, what is the limit? Like how much we can support. But yeah, that's something that we have to figure out how much we can handle. But the, but the main problem is also that we have customers who have only 2 regions. So we have to somehow make Camunda 8 work with two regions and that's kind of challenging, because, you know by by default, you need three regions to have like a high-available setup.

[NIALL] Of course, yeah. Otherwise, one thing goes down and you don't have the backup of a third. Yeah, that makes a lot of sense. Great. This is really insightful. I really appreciate taking you taking the time to chat with me, Deepthi. I'm looking forward to seeing you chat more and also hopefully getting you back onto CamundaCon’s stage at some stage

Maybe to demonstrate some of the stuff you're talking about, because I-I can only imagine that this is even an more impressive discussion with some diagrammatic aids to let us see exactly how Zeebe deals with the incoming stuff.

But thank you so much, greatly appreciate your time. And this is really fascinating. And I'm sure we’re going to get a lot of great feedback from the developer community for you. So thank you very much and talk to you again.

[DEEPTHI] Thanks, Niall. It was nice talking to you on this podcast. And thanks for inviting me.


[NIALL] This was a great chat with Deepthi. I really did enjoy it. 

And, it’s one of the features I feel needs a little more of a spotlight. Those of you who are already using 8.4, unless you checked out the release notes or read the post, you may not even know this new feature exists and it’s such a powerful feature. 

I always like to try to make sure some of the backend stuff that maybe you may not be made aware of too easily is still prominently described and talked about. 

So I hope this gives you even more faith in the glorious scalability of Zeebe. And, of course, there’s more to come, as Deepthi mentioned. We’ve got 8.5, 8.6, will all have additional features around scaling so it’s getting better and better. 

Of course, tickets are always sitting there to be explored. That’s how I found this feature. So, if you’re interested, there’s some links below to find where you might want to just check out what’s going on with Zeebe and keep an eye on the different new features that are coming out. 

Until then, why not subscribe to the podcast if you haven’t already? That makes me happy. And, also, the next time you want to hear about orchestration or BPMN or FEEL or DMN or whatever sort of fun stuff that I tend to talk about, then you’ll get it immediately. 

Thanks very much. I’ve been Niall the whole time. Bye-bye. 
















[NIALL] I wanted to talk to you because you are involved in that. I contacted you about it, and you very kindly said, “Yes, leave me alone for a few months while I actually do this work, and then I'll talk to you afterwards,” which was a smart idea. So now I'm here.


Welcome
Meet Deepthi Akkoorath, Camunda Senior Software Engineer
What is dynamic scaling of Zeebe brokers?
What hurdles did you overcome in building this feature?
How long does it take to dynamically scale Zeebe?
Is it possible to scale down Zeebe?
What other hard programming problems did you encounter?
Why did we go with a distributed approach?
What can developers tell Zeebe?
What kind of failover do we have?
What's next for dynamic scaling?
Sign-off