Looking for more podcasts? Tune in to the Salesforce Developer podcast to hear short and insightful stories for developers, from developers.
105. Event Sourcing and CQRS
Hosted by Robert Blumen, with guest Andrzej Ludwikowski.
Organizing data into a sequence of CRUD operations have a long history in software. But with newer and never-ending data streams, different models are emerging. Guest Andrzej Ludwikowski, a software architect at SoftwareMill joins host Robert Blumen to discuss the architecture patterns of event sourcing and CQRS as alternatives.
Robert Blumen is a DevOps engineer with Salesforce, and he's joined in conversation with Andrzej Ludwikowski, a software architect at SoftwareMill, a Scala development shop. Andrzej is introducing listeners to the concept of event sourcing against the more traditional pattern of CRUD, which stands for create-read-update-delete. CRUD systems are everywhere, and are most typically associated with SQL databases. In comparison, event sourcing is a simply a sequential list of every single action which occurred on a system. Whereas in a database, a row may be updated, erasing the previous data in a column, and event source system would have the old data kept indefinitely, and simply record a new action indicating that the data was updated. In a certain sense, you can get the state of your system at any point in time.
Each architectural pattern has its pros and cons. For one, an event source system can make it easier to track down bugs. If a customer notes an issue an production, rather than pouring through logs, developers can simply "rewind" the state of the application back to some earlier event and see if the faulty behavior is still there. On the flip side, since the event stream is immutable, fixes to previous data needs to be made at the end of the stream. You can modify old events or insert new ones into the flow.
CQRS, or Command Query Responsibility Segregation, builds on top of event sourcing. The idea is to separate the part of the application responsible for handling commands and writes from the part responsible for handling queries and reads. This separation is not only on a software level (different repositories and different deployments), but also on the hardware level ( different hosts and different databases). The motivation for this is to be able to scale each part independently. Maybe your app has more writes than reads, and thus requires different computing power. It allows for a separation of concerns, and can make overall operations more efficient, albeit at a complexity cost. Andrzej is quick to note that event sourcing and CQRS divisions are not necessary for every application. Teams, as always, need to understand how the data flows in their application and which architectural pattern is most efficient for the problems they are trying to solve.
Links from this episode
- SoftwareMill is a Scala development shop
- Martin Fowler gives a brief run-down on event sourcing
- On the SoftwareMill blog, Andrzej has a blog posts on entry-level event sourcing, keeping your domain clean in Event Sourcing, and the best serialization strategy for Event Sourcing
- CQRS.nu provides more educational resources
- Andrzej Ludwikowski's web site and talk Event Sourcing: What could go wrong?
Robert: For Code[ish], this is Robert Blumen. I am a DevOps engineer with Salesforce. I have with me Andrzej Ludwikowski, a software architect at SoftwareMill, which is a Scala development shop in Poland. He writes in blogs extensively. Andrzej and I will be talking about event sourcing, and CQRS. Andrzej, welcome to Code[ish].
Andrzej: Hello, Robert. Hello, everybody. Thank you for having me.
Robert: To get some basic concepts in place, could you give us a brief review of the C-R-U-D or CRUD model?
Robert: We're talking about an alternative model event sourcing. Let's start with the idea of an event. What is an event?
Andrzej: Okay, event is something that happened, a fact in the system, or a change to the application state.
Robert: Now, I think I can ask you what is event sourcing?
Andrzej: Let me start with the definition, because some people would say that this is a software design pattern, others could say that this is a software architectural style. I think both terms are correct. However, I would personally define event sourcing as a software design pattern that has a huge influence on the underlying system architecture. And maybe that's why programmers use these definitions interchangeably.
Robert: Okay, let's talk about as a design pattern, and then we can look at architecture.
Andrzej: Okay. So I think the easiest way to explain this pattern is to show it in a comparison to something that is familiar to every programmer, like CRUD systems. As I mentioned, in classic systems, entities are saved to the database as the most current state, whereas an application based on event sourcing, the most current state does not exist in the database. It can, for performance reasons. It's called snapshotting. But this is totally optional. Instead of state, we are persisting a sequence of events. So as I mentioned facts in the systems. And when you replay all these events for a particular entity, you will get the current state. Relatively easy concept.
Andrzej: And fun fact from Greg Yang presentation, by the way, Greg Yang is one of the event sourcing evangelists, is that humans have been using event sourcing for thousands of years. The first implementation of event sourcing was found on clay tablets from Mesopotamia 9000 years BC, and it was used for persisting marketplace transactions. So how many sheeps, grain, bread loaves, were bought or sold. And based on such tablets, a Mesopotamian merchant trader could very easily calculate the state of possession. Not necessarily by doing this manually, for example, by counting sheeps. And event sourcing for general bookkeeping has been used with great success until nowadays.
Robert: I love the sheep example. Could you think of another example that you could give us the entity, and some typical events on that entity?
Andrzej: Yeah, sure. So for example, we could have an order entity, and the order could be order shipped, order validated, order rejected, things like that.
Robert: You've explained, we don't update in place, we don't necessarily keep anywhere what is the most current state. You might have to look at every single event on the entity to find out what is the most current state. Now, many business rules depend on the current state. Like, if you want to make a withdrawal from an account, we have to see the current balance is greater than zero. Does that mean when a change comes in, you have to roll up the entire state history before you can determine if that change is valid?
Andrzej: Yes, exactly. You need your domain aggregate to validate the input, and to make the change, and to produce new events. And you need to replay all the events. But I suppose the next question will be, is the performance problem? Not necessary, because you should do this only once, when, for example, the application is restarted and it's empty, without any state. When you first request for a given entity, yeah, you should roll up the state. But then you should use the state from the memory, which is extremely fast, because you don't have to do any IO operation. So this is really fast if you need to create a low latency system.
Robert: For example, if I log into my bank, and it was using event sourcing, would it roll up my account balance when I create a session so that it's present?
Andrzej: I mean, it depends if you need to change your bank account somehow, so withdraw money or something like that, then yeah, it will need to replay all the events. But if you just want to look at the current state, you could optimize the whole process and use some view model of this data.
Robert: So Andrzej, we've been talking about this one area where perhaps CRUD is a bit better, because this current value of the state is already there. What are some advantages of event sourcing, things you cannot do or that are better than with CRUD?
Andrzej: First of all, let's start with a complete log of all state changes, which is something extremely powerful. If you never use it before, you might be thinking, "Okay, but do I really need it? I don't have any business requirements for audit logs whatsoever. So how will I benefit from it?" And then you need to solve a very nasty bug on production. You have no idea how this could happen, or where the problem is. Usually, in such cases, you will analyze bunch of logs, if you have them. Sometimes they are helpful, sometimes completely useless. Debugging is also an option if you know how to reproduce the problem. However, with event sourcing, you have something much better, a complete load of all state changes in the form of events journal. And you can replay events, event by event, and analyze what has happened with your application.
Andrzej: It's like Time Machine, you can go back to the any point in the past, and verify the entity state. A priceless feature from my perspective. I use this kind of debugging many times in production, to prove that the system was working properly, or to analyze what should be fixed in other cases. And if you actually need an out log because of the business or regulatory requirements, then you will get it by design. Also, from an analytics perspective, if you don't lose any information about the system, you can process all the historical events, and you can create a real data driven organization.
Robert: Could you give an example of an issue you solved in production by using the event log?
Andrzej: Okay. So, for example, we got a ticket from a user that he thinks that the application was not working fine, because he changed something a month ago. And we don't have logs. Basically, our logging infrastructure, we can go back for two weeks, something like that. And with event sourcing, we could read all the events. So a month ago, a year ago, that's not a problem. We could analyze the application state, and explain the user that everything was working okay.
Robert: Now, what happens if perhaps a bug in software, maybe you have a incorrect validation or some other kind of like that writes an incorrect event into the event log? How do you move forward to get to the state you want to be in in that case?
Andrzej: Okay. If the state is wrong, because we made bug in the source code, and we produce wrong events, so events are immutable objects. Once persisted, they shouldn't be updated under any circumstances. But the state is created from events, also bad events in this case. So how can I fix it? Relatively easy. We should use something called healing command, which is either an existing command or some additional semi-business, semi-technical thing that will allow us to fix the entity state. This command probably will produce a healing event or use an existing event, but this time in a correct way. And that's the recipe for fixing the state. Once again, updating events in database should be forbidden.
Robert: If for example, a bank statement it showed an incorrect deposit, then you don't remove the deposit, you would enter a compensating withdrawal of the incorrect amount to get the deposit back to where it's correct.
Robert: Wanted to talk a little bit about performance implications. Are there any advantages from a rate standpoint to doing append versus in place updates?
Andrzej: Definitely. Event sourcing from the right path perspective is something that any database would love to handle. Append only writes without updates, deletes, is the easiest and the fastest possible operation for majority of storage solutions. And I heard this on some presentation years ago that, if you need to work effectively with the hard drive, the best way is to look at if it were a cassette tape. And this way, you will gain maximum performance from it. So imagine that your hard drive is a cassette tape. If you don't have to jump between many different places, you can write and read from such tape without extra logs. Now, this metaphor is obviously correct for rotational hard drives with spinning disk, recording head moved by some coil motor. Surprisingly, it is also correct with nowadays fast SSD drives. Of course, in case of SSD, the problem of writing or reading data from different places is much smaller. But there is a phenomenon called write amplification, about the details you can read on one of my blog posts. But in general, append only writes writing data chunks close to each other, is a recipe for really low latency applications.
Robert: Andrzej, we've been talking about a comparison between CRUD and event sourcing. You're trying to decide your architecture and system. You're trying to decide which one to use. How are you going to go about making the decision of what's right in your case?
Andrzej: So first, your event sourcing is not a general purpose butter. Although some indicators might help you with the decision if event sourcing is suitable for the case. Let me start with domains where event sourcing is a good match. So bookkeeping systems already mentioned, trading systems might gain a lot of benefits from event sourcing. Actually, any system that handles money in some way, would be a candidate for event sourcing, simply because sooner or later, you will need to provide a full audit log. And if event sourcing of the log is something that you will get almost for free.
Andrzej: Also, any non-trivial domain could introduce event sourcing. I think this one needs some clarification. For example, you have a simple CRUD application where the user is sending a request, the application is updating something on the database, and reply with the response. Don't go with event sourcing, you don't need it. But if your domain is more complex, and after persisting something to database, you also need to send an email, then invoke some internal microservice then external microservice, to finally send the response as a WebSocket push to the user, because you don't want for a user to wait for all these steps to complete, and you will do this as synchronously. Maybe this domain with this complex process will be easier to implement with event sourcing.
Andrzej: I think the rule of thumb is that you don't have to use event sourcing for everything. Even in your current domain, you should choose some subset of this domain, probably the part which brings money, which is the most valuable, and maybe this will be a good candidate for event sourcing adaptation. From a different angle, if you have a very specific technical requirements like the speed, really low latency, scaling capabilities, then you should also consider event sourcing as a possible implementation. There is no magic here. It doesn't mean that event sourcing will give you all these features just like that. Although this pattern will at least not block you in this area of scaling and speed. And because of its mechanical sympathy, it will allow you to create really fast and really right solutions.
Andrzej: On the other hand, where event sourcing is not the best idea to use, so simple CRUD applications, that's for sure, where your events names would be like user updated, user deleted et cetera. Also, I noticed that applications or parts of the applications where you need to manage a lot of complex relations between entities, usually with very high level of consistency, and transactional updates, are not the best match for event sourcing as well.
Robert: Most of the popular web frameworks now give you a CRUD, pretty much out of the box. You can get a CRUD application up very quickly. Are you thinking in terms of, you have a situation where it's a good match for event sourcing, but there's a bit more cost to build so you have to decide if the cost is worth it to get the additional benefits?
Andrzej: Yeah, that's for sure. To be honest, I don't believe in auto generated systems, because sooner or later, you will pay the price for the very quick start of the application.
Robert: We have a pretty good introduction to event sourcing. I want to move on to the next major topic, CQRS, which is going to build on top of event sourcing. First question is, what does it stand for?
Andrzej: Okay. Command Query Responsibility Segregation. The idea is to separate the part of the application responsible for handling commands and writes, from the part responsible for handling queries and reads. And this separation is not only on a software level, so different repositories, different deployments, but also on the hardware level, so different hosts, different databases, et cetera.
Robert: C, standing for command. Is that our reference to the well known command pattern from the Gang of Four?
Robert: Give us a brief review of the command pattern.
Andrzej: It's quite easy. You simply encapsulate a request from the user as a command object. And this way, you can build command processors that will consume commands from different sources like HTTP request, Kafka messages, WebSocket messages, et cetera.
Robert: The main part of CQRS is the separation between the part of the system that handles writes from other parts of handle reads, what's the motivation for that, or where is a case where you would want to do that?
Andrzej: Where you want to scale those parts independently, or where you want to have a separate model for writes and for command handling, and separate model for reads.
Robert: I know in data modeling in a relational database, and we're not necessarily always talking about a relational database. But it's a trade off between a highly normalized model that's more efficient for writes, and a denormalized model is more efficient for reads. And it ends up doing is making some kind of a compromise of what you can live with for the reads, and what you can live with for the writes. Are we trying to avoid that compromise here by giving each side its own model that's optimized for what it needs to do?
Andrzej: Exactly. No more compromises like, okay, this single model or this single database, maybe it's not perfect, but at least it can support writes and most of the queries. With CQRS, you can choose the best model and the best database for writes, and the best solution for reads. And scaling capabilities of such a system are great. Of course, it won't be a piece of cake to scale it, you have a lot of moving pieces. But at least it's possible. Also, you can do this completely independently. If only reads are the problematic part, focus on them and add more resources for query services or for the database responsible for reads. Or create a new read model that will allow you to have a very fast reads. On the other hand, if commands processing is the bottleneck, analyze what can be done to fix this without touching the read site entirely.
Robert: The command processing side, which is the write model, is that going to be an event sourcing model with an append-only database?
Robert: And that is not necessarily a great model for doing a lot of queries against it, depending on what kind of queries. But it might be very unsuitable for query.
Andrzej: Definitely. And so it will be really, really hard to query event stream, and get some aggregations or some very specific queries. That's why event sourcing and CQRS are more or less used together.
Robert: You mentioned different databases. Would an example of that be, I'm going to have part of my read model be a document database or graph database or a key value store, that would be a specialized database, it's very optimized for a particular type of read, and it doesn't even have to be the same technology as processing the writes?
Andrzej: Exactly. For example, you have an application for managing users and their contacts, friends. And now, on the command handling side, it could be stored simply as a new entry in the users friends list, which is easy to model in SQL database. We need to also provide a way to run very complex queries about users relationships. I know Bob, you also know Bob. So from this application perspective, we are in the second level of relationship. And we need to support third and the fourth level as well. Such a query will kill SQL database. But for graph database, it's a pretty standard query. And in CQRS, we will create a dedicated read model that will work as a different view on the same data generated from the common handling site. And it's more efficient to get the information about users relationships from such a view, then querying databases responsible for handling commands.
Robert: Is it also the case, you could have multiple read models, each one optimized for different use cases?
Andrzej: You can have a separate read model for a single query, you can have a separate database for a single query, if that's the problem for you. If that's the only way to solve the performance problem, which is great, because you can use the best tool for the problem you want to solve. And in this case, very fast and efficient reads. I suppose the limit is only in your pocket. You can start with something simple, even a single read model, and slowly evolve your application to meet new requirements, some extra production level, et cetera.
Robert: The obvious question, you have this write model, which is the source of truth and you have one or more read models, which are only useful if they have a reasonably up-to-date information. How does the data get from the write model into one or more read models?
Andrzej: You should launch something called a projection, which will consume all the events from the common handling part, and update various read models in near real time.
Robert: So we're talking about this in a conceptual way as more of an architecture pattern. Are there some well-known tools or frameworks that you can build this on without a lot of custom work?
Andrzej: Yeah, I could recommend Akka Persistence or Akka Persistence Typed, both solutions are really mature. I use it in many production solutions, and I can honestly recommend them. So do not reinvent the wheel, use some existing solution developed and tested by many programmers, many users.
Robert: There is going to be some latency before data from the write model shows up in the read model. If you adopt some of the more proven technologies and with some tuning, how far behind is the read model going to be?
Andrzej: It should be less than I would say 10 millisecond. If you have bigger latency, maybe the problem is with your read models. Maybe the read models, updates are slow or something like that. But the whole process of moving the event from one part, so from the command handling side, to the read handling side, should be really, really fast if you use good tools for that.
Robert: Many enterprises are doing something like this where they have transaction databases, data gets pulled out of them into say Hadoop jobs and data warehouses. They may not be calling it CQRS, but is this putting a name and a bit of a sharper point on something that most organizations are doing anyway, and just not calling it that?
Andrzej: Probably because this pattern is really heavily used. For example, any database internals is using this pattern. So write something to the log, and then update some read models that could be queried by users. So I suppose many companies are using event sourcing, or similar solutions to event sourcing without knowing that they are using event sourcing.
Robert: How widespread is the adoption of CQRS by people who've been influenced by the idea of CQRS, and they set out to do that, and if you ask them what they're doing, they'd say CQRS?
Andrzej: Okay. So my personal observation is that more and more systems require real time processing, low latency, high throughput. And with CQRS and event sourcing, you can address all these requirements, or be ready for them when you hit a certain threshold. Also, stream processing, stream platforms, as a way to build systems are getting more and more popular each month. And event sourcing naturally matches such solutions. And from a completely different perspective, functional programming community has been growing from quite some time. And here event sourcing is also a pretty good match. So in the table events, avoiding side effects. Not to mention about the state, which is nothing more than a fold left on apply event function. Also Reactive Manifesto, probably you heard about it, it's built on top of asynchronous message processing. So building lasting, resilient, responsive applications, leverage the message driven approach for creating software. And if your messages or events, it's even easier to do this right.
Robert: I want to change direction once again, and talk a little bit about what type of persistent store would you put your events in? I can imagine you could put them in a lot of different types of persistent stores. But are there some either specialized event sourcing or of the more well-known open-source databases, are there some that are a better fit?
Andrzej: Let me start with relational databases. Because I wouldn't cross out SQL database as an event store. You would be shocked how fast append-only writes will be handled by classic SQL database. Of course, SQL databases were designed to work well on a single host. Sooner or later, you will hit two possible limits. So the first one is, of course, the CPU power. And the second one is required disk space. And from my perspective, the second one is even more often the problem than the first one. This shouldn't be a surprise, since in event sourcing systems, we are not deleting any data, and events might be pretty fat structures. Of course, vertically scaling CPU or disk space is extremely expensive from some level. However, if you need to handle a moderate load, then start with SQL database. It will save you a lot of time and probably a lot of money. Once you grow so big that SQL database is not enough, you will probably also have enough money to migrate to a different storage.
Andrzej: Now, if you know based on some data that storage size will be a problem, or you need to handle thousands of writes per second, then you should start thinking about distributed databases, which can be scaled horizontally. And a perfect example of such database for event sourcing is Cassandra, or any database with similar concepts under the hood. So ScyllaDB, DynamoDB, et cetera.
Robert: Andrzej, you wrote a blog post about event serialization. Why would I want to serialize my events?
Andrzej: Well, first of all, you don't want to map event attributes to columns. Because it's like the worst idea. In most cases, events will have a completely different fields. So you'll end up with a lot of nullable columns. Also, it's not very efficient to read such a schema, and then parse column by column. From time to time, you want to also change something in the event, so as to remove some fields or add a completely new event. Good luck with schema evolution backward forward compatibility when using column per field approach. That's why you should learn and analyze serialization for event sourcing.
Robert: What are some of the more popular serialization options?
Andrzej: So the two main categories are plain text formats like JSON, XML, YAML, and binary formats like Avro, Kryo, ProtoBuf, Thrift, and others. And most people usually choose JSON for event serialization, because it's well known, anybody can read it, produce it, et cetera. It's okay. But from my experience, it's worth spending some time and at least analyze binary formats. Why? Because they are faster when it comes to serialization and deserialization, produce smaller payloads. It's important when you want to save disk space. Also, they will give you very good support for schema evolution. So if you want to be always backward-forward compatible, it will be much easier with some binary formats than with JSON.
Robert: The last topic, you brought up a little earlier was snapshotting. I can guess what that is. But why don't you describe what it is when you would use it?
Andrzej: Snapshotting is a performance optimization that consists in saving the current state every X of events. It could be 1000, or 10000, or more or less, depending on your case. This way, you don't have to load all events from database to get this current state. You just need to load the most current snapshot, and then all events created after this snapshot.
Robert: Great. Before we wrap up, is there anything else you wanted the listeners to know about event sourcing or CQRS that we haven't covered?
Andrzej: Okay. So I've been developing event source systems for over four or even five years. And I must say that it was one of the best decision to try an event sourcing approach. It has really changed my mind, in the sense that when I'm thinking about developing a system or designing a system, I'm not so focused on the state itself, which is very common for many programmers, and leads to many problems. I pay more attention to the events, the overall business process, the events flow in the whole system. Thanks to that, I have learned a lot of techniques to overcome many challenges when building a distributed system. These are not single-use patterns, they are used everywhere where speed and scaling is a problem.
Andrzej: Event sourcing, as someone noticed, is turning your database inside out, which is interesting idea. Because when you're creating an event, event source systems, you're actually creating a very specific one-of-a-kind database that will solve your business problems. The inside-out approach might also be a good candidate for the entire microservice asynchronous communication pattern. By making the events and the schema public, each microservice can arbitrarily use data emitted by others. Of course, a translation layer will be useful. We don't want to expose the internal schema of the events. And as a developer, I can only say that extending your toolbox with event sourcing is something that you won't regret, and you will benefit from it for a long time.
Robert: If listeners would like to find you on the internet, where should they look?
Andrzej: You can find me on LinkedIn, you can check my blogs on Medium, or my homepage.
Robert: Andrzej, thank you for speaking to Code[ish].
Andrzej: Thank you very much.
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
Lead DevOps Engineer, Salesforce
Robert Blumen is a dev ops engineer at Salesforce and podcast host for Code[ish] and for Software Engineering Radio.
More episodes from Code[ish]
Karan Gupta and Marcus Blankenship
How can applying the right technology choices at the right time impact your coding and business choices? Karan Gupta explains how practicing “pragmatic engineering” can have an oversized impact on business and business efficiency. →
The episode focuses on managing a certificate authority (CA) within an enterprise. The internal CA is compared on many points to PKI on the public internet. →
James Dong and Chris Castle
How much can a day of coding help others? James Dong created a platform to help small businesses impacted by the COVID-19 pandemic sell gift cards online. Learn how this platform, built on Heroku, provided a way for residents to support... →