53. Scaling Telecommunications Data with a Service Mesh
Hosted by Julián Duque, with guest Luca Maraschi.
When you're one of the largest telecommunications companies in Canada, you're responsible for building and maintaining services that can handle a volume of data many times greater than the average web server. Julián Duque had a chance to sit down with Luca Maraschi, a chief architect at TELUS Digital, during NodeConfEU. On this episode, they talk about the frameworks and tech stack Luca has chosen to build the service mesh which facilities data flow between their microservices and clients.
Julián Duque is a senior developer advocate at the Heroku, and he's interviewing Luca Maraschi, a chief architect at TELUS Digital. TELUS is a large communications company in Canada, which processes a massive amount of data produced by millions of customers all across the country. One of their challenges, for example, was to design a single datastore which provided a consistent experience across online services and offline ones, such as call centers or brick-and-mortar stores. In the end, Luca describes how they were able to break apart the existing monolith in favor many different microservices.
Luca notes that the data sources are not uniform; they can be coming from Excel files, Cassandra clusters, PostgresQL, MongoDB, and so on. He talks about the technologies they used to accomplish this feat, settling on GraphQL as a main component to the stack, as their services already act as edges in a graph. TELUS also makes use of Kuma and Kong, two relatively new projects. Luca finds new technologies to be not something to necessarily be afraid of, but rather, because they are often open source software, to make use of them as opportunities for driving the project's decision.
The conversation wraps up with a discussion on the future of distributed systems. In the end, Luca believes that more responsibility for data manipulation should be placed on the client, rather the server identifying what it believes to be the "perfect" data set. This allows for greater flexibility, and opportunities for changing backend strategies as needs evolve, without user-facing disruption.
Links from this episode
- Luca's talk at NodeConfEU 2019 titled "Scaling data from the lake to the mesh via OSS"
- Kuma (which ties microservices together) and Kong (which provides an API layer for them) are two technologies which Luca loves to use for TELUS' needs
- GraphQL helps TELUS with its goal for providing data as nodes in a graph
Julián: Welcome to Code[ish]. It's me again, Julián Duque. I'm a senior developer advocate at the Heroku. And we are continuing recording from NodeConfEU, from the beautiful city of Kilkenny in Ireland. And I have the pleasure to be talking today with Luca Maraschi. Luca Maraschi is a chief architect at TELUS Digital. And he's very passionate about distributed systems. He did like an amazing presentation about architecture and how they are like solving problems at scale. So today we are going to be having like a very technical and interesting episode about a subject that I'm also very passionate about. So Luca, tell us a little bit more about what you do at your job and the kind of problems you are solving.
Luca: Yeah, so I'm bringing some Italian sparkle and craziness into the Canadian ground. I'm working at TELUS Digital, like you said. And basically, my job is to help everyone to architect the best platform that we can, that can scale to all our customers. And basically, converge the world of developer experience with a world of customer satisfaction. And at the same time, try to innovate 100 years old company and industry. So it's a very challenging and different from my past experience.
Julián: I think TELUS is a telecommunication company, right?
Julián: I come from a telecommunication background, I'm a telecommunication engineer myself. And I know that dealing with like or working with a telecommunication company, you are dealing with a lot of data, different customers that are connecting to your systems. So you will need to be thinking in really big scale. It's not your regular common web application that is getting traffic. You need to start like thinking more about like complex systems, how to get those systems interconnected. How to think in like microservices and have like a very good approach to be able to serve.
Julián: So why you don't tell us about what you were speaking today at NodeConfEU? And how are you solving that problem from a technical perspective?
Luca: You nailed it. So we are entering with massive quantity of data that comes real time from our consumer users. So from the networking side, from all our customer interactions. So go on the website, you click ... and there are a lot of millions of interactions that are going across all the system. So it has been a kind of very interesting journey and that's what I was trying to tell to the audience this morning.
Luca: The journey started when we look back into what we add. So this complicated network of systems, duplicated systems, systems that they needed to interact across different sides of Canada. And we kind of took a journey to think, how can we change this? How can we make it more accessible and more self serve? How can we improve the process of data quality, for example?
Luca: And at the end of the day, this one is a fantastic exercise when you look at big data or AI systems. But on the other side, it fits also perfectly a picture of a serving APIs at scale. Because we have millions of users that are using our online services, we have many that are also using our offline services like the call center, for example. Or they're going directly in store, and all these systems are connected. And the experience needs to be a single experience. So we kind of took a 10,000 feet overview of the system and we said, "Okay, where shall we start?" And we look at TELUS and the more enterprise world as an evolving monolith. So we started thinking how can we break this model in pieces and bring values that are coming from modern and evolving technologies like functions, like more actor model architectures.
Luca: And bring them to basically a problem that looks simple, but it's super complicated, that is data. If I tell you data, you think about a database, but we are thinking about, really, hundreds of different data sources with can be even an Excel file. It can be an Excel file, it can be a Cassandra cluster, can be a Mongo instance, can be everything. Oh, let's not forget about Oracle, Postgres and million of others. And in the talk, I tried to walk the whole audience through this process. So where we divide and conquer the problem and we look at that as a pure distributed system problem. So that was probably the nice engineering mad science, a little bit of craziness.
Luca: I always say that I love some pepper on my pizza because it gives a little better flavor. And that, I think, it was as an Italian I found, "I need some pepper there." We took kind of a very progressive approach on solving the data distribution problem. We could talk hours about data, right? But to go kind of to the nice, juicy part of this nice steak, we actually kind of looked at a mesh approach. So we know that our services, they are moving from a microservices to a more mesh. And the mesh services are a very important problem. We need to solve the distribution, we need to solve the acceleration of developers delivering features and solving that logistical complexity on how to coordinate all the services.
Luca: And we said data is not different than a microservice, we can do exactly the same. We cannot create a data function, but if you look at data structures, we have the graph that is a very interesting mesh. So that's how all the journey started, in thinking of the data mesh from a monolith like to a mesh of ephemeral data that continues to get reiterated with an actor-model pattern into this massive scale secure system, where read operations, they are basically, we use a a purely physical approach. I have a nuclear physics background, so I wanted to reuse some of my years of university and say, "Well, action, reaction. That's why I started." And so we have this kind of streaming process that continually keep hydrated these ephemeral graphs into a lasting search.
Julián: So tell us the technologies and the stack you're using to solve these very interesting problems. So the different pieces that you are adding, or the ingredients, as you were mentioning when you were giving the talk. And we are going to be sharing your presentation also with our audience. You are like pretty much cooking and very engaging like adding a lot of different ingredients to these a special dish you were preparing. So tell us a little more about the different ingredients and the technology behind the architecture you are using.
Luca: Yeah. Working in a kitchen and being a chef, I experimented this life and it was the most inspiring change in my life. I really learned about leadership and how to drive a team to success, inspiring them, delivering quality. And that's why I'm so passionate about comparing architecture to food and cooking in the kitchen.
Luca: So when we look at this problem, if you go on the market and you Google it or Bing it or use whatever search engine you want, you will find that there are many products that, on the market, can solve this problem and they are all paid products. So we wanted to take a more provocative approach to the problem and say, "Okay, why do we go for commercial products when we can go for organic like products? Something that is vibrant, something that we know where it comes from."
Luca: I want to have an almost, that kind of farm experience. I want to know exactly where the different components were created and who made them. So you touched a very important point. For me, was very important to connect the my team and the organization with the makers and have that kind of one on one relationship. So we took a purely open source approach and we decided, "Let's go open source first and cloud native as much as we can." So we we are using, like I said, Fastify as a framework of reference to structure all our workers, all our API's. And we are using Elasticsearch as a very controversial graph database because, like I said, we don't care about really the persistency. But we more care about the speed and the way we can change the data structures without impacting the normal delivery processes, the development process. Basically, the governance around our data.
Luca: We are using also a GraphQL as an edge technology. And why? Because GraphQL is a graph, Elasticsearch is a graph. So we said, why not marry two technologies that they are almost very similar? GraphQL is definitely one of the main components in this stack. We're using TypeScript, not really to build applications but just to define contracts. So because, at the end of the day, when you consume data, I never believed in the utopia of fully de-normalized data. And so what people might call like NoSQL or document base data sets. They give us such a limited set of opportunities to change, that it becomes even more costly and more difficult to change than our original database. That's why we decided to use something that we could change the the schema and the source, but the consumption is purely strict contracts.
Luca: So if you look at taking a step back into Thrift or Avro, or any other schemaful contract system, and we are still using as a protocol for transporting data, HTTP. But we are clearly kind of exploring the world, we want to take it like two steps ahead and say, "Why don't we just use GRPC? And use HTTP just as a simple," I call it, purely public interface. And privately, we just work with the pure schema full approach and more kind of RPC model. Because it fits perfectly this picture of evolution of our mesh goals. The mesh is transforming us from thinking into models and entities, into purely RPC model because it scales better. We can work on way more organic and better designed algorithm to scale independently, like tiny functionalities. We can use cold start for something that is not so used.
Luca: So there is a lot of thought about modeling this data and so that's why TypeScript, it's a nice DSL, it's nothing more than a DSL that we compile into different target like GQL schemas or Elastic schemas, or purely ORM-based kind of mapping schemas. Because at the end of the day, data comes from a Kafka stream, but it still represents the data set that basically was in the source of record. So we don't apply any massive transformation at the source. We prefer to do it in this stream processing layer. Because, once again, you never want to change the source of record because we have so much flexibility at the edge that is way cheaper and faster.
Julián: I saw that you were also using another technology to manage this, it was called Kuma.
Luca: Kuma, yeah.
Julián: Can you talk a little bit more about it?
Luca: Well, I'm very passionate about Kuma and Kong. Apart of having a connection with the founders, I have a very strong connection with the technology that is behind. So Kong, for whoever doesn't know what it is, is an API gateway. It's basically Nginx with openresty and Lua and a lot of plugins on top. And I think it's a such an elegant, well written, well design and so versatile system for solving this challenge, this kind of nice, also, adventure of terminating HDP at a pure network level and going down with GRPC. It's so well integrated with also Kubernetes. So all this kind of points, they were, I mean, driver on choosing Kong as our gateway.
Luca: And Kuma is a natural evolution when you think about scaling services from east, west, Kong does mostly north, south. And in this kind of east, west, we look at what is the best tool that we can use. So we knew that we want to something Envoy based. And then the choice automatically went down to Istio or Kuma. So the choice was very difficult. Istio is a very well adopted and known mesh system. Kuma was the new boy in town, and we wanted really to have something that was fitting a little bit, this kind of fast evolving, more cloud forward mindset.
Luca: And Kuma, well, I cannot say anything bad about it. It is fantastic to manage, easy to install. The experience in monitoring and tracing is phenomenal. It's new, but I think I never see the new as a negative side. I see it as a huge positive side to influence the community around it. So it's fully open sourced, so we are able to just contribute to it, to help move it forward. If you look at the architecture, I think Kuma is definitely the right choice, especially when you look into scanning across data center and make a federated mesh. I cannot stop recommending, I feel like I'm almost a salesman for Kong. But I think it's such an elegant piece of technology that I cannot be more enthusiastic of talking to everyone about it.
Julián: You got to be curious about it and I definitely we will be checking in because it sounds super interesting.
Julián: This is what I love about open source, and you were telling it at the beginning. You took the chance to innovate and be able to add open source, be a little bit disruptive instead of going on adding just a black box in your solution, but you don't know what is going on in there. Yeah, maybe it's solving a problem, but you are lacking control, you are lacking the ability to be mixing and matching different products, different technologies and create the architecture and solve the problem the way you want to do it and change it, iterate over it and be able to have like a better version every time.
Julián: So it's why open source is so interesting and I've been involved in the open source community for a while for the same exact reason. You were mentioning some modules around Fastify and GraphQL. I remember the Fastify, Elasticsearch was one, the Elasticsearch GraphQL connector was-
Julián: Compose, yeah, it was the other one.
Julián: Those modules were created by your team just to solve the problem or you found them in the community already? Somebody else solved those problems?
Luca: We actually found them on the open internet. The most beautiful tool that we have and the most powerful one. My team and the organization, we didn't have to build anything. It's mostly that beautiful exercise of taking nice food, putting it together in the right order and make the most amazing pasta or risotto, or whatever Italian food you like. The sauce and how the sauce kind of blended together, it was so great. Because the ingredients that we found, they were all high quality ingredients. Honestly, it was so easy to take the architecture from a conceptual point, that is was purely on paper, on a whiteboard, down to reality because we found exactly the tools that we were looking for.
Julián: And I also saw you are using Apollo as a GraphQL server. I had the opportunity to work, before, with schema stitching, because that was the way you were able to connect like different services into one. But it was not performant, because like trying to connect those multiple things into one huge one, at the end, you are going to end having like a monolith. But you mentioned something that I haven't heard before, which is Apollo Federation. Can you talk a little bit more about it?
Luca: Yeah. So like I told you, I really love distributed systems. I think I could talk for days about it. And one of the problems that we had when selecting the technology and the tools for solving that API complexity, was exactly what you just said. We went for GraphQL and we had only two choices. We either duplicate services, which means we have a lot of end points, which means that we basically go back to the microservices world. So the mesh, boom, would disappear. Or we would use schema stitching.
Luca: The schema stitching, like you said, is basically creating yet another monolith. Because how it works is that you take all the schemas, there is a stitching process that puts all the schemes together and then they are deployed into a service. So when you want to scale the service, you cannot apply any smart routing, balancing algorithm, but what you're doing are just going for the old traditional round robin because they're all copies of each other.
Luca: So the question is, is the mesh, then, the right thing to do? And I would say, no. So Apollo was just launching Federation. And like I said, the sparkle of craziness. We said, "Let's try it. It fits the picture." Because with Apollo Federation, you actually centralize only the routing logic. And at this moment in time, I still think that there is a long way before schema federation, it's really usable at large scale consumption because you still have a single point of failure, that is the Apollo gateway. The gateway is a sort of router that gets all the different schemas and where the schemas are deployed and does all, basically, the resolver kind of routing.
Luca: But we actually see a huge opportunity, and I mentioned to you before, Kong, and we are using, clearly, Kubernetes. And Kong is an Ingress controller, so I actually see a huge opportunity to do some nice community experiment and build a nice a conch plugin that can support Apollo gateway. The challenge there is that the gateway is clearly ... I don't recall if it's fully open sourced, I think they have a piece that is not open source. Clearly, there's IP. But we are actually looking into creating a case to basically have data communications with Apollo and say, "Well, why don't we try this thing?" Because it would be massively impactful for everyone in the world.
Luca: I honestly believe, that if you're asking me what would be the one of the breakthrough things that can happen, I would definitely tell you the moment that we can scale this Federation, we are solving a massive problem in providing that access in a self serve way and a cost effective way to developers. So you really can draw the demarcation line between, I will say, backend and API development and front-end consumption. And if you see how React is growing and React Native is growing, and all these kinds of technologies that are more growing into a place where there the demarcation line is kind of clearly set, and it's purely responsibility of the client to retrieve the data and not really of the service side to provide the perfect data set to the client.
Luca: So we share the responsibility between client and server, which is amazing, because you basically take one step further in pushing this data mesh even closer to the edge.
Julián: Yeah, this is fascinating. I just want to get back into the practice and start building distributed system because I really, really love this subject.
Julián: So I hope everybody that is listening to this episode also enjoy Luca's talk. We are going to be sharing that talk in the description of the episode and the video setup so you can like see the beautiful presentation and the architecture they have been working on at TELUS. And Luca, is there any final words or recommendations you have for our audience to get them as passionate, as engaged as you into distributed systems, solving problems at this scale?
Luca: I actually want to quote what you just said, like go back and try it with your own hands. It's a beautiful world. Honestly, we are so lucky to live in this moment in technology. The technology landscape is changing fast. And I think distributed systems, they are really the future of developer experience, of customer experience. And I would really tell everyone, just try it with your own hands. It's something that is very infective because you literally feel the passion and the spirit growing inside of you, that you want really to scale it and try to find new ways. And that's what I would actually tell everyone, just try and try to be as crazy as you can.
Luca: Because there's no limit to creativity. And another important point is, don't isolate yourself in trying distributed systems. The community is the best place where to test it. There are many people that they know different side of it, and the best solution comes from busy connecting with others and get inspired. And get, sometimes, that small sparkle off craziness that will make you creating the next generation in a system.
Julián: So this was another Code[ish] episode. Let's wait until the next one. Thank you and bye-bye.
A podcast brought to you by the developer advocate team at Heroku, exploring code, technology, tools, tips, and the life of the developer.
← Previous episode
52. Building and Scaling a Heroku Add-on
Next episode →
54. Building a Business by Teaching Developers
February 25th, 58. Capturing and Analyzing Energy Usage Metrics
Developer Advocate, Heroku
More episodes from Code[ish]
Bret Fisher and Mike Mondragon
Docker has emerged as an extraordinarily popular way to safely and predictably deploy applications. But because of its rapid evolution, changing business targets, and technical composition, it can still be a bit daunting to understand when... →
Joe Leo and Corey Martin
Legacy code is everywhere. Software is also being modified, whether as a result of new requirements, new security issues to patch, or new hardware and operating systems to target. Whether you're working with code from three months or three... →
Ben Curtis and Mike Mondragon
Developers can't help themselves from implementing their ideas with every new language and framework that comes along. Sometimes, we discover that the itch we wanted to scratch was a problem for many other teams, too. Ben Curtis found this... →