Message processing failed... But what's the root cause?

00:01 Laila Bougria

Hello everyone. My name is Laila Bougria and I'm super excited to be here with you at the .NET Enterprise Developer Day. Now, to be completely fair, I'm probably currently sleeping and I'm on vacation, but I still really wanted to be a part of this conference. What I did is pre-record this session for you. I hope you will enjoy it. With that being said, let me just get started. I've built this session based on my previous experiences when I was starting to get into message-based systems and distributed systems in general, and also the sort of pain points that I have seen lots of other teams go through throughout my career and while I was building up this knowledge. Let's get started with the story.

00:50 Laila Bougria

Once upon a time there was a retail business that decided to build an online shop for their offering. They had a few physical shops doing very well, so they decided to go online as well. They started pretty small, but as their business venture grew so did their system. They quickly understood that in order to keep up with all of that increasing complexity and those changing requirements all the time that they had to start decoupling the system. Now they decoupled this into separate components and slowly it grew into what we today know as a distributed system.

01:25 Laila Bougria

Now that also meant that we had multiple teams in the house at that point, Team Lorem, Ipsum, and Dolor. For those of you who don't know, Dolor means pain because there's always one team that suffers a little bit more than others. Now, as a way to communicate between these autonomous components that are part of this now larger system, they chose to use messaging as their main communication pattern. So let's take a look at one of the most crucial business processes that is part of this system or application: placing an order. We sort of know how this works, but I'll quickly go over it anyway.

01:59 Laila Bougria

We want to store the order in some type of database. Then we want to charge the credit card so we can make some money, package that order, ship it, also send an invoice or a bill to our customer, and then adjust the stock so that it reflects the sale that was just completed. Maybe figure out if we need to order some more stock if something is about to deplete or something like that. And finally, we also want to verify the customer status. Maybe they're eligible for a discount code or maybe they've been placing so many orders the last month that they're now a gold customer.

02:31 Laila Bougria

So if you think of these responsibilities in terms of the teams that were managing them, then we have Team Lorem responsible for everything regarding that order fulfillment, if you will. Then we have Team Ipsum that is responsible for everything regarding the stock. Finally we have Team Dolor responsible for everything regarding customer care, hence the name. In order to understand how such a business process would work in a distributed system with many different components, we need to start thinking about those components in terms of how they interact with each other.

03:10 Laila Bougria

So this drawing would give us an overview of how that is. We have a placed order, a message or command that starts the entire business process that goes to sales, which makes sure that we get the money by sending a command to payments, and when that is done, the `OrderPlaced` event is published and then there are other components in the system that can then subscribe to that event and make whatever changes make sense within the context of their individual business components. We're making use of the `publish-subscribe` pattern here and the advantage of splitting things up this way is that we end up with autonomous components that can more easily evolve on their own.

03:53 Laila Bougria

Building a system that's based on autonomous components does bring us quite a few advantages. It supports the idea of having our teams operate autonomously, for example, and that means that our teams they can do their own planning, they can do their individual deployment planning as well. They can use whatever tools, frameworks, even programming languages for that matter, that best suit the needs of the components that they are building.

04:20 Laila Bougria

They should also be able to operate in isolation. What does that mean? Well, it basically means that if you have a running system in production and there's one component that is down for whatever reason, the rest of the system should be able to continue operating without too much friction. It shouldn't bring the whole system down, right?

04:40 Laila Bougria

As long as the interactions that are happening between those components are happening based on a well-known and stable contract, which in our case are messages, then they should also be able to evolve independently. Now there's a caveat because anything that can evolve independently can also fail independently.

05:00 Laila Bougria

So let's consider a relatively simple error that our team Ipsum ran into. Here we have a call stack basically pretty simple. We run into an invalid operation exception as we try to get a product from a set of products in this case, it doesn't exist, that product just doesn't exist. Now looking at this, our minds start racing through a bunch of questions and the engineers on Team Ipsum start to go through all of these and they consider what happened, where did this happen, in what context did this happen, and why?

05:36 Laila Bougria

Usually debugging is the next step because that's how we can build a sort of step-by-step context of what's happening in the system, what state it's coming in, how are we transforming that, in which order are we calling the next operations and how may that be affecting things, and that's how we usually get to the root cause of problems. However, in a message-based system, you're going to have to ask yourself a bunch more questions because first of all, we need to have a very good understanding of which specific actions started this chain or sequence of messages that we are seeing happening in this production environment because as we saw earlier, that could be multiple components, right?

06:19 Laila Bougria

We need to have a good understanding of which processes are involved. Again, we have multiple components that are participating in this single business process and we need to understand which ones for each specific use case. We also need to have a very good understanding of the specific chain of messages that was sent, because sometimes there are minor differences in use cases that could mean that that is a little bit different. And finally, we need to understand in which order those messages were processed and is that affecting the behavior that we're perceiving in some way.

06:55 Laila Bougria

So, let me step into the shoes of the engineer that was assigned this issue, and you can decide for yourself if I'm drawing from experience or not. So, I'm assigned this issue/ticket, whatever it is, and I'm looking at that exception message and I'm like, "Okay, product does not exist. Okay, seems something...probably something obvious. Let me just quickly debug through this."

07:22 Laila Bougria

So I start the failing component in which I have that failed message. I drag it back to the input queue and boom, immediately I see the exception. I'm looking at any inner exceptions that might be there, and I think to myself, "This is odd, how did I get here?" I start scrolling down the call stack to understand how I got there, only to realize that at the bottom of the call stack, I'm ingesting a message off a queue, so that's not really giving me much insight, right?

07:50 Laila Bougria

Okay, no problem. I'll just start the initiating component that started this entire business process, right? And then I should be able to F11, F12, and step into the next method and just debug my way through this. But that's not going to work either because if at some point we are sending a message to a queue and that's being handled by another component. You know what, I'll just start the entire solution with all of the components in it and then I should be able to debug everything. But that's not going to work either because we decided to split everything up into different solutions and even different repositories because reasons.

08:33 Laila Bougria

Okay, okay, okay. I'll just start eight different instances of Visual Studio or whatever your favorite IDE is. I'll attach the necessary processes together and then I should be able to debug everything, right? Well, it's not impossible. It's challenging because you have to take into account which specific infrastructure you need to have running for each individual component, but it's not impossible.

09:01 Laila Bougria

Two hours in, I have the whole system running in debug mode on my machine. At that point I have 158 breakpoints set all over the place and I'm debugging through all of it. So, I hit a specific breakpoint and I'm like, "Okay, F-10, F-12. Okay, I should be able to hit F-5 until the next breakpoint. I'm probably safe here." I hit F-5 and then everything just pauses for a second, and then I hit a breakpoint that I didn't expect and I'm like, "Come on, are we serious right now? I have done this 500 times already and I forgot another breakpoint." Now at that point, I have Mediterranean roots, I'm already cursing in at least three different languages. I'm like, "Okay, people stay away. I need a break. Let me just take a moment and go breathe." I'm thinking, "Come on, Laila, this is so straightforward. Why is this so hard? What else can I do? How can I fix this? Oh, I know. Let me add some logging."

10:07 Laila Bougria

So, I go back in, sit at my desk, fresh energy, and I start adding log statements all over the place in my code, still debugging through the whole system and looking at those log statements that are being emitted. And, I'm seeing order 257, and I'm like, "Wait a second, 257. Wasn't I just debugging order 283?" So, I wonder, "Is this message even mine?" I look around at my colleagues and I'm like, "Is there anyone creating orders on the testing environment right now because I am debugging something!"

10:44 Laila Bougria

Now, at that moment, one of my colleagues looks at me in disbelief and is like, "Laila, what are you doing? Are you testing the whole system, debugging through all of it? And I'm like, "Well, yeah, I'm trying to get this issue fixed and I need to understand what's going on, right?" and they look at me and say, "Well, do you think anyone at Amazon runs into a failure and thinks to themselves, oh, let me fire up the entire system so that they can debug through this?" Oh, well, I guess not when you put it like that, probably not.

11:21 Laila Bougria

But then the question is, what do we do instead? Well, we test. And, if I'm stating the obvious to you right now, awesome, then I'm happy to hear it. But I still feel it's worth saying and saying again because having a good testing strategy is important, independent of any architectural style that you are using in a system. But if you're building a distributed system and you have multiple components that are evolving on their own and you're even using asynchronous communication, in this case messaging to communicate between those components; it's really, really essential to have a good testing strategy in place. The testing strategies are completely out of scope for this specific session, but I still wanted to hammer this for a moment. If you run into any failure at any point, add a test. You're missing one, but probably even more tests at that point.

12:18 Laila Bougria

Now, there is one thing that I want to add in about the testing in the sense of, or in the context of message-based systems, and it's rethinking the rule of three. I've used patterns like Arrange-Act-Assert, the AAA syntax, or Given-When-Then. It doesn't really matter which flavor, right? At some point we'll prepare the test scenario, execute the test scenario, and finally we'll verify whether that piece of code did exactly what we expected it to do or not.

12:48 Laila Bougria

If you translate that into a message-based system, what we would do in that first phase is that we're going to prepare the message that would be processed. Next up, we're going to invoke the message handler that would process that specific message. And it's the verification phase where things get a little bit more interesting because not only do you want to verify any desired outcome in terms of data modifications or maybe API calls or whatnot, but you specifically want to verify all of the outgoing messages that were sent or published as part of handling this specific message.

13:26 Laila Bougria

Because if you think about it, if you go back to our traditional monolithic applications, that would be calling a next method, and there we would probably test something like that, but now we are sending a message to a queue and that's going to be handled by a completely different component and process in the system. Doing that handoff is really important to make sure that the data that we're sending to the next component makes sense because they're going to be basing decisions based on that payload that they receive on the other side. Okay, that's all good, but as always the devil is in the details and with the details, I mean the order. Now, one common mistake that I see teams make a lot when they're new to messaging, and I was definitely guilty of this myself, is that the code that we write tends to make a lot of assumptions about the state of the system.

14:26 Laila Bougria

Now the thing is that at some point messages can and will arrive out of order even if you have a system in which this doesn't happen for months and then it starts happening all of a sudden you, you can never be certain that this will never happen because handlers could fail or there could be some more latency that you didn't expect for whatever reason. There could be concurrent messages being processed and things like that. Now, the thing is that if you have certain assumptions built into your code that make assumptions about the things that have already happened and those assumptions don't hold up, then your handlers are just going to start failing all over the place and you'll be faced with a bunch of failed messages, not being able to quickly understand why.

15:12 Laila Bougria

Now let's take a look at a snippet in Team Ipsum's code base. So it's a very naive stock update implementation. Basically looking at the message payload, and based on that information, I'm increasing or decreasing the amount in stock that I have for a specific product. Now, what's interesting here is that guard clause at the bottom, because what we're basically doing there is checking, "Okay, if this operation led to a negative stock, then I'm going to throw an exception because I can't accept having minus five PlayStations in stock. That just doesn't make any sense, right?"

15:51 Laila Bougria

And although that makes sense from a general perspective, the thing is that we have to think in terms of a distributed system and the asynchronicity that comes with that. If you look at it from that point of view, that is really a false assumption. Now, let's take a look at how this system is currently modeled so that we can walk through it. Now, what currently happens is that after an order is placed, the stock will be updated, that's subscribing to that event, and it will make changes accordingly.

16:23 Laila Bougria

Now, imagine that it's PlayStation release day and we have open sales at 9:00 AM because we know that there's a massive truck arriving at 9:30 and someone goes into the system and adds that message to the queue by increasing the stock. So there's a message to increase the stock waiting in the queue. Now at that point, there are an incredible amount of concurrent sales happening at the same time because people just want to get their hands on a PlayStation. From that perspective, with that piece of code in place, that assumption is just not going to hold up. We could run into a negative stock because we're already accepting sales and we have the stock physically available, but the system does not reflect it yet.

17:10 Laila Bougria

So instead of letting scenarios like these fail and have these assumptions in our code, we need to try to take that code out and try to think about how we can model the system to cater for such a solution. One thing to do is just let go of that assumption and accept that the stock can be negative and we'll just solve the problem of an empty stock when we actually run into it, which is when we need to ship our products. This is actually something that is applied all the time in online shops.

17:45 Laila Bougria

I remember when it was COVID lockdown, everything was sold out everywhere, and we were all working from home, and I didn't have a decent screen. I was shopping, looking for a 49-inch screen because it's 49 inch and I just wanted it, but I just found one in stock in one shop somewhere, next day delivery. Great. I quickly ordered it, assuming it would arrive the next day, and within the hour I had an email saying, "Well, sorry, we're going to back order this for you. It's going to take a few days, but then we'll deliver this." They already had my money, so I ended up waiting for it. That's how businesses work.

18:27 Laila Bougria

Now, to sort of implement a scenario like this, what we could do is basically accept the sale when it happens, and then at that point we are going to ask the stock, "Hey, can you reserve the stock for me?" If stock then says, "Well, yes, I have your three PlayStations in stock, not a problem", then we'll just continue completing the business process as usual and everything is good in the world. However, if we run into a scenario where stock is telling us, "Well, I'm sorry, but I don't have sufficient stock to be able to ship this order", then we can say, "Okay, well, what we want to do is we're going to back order that and we'll take the necessary steps to make sure that we do, but we also want to guard our customers from, we want them to have a good experience. What we're going to do is if we can't ship it within two working days, then we're just going to cancel and refund this part of the order."

19:25 Laila Bougria

Now, to do that, we could make use of something like a delayed message so we can basically send a message in the future. Now, there are some queuing systems out there that support this out of the box, but if you're running on a queuing system that doesn't, you can use messaging middleware frameworks like NServiceBus to help you with this.

19:44 Laila Bougria

What we would do is send a message in the future, we'll go check, has this been shipped? And if not, we'll just refund the part of that order. Now you're probably looking at this solution and thinking, "Oh, but I would model it this way in that way." That's perfectly valid. There are usually multiple plausible solutions to a single problem. But my point and what I want you to take away from this section is to think about those assumptions that we make in the code.

20:11 Laila Bougria

This was a relatively simple example, but sometimes they are a lot more nuanced in a production environment. I tend to make a habit of looking at a line of code and thinking, "What if something that I expect here is just not there? How would the system react? What kind of consequences does that have? Is that something that I want to be dealing with in a production environment translating into failed messages, or do I need to model things a little bit differently?"

20:42 Laila Bougria

Okay, so let's recap this first part to hammer this a little bit again. We need to invest in a very, very strong testing strategy. So whenever you run into a failure, add more tests. As part of your testing strategy, especially in a message-based system ensure that you are asserting those outgoing messages, your handoff to the rest of the system that will be making decisions based upon that data that they are receiving there. And be aware of the assumptions that you are making in your code.

21:14 Laila Bougria

I'm not saying that you shouldn't have any, but it's important to be aware of them so that when something fails, we can also better understand why, and so that we can also evaluate if that assumption is even correct so that we can maybe even validate it with business experts. Because if we take that example that I just gave you with the back-ordering example, you would take that to a business expert. They would tell you, "Well, of course sell stuff even if we don't have the stock, we'll just back-order it, and if we can ship it to them, then maybe we'll even give them a voucher and they might not love that, but at least we got their money and they'll spend it with us in another way, because even if we don't like it, that's how businesses work."

22:00 Laila Bougria

And finally expect out-of-order messages to happen. So make sure that you're also taking those scenarios into account out-of-order messages will happen, whether it's due to concurrent processing or process failures that are happening, network latency that is happening at a point that you don't expect it, or even an individual component that is unavailable for a while causing some buildup in certain queues that is delaying certain things. Okay, multiple teams take this very seriously. They have very good testing strategies. All of our tests look green and they go to production because they're ready and they've actually been running in production quite successfully for a while, but then there's this guy Murphy, right? So a message fails on production.

22:50 Laila Bougria

What do we do then? Well, it's easy. We'll go and add more tests. That's what I said -- run into a failure, add more tests. But, what if you look at that message payload that is failing and you're trying to add a test for that, and you're like, "Well, of course this failed, this doesn't make any sense. This data that we're receiving here is faulty or incorrect or corrupt or whatever it is." Let's take a look at an example.

23:18 Laila Bougria

So if we circle back to that call stack that we saw earlier and we revisit that error, we were basically trying to update the stock and we ran into an invalid operation exception because the product does not exist. So the data that's coming in is already not correct. In order to understand what's going wrong, we need to go upstream to see where did this data come from? Now that will lead us to the order package event in which we're reaching over the order lines to then send an update stock message. So that was just also copying some data over that it also received in an incoming payload. So, we need to even go further upstream, and at that point we can see, "Oh, it's when the order placed the event is published that the product IDs were incorrectly mapped." Now, clearly this is a simple example, but in a production environment, this could be a chain of 10, 15 messages upstream, and then things get a lot more complex.

24:15 Laila Bougria

The question is, if you're running into situations like these, how would you solve these scenarios in a production environment? Now, the thing is that we can't really focus on a single step to find the root cause of problems like these. What we need to be able to do is follow a business transaction, if you will, through all of the participating subcomponents and subsystems that are part of your larger system, not only the individual components that are part of your system, but also databases and message brokers and certain REST APIs that you're calling and all of that.

24:55 Laila Bougria

Now, for each of those individual steps, that's part of the larger business process. You also want to include some context so that you can understand, "Oh, when we were getting this specific record from the database, what ID was being used to get that from the database", and things like that, for example. Basically what we need to be able to do is zoom out to see what's happening in the system and how things are flowing through the system and then so that we can zoom in to solve specific issues. That's exactly what distributed tracing is for.

25:27 Laila Bougria

Now, distributed tracing is quite the hot topic lately, especially with the rise of microservices, but it's actually created around the late 2000s when all we were ever talking about back then was service oriented architecture. Now, the need for distributed tracing is completely independent of any specific architectural style that you're using, but think about it in a way of once you start decoupling systems into separate components that are interacting together and are running on different services, different machines, different processes, then we need to be able to have a way to understand how those business transactions traverse all of those different tiers to get to the root cause of the problems that we find in our system. Okay, our teams, they say our application has grown and there are so many components interacting that we really need to invest into distributed tracing. They start taking a look at what's out there, and they actually find that there are many frameworks and tools available out there.

26:33 Laila Bougria

Now, if you remember from earlier, our teams operate autonomously. They could choose whatever tools and languages that best suited their needs. What happens is that we have team Lorem that ends up going with Jaeger, and then we have Team Ipsum that says, "Oh, we're going to use AWS X-Ray", for example. Now the thing is that although the information that they're storing is very valuable, it's stored in a vendor specific format, and then if we need to start reconciling that information across multiple systems, then that becomes really hard and it's problems like those that led to a higher need for standardizations.

27:14 Laila Bougria

So, what do we do as an industry? We create multiple standards, not one. What happened is that we had the Cloud Native Computing Foundation that released OpenTracing, which is a specification for tracing, for distributed tracing, and around the same timeframe, Google also opensourced their OpenCensus specification, which is a spec specifically for tracing and for metrics as well.

27:40 Laila Bougria

Now at this point, we finally learned from previous mistakes and instead of keeping creating additional standards, they actually sat together and said, "Okay, let's look at the strength of each specification and then come up with a new specification that is going to be the industry-wide standard", which is how OpenTelemetry came to be.

28:02 Laila Bougria

Now, you can think of OpenTelemetry as an open source observability framework that is composed of multiple tools, APIs and SDKs. The aim or the goal of this project is to basically standardize how teams instruments generate, transform, and also collect and export telemetry information from their systems. The idea is that that way we'll be able to analyze and understand our systems' behavior and performance as well. Now, this is available cross-platform and cross-runtime. If I'm not mistaken it's currently supported in 11 programming languages, which is super interesting when you're building distributed systems in which multiple stacks and languages are being used as well.

28:53 Laila Bougria

Okay, then the next question is, "What is observability?", because OpenTelemetry is an observability framework and that's a pretty new term, so I want to take a moment to quickly go over that. Now, for software to be deemed observable, we need to be able to understand the inner workings of a system and any state it might arrive to, whether that state is expected, the happy path, or whether it's unexpected, any failures that might occur at any moment. Now the tricky part is that we need to be able to achieve that without changing any code, without shipping any new versions. It's not like me earlier, "Oh, let me add some log statements." No, that is not what an observable system is.

29:38 Laila Bougria

An observable system means that you already have that in place and you can actually go into a sort of database that has data stored outside of the system, not inside your business tables and things like that, completely outside of the system where it's just collecting telemetry about your system and how that's behaving. Then we can query and analyze that information with specialized tools so that we can extract actionable insights that can then help us improve and fix certain problems that are occurring in our software.

30:16 Laila Bougria

The OpenTelemetry project actually supports different ways that can help us increase the observability of our systems. There are three so-called observability signals. There's traces, logs, and metrics. Now you can view these signals as standalone components that you can configure independent of the others. If you just want to use logs, you can do that. If you want to use traces and metrics, you can do that. If you want to combine all of them, that is also a possibility. They each provide value in their own specific way. Each has their own strengths. Now, I just wanted to mention this because I'm mostly talking about distributed tracing in the session to keep it a little bit concise, but it's important to know that there are also other signals that you can also start collecting from your applications. Let's focus on tracing.

31:08 Laila Bougria

If we think about what a trace is exactly, you can think about it tracking the progression of a request or the way I like to think about it better is a business transaction. So, completing an order, that would be the sort of "I'm placing an order", that would be the larger business transaction and something somewhere there is failing. Now that request is going to traverse process and network and security boundaries. You can think of it as a tree of spans where each individual span represents a specific unit of work that is happening within that larger business transaction that's happening within that trace. That could be getting some data from the database, or ingesting a message from a queue, or sending or publishing an event somewhere, or accessing a REST API of some kind, or even a specific high intensity type of calculation that's happening inside one of your components. Now, each span also carries a context that is also propagated to tie things together. What is interesting is that you can also capture some things that are specific to that unit of work to get a little bit more insight into what is happening.

32:24 Laila Bougria

Now, I know that's a little bit abstract, so I drew a visual representation of how a trace could look like independent of any specific tooling that you might use. As you can see on the left, it's the same trace ID. This is one trace, actually not even the whole trace. You can't see it completely because even with a 49 inch, I didn't have enough screen real estate. I'm kidding. No. Anyway, what we can see is that the trace ID is equal for all of those spans, and then each individual colored block represents a span that participates in that trace, and it can also be emitted by different components as you see in the colors that changing. Those are all different components. If you take a step back, it actually looks pretty familiar. It's basically a distributed call stack in a visual representation.

33:17 Laila Bougria

And it gets even better because what we are then able to do is then zoom into a specific span and sort of drill into that to get additional context. We would be able to see, "Oh, this individual step took this amount of time", and we can even see some information like in this case, a message ID and the destination queue that this was being sent to and some information about the order. We're getting a lot more insight into what was happening in that specific unit of work there. Now, if we bring this back to the problem or ticket that I was completely unable to debug earlier, if I would've had distributed tracing in place at that point in time, what I would've done is I would've gone into my observability backend, which is where that telemetry data is stored, and I would've searched for failed traces. Now, these two spans would come up as failed as an exception happened in there, and then I would be able to go upstream and travel up the tree and see it's when we placed that order or when we published that order placed event that we can see that those product IDs were incorrectly mapped.

34:31 Laila Bougria

So in this case, I would have even been able to solve it without going back and debugging or whatever it is. So just having increased observability would have helped here. Now, you can see that this trace is combined across multiple components, different processes that are running, maybe even in different parts of the world. How do we get a trace that is connected across all of those boundaries? Well, basically for a trace to be distributed, it's therefore propagating across multiple components, we need some kind of a propagation mechanism. That usually happens through a standardized protocol. Now, one of the protocols that's recognized by the OpenTelemetry project is a W3C trace context for HTTP headers, rolls off the tongue. I always feel like I need a full breath of air to be able to say that. They probably did not think about that. But anyway, what it really does is it just adds header to every outgoing request called the trace parent, and it looks a little bit like this.

35:34 Laila Bougria

There are multiple components to that ID, the first part being the version of the protocol. The next part represents the trace ID. This ID is going to be generated when that request or business transaction starts, and then it's going to be copied all the way. The next part represents the span ID, and this is going to be unique for every single unit of work that participates in that larger trace. And then finally, there are also some trace flags that can indicate sampling decisions and things like that. Okay, if you're running on the .NET stack and you're like, "Yes, please get me started", then Microsoft has made this really easy for you. Why? Because they've built it into .NET. It's available from .NET 5 and later. But even if you're running on earlier versions or you still have applications running on .NET Framework, that's also not a problem. You can just pull in the `System.Diagnostics.DiagnosticsSource` NuGet package and have access to that same set of APIs.

36:35 Laila Bougria

It's super easy to sort of set up tracing. You just need to select the sources from which you want to capture telemetry, and then you will need to set up an exporter to send all of that collected telemetry to an observability backend. I'm using, in this example, the .NET generic host pulling in OpenTelemetry extensions hosting package, which makes that add OpenTelemetry method available. This allows me to register a tracer provider, which you can think as a starting point of the telemetry pipeline basically. Then you need to add a recognizable name for your component here. Remember that the telemetry data is being stored elsewhere, so you need to be able to quickly understand which component emitted that piece of information as well. Then I'm saying, "Okay, I want to collect tracing information in this case from NServiceBus.Core from ASP.NET and from the AWS SDK. And then finally, I'm specifying where I want that data to be sent to. In this case, that's Jaeger.

37:40 Laila Bougria

Now that is one option that you can use as an observability backend, but there are so many available out there. This is probably not even a complete list. The beauty of a standard is that it doesn't really matter which individual tool that you use, but let's say that in this case we want to use CloudWatch and X-Ray because we're already running on AWS Cloud and it just makes sense to have everything running on the same environment or so to speak. We say, "Okay, how do we do that?" Now, you would just think that you can then just instead of adding a Jaeger exporter, add some kind of an AWS X-Ray exporter or a CloudWatch exporter. Now the thing is that it doesn't exist, and there are probably reasons for that, and I don't know the exact reasoning why they decided not to, but I can guess, because the thing is that although this type of exporter is super easy to set up, there are quite a few challenges with this approach.

38:44 Laila Bougria

So, let's take a look at a visual representation of how this would work. This is a sort of direct export model, and what that means is that each individual service that we have is collecting telemetry and then sending that to the observability backend. But there are a few challenges with this type of model because there's an increased load on each individual service. It takes on the work of collecting the telemetry, of maybe transforming some of that information and then sending it to the observability backend, which also exposes it to communication issues that might happen with that backend affecting the service as well. Now, that's the last thing you want. You just wanted some more observability. You don't want that to cause failures in your services. Another problem with this approach is that there's no central configuration point. So if there's something that you want to change about how you collect telemetry or what you want to keep or how you transform it, you are going to have to change the configuration of each individual service.

39:47 Laila Bougria

How do we solve these types of problems? Well, that brings me to the OpenTelemetry Collector, which is another main feature of the OpenTelemetry project. Now, you can think of it as a vendor agnostic dedicated component that is able to receive, proces,s and export telemetry data. So think of it as its main goal is to be a sort of mediation layer between your application and your observability backend or backends for that matter. You could be exporting that data to multiple backends as well. It helps centralize the configuration and management of how you want to handle your telemetry strategy, if you will. It also offloads all of that work from your individual services and also the risk of failures. It could buffer data before sending it without impact to your services, and it could handle communication issues by implementing retries and things like that without having to include all of those concerns in your individual services. It also allows for advanced scaling scenarios. Let's say that you are collecting mostly traces and just a few metrics, and you want to be able to scale that OpenTelemetry Collector to cater mostly for the tracing, then that allows you to do that as well.

41:06 Laila Bougria

Now, I'm guessing that because this is a much more stronger use case to run in a production environment that that's why there's no direct export available to AWS X-Ray. Now, if we look at how this would work visually is that there's now an individual component sitting in the middle that our individual services will be sending telemetry to and then they can just move on with their day. The OpenTelemetry Collector will then become responsible for managing those connectivity issues and also making sure that the data is transformed, some things are left out that we don't want to send to the observability backend and so forth. Now it's a little bit more complex to set up because now you have an additional moving piece in your system, but it's solving all of those problems that we also talked about.

41:58 Laila Bougria

So to export to AWS X-Ray, and even CloudWatch, you will need to set up an OpenTelemetry Collector. That brings me to the ADOT Collector or the AWS Distro for OpenTelemetry Collector, another long name. It's one of those days. Now you can sort of think of this as a sort of implementation of the OpenTelemetry Collector. Now this one is built and distributed by Amazon, and it supports sending telemetry to AWS CloudWatch and to AWS X-Ray as well. Now it also supports additional exporters including Prometheus, DataDoc, and I think Dynatrace as well. And it's also compatible with AWS computing platforms including EC2, ECS, and EKS as well. That means that you can collect telemetry and insights across your system and your system infrastructure as well. Now, what you would have to do is set up your specific configuration and then you can just use the Docker image that they have available and just run that. It's super easy. I'm not going to show you the code on how to do that exactly, but I do have a working sample for you available that I will share at the end of the session. You can even run that if you can quickly set up those services or your connection information to Amazon, and then you can just run that and play a little bit with that on your own.

43:25 Laila Bougria

So let's assume for a moment that we have this OpenTelemetry Collector. We have it up and running. It's all good, it's running and it's online. Now we still have to go back to our service configuration and see what we need to change there because we still do require a few changes. First, we basically need a reference to the `OpenTelemetry.Contrib.Instrumentation.AWS` package. Again, I have this all listed at the end. You don't need to remember this, but what that does is it makes this `AddXRayTraceId` method available. Now, this is really important because what it's going to do is it's going to change the trace ID that is generated by default to be compatible with AWS X-Ray because it has specific needs on how that ID would look like that are not fulfilled by the default trace ID that's being generated. Now, this is something that you have to keep in mind that you do because otherwise it will just not work.

44:25 Laila Bougria

The same thing with setting that `AWSXRayPropagator`. It's also something that we need to do in order to make sure that we can export that information to the AWS backends there. And then finally, what I can then do is instead of setting up the direct export, what I'm going to do is make use of the OTLP or OpenTelemetry Protocol Exporter, and this will allow me to send all of that telemetry information to the Open Telemetry Collector. I'm giving it some configuration of where to find the OpenTelemetry Collector, and finally, which protocol to use. In this case it's gRPC. It could be HTTP as well. That's really all that you need to export data to AWS X-Ray. Awesome. Are we now ready to add some traces to our application?

45:14 Laila Bougria

No. I want to stop you right there because one of the things that I would like you to do before you do that and go changing your own code is to look at the libraries that you're using in your application because most of them already have support for OpenTelemetry with dedicated instrumentation libraries, and some of these frameworks has even built it in now. So look at that, at the information that is emitted by those libraries. Don't reinvent the wheel because if you think of this list, one of the biggest gap that we have in distributed system is cross-service communication. That is really already covered by a lot of these instrumentation libraries. Look at what that's emitting and then focus your effort and your time on further enriching and improving that information and see where you still have gaps.

46:02 Laila Bougria

Now, before we actually look at some code on how to add tracing, I do want to take a moment first to think about what type of telemetry do we want to add to our applications. If you think about adding traces, how are we going to action that? I usually tend to use a set of questions that can help put me in the right mindset. One example that I like to use because it's so recognizable, is the amount of issues that we end up closing because we can't reproduce them. I mean, we even have banners for it that says, "It works on my machine." I swear I have such a sticker on my laptop even. Now, the thing is that the user that's opening a ticket and saying, "This doesn't work", they're probably not imagining things. They have seen something that is off. They're probably doing a poor job of explaining what was off or what they did in what sequence, but they did experience some friction.

46:57 Laila Bougria

But imagine that you could easily connect a specific user action to a trace that you could say, "Oh, that was John. It was Friday, 5:00 PM", and you would be able to easily find that trace based on a set of attributes. That would help you understand what happened to that specific user in that production environment. Another question to ask yourself is how easily can you group similar traces together, things that stand out in performance, in latency so that you can understand what is happening there? Another question to ask yourself is, how easily can you identify the most load generating operations? Do they happen at a specific time in the day? Are they specific to a segment of your user base or even a physical location in the world? How easily can you find the users that are stressing out your system? What are they doing? In what order are they doing things and how is that affecting the behavior that we're seeing?

47:58 Laila Bougria

Finally, another one, how easily can you find suspicious or even faulty events that are happening in the system? I mean, it happens all the time that you run into a bug and you say, "Okay, well, sorry, I'll fix it." Mea culpa, you fix it. You deploy a new version and you think, "Okay, it's all good." But then someone from the business walks up to your desk and says, "Well, hello Laila. Can you tell me how many customers were affected by this issue?" And you're like, "Uhm, I don't know?" So it's sort of important to get a sense of what should my telemetry be answering for me so that you can understand what you need as well.

48:36 Laila Bougria

Finally, let's take a look at how you add some application-specific tracing. So you're going to need an activity source. You create this once. It could be a static instance and you can use it throughout your component. It needs to be uniquely named so that you can recognize where this data is coming from, and you can also version it because as your application or component evolves, so will the instrumentation that it emits, right? It's still an opt-in model, so what you still need to do is AddSource "Stock", and it's important that those string literals match so that you are collecting that information as well. Then if I go into a piece of code, I can start an activity based on that activity source that we just created, and that's going to be nullable. This is a performance optimization because as I just said, it's an opt-in model. Unless someone adds source stock, this is going to return a null reference because they chose not to instantiate any objects that you don't need anyway.

49:35 Laila Bougria

I'm making use of the `using` statement not only to be a good .NET citizen, but also because I'm lazy and I don't want to call `StopActivity` and this will take care of it for me. I can also add tags or those span attributes as we talked about earlier. Think of anything that you would be looking at if you were debugging this piece of code and that would make a good option to consider adding a tag for here. You can also add some events. Think of this as very lightweight logging, but don't make this too verbose. It's just maybe to signal like, "Oh, there's a calculation starting." Then finally, I'm catching any exceptions that might occur here to set the appropriate tags as well so that I can signal to the observability backend that something failed here. And don't forget to re-throw.

50:24 Laila Bougria

Now our teams look at this and they say, "This looks a lot like logging. So should we then stop logging altogether?" Well, the thing is that each signal has its own place in the observability ecosystem, but this question comes up a lot. What I tend to say is tracing has a way richer data model, has very, very strong correlation across multiple components. In general, I would say prefer tracing. But we've been logging for years. That effort is not lost. We can still keep all of that information, but what's really important is that we connect our traces and our logs together so that we have cross-signal correlation so that we're looking at the logs, we can connect that back to our traces and vice versa. I have a sample again available at the end to show you how to do this.

51:15 Laila Bougria

That brings me to the end. I hope I have poked your brains to look into observability a bit more and the OpenTelemetry project, I like to think of observability as a way to replace the call stack that we gave up when we started building distributed systems. Instead of having those horrible horror stories of debugging problems, we can replace this by observability and then still doing some focused localized and hopefully pain-free debugging. If you haven't done any distributed tracing before, look at those instrumentation libraries first. Start there and then connect that information to your logs if you have those. And then enrich and enhance from there. No need to ask for three sprints to add distributed tracing. You can do this in a sort of incremental manner.

52:04 Laila Bougria

Cool, well, thank you for listening. I hope you enjoyed the session. If you scan the GitHub QR code, it will take you straight to one of my repositories where I have additional resources and samples available for you to try and play around with. Thank you for listening. If you have any questions, I'm happy to hear them and I will get back to them as soon as I'm awake. Have a good rest of the conference. Bye.

Message processing failed... But what's the root cause?

🔗Transcription