Telemetry, not Logs & Metrics

Whats wrong with Logs and Metrics?

For decades we’ve worked iwht logs and metrics.

Our software developers have spewed out random pieces of text - some useful, some not so useful, and we’ve collected that information. They handily categorized it into levels, i.e. informational, warnings, errors etc.

At the same time our infra has spewed out numbers that ean something. usually things like how much is the cpu doing, how much disk space do we have left. how many requests are we getting.

Over time we got al ittle clever and started to include the magical Traces! These let us see a bit more detail into an operation, and how long function calls took.

They’ve served us well. So well that we gave them a grand title: The Three Pillars.

The thing is though, if we’re honest, they’re a bit shit really…..

If you know where to look, and what to look at, and you can pattern match across millions of log lines, and you squnit a little bit and tilt your head, sometimes you can see…… something….

Metrics make some pretty charts though - and thats very satisfying….

Except, they don’t tell you why something is happening. They dont tell you anything in themselves, they just show you that something is happening. its on you to decide what any of that means. In some ways its a little like Astrology.

Not very 21st century tech…..

What’s better

Charity Majors has written, spoken and (correctly) ranted about observaility. And she has a point.

What we really need is Telemetry.

When I think about Telemetry i think about Formula-1 cars, Rockets.

I dont remember the last time i saw people in the NASA control room grep’ing logs (although i’m sure it does happen!). The same with the track-side team at a Formula-1 race.

Telemetry is about collecting information that lets us understand whats happening, even when we’re not there. It lets us understand the state of the system, and how it got there. It lets us understand why something is happening, not just that something is happening.

The funny thing is, its actually not that hard to do, but it does require a slight change of mind-set.

The Properties of Telemetry

Telemetry becomes useful when its consistent, and when it has context. We call this High Cardinality.

Logs have a problem in that a single system may be concurrently processing millions of requests, and if we want to understand the state of a single request we need to be able to correlate all the logs that relate to that request.

Logs also have the problem that they only contain what the developer thought would be useful - and thats generally based upon their mental model of the system, which may not be accurate.

Logs are also often essentially the debugging method of last resort where we output lots of “We got here” log messages in order to trace whats going on.

Untangling all this can be a nightmare.

Traces make this a bit better, but they still rely on the developer to have instrumented the code in the right way, and to have included the right information in the spans.

Structured Logging

The first step in untangling all of this is to structure our logs into a consistent, machine-readable format.

There are a few formats we could use, but the most common is JSON. This is both human and machine readable.

Canonical Logs

The second step is to output one single log line per “request”. Any job should create a single line of all the information, end-to-end for that job. This stops us having to pull otgether lots of different log lines to understand whats going on.

Dimensionality

The more fields we have in our structured, canonical logs, the more we can slice and dice our data to understand whats going on. Knowledge is power - and this is where the power of telemetry comes from.

Cardinality

For each field of dimensionality, we want to have it as highly cardinal as possible. HTTP verbs are a simple example: they can be GET, POST, PUT, DELETE etc. If we just have a field called “method” with the value “GET”, then we can easily filter our logs to see all the GET requests. UserId is highly cardinal because there are potentially millions of users, and each user has a unique ID.

Correlation Ids

Correlation Ids are a special type of field that we include in our logs to allow us to correlate logs across different systems. For example, if we have a request that goes through multiple microservices, we can include a correlation ID in each log line that relates to that request. This allows us to pull together all the logs that relate to that request, even if they are spread across multiple systems.

What This Gives Us

This allows us to filter our logs to see all the requests from a particular user.

The more cardinal our fields are, the more dimensionality we tend to start adding - one feeds into another. For example we start by including UserID, but a user is part of a specific customer, so we add CustomerID. A customer is part of a specific region, so we add Region. A region is part of a specific data centre, so we add DataCentre. Our customer is on a specific payment plan, so we include PlanID.And so on….

Now we can query our logs to see all the requests from a particular user, or all the requests from a particular customer, or all the requests from a particular region, or all the requests from a particular payment plan. We can also start to correlate this with other information, such as metrics and traces, to get a more complete picture of whats going on.

Dimensions can also be metrics - for example startTime, endTime, duration. Now we can query how may user in the US region took more than 1 second to compelte their request on the Enterprise plan.

Try that with the old school pillars of logs and metrics!

Isnt This Open Telemetry?

Yes and no. OpenTelemetry definitely advocates this approach, but really it just a transport mechanism for getting telemetry data from our systems to our observability platform. It doesn’t tell us how to structure our logs, or what fields to include, or how to correlate them. That’s up to us to figure out.

As an example, a siple http request to an ecommerce system may show:

{
  "span_name": "GET /products/8365",
  "duration": 836,
  "status": 404,
  "message": "Product not found"
}

On the other hand, we could also show:

{
  "correlation_id": "019d3e53-4f79-715e-b4e3-182a400092ca",
  "method": "GET",
  "path": "/products/8365",
  "timestamp": "2026-03-30T10:03:51.387Z",
  "duration": 836,
  "status": 404,
  "message": "Product not found",
  "user": {
    "id": "12345",
    "customer_id": "67890",
    "region": "us-east-1",
    "plan_id": "enterprise"
  },
  "cart": {
    "item_count": 3,
    "total_value": 123.45
  },
  "flags": {
    "black_friday": true,
    "new_ui": false
  },
  "error": {
    "code": "PRODUCT_NOT_FOUND",
    "message": "Product not found"
  }
}

The same request, but the second example gives us far more information. Did we need all that? maybe not this time, but if we had a problem with our product search, then having all that information would be invaluable in helping us understand what was going on, and how to fix it.

Wont this create a lot of data?

Why yes, yes it will!

but thats the point. The more data we have, the more we can understand whats going on. The more we can understand whats going on, the better we can fix it when things go wrong, and the better we can improve it when things are working well.

Tail Sampling

Tail sampling is a technique for reducing the amount of data we store. By making the storage decision after the request compeltes, based on its outcome we an choose whther to keep it, how long to keep it, and where to keep it:

  • Keep 100% of requests that result in an error
  • Keep 75% of requests that take longer than 1 second
  • Keep 25% of requests that take longer than 500ms
  • Keep 5% of all other requests

Importantly, because we have a correlation ID, we can keep all the telemetry when a failure occurs, not just the telemetry for one system. This lets us understand not only that a failure occured, but why it occurred and where it occurred, even if that operation in itself wasnt a failure.

The ROI

Implemented properly, the knowledge and understanding of a system that we can gain from telemetry is invaluable.

We can use it for debugging, on-call, understanding features, user behaviour, cost analysis, and so much more. The ROI is huge, and the cost of implementing it is relatively low, especially when compared to the cost of not having it when things go wrong.

Whats even better is we dont have to get it right on day one. We can start small, with just a few fields, and then iterate and improve over time.

  • Each time there is a problem and there was data missing - add it so we capture it for next time
  • When a product manager asks a question about user behaviour, add the fields to answer that question for next time

Summary

We’ve got a long way on the three pillars, but its time to move on. Telemetry is the future, and we should be embracing it. It gives us a much richer understanding of our systems, and allows us to fix problems faster, and improve our systems more effectively.

Whats more it helps us get to be lazy. Information is power, and giving it to the people who need it means engineers are freed from trying to wrangle insights from logs and metrics, and can instead focus on building new features and improving the system.

Every engineer should aspire to be lazy, and telemetry is a great way to achieve that.