The whole time I was learning/porting to Otel I felt like I was back in the Java world again. Every time I stepped through the code it felt like EnterpriseFizzBuzz. No discoverability. At all. And their own jargon that looks like it was made by people high on something.
And in NodeJS, about four times the CPU usage of StatsD. We ended up doing our own aggregation to tamp this down and to reduce tag proliferation (StatsD is fine having multiple processes reporting the same tags, OTEL clobbers). At peak load we had 1 CPU running at 60-80% utilization. Until something changes we couldn’t vertically scale. Other factors on that project mean that’s now unlikely to happen but it grates.
OTEL is actively hostile to any language that uses one process per core. What a joke.
Just go with Prometheus. It’s not like there are other contenders out there.
I'm fairly convinced that OTEL is in a form of 'vendor capture', i.e. because the only way to get a standard was to compromise with various bigcorps and sloppy startups to glue-gun it all together.
I tried doing a simple otel setup in .NET and after a few hours of trying to grok the documentation of the vendor my org has chosen, hopped into a discord run by a colleague that has part of their business model around 'pay for the good otel on the OSS product' and immediately stated that whatever it cost, it was worth the money.
I'd rather build another reliable event/pubsub library without prior experience than try to implement OTEL.
> It’s not like there are other contenders out there.
Apache Skywalking might be worth a look in some circumstances, doesn't eat too many resources, is fairly straightforwards to setup and run, admittedly somewhat jank (not the most polished UI or docs), but works okay: https://skywalking.apache.org/
Also I quite liked that a minimal setup is indeed pretty minimal: a web UI, a server instance and a DB that you already know https://skywalking.apache.org/docs/main/latest/en/setup/back...
In some ways, it's a lot like Zabbix in the monitoring space - neither will necessarily impress anyone, but both have a nice amount of utility.
This matches my conclusion as well. Just use Prometheus and whatever client library for your language of choice, it's 1000x simpler than the OTEL story.
Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.
How would you build the "holy grail" map that shows a trace of every sub component in a transaction broken down by start/stop time etc... for instance show the load balancer see a request, the request get handled by middlewares etc, then go onto some kind of handler/controller, the sub-queries inside of that like database calls or cache calls. I don't think that is possible with prometheus?
> Can you even achieve this with prometheus? Afaik it operates by exposing metrics that are scraped at some interval. High level stuff, not per-trace stuff.
Correct. Prometheus is just metrics.
The main argument for oTel is that instead of one proprietary vendor SDK or importing prometheus and jaeger and whatever you want to use for logging, just import oTel and all that will be done with a common / open data format.
I still believe in that dream but it's clear that the whole project needs some time/resources to mature a bit more.
If anybody remembers the Terraform/ToFu drama, it's been really wild to see how much support everybody pledged for ToFu but all the traditional observability providers have just kinda tolerated oTel :/
Yeah part of the problem is it’s called Opentelemetry and half of you are only talking about tracing, not metrics. Telemetry is metrics. It’s been metrics since at least the Mercury Program.
Metrics in OTEL is about three years old and it’s garbage for something that’s been in development for three years.
its looks hassle to implement ngl
Code traces are metrics. Run times per function calls metrics, count of specific function call metrics.
Otel is an attempt to package such arithmetic.
Web apps have added so many layers of syntax sugar and semantic wank, we’ve lost sight its all just the same old math operations relative to different math objects. Sets are not triangles but both are tested, quantified, and compared with the same old mathematical ops we learn by middle school.
No, code traces are not just metrics; and while you can knit together something approximating traces from metrics, you'll quickly run into the reason why traces are a distinct thing. First, in a distributed system, you'll discover that you can't rely on clocks to get the timing of subsecond events correct. Second, you'll be contextless about code paths. So, you might independantly reinvent the idea of passing along a context - and now you're just making your own tracing system but without any of the benefit of building on years of existing discoveries in this field.
OTel does feel a little bit heavy, unless you're already used to e.g. New Relic, Dynatrace, etc. where you have to run an agent process and instrumentize your code to some extent; it's never going to be free to audit every function call! This is why (a) you sample down and don't keep every trace, and (b) unless your company is extremely flush with cash you probably don't run tracing in every environment. If you can get away with it just in a staging or perf test env you can reap most of the benefit without the production impact and cost.
All those things you describe are computable metrics. They have to be or Otel itself would not be able to compute them for consumption. All you described are cherry picked semantic indirections to obfuscate it’s all just a computer computing metrics of its own memory states.
Sorry for knowing how computers actually work (EE grad not a CS grad). I know that can frustrate CS grads who think their preferred OS and favorite programming language is how a computer works. You’re describing how contemporary SWEs view their day job.
Edit: teleMETRY …what’s in a name? Oh right …meaning.
To be a smart-ass, one has to be smart first. Quit this.
As a no grad to EE grad: traces mean a bundle of metrics that varies in structure hence you can't store and process them as effective as a list of counters unless you have a distinct bin for each possible trace, combinatorial explosion y'know.
You know the conversation is going well for you when you resort to citing the "meaning" of a name instead of, you know, base reality. Who needs the territory, I've got my map right here.
Speaking of meaning, the best I can make of your point is that you're using a much broader definition of "metrics" than the rest of this conversation, and in particular broader than Prometheus (remember context? very important for "meaning"!) supports. That or you really just don't know what a "trace" is (in this context).
OpenTelemetry's traces are trees of spans. You cannot represent this efficiently without a combinatorial explosion of labels.
You may be thinking of metrics in the sense of counters and gauges, but that's not the data model that OpenTelemetry (and before they, Zipkin, Jaeger, and OpenCensus) uses for traces.
The data model for tracing is to emit events that provide a span ID and an optional parent span ID. The event collector can piece these together into a tree after the fact, which will work as long as the parent structure is maintained.
Prometheus is absolutely not suitable for this.
Quibbling about the word "telemetry" doesn't really help here. OpenTelemetry supports three different, completely different subsets of functionality: Metrics (counters, gauges, histograms), traces (span events in a tree structure), and logging (structured log events). They each have completely different client interfaces.
huh? I've always heard and read and experienced that "logs, traces, metrics" are the 3 legs of the observability stool.
Open teleMETRY
Any guesses as to etymology?
By this logic, you can say that logging, metrics and tracing are all fundamentally just different kinds of data and we should be calling it just plain databases and CRUD.
They're related, but people have a very specific idea and concept of what each is, you haven't actually provided a good argument why we should throw out these distinctions just because they somewhat resemble each other if you ignore a few details
Prometheus is good, but let's be clear...you don't get tracing.
For tracing FOSS: Grafana Tempo.
https://grafana.com/oss/tempo/
Tempo's a backend/sink for traces, but if you click through to the Tempo docs and find out how to generate tracing data[1], you learn that you have two options: OpenTelemetry, which they recommend, and Zipkin, which they do not recommend.
[1] https://grafana.com/docs/tempo/latest/getting-started/instru...
"I don't want solutions, I want to be mad."
Tempo is a traces server. Prometheus is a metrics server.
Grafana, the same company that develops and sells Tempo created a horizontally scalable version of Prometheus called Mimir.
OpenTelemetry is an ecosystem, not just 1 app. It’s protocols, libraries, specs, a Collector (which acts as a clearinghouse for metrics+traces+logs data). It’s bigger than just Tempo. The intention of Patel seems to be to decouple the protocol from the app by having adapters for all of the pieces.
Prometheus is not only a metrics server, it's also become the de-facto metrics exposition format.
You probably don´t understand what Otel is if you think that Prometheus is an alternative.
You'd do better to point out which distinction you think the parent poster is missing.
My guess is that Prometheus cannot do distributed tracing, while OpenTelemetry can. Is that what you meant?
Otel is a spec. You can create your own clients/aggregators/etc.. The problem is that if nobody does it, there will be no tooling. So Otel created some tooling (and yes, it's bad) for people to use.
Some companies (ie: Datadog) are contributing to the tooling but I think most companies would rather spend dev time on their own platforms than something that anybody (competitor) can use.
From the user side, a spec isn't helpful unless it has implementations. And the official implementations are complicated compared to prometheus.
I worked on a team that produced a distributed tracing library. We were tasked with interoperating with OpenTelemetry, or at least figuring out what that means.
My teammate said that at a previous job he wanted to add OpenTelemetry tracing to some C++ code he was working on. He took one look at the reference implementation for C++ OpenTelemetry and decided instead to write his own tracing library that sends gRPC to the OpenTelemetry collector.
It's also worth noting that, at least last time I checked, the reference implementations per programming language are less like reference implementations of some specification, and more like "this is the code you use to do OpenTelemetry in this language."
Why Otel compared to prometheus+syslog+(favorite way to do request tagging, eg: MDC in slf4j)+grep?
Syslog is kinda a pain, but it's an hour of work and log aggregation is set up. Is the difference the pain of doing simple things with elastic compute and kubernetes?
Typically this is a subset of OTel that's being compared. Almost everything (aside from Datadog's proprietary stuff) is just smaller than OTel is, which is why it's often chosen for many different needs.
In my experience, it's often folks who have experience setting up metrics or log collection with something smaller (e.g., StatsD) and sometimes for purposes with less scope, who have the most frustration with OTel. All the concepts are different, carry different names, have different configs, have different quirks, etc. There's often an expectation that things will largely the same as before and that they can carry over the cursed knowledge they had from the other toolset.
Simpler near-term, but more painful long term when you want to switch vendors/stacks.
Nine times out of ten, I've got more valuable problems to solve than a theoretical future change of our vendor/stack for telemetry. I'll gladly borrow from my future self's time if it means I can focus on something more important right now.
I did our migration from StatsD to OTEL because our third party StatsD service was getting flaky. The first person from OPs to get to me pushed OTEL. The rest were fine with Prometheus and it was late in the process before they realized what had happened. I believe if we had gone straight to Prometheus I would have been done in half the time and solved half the problems I had to solve anyway for OTEL. If someone had to replace it again in the future I fully believe it would have taken cumulatively as much time to go StatsD->Prometheus->OTEL as it took to go StatsD->OTEL, especially when you consider that OTEL is not quite baked.
Meanwhile functionality to retain and recruit new customers sat in the backlog.
Edit to add: also regarding the perf issues I saw: do you really want to pay for an extra server or half a server in your cluster just in case some day comes? These decisions were much fuzzier when you ordered hardware once every two years and just had to live with the capacity you got.
And switching log implementations can be a pain in the butt. Ask me how I know.
But I’d rather do that three more times before I want to see OpenTelemetry again.
Also Prometheus is getting OTEL interop.
Is this the same scam as "standard SQL"? Switching database products is never straightforward in practice, despite any marketing copy or wishful thinking.
Prometheus ecosystem is very interoperable, by the way.
It's not a "scam", the protocols and clients are 100% scrutable. Not sure why you used that word.
Using otel from C++ side... To have cumulative metrics from multiple applications (e.g. not "statds/delta") I create a relatively low cardinality process.vpid integer (and somehow coordinate this number to be unique as long as the app emitting it is stil alive) - you can use some global object to coordinate it.
Then you can have something that sums, and removes the attribute.
With statsd/delta if you lose sending a signal - then all data gets skewed, with cumulation - you only use precision.
edit... forgot to say - my use case is "push based" metrics as these are coming from "batch" tools, not long running processes that can be scraped.
This matches my experience. Very difficult to understand what I needed to get the effect I wanted.
I wonder what your experience is with Sentry? Not just for error reporting but especially also their support for traces.
Also open-source & self-hostable.
Likely only a handful of people care, but Sentry hasn't been open source in quite a while https://github.com/getsentry/sentry/blob/24.12.1/LICENSE.md (I'd have to do tag-spelunking to find the last Apache 2 version)
Glitchtip is the Sentry compatible open source (MIT) one https://gitlab.com/glitchtip/glitchtip-backend/-/blob/v4.2.2... with the extra advantage that it doesn't require like 12 containers to deploy (e.g. https://github.com/getsentry/self-hosted/blob/24.12.1/docker... )
Sentry is not horizontally scalable, thus ~ not-scalable at all, if your company is big.
That's a fair point, but scaling it vertically can take you very far in my experience.
Quota/pricing.
Same. I implemented Otel once and exactly once. I wouldn't wish it on any company.
Otel is a design by committee garbage pile of half baked ideas.
There are a lot of Java programmers working on it.
(And some Go tbf.)
Yeah and a blind man can see this, it’s so loud.