Building an elaborate pile of technical debt is a great way to have an elaborate pile of technical debt, but the lifespan of services being 2-3 years gets painful as you start composing a stack out of enough products that every quarter you need to replace something big.
What's the most promising alternative to Prometheus/Grafana if you're developing a new solution around OTEL? If you could start today and pick tools, what would you go for?
I frequently use a docker-compose template with prometheus pushgateway + grafana for deploying on single node servers, as described at the start of the article. It works well and is trivial to setup, but the complexity explodes once your metric volume or cardinality requires more scale like prometheus alternatives a la mimir.
I think this would not need to be an issue as frequently if prometheus had a more efficient publish/scraping mechanism. iirc there was once a protobuf metric format that was dropped, and now there is just the text format. While it wouldn't handle billions of unique labels like mimir, a compact binary metric format could certainly allow for millions at reasonable resolution instead of wasting all that scale potential on repeated name strings. I should be able to push or expose a bulk blob all at once with ordered labels or at least raw int keys.
I only know of https://perses.dev/ but haven't had a look at it for ~half a year. It was very barebones back then but I'm hopeful it can replace Grafana for at least basic dashboarding soon.
Not sure what's an alternative for Grafana in the open source world in terms of building dashboards for o11y? I'm not aware of one and Grafana is used very extensively in my company...
I remember that alternative, free/FOSS products existed before Grafana (c2015) but many died, Grafana was everywhere. Now I also cannot find the old-alts. Vague memories of RRD and Nagios...
Munin was what we used for a while, along with a smattering of smokeping.
We're using a combination of Zabbix (alerting) and local Grafana/Prometheus/Loki (observability) at this point, but I've been worried about when Grafana will rug-pull for a while now. Hopefully enough people using their cloud offering sates their appetite and they leave the people running locally alone.
I mentioned it in another reply, but https://perses.dev/ is probably the most promising alternative.
Besides that, if you're feeling masochistic you could use Prometheus' console templates or VictoriaMetrics' built-in dashboards.
Though these are all obviously nowhere near as feature rich and capable as Grafana and would only be able to display metrics for the single Prom/VM node they're running on. Might enough for some users.
https://github.com/opensearch-project/OpenSearch-Dashboards (Kibana fork) is one. But Grafana is still way better if you just stay away from anything that isn't the core product: data visualization and exploration (explorer and traces).
> "But I got it all working; now I can finally stop explaining to my boss why we need to re-structure the monitoring stack every year."
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
I still haven't got my head around how OTEL fits into a good open-source monitoring stack. Afaik, it is a protocol for metrics, traces, and logs. And we want our open-source monitoring services/dbs to support it, so they become pluggable. But, afaik, there's no one good DB for logs and metrics, so most of us use Prometheus for metrics and OpenSearch for logs.
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
I think the answer is it doesn't fit in any definition of a _good_ monitoring stack, but we are stuck with it. It has largely become the blessed protocol, specification, and standard for OSS monitoring, along every axis (logging, tracing, collecting, instrumentation, etc)...its a bit like the efforts that resulted in J2EE and EJBs back in the day, only more diffuse and with more varied implementations.
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.
off topic, but prometheus pushgateway is such a bad implementation (once you push the metrics, it always stays there until it's restarted, like counter does not increase, it just pushes a new metric with the new value) that we had to write our own metrics collector endpoint.
That is literally how it is supposed to work. Prometheus grabs metrics --- that is how it works. If you for some reason find yourself unable to host an endpoint with metrics, you can use the fallback pushgateway to push metrics where yes they will stay until restarted. Ask yourself how it could ever work if they are subsequently deleted after read. How would multiple prometheus agents be able to read from the same source?
The pushgateway is itself a horrible hack for the fact that prometheus is designed only for metrics scraping. Unfortunately the whole ecosystem around it is an utter mess.
Remote Write is a viable alternative in Prometheus and its drop-in replacements. I'm not a massive fan of it myself as I feel the pull-based approach is superior overall but still make heavy use of it.
The pushgateway's documentation itself calls out that there are only very limited cirumstances where it makes sense.
I personally only used it in $old_job and only for batch jobs that could not use the node_exporter's textfile collector. I would not use it again and would even advise against it.
what are tested and fairly lightweight alternatives for Loki?
elastic stack is so heavy it's out of question for smaller clusters, loki integration with grafana is nice to have but separate capable dashboard would be also fine
This is pretty interesting to me, as I do use Grafana in my current role. But none of their other products, and not their helm chart (we're on the Bitnami chart if that's a thing).
So far it's pretty good. We're at least one major version behind, but hey everything still works.
I cannot imagine other products support as many data sources (though I'm starting to think they all suck, I just dump what I can in InfluxDB).
I agree. I think OP has made the mistake of using more than just Grafana for dashboards and perhaps user queries.
I operate a fairly large custom VictoriaMetrics-based Observability platform and have learned early on to only use Grafana as opposed to other Grafana products. Part of the stack used to use Mimir's frontend as caching layer but even that died with Mimir v3.0, now that it can't talk to generic Prometheus APIs anymore (vanilla Prom, VictoriaMetrics, promxy etc.). I went back to Cortex for caching.
Such a custom stack is obviously not for everyone and takes much more time, knowledge and effort to deploy than some helm chart but overall I'd say it did save me some headache. At least when compared to the Google-like deprecation culture Grafana seems to have.
FTA > "I know for a fact that that pace is partially driven by career-driven development."
This isn't a Grafana problem, this is an industry wide problem. Resume driven product design, resume driven engineering, resume driven marketing. DO your 2-3 years, pump out something big to inflate your resume. Apply elsewhere to get the pay bump that almost no company is handing out. After the departures there is no one left who knows the system and the next people in want to replace the things they don't understand to pad their resume for the next job.
Wash, rinse, repete.
Loyalty, simply goes unrewarded in a lot of places in our industry (and at a many corporations). And the people who do stay... in many cases they turn into furniture that ends up holding potential good evolution back. They loose out to the technological magpies the bring shiny things to management because it will "move the needle".
Sadly this is just one facet of the problems we are facing, from how we interview to how we run (or rent) our infrastructure things have gotten rather silly...
without any stability, you really can’t blame the player for playing this game.
The days where you could devote your career to a firm and retire with a pension are long gone
The author of this article wants a boring tech stack that just works, and honestly after everything we’ve been through in the last five years, I kinda want a boring job I can keep until I retire, too
As someone who runs SaaS products, this post resonates painfully well.
The author is 100% correct: Monitoring should be the most boring tool in the stack. Its one and only job is to be more reliable than the thing it's monitoring.
The moment your monitoring stack requires a complex dependency like Kafka, or changes its entire agent flow every 18 months, it has failed its primary purpose. It has become the problem.
This sounds less like a technical evolution and more like the classic VC-funded push to get everyone onto a high-margin cloud product, even at the cost of the open-source soul.
Building an elaborate pile of technical debt is a great way to have an elaborate pile of technical debt, but the lifespan of services being 2-3 years gets painful as you start composing a stack out of enough products that every quarter you need to replace something big.
What's the most promising alternative to Prometheus/Grafana if you're developing a new solution around OTEL? If you could start today and pick tools, what would you go for?
I frequently use a docker-compose template with prometheus pushgateway + grafana for deploying on single node servers, as described at the start of the article. It works well and is trivial to setup, but the complexity explodes once your metric volume or cardinality requires more scale like prometheus alternatives a la mimir.
I think this would not need to be an issue as frequently if prometheus had a more efficient publish/scraping mechanism. iirc there was once a protobuf metric format that was dropped, and now there is just the text format. While it wouldn't handle billions of unique labels like mimir, a compact binary metric format could certainly allow for millions at reasonable resolution instead of wasting all that scale potential on repeated name strings. I should be able to push or expose a bulk blob all at once with ordered labels or at least raw int keys.
Signoz is good, and active development https://github.com/SigNoz/signoz
But are there good alternatives to grafana in the foss space nowadays?
I only know of https://perses.dev/ but haven't had a look at it for ~half a year. It was very barebones back then but I'm hopeful it can replace Grafana for at least basic dashboarding soon.
Not sure what's an alternative for Grafana in the open source world in terms of building dashboards for o11y? I'm not aware of one and Grafana is used very extensively in my company...
I remember that alternative, free/FOSS products existed before Grafana (c2015) but many died, Grafana was everywhere. Now I also cannot find the old-alts. Vague memories of RRD and Nagios...
Munin was what we used for a while, along with a smattering of smokeping.
We're using a combination of Zabbix (alerting) and local Grafana/Prometheus/Loki (observability) at this point, but I've been worried about when Grafana will rug-pull for a while now. Hopefully enough people using their cloud offering sates their appetite and they leave the people running locally alone.
I ran with Centreon for a while because you got Nagios + integrated dashboarding out of the box and a Community option.
I'm out of that game now though so don't have the challenge.
https://www.centreon.com/
I mentioned it in another reply, but https://perses.dev/ is probably the most promising alternative.
Besides that, if you're feeling masochistic you could use Prometheus' console templates or VictoriaMetrics' built-in dashboards.
Though these are all obviously nowhere near as feature rich and capable as Grafana and would only be able to display metrics for the single Prom/VM node they're running on. Might enough for some users.
https://github.com/opensearch-project/OpenSearch-Dashboards (Kibana fork) is one. But Grafana is still way better if you just stay away from anything that isn't the core product: data visualization and exploration (explorer and traces).
We’re using Greylog+Elastic Search which would totally replace a Loki-only stack.
o11y is not a word. What do you mean?
Observability , in the vein of accessibility which has the silly nickname of a11y
... ugh, they actually made an `o11[a-z]` abbreviation? When I picked this nick, the only term I ever saw in the wild was `i18n`.
K8s (Kubernetes), a11y (accessibility)...
The kicker for me recently was hearing someone say "ally"
a16z, l10n, s11n,
Or without numbers,
authC/authN, authZ...
> "But I got it all working; now I can finally stop explaining to my boss why we need to re-structure the monitoring stack every year."
Prometheus and Grafana have been progressing in their own ways and each of them is trying to have a fullstack solution and then the OTEL thingy came and ruined the party for everyone
I still haven't got my head around how OTEL fits into a good open-source monitoring stack. Afaik, it is a protocol for metrics, traces, and logs. And we want our open-source monitoring services/dbs to support it, so they become pluggable. But, afaik, there's no one good DB for logs and metrics, so most of us use Prometheus for metrics and OpenSearch for logs.
Does OTEL mean we just need to replace all our collectors (like logstash for logs and all the native metrics collectors and pushgateway crap) and then reconfigure Prometheus and OpenSearch?
I think the answer is it doesn't fit in any definition of a _good_ monitoring stack, but we are stuck with it. It has largely become the blessed protocol, specification, and standard for OSS monitoring, along every axis (logging, tracing, collecting, instrumentation, etc)...its a bit like the efforts that resulted in J2EE and EJBs back in the day, only more diffuse and with more varied implementations.
And we don't really have a simpler alternative in sight...at least in the java days there was the disgust and reaction via struts, spring, EJB3+, and of course other languages and communities.
Not sure how we exactly we got into such an over-engineered mono-culture in terms of operations and monitoring and deployment for 80%+ of the industry (k8s + graf/loki/tempo + endless supporting tools or flavors), but it is really a sad state.
Then you have endless implementations handling bits and pieces of various parts of the spec, and of course you have the tools to actually ingest and analyze and report on them.
off topic, but prometheus pushgateway is such a bad implementation (once you push the metrics, it always stays there until it's restarted, like counter does not increase, it just pushes a new metric with the new value) that we had to write our own metrics collector endpoint.
That is literally how it is supposed to work. Prometheus grabs metrics --- that is how it works. If you for some reason find yourself unable to host an endpoint with metrics, you can use the fallback pushgateway to push metrics where yes they will stay until restarted. Ask yourself how it could ever work if they are subsequently deleted after read. How would multiple prometheus agents be able to read from the same source?
The pushgateway is itself a horrible hack for the fact that prometheus is designed only for metrics scraping. Unfortunately the whole ecosystem around it is an utter mess.
Remote Write is a viable alternative in Prometheus and its drop-in replacements. I'm not a massive fan of it myself as I feel the pull-based approach is superior overall but still make heavy use of it.
The pushgateway's documentation itself calls out that there are only very limited cirumstances where it makes sense.
I personally only used it in $old_job and only for batch jobs that could not use the node_exporter's textfile collector. I would not use it again and would even advise against it.
what are tested and fairly lightweight alternatives for Loki?
elastic stack is so heavy it's out of question for smaller clusters, loki integration with grafana is nice to have but separate capable dashboard would be also fine
This is pretty interesting to me, as I do use Grafana in my current role. But none of their other products, and not their helm chart (we're on the Bitnami chart if that's a thing).
So far it's pretty good. We're at least one major version behind, but hey everything still works.
I cannot imagine other products support as many data sources (though I'm starting to think they all suck, I just dump what I can in InfluxDB).
I agree. I think OP has made the mistake of using more than just Grafana for dashboards and perhaps user queries.
I operate a fairly large custom VictoriaMetrics-based Observability platform and have learned early on to only use Grafana as opposed to other Grafana products. Part of the stack used to use Mimir's frontend as caching layer but even that died with Mimir v3.0, now that it can't talk to generic Prometheus APIs anymore (vanilla Prom, VictoriaMetrics, promxy etc.). I went back to Cortex for caching.
Such a custom stack is obviously not for everyone and takes much more time, knowledge and effort to deploy than some helm chart but overall I'd say it did save me some headache. At least when compared to the Google-like deprecation culture Grafana seems to have.
Sounds like grafana needed to fork
FTA > "I know for a fact that that pace is partially driven by career-driven development."
This isn't a Grafana problem, this is an industry wide problem. Resume driven product design, resume driven engineering, resume driven marketing. DO your 2-3 years, pump out something big to inflate your resume. Apply elsewhere to get the pay bump that almost no company is handing out. After the departures there is no one left who knows the system and the next people in want to replace the things they don't understand to pad their resume for the next job.
Wash, rinse, repete.
Loyalty, simply goes unrewarded in a lot of places in our industry (and at a many corporations). And the people who do stay... in many cases they turn into furniture that ends up holding potential good evolution back. They loose out to the technological magpies the bring shiny things to management because it will "move the needle".
Sadly this is just one facet of the problems we are facing, from how we interview to how we run (or rent) our infrastructure things have gotten rather silly...
without any stability, you really can’t blame the player for playing this game.
The days where you could devote your career to a firm and retire with a pension are long gone
The author of this article wants a boring tech stack that just works, and honestly after everything we’ve been through in the last five years, I kinda want a boring job I can keep until I retire, too
I find the color of the text so light as to be unreadable.
The font doesn't render correctly on my device. It seems like as if some strokes are double, making lines inconsistent.
the "dark" theme having a *blaring* white banner at the top is an ... interesting design choice
It’s challenging to read for me too.
As someone who runs SaaS products, this post resonates painfully well.
The author is 100% correct: Monitoring should be the most boring tool in the stack. Its one and only job is to be more reliable than the thing it's monitoring.
The moment your monitoring stack requires a complex dependency like Kafka, or changes its entire agent flow every 18 months, it has failed its primary purpose. It has become the problem.
This sounds less like a technical evolution and more like the classic VC-funded push to get everyone onto a high-margin cloud product, even at the cost of the open-source soul.