Great question, and very timely. We've been tackling this exact problem at my startup, Markhub. The cost of hot storage for observability data, especially logs, can quickly become a significant line item.
We adopted a tiered approach that's working well for us so far:
1. Hot Tier (Last 7 Days): Elasticsearch. For our real-time debugging and immediate operational needs, nothing beats the query speed of Elasticsearch. We keep a rolling 7-day window of all logs here. It's expensive, but essential.
2. Warm Tier (7-90 Days): AWS S3 Standard. After 7 days, our log shipper (Fluentd) automatically archives the logs to S3. If we need to investigate an older issue, we can still query these logs directly using AWS Athena. It's much slower than Elasticsearch, but for occasional, deep-dive investigations, the cost savings are massive.
3. Cold Tier (After 90 Days): S3 Glacier Deep Archive. After 90 days, the logs are transitioned to Glacier Deep Archive via S3's lifecycle policies. This is purely for long-term compliance and "break glass in case of emergency" scenarios. It's incredibly cheap to store, but we know that retrieving it would be a slow and deliberate process.
The key lesson for us was to be realistic about our actual query patterns. We found that over 95% of our queries were for logs less than 3 days old. This data-driven approach allowed us to be aggressive with our tiering strategy without sacrificing critical visibility.
Great question, and very timely. We've been tackling this exact problem at my startup, Markhub. The cost of hot storage for observability data, especially logs, can quickly become a significant line item.
We adopted a tiered approach that's working well for us so far:
1. Hot Tier (Last 7 Days): Elasticsearch. For our real-time debugging and immediate operational needs, nothing beats the query speed of Elasticsearch. We keep a rolling 7-day window of all logs here. It's expensive, but essential.
2. Warm Tier (7-90 Days): AWS S3 Standard. After 7 days, our log shipper (Fluentd) automatically archives the logs to S3. If we need to investigate an older issue, we can still query these logs directly using AWS Athena. It's much slower than Elasticsearch, but for occasional, deep-dive investigations, the cost savings are massive.
3. Cold Tier (After 90 Days): S3 Glacier Deep Archive. After 90 days, the logs are transitioned to Glacier Deep Archive via S3's lifecycle policies. This is purely for long-term compliance and "break glass in case of emergency" scenarios. It's incredibly cheap to store, but we know that retrieving it would be a slow and deliberate process.
The key lesson for us was to be realistic about our actual query patterns. We found that over 95% of our queries were for logs less than 3 days old. This data-driven approach allowed us to be aggressive with our tiering strategy without sacrificing critical visibility.
Hm, interesting insight. Thanks for sharing!
But how did you get that query patterns? Were there some Elasticsearch API, proxies or smth else?
If your logging system uses object storage like S3 or Tigris for persistence then it can automatically benefit from storage tiering.
But they don't offer intelligent tiering for hot, quick-access storage. The S3 Intelligent Tiering storage spans warm, cold and archive.
Splunk has done this forever
How? It allows out of the box for static retention policies only, as far as I know...