Devy

Search Posts

Search blog posts by title, description, tags, or content.

Back to list

Replacing Datadog with an LGTM Stack - OTel Sidecar with Grafana, Loki, Tempo, and Mimir

7 min read0 views
grafanalokitempomimiropentelemetrykubernetesobservabilitydevops
TranslationKoreanEnglish
  1. 1.From ECS to EKS - Building an Operating Environment with kustomize and Gateway API
  2. 2.Replacing Datadog with an LGTM Stack - OTel Sidecar with Grafana, Loki, Tempo, and Mimir

Background

In the previous post, I covered the ECS to EKS migration. Monitoring was also changed during that work. The old system used Datadog.

Datadog is convenient. It gives logs, metrics, traces, dashboards, alerts, and many integrations without much setup. But as traffic and log volume grew, cost became hard to ignore.

Problems with Datadog

The main issue was pricing. Host-based and log-volume-based pricing is easy to start with, but the bill grows with the service.

The second issue was control. We wanted to control what data is collected, how noise is filtered, how long data is retained, and where it is stored.

The third issue was platform alignment. After moving to EKS, we already had Kubernetes manifests, IRSA, S3, and GitOps. It became natural to run observability as part of the platform.

Why LGTM

LGTM is the Grafana observability stack:

  • Loki for logs.
  • Grafana for dashboards.
  • Tempo for traces.
  • Mimir for metrics.

Together with OpenTelemetry, this gives a vendor-neutral observability path.

Overall Architecture

Data Flow

flowchart LR
    APP["Spring Boot app<br/>OTel Java Agent"] --> COL["OTel Collector sidecar"]
    COL --> LOKI["Loki<br/>logs"]
    COL --> TEMPO["Tempo<br/>traces"]
    COL --> MIMIR["Mimir<br/>metrics"]
    LOKI --> G["Grafana"]
    TEMPO --> G
    MIMIR --> G
    LOKI --> S3["S3"]
    TEMPO --> S3
    MIMIR --> S3

The application sends telemetry to a sidecar OTel Collector. The collector filters, enriches, and forwards data to the LGTM stack. Long-term storage uses S3 through IRSA.

Infrastructure Components

The stack consists of:

  • OTel Java Agent attached to Spring Boot.
  • OTel Collector sidecar per application Pod.
  • Loki, Tempo, Mimir, and Grafana deployed through Helm.
  • S3 buckets for durable storage.
  • IRSA service accounts for AWS permissions.
  • Karpenter node pools for observability workloads.

OTel Java Agent - Application Instrumentation

Automatic Instrumentation

The OTel Java Agent instruments common libraries without code changes:

  • HTTP server and client calls.
  • JDBC queries.
  • Spring MVC request handling.
  • log correlation.
  • metrics export.

The Docker image downloads the agent and starts the app with -javaagent.

ADD https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.10.0/opentelemetry-javaagent.jar /otel/opentelemetry-javaagent.jar

ENTRYPOINT ["java", "-javaagent:/otel/opentelemetry-javaagent.jar", "-jar", "/app/app.jar"]

Connection Through Environment Variables

env:
  - name: OTEL_SERVICE_NAME
    value: app-api
  - name: OTEL_EXPORTER_OTLP_ENDPOINT
    value: http://localhost:4317
  - name: OTEL_EXPORTER_OTLP_PROTOCOL
    value: grpc
  - name: OTEL_TRACES_SAMPLER
    value: parentbased_traceidratio
  - name: OTEL_TRACES_SAMPLER_ARG
    value: "0.1"

The endpoint is localhost because the collector runs as a sidecar in the same Pod.

Sampling

Tracing every request is expensive. I used ratio-based sampling and adjusted the rate by environment.

EnvironmentSampling
devhigh
stagemedium
prodlow, with error visibility retained

The goal was not to collect everything. The goal was to collect enough data to debug production issues while controlling cost.

OTel Collector Sidecar - Collect, Process, Export

Receiver

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

The sidecar receives OTLP data from the Java Agent.

Processor - Noise Filtering

Health checks and actuator endpoints create a lot of noise.

processors:
  filter/noise:
    traces:
      span:
        - 'attributes["http.route"] == "/actuator/health"'
        - 'attributes["url.path"] == "/actuator/prometheus"'

Authorization headers and other sensitive data must also be removed.

processors:
  attributes/security:
    actions:
      - key: http.request.header.authorization
        action: delete
      - key: http.request.header.cookie
        action: delete

Exporter

exporters:
  otlp/tempo:
    endpoint: tempo-distributor:4317
    tls:
      insecure: true
  loki:
    endpoint: http://loki-gateway/loki/api/v1/push
  prometheusremotewrite/mimir:
    endpoint: http://mimir-nginx/api/v1/push

Pipeline

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [filter/noise, attributes/security, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheusremotewrite/mimir]
    logs:
      receivers: [otlp]
      processors: [attributes/security, batch]
      exporters: [loki]

The sidecar pattern makes the app deployment self-contained. Each application controls its telemetry path without requiring a shared node agent for every case.

Request Tracing with MDC

MDCFilter

MDC connects logs to requests.

public class MDCFilter implements Filter {
    @Override
    public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
            throws IOException, ServletException {
        try {
            String requestId = UUID.randomUUID().toString();
            MDC.put("requestId", requestId);
            chain.doFilter(request, response);
        } finally {
            MDC.clear();
        }
    }
}

Async MDC Propagation

For async work, MDC does not automatically move to another thread. A task decorator copies the context.

public class MdcTaskDecorator implements TaskDecorator {
    @Override
    public Runnable decorate(Runnable runnable) {
        Map<String, String> contextMap = MDC.getCopyOfContextMap();
        return () -> {
            if (contextMap != null) MDC.setContextMap(contextMap);
            try {
                runnable.run();
            } finally {
                MDC.clear();
            }
        };
    }
}

Logback Format

<pattern>%d{yyyy-MM-dd HH:mm:ss.SSS} [%thread] %-5level [%X{requestId}] %logger{36} - %msg%n</pattern>

With MDC and trace IDs together, it becomes much easier to move between logs and traces.

Deploying the LGTM Stack

Shared Infrastructure - IRSA + Karpenter

Loki, Tempo, and Mimir need object storage. In EKS, the cleanest way is IRSA.

apiVersion: v1
kind: ServiceAccount
metadata:
  name: loki
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/loki-s3-role

Karpenter also helps isolate observability workloads from application workloads.

Loki - Log Storage

Loki stores log chunks in S3 and keeps indexes compact.

loki:
  storage:
    type: s3
    bucketNames:
      chunks: observability-loki-chunks
      ruler: observability-loki-ruler

Tempo - Distributed Tracing

Tempo stores traces in object storage and can generate metrics from spans.

tempo:
  storage:
    trace:
      backend: s3
      s3:
        bucket: observability-tempo
        region: ap-northeast-2

Mimir - Long-Term Metrics

Mimir stores Prometheus-compatible metrics with long retention.

mimir:
  structuredConfig:
    common:
      storage:
        backend: s3

Grafana - Dashboard

Grafana connects the three data sources.

datasources:
  datasources.yaml:
    apiVersion: 1
    datasources:
      - name: Loki
        type: loki
        url: http://loki-gateway
      - name: Tempo
        type: tempo
        url: http://tempo-query-frontend:3100
      - name: Mimir
        type: prometheus
        url: http://mimir-nginx/prometheus

Kustomize by Environment

The LGTM stack also used base and overlays.

observability/
  base/
  overlays/
    dev/
    stage/
    prod/

Retention, replicas, resource requests, and sampling rates differ by environment. The shape stays the same, but the size changes.

Things That Went Wrong

Loki Data Source Connection Failure

Grafana could not connect to Loki at first. The issue was not Loki itself, but the service URL. Helm charts often expose multiple services, and the correct internal endpoint depends on the chart mode.

The fix was to check the generated services and use the gateway endpoint Grafana can reach.

Missing OTel Collector RBAC

The collector needed Kubernetes metadata enrichment, but RBAC was missing. As a result, pod and namespace metadata did not appear.

Adding the proper ClusterRole and ClusterRoleBinding fixed it.

Tempo metricsGenerator remoteWrite

Tempo metrics generation requires remote write configuration. Without it, traces were collected but span metrics did not appear in Mimir.

Authorization Header Exposure

Automatic instrumentation can capture HTTP headers. That is useful until it captures secrets. Authorization and cookie headers must be removed before export.

This filtering belongs in the collector, not in every application.

Before/After

ItemBefore, DatadogAfter, LGTM
Cost modelHost and volume basedInfrastructure and storage based
Data ownershipSaaSS3 and cluster
InstrumentationAgentOTel Java Agent
LogsDatadog LogsLoki
TracesDatadog APMTempo
MetricsDatadog MetricsMimir
DashboardDatadogGrafana
FilteringVendor configOTel Collector processors

Closing

Moving away from Datadog does not make observability free. The cost moves from the vendor bill to platform operation: Helm values, storage, upgrades, RBAC, and dashboards.

But the tradeoff was acceptable. We gained more control over collection, filtering, retention, and storage. More importantly, the observability stack became part of the same Kubernetes and GitOps operating model as the application.

That consistency was the biggest win.