At DeepXplore we build an AI-driven platform for performance testing, anomaly detection, and incident analysis. We continuously correlate telemetry (metrics, logs, traces) with engineering context from systems like GitHub, Kubernetes, and Jira to detect problems early and explain exactly what changed when incidents happen. When the platform can act safely on that signal, DeepXplore Code applies remediations immediately so teams spend less time in firefighting loops.
The platform is only as useful as the telemetry layer underneath it. That is why we picked GreptimeDB as the storage engine for the entire data foundation — from performance test metrics, all the way through to the data we hand to our AI coordinator during root cause analysis.
Why we picked GreptimeDB
When we evaluated backends, we kept coming back to one question: how much of our team’s time will go into operating the database itself instead of building product?
The conventional cloud-native answer is a stack of specialised systems — Thanos or Mimir for long-term metrics, Loki for logs, Tempo or Jaeger for traces, plus the glue between them. Each one is good at its job. Each one also brings its own deployment model, its own scaling story, its own configuration surface, and its own failure modes. Three or four systems quickly add up to an entire team’s worth of operational overhead.
GreptimeDB collapses that into one engine:
- one database for metrics, logs, traces, and wide events,
- standard protocols out of the box: OpenTelemetry, Prometheus remote write, SQL, PromQL,
- object-storage-native architecture, so storage scales independently of compute,
- runs as a single binary in standalone mode for development, or as a horizontally scalable cluster in production.
It is, practically, plug and play. We did not write integration code to make our existing Prometheus and OpenTelemetry pipelines talk to it — they connected straight in. The day-to-day operational footprint shrank from four systems to one. That difference shows up in fewer alerts at 3 a.m., shorter onboarding for new engineers, and more time spent on the AI features customers actually care about.
How GreptimeDB fits into our pipeline
Performance tests in DeepXplore generate a continuous stream of telemetry: load-test results, request-level metrics, infrastructure health signals, and the logs and traces produced by the systems under test. Everything lands in GreptimeDB through a small, focused ingestion pipeline:
- Grafana Alloy runs in the cluster, discovers the test pods, scrapes their Prometheus-format metrics, and tails container logs.
- Vector sits in front of GreptimeDB as an aggregator. It is where we apply lightweight transforms before storage — most notably a sharding tag that distributes high-volume time series across partitions to keep ingestion and queries fast.
- GreptimeDB ingests the result via standard Prometheus remote write and OTLP. From there, the same data powers every downstream surface we run.
(Helios runs in K8s)"] --> alloy["Grafana Alloy
scrape metrics + logs"] alloy --> vector["Vector aggregator
sharding pipeline"] vector --> greptime[("GreptimeDB
metrics, logs, traces")] greptime --> threshold["Threshold validation"] greptime --> dashboards["Frontend dashboards"] greptime --> anomaly["Anomaly detection
(forecasting)"] greptime --> rca["AI RCA + DeepXplore Code"]
Once data is in GreptimeDB, it serves multiple purposes from the same store:
- Threshold validation during performance tests — pass/fail checks against the run’s defined SLAs.
- Frontend dashboards — engineers see live test results and historical comparisons reading directly from GreptimeDB.
- Anomaly detection — our forecasting service pulls run-scoped series and flags deviations from learned baselines.
- Root cause analysis — DeepXplore agents query metrics, logs, and traces through the same surface during AI investigations.
There is no separate “long-term store” or “hot vs cold” sync to operate. Data is queryable as soon as it lands.
Why one engine matters for root cause analysis
When it comes to root cause analysis, we also rely on telemetry provided by the customer from their own datacenter. In many real environments that telemetry is scattered across separate systems and teams: metrics in one platform, logs in another, traces in a third, plus additional context in cluster tooling. During an incident, the bottleneck is rarely data availability — it is fragmentation and manual stitching across tools.
(metrics)"] loki["Loki
(logs)"] tempo["Tempo / Jaeger
(traces)"] glue["Custom correlation
and ETL glue"] thanosMimir --> glue loki --> glue tempo --> glue end subgraph afterState["After: one engine for all signals"] greptimeCore[("GreptimeDB
metrics, logs, traces")] deepxploreAI["DeepXplore RCA
+ DeepXplore Code"] customerActions["Actionable incident decisions"] greptimeCore --> deepxploreAI --> customerActions end
GreptimeDB removes those seams for customer data. Because customer metrics, logs, and traces can be analyzed through one engine with a shared time index, our RCA logic correlates evidence across signal types without custom bridge code. Traces correlate to logs by trace ID. Logs correlate to metrics by service and timestamp. Metrics correlate back to traces by service and label.
That is what makes GreptimeDB our preferred analysis surface during customer RCA. The DeepXplore agent installed in the customer cluster reaches customer metrics, logs, and traces through the same query interface. When the AI coordinator builds an investigation it does not need to care whether the next evidence point comes from a metric, a log, or a trace — only the service and time window it needs to inspect. GreptimeDB returns the rest.
When the correlated evidence is strong enough and the runbook allows it, DeepXplore Code applies the remediation directly, closing the loop from anomaly detection to fix without a human in the middle.
From telemetry setup to RCA in minutes
A key part of DeepXplore is how little friction teams face when wiring on-prem telemetry into DeepXplore's root cause analysis. In practice, you configure the DeepXplore Agent once, define which telemetry sources should be available for investigation, and roll out the installer in your cluster.
When an RCA is triggered, DeepXplore posts investigation tasks and notifies the in-cluster agent that telemetry is ready to analyze. The DeepXplore Agent pulls those tasks outbound, fetches only the scoped telemetry from sources like GreptimeDB, and returns structured findings. This pull-based model avoids inbound internet exposure for the agent and helps keep customer data secure.
Teams that prefer a managed deployment can also connect to managed GreptimeDB directly as the telemetry backend. DeepXplore supports both options: in-cluster telemetry access via the DeepXplore Agent and direct integration with managed GreptimeDB.
This short demo walks through the full integration flow and shows how quickly on-premise GreptimeDB telemetry becomes usable in DeepXplore RCA.
What this partnership unlocks
This cooperation is not just logo placement. It is a practical integration path for shared customers:
- a stronger telemetry foundation for teams currently struggling with fragmented observability,
- faster and more precise root cause analysis when incidents do happen,
- a clean path from raw telemetry to actionable engineering decisions — and where your runbooks allow, immediate remediation through DeepXplore Code.
If you are building systems where performance, reliability, and incident speed matter, deepxplore.io and greptime.com are a strong combination.