Datadog Announces GPU Monitoring to Help Businesses Optimize Spend and Performance as They Aim to Scale AI Projects
“GPU instances account for 14 percent of compute costs—which is a huge issue as companies are struggling to build AI-first technology in scalable and smart ways. While these companies can see their costs climbing, they can’t chargeback GPU spend across business units, see workload context or identify clear next steps for improvement. As a result, it is very challenging to budget and plan in thoughtful ways,” said
The launch of GPU Monitoring
“Smartly managing AI spend becomes a board-level conversation when capacity is misallocated, training and inference workloads stall, and costs escalate. We all know managing GPU costs is a huge problem we need to solve, but most companies are experimenting with solutions and it is still very difficult to get a single view of what is happening across the stack. GPU Monitoring fixes that with efficiency and reliability that we haven’t seen before,” said Li.
Today, most GPU tools provide high-level device health metrics, but they don’t surface cross-functional resource contention issues, explain why training and inference workloads fail, or provide visibility into which devices are idle or ineffectively used. This lack of visibility slows down investigations and means that teams overprovision as the safest default—leading to wasted spend.
GPU Monitoring streamlines this work by linking fleet telemetry directly to the workloads consuming those resources, and gives platform engineering and machine learning teams a shared view to investigate together, enabling them to:
- Scale AI without overspending: With visibility and forecasting based on the usage patterns of fleets and direct guidance on whether to buy new GPUs or free up existing ones, platform teams avoid expensive purchases and long procurement cycles, machine learning teams get capacity faster, and leadership gets better ROI with predictable spend.
- Accelerate AI delivery: Stalled workloads are correlated directly to the underlying GPUs, pods and processes running them so that teams can troubleshoot performance bottlenecks in minutes instead of hours, allowing engineers to focus on shipping AI projects.
- Avoid costly disruptions: Unhealthy GPUs are proactively identified before failures cascade across a cluster and cause training and inference delays.
- Maximize ROI on GPU spend: Teams are empowered and accountable for their GPU utilization and costs, and can easily pinpoint where they are overserving or underutilizing their GPUs. This allows teams to reclaim and reallocate resources in order to reduce wasted spend.
“Datadog GPU Monitoring has made it easy for us to stay on top of our multi-tenant GPU infrastructure. We get per-instance, per-device visibility into core utilization, memory, power and thermals right out of the box with no extra setup. The dashboards are rich out of the gate and simple to customize, and standing up isolated views per customer takes minutes,” said
GPU Monitoring is now generally available. To learn more, please visit: https://www.datadoghq.com/blog/datadog-gpu-monitoring/.
About
Forward-Looking Statements
This press release may include certain “forward-looking statements” within the meaning of Section 27A of the Securities Act of 1933, as amended, or the Securities Act, and Section 21E of the Securities Exchange Act of 1934, as amended including statements on the benefits of new products and features. These forward-looking statements reflect our current views about our plans, intentions, expectations, strategies and prospects, which are based on the information currently available to us and on assumptions we have made. Actual results may differ materially from those described in the forward-looking statements and are subject to a variety of assumptions, uncertainties, risks and factors that are beyond our control, including those risks detailed under the caption “Risk Factors” and elsewhere in our
Contact:
press@datadoghq.com
Source: Datadog, Inc.
