ATLANTA — Those pesky AI agents. You’ll never know what trouble they’ll cause.
Sneaky and malicious ones will elevate their privileges and cause who knows how much havoc on real systems.
Google LLC...
ATLANTA — Those pesky AI agents. You’ll never know what trouble they’ll cause.
Sneaky and malicious ones will elevate their privileges and cause who knows how much havoc on real systems.
Google LLC has sorted the problem, however, with the Google Kubernetes Engine (GKE) Agent Sandbox, which can house large language model (LLM)-generated code and tools in a restricted environment.
It’s one of a number of initiatives the company has taken to lure large-scale AI workloads to its cloud platform, and is demonstrating at KubeCon + CloudNativeCon North America, held in Atlanta this week.
The company has also made a number of optimizations to its Kubernetes cloud service so it can process large-scale AI jobs more quickly.
“Our customers, especially some of the customers who are running AI workloads, are asking for greater scale, better performance, greater cost efficiency, lower latency,” said Nathan Beach, director of product management at Google, in an interview with TNS.
About 79% of senior IT leaders have adopted AI agents, and 88% plan to increase IT budgets in the year to accommodate agentic AI, according to PricewaterhouseCoopers LLP.
To this end, the company has released into general availability its GKE Inference Gateway, a set of optimizations (based on the Kubernetes inference extension) for running AI workloads more quickly.
Early results look promising. The production version has cut the latency of time to first token (TTFT) by 96%, while using a quarter fewer tokens compared to standard GKE implementations.
Faster autoscaling has also been a priority for the company. It has also raised the number of nodes GKE can support to 130,000 in a single cluster. That should handle even the largest training workloads.
A Sandbox for Security, Governance and Isolation
The “Agent Sandbox is addressing what we’ve seen as one of the biggest gaps in the current agent ecosystem,” Beach said.
“Agents need to do things beyond simply what an existing tool is able to do,” he continued. “So an agent will need to execute, for example, LLM-generated code, which is not fully trusted.”
The GKE Sandbox uses gVisor to keep LLM environments isolated from other workloads on the network. Other capabilities were also built into the sandbox to provide sandbox snapshots and container-optimized compute.
The admin sets what privileges an LLM may have. It could have access to the internet, though the sandbox limits the agent from rummaging around in the internal system itself.
And in case of something going really wrong, sandboxes can be restored to their initial state in less than three seconds.
GKE Inference Gateway
The GKE Inference Gateway has been customized for AI workloads, which can have different load-balancing characteristics than most Kubernetes jobs, and hence can get backlogged.
The Gateway optimized two specific kinds of AI jobs. In Google’s words:
LLM-aware routing for applications like multiturn chat, which routes requests to the same accelerators to use cached context, avoiding latency spikes.
Disaggregated serving, which separates the “prefill” (prompt processing) and “decode” (token generation) stages onto separate, optimized machine pools.
“The Gateway allows customers to dramatically reduce the latency of serving LLMs, and to do so in a way that increases throughput and reduces the cost of inference,” Beach said.
Autoscaling Improvements
Elsewhere, autoscaling got an overhaul, with more node-provisioning operations being done in parallel. Google can also set up a buffer of preprovisioned nodes, which can be provisioned almost instantly.
Even on the latest hardware, LLMs can take up to 10 minutes or more to start. As a way around this, Google has developed GKE Pod Snapshots, or memory snapshots that can be used to restart a job, saving as much as 80% in start times.
“Pod Snapshots is ideal for situations where you are horizontally scaling and creating new replicas,” Beach said.
The snapshot includes CPU and GPU memory, which is written to Google Cloud Storage.
“We restore that snapshot from cloud storage, which dramatically reduces the amount of time that it takes to scale out [additional] instances, because you don’t have to start from scratch,” he said.
With a snapshot, 70-billion-parameter models can be loaded in 80 seconds, and an 8-billion-parameter model can be loaded in just 16 seconds.
Other time-saving tweaks include a revamp to the company’s GKE container image streaming to allow containerized applications to start running before the entire container image is downloaded.
The company is open-sourcing its multi-tier checkpointing (MTC) solution, which offers the ability to store different checkpoints on different types of storage, such as local SSDs, RAM and backup storage, allowing workloads to be recovered more quickly if needed.
The post Google Debuts GKE Agent Sandbox, Inference Gateway at KubeCon appeared first on The New Stack.