Sign in to confirm you’re not a bot
This helps protect our community. Learn more
How We Power the Largest AI Deployments on the Planet: Running Vir... Brandon Jacobs & Lukas Gentele
28Likes
1,538Views
2023Nov 13
How We Power the Largest AI Deployments on the Planet: Running Virtual Clusters at Scale - Brandon Jacobs, CoreWeave & Lukas Gentele, Loft Labs Running and managing a large number of Kubernetes clusters on bare metal poses significant challenges, from security to GPU provisioning to scalability. Specialized cloud provider CoreWeave experienced these first-hand, operating 3,000+ Kubernetes clusters on top of 5,000 bare metal nodes with massive amounts of GPUs to power modern AI applications at scale. In the session, we’ll dive into these challenges and how CoreWeave partnered with Loft Labs, the maintainers of vcluster, to create this serverless Kubernetes experience for numerous companies running AI workloads at scale. This session demonstrates the pitfalls, design choices and architectural challenges the teams have dealt with over the course of 3 years while evolving its serverless Kubernetes offering, including: -Secure Isolation Of Tenants On A Shared Infrastructure -Challenges in achieving 10 second autoscaling -On-Demand Cluster & Compute Provisioning For Tenants -Day 2 Operations & Managing A Fleet Of Clusters At Scale

Follow along using the transcript.

CNCF [Cloud Native Computing Foundation]

125K subscribers