DevJobs

AI Infrastructure Lead

Overview
Skills
  • Python Python
  • Go Go
  • ML ML
  • Kubernetes Kubernetes
  • Autoscaling
  • Containerization
  • Scheduling
  • Anomaly detection
  • LLM-driven optimizations
  • Time-series forecasting
About ScaleOps

ScaleOps, the leader in real-time automated cloud resource management, is redefining the way engineering teams run in the cloud. Our platform automatically allocates resources to match real-time demand, achieving 60–80% cost savings while improving performance and simplifying DevOps operations.

Backed by $80M from top-tier investors, and trusted by leading cloud-native innovators such as Wiz, CATO Networks, SentinelOne, and Orca Security, we’re rapidly expanding our core technology team.

About The Role

We are looking for an AI Infrastructure Lead to spearhead innovation at the intersection of AI systems, Kubernetes internals, and large-scale cloud infrastructure. This is a hands-on, high-impact role, combining deep research with prototyping and engineering leadership. You’ll design and implement next-generation infrastructure to power ScaleOps’s intelligent cloud automation engine, enabling smarter, faster, and more efficient decision-making across massive distributed systems.

This role sits at the frontier of AI + cloud infrastructure, driving the application of machine learning, scheduling intelligence, and performance optimization directly into the fabric of our platform.

What You’ll Be Doing

  • Lead research and development of advanced AI-powered strategies for resource orchestration, scheduling, and autoscaling in Kubernetes.
  • Build and validate POCs, simulations, and benchmarks that shape the future of AI-driven infrastructure management.
  • Collaborate with product, research, and engineering teams to translate cutting-edge insights into production-grade features.
  • Develop time-series forecasting models, LLM-driven optimizations, and ML-based anomaly detection to enhance cost and performance efficiency.
  • Stay at the forefront of cloud-native, FinOps, and AI infrastructure ecosystems—bringing in the latest trends and opportunities.
  • Mentor engineers and foster a culture of innovation, exploration, and cross-team technical excellence.
  • Contribute to technical blogs, papers, and conference talks, sharing our breakthroughs with the broader community.

Requirements:

What You Bring

  • 7+ years of experience in engineering, infrastructure research, or large-scale distributed systems.
  • Deep knowledge of Kubernetes internals, containerization, scheduling, and autoscaling.
  • Strong coding experience in Go, Python, or other systems-level languages.
  • Proven track record of researching, prototyping, and delivering innovative infrastructure or AI solutions.
  • Background in cloud cost optimization, performance engineering, or large-scale orchestration.
  • Ability to thrive in ambiguity, own technical domains, and deliver projects end-to-end.
  • Excellent communication skills, with the ability to influence technical direction and share knowledge.

Big Pluses:

  • Published research papers, patents, or open-source contributions.
  • Experience building machine learning systems (e.g., forecasting, reinforcement learning, or LLM-based optimization).
  • Background in security research or AI-driven threat detection.
  • Track record of presenting at top technical conferences.
ScaleOps