What does Dataloop do?
Dataloop AI is on a mission to provide an all-in-one platform for AI/GenAI lifecycle management, specializing in unstructured data. Our platform offers advanced data management, annotation tools, MLOps workflows, and data pipelines for seamless development and production. Dataloop empowers organizations with a unique ML applications marketplace to easily deploy AI/GenAI solutions.
About the position:
We’re growing and we need people who are experienced in devops positions to help us grow faster and bigger
We expect this position to be taken by someone who is ready to tackle big production systems and wants to learn what it takes to scale a system greatly, manage it, maintain it, and keep it operational at all times
This includes, but is not limited to, suggesting, planning and executing tasks that will help achieve this goal.
Key Responsibilities:
- Cloud Infrastructure Management: Provision, manage, and optimize resources across Azure (preferred), AWS, and GCP, ensuring high availability, cost-efficiency, and performance.
- Kubernetes Orchestration: Architect and operate advanced Kubernetes environments — both managed (AKS, EKS, GKE) and on-prem (RKE, Rancher, OpenShift).
- CI/CD Engineering: Develop and manage robust CI/CD pipelines using Bitbucket Pipelines, Argo Workflows, and Jenkins, automating build-test-deploy workflows.
- Monitoring & Logging: Implement and manage monitoring systems with Prometheus, Grafana, and centralized logging with ELK/EFK stacks.
- Infrastructure Automation: Leverage Terraform, Ansible, and scripting (Python/Bash) to build and manage infrastructure as code (IaC).
- On-Premise and Air-Gapped Deployments: Architect and support isolated environments using local DNS (PowerDNS), registries (Harbor), GitOps, and secure deployment practices.
- Security & Compliance: Implement IAM policies, secrets management (Vault), encryption, and secure software delivery pipelines.
- Documentation & System Design: Author detailed technical documentation including architectural blueprints, SOPs, and disaster recovery plans.
- Collaboration & Mentorship: Work cross-functionally with developers, product, and QA teams; mentor junior DevOps engineers; and drive a culture of excellence.
- Customer Engagement: Participate in technical discussions and workshops with enterprise clients to support onboarding and production success.
Minimum Qualifications:
- 5+ years of experience in a DevOps, Site Reliability, or Platform Engineering role.
- Strong command over Linux system administration, cloud networking, and container orchestration.
- Proven experience with Azure, AWS, and GCP cloud services, with Azure being a strong preference.
- Advanced skills in Kubernetes, with expertise in OpenShift, RKE, and Rancher.
- Familiarity with GitOps, Bitbucket Pipelines, Helm, and ArgoCD
- Experience with observability using Prometheus, Grafana, Elasticsearch, Fluentd/Filebeat, and Kibana.
- Hands-on expertise with Terraform, Ansible, and scripting languages like Bash or Python.
- Knowledge of secure deployment practices, disaster recovery, and high availability designs.
Nice to Have:
- Red Hat, Kubernetes, or cloud certifications (e.g., RHCA, CKA, Azure DevOps Expert)
- Experience in GPU-enabled Kubernetes clusters.
- Familiarity with DNS, Image Registry, or service mesh technologies.
- Exposure to hybrid infrastructure environments.
- Understanding of DevSecOps and compliance standards.