DevJobs

Site Reliability Engineer

Overview
Skills
  • Azure DevOps Azure DevOps
  • GitHub Actions GitHub Actions
  • AWS AWS
  • Kubernetes Kubernetes
  • Terraform Terraform
  • Grafana Grafana
  • Infrastructure as Code
  • ArgoCD
  • DataDog
  • Prometheus Prometheus

Senior SRE


Company Profile:


AppCard Inc. is a technology and marketing company headquartered in Manhattan, NY. Appcard has a powerful marketing tool that leverages data acquired at the point of sale (POS) via an advanced rewards program to create advanced retargeting campaigns that help businesses increase their bottom line. AppCard is unique in the loyalty space due to its patented technology which allows businesses to capture shopper identity and item level data in realtime from purchases made in store and online. The benefit of this is two fold: consumers benefit by receiving offers, incentives and coupons. Through a shopper’s interactions with the former AppCard’s platform records and learns shopper behavior and gives grocers the ability to make their data actionable to increase average basket size and systematically increase repeat purchases.


About the role:


At AppCard, we power AI-driven customer loyalty and marketing solutions, processing millions of daily transactions. We are looking for a Senior Site Reliability Engineer (SRE) to help establish and lead SRE best practices within our CloudOps team. This is a hands-on role for an experienced professional who can maintain and optimize cloud production systems, ensure real-time monitoring, and implement robust security measures.


What you’ll do:


  • Maintain and optimize cloud production systems, ensuring high availability, reliability, and performance through robust monitoring, alerting, and backup restoring mechanisms


  • Implement and manage security tools (must-have experience), proactively identifying vulnerabilities and enforcing security best practices to protect infrastructure and data


  • Automate operations and build self-healing systems using Infrastructure as Code (IaC) tools like Terraform and CrossPlane, reducing manual effort and improving system efficiency


  • Enhance observability and monitoring with DataDog, Prometheus, and Grafana, developing intelligent alerting and anomaly detection to ensure system health and uptime


  • Lead incident response and troubleshooting, ensuring swift recovery, conducting Root Cause Analysis (RCA), and driving continuous improvements in system stability


  • Optimize and automate CI/CD pipelines with GitHub Actions, Azure DevOps, or ArgoCD, enabling smooth, reliable, and efficient deployments


  • Strengthen system resilience and scalability, applying chaos engineering principles and designing architectures that support high-scale, production-critical environments



What you have:


  • 5+ years of experience in SRE, DevOps, or Cloud Engineering, with a strong track record in managing and optimizing cloud infrastructure


  • Proven ability to maintain and enhance cloud production environments, ensuring high availability and performance (AWS preferred)


  • Expertise in security tools and best practices (mandatory), proactively identifying and mitigating vulnerabilities to protect infrastructure


  • Strong experience in monitoring, alerting, and backup restoring, ensuring system reliability, quick incident resolution, and data protection


  • Hands-on proficiency with Terraform, Kubernetes, and Infrastructure as Code (IaC) to automate deployments and infrastructure management


  • Deep knowledge of incident response, troubleshooting, and system performance tuning, with the ability to diagnose and resolve complex issues efficiently


  • Strong problem-solving mindset, thriving in fast-paced, production-critical environments, with the ability to balance operational stability and innovation


  • Creative problem-solver with an innovative mindset


  • A team player, self-motivated, fast learner


  • Fluent in English

AppCard