We’re looking for a hands-on Observability & Monitoring Engineer who will own the visibility, health, and reliability of Kela’s production environments. This role is all about designing, building, and maintaining Kela’s monitoring and alerting stack, enabling the organization to detect, respond to, and prevent issues before they impact our customers.
Observability & Monitoring Stack Ownership:
- Design, implement, and maintain our observability stack, covering metrics, logs, and traces across all production sites and components (physical and software layers).
- Work with tools like Prometheus, Grafana, ELK, or similar.
- Ensure clear, accessible dashboards for both internal and customer-facing stakeholders, showing site/component health, uptime, and anomalies.
- Define and continuously tune alert thresholds, escalation paths, and severity levels.
- Reduce alert noise and focus on actionable signals.
- Ensure all critical services have effective monitoring coverage (availability, performance, resource usage, errors, etc.).
- Partner with R&D to define, request, and validate telemetry data: logs, custom metrics, traces, etc.
- Advocate for observability best practices in product and feature development.
- Influence logging and monitoring standards across engineering teams.
- Automate routine monitoring tasks, health checks, and anomaly detection scripts.
- Drive self-healing initiatives: Build tools and automation for faster incident mitigation.
- Create and maintain runbooks for alert response, ensuring that Support and Delivery teams have clear operational guidance.
Contribute to incident post-mortems and help drive continuous improvement based on lessons learned.
Participate in incident response and on-call rotations (where applicable).
Support real-time production incident bridges with data-driven analysis from the observability stack.
Must Have:
- 5+ years experience in Observability, SRE, Production Engineering, or DevOps roles focused on monitoring and system reliability.
- Deep hands-on experience with monitoring tools like Datadog, Prometheus, Grafana, ELK, or equivalents.
- Strong experience with Linux systems, networking basics (HTTP, DNS, firewalls, proxies).
- Experience with Kubernetes, microservices architecture, or multi-cluster environments.
- Background working in hybrid environments (on-prem + cloud).
- Prior experience implementing self-healing automation in production environments.
- Proven track record in alerting design, threshold tuning, and incident detection at scale.
- Experience with log pipelines, metrics collection frameworks, and distributed tracing tools.
- Solid scripting and automation skills (Python, Bash, etc.).
- Experience participating in incident response processes, on-call rotations, and root cause analysis.
- Excellent communication skills - able to explain system status and health metrics to both engineers and non-technical stakeholders.