Observability & Monitoring Engineer

No longer accepting applications

Overview

Job TypeHybrid

Experience5 years

Job PositionCloud/DevOps

UpdatedAug 10, 2025

LocationTel Aviv District

SalaryN/A

Skills

Bash
Python
Elasticsearch
Linux
Microservices
Kubernetes
Grafana
Datadog
Distributed tracing tools
Log pipelines
Metrics collection frameworks
Networking basics
Prometheus

We’re looking for a hands-on Observability & Monitoring Engineer who will own the visibility, health, and reliability of Kela’s production environments. This role is all about designing, building, and maintaining Kela’s monitoring and alerting stack, enabling the organization to detect, respond to, and prevent issues before they impact our customers.

Observability & Monitoring Stack Ownership:

Design, implement, and maintain our observability stack, covering metrics, logs, and traces across all production sites and components (physical and software layers).
Work with tools like Prometheus, Grafana, ELK, or similar.
Ensure clear, accessible dashboards for both internal and customer-facing stakeholders, showing site/component health, uptime, and anomalies.
Define and continuously tune alert thresholds, escalation paths, and severity levels.
Reduce alert noise and focus on actionable signals.
Ensure all critical services have effective monitoring coverage (availability, performance, resource usage, errors, etc.).
Partner with R&D to define, request, and validate telemetry data: logs, custom metrics, traces, etc.
Advocate for observability best practices in product and feature development.
Influence logging and monitoring standards across engineering teams.
Automate routine monitoring tasks, health checks, and anomaly detection scripts.
Drive self-healing initiatives: Build tools and automation for faster incident mitigation.
Create and maintain runbooks for alert response, ensuring that Support and Delivery teams have clear operational guidance.

Contribute to incident post-mortems and help drive continuous improvement based on lessons learned.

Participate in incident response and on-call rotations (where applicable).

Support real-time production incident bridges with data-driven analysis from the observability stack.

Must Have:

5+ years experience in Observability, SRE, Production Engineering, or DevOps roles focused on monitoring and system reliability.
Deep hands-on experience with monitoring tools like Datadog, Prometheus, Grafana, ELK, or equivalents.
Strong experience with Linux systems, networking basics (HTTP, DNS, firewalls, proxies).
Experience with Kubernetes, microservices architecture, or multi-cluster environments.
Background working in hybrid environments (on-prem + cloud).
Prior experience implementing self-healing automation in production environments.
Proven track record in alerting design, threshold tuning, and incident detection at scale.
Experience with log pipelines, metrics collection frameworks, and distributed tracing tools.
Solid scripting and automation skills (Python, Bash, etc.).
Experience participating in incident response processes, on-call rotations, and root cause analysis.
Excellent communication skills - able to explain system status and health metrics to both engineers and non-technical stakeholders.

Similar jobs

Sr. Engineer, iAuto (Remote)

IsraelJul 08, 2026
Senior DevOps Engineer

Ramat GanJul 07, 2026
DevOps & Infrastructure Engineer

Ramat GanJul 06, 2026
Software Engineer

Tel Aviv-YafoJul 06, 2026
DevOps Linux Administrator

RehovotJun 29, 2026
DevOps Engineer

Tel Aviv DistrictJun 18, 2026
Cloud Architect

RaananaJul 09, 2026
DevOps Engineer

GivatayimJul 06, 2026

Your Account

Your Account