MLOps Engineering
Experience operationalizing & managing ML/AI workloads in production environments
Distributed Tracing & Observability
Strong understanding and hands-on implementation of metrics, logs, and traces (three pillars of observability)
Monitoring & Alerting
Production experience building Grafana dashboards and actionable alert systems; understands that dashboards without alerts lack operational value
Azure Databricks Operations
Cluster management, performance optimization, timeout resolution, library troubleshooting, and compute issue resolution
Azure Cloud Services
Deep knowledge of Azure PaaS, AKS, cloud-native architectures, and Azure monitoring/diagnostics ecosystem
Good-to-Have Skills
GCP Experience
Exposure to Google Cloud Platform services and telemetry collection
Multi-Cloud Operations : Experience across Azure, GCP, or AWS environments
Apache Airflow : Workflow orchestration experience (basic level acceptable; can be learned on job)
Python/Scripting : Automation and scripting proficiency
MLOps Knowledge : Understanding of ML lifecycle management and MLOps practices
Technology Stack
Primary Cloud : Microsoft Azure
Key Platforms : Azure Databricks, Azure Kubernetes Services (AKS), Azure PaaS services
Observability : Grafana, distributed tracing tools, metrics/logs/traces platforms
Orchestration : Apache Airflow (basic usage)
Secondary Cloud : GCP services (limited scope)
Key Responsibilities
Design and implement comprehensive observability solutions using metrics, logs, and distributed traces
Build unified Grafana dashboards for single-pane-of-glass visibility across multi-cloud environments
Establish actionable alerting frameworks that drive incident response
Implement distributed tracing for AI/ML workloads and microservices
Proactively identify and remediate performance bottlenecks
Monitor, troubleshoot, and optimize Azure Databricks compute environments
Right-size clusters and resolve performance issues (timeouts, long-running jobs, library failures)
Build observability layers where current gaps exist
Manage and optimize AKS workloads and Azure PaaS offerings
Collect telemetry from Azure and GCP services and pipe to observability stack
Integrate diverse cloud services into unified monitoring infrastructure
Implement logging, metrics collection, and tracing across heterogeneous environments
Ensure comprehensive visibility across entire technology stack
Create and manage support cases with Databricks and Microsoft
Provide technical support for AI/ML workloads on cloud infrastructure
Research and implement solutions for unfamiliar technologies

Keyskills: grafana azure Kubernetes ml mlops kuberflow aks Ml Deployment Ml Pipelines mlflow
CitiusTech is a specialist provider of healthcare technology services and solutions, with strong presence across the globe. As a strategic partner to some of the world's largest healthcare organizations, CitiusTech plays a deep and meaningful role in accelerating technology innovation and shaping th...