Act as the highest-level technical authority for critical issues impacting AI/ML SaaS products.
Develop a detailed technical understanding of AI applications and services and be equipped to fix hands-on issues in any area.
Perform deep-dive diagnostics across microservices, APIs, AI/ML pipelines, and Azure resources.
Lead root cause analysis (RCA), create incident/problem records, and implement permanent fixes.
Ensure adherence to SLAs and SLOs for enterprise customers.
Interface with partners, professional services, and customers during escalations, lead debugging, and provide resolution plans.
Support and troubleshoot AI/ML inference services, model deployments, and data pipelines.
Work with Azure Machine Learning, Azure Kubernetes Service (AKS), and Azure Data/Storage services.
Collaborate with data scientists and engineering teams on performance optimization and retraining impacts. Feed field issues into the product backlog.
Implement and fine-tune monitoring using Datadog, Elastic Search/Kibana, and Azure Monitor.
Set up and maintain alerting for anomaly detection in AI/ML workloads.
Partner with engineering, DevOps, and SRE teams to improve architecture and prevent recurring incidents.
Create and maintain detailed runbooks, playbooks, and knowledge base articles.
Provide mentoring and technical guidance to junior engineers and operational teams.
Required Skills Experience
Development and troubleshooting skills on the Microsoft platform, with expertise in C#, ASP.NET, MVC, SQL, JQUERY, Stored Procedures, Azure. Python skills will be an added advantage.
Database: SQL debugging, query tuning; exposure to Cosmos DB or PostgreSQL preferred.
Cloud Infra: Deep hands-on expertise in Microsoft Azure (AKS, App Services, Functions, Storage, Networking), CI/CD pipelines.
AI/ML: Understanding of ML model deployment, inference pipelines, vector stores, RAG, and GPU/CPU optimization.
Observability Logging: Proficiency with Datadog, Elastic Search/Kibana, Open Telemetry, and Azure Monitor.
Service Management: Strong knowledge of ITSM processes (incident, problem, change).
Experience: 12-15 years in technical architect engineering roles, with at least 2+ years in senior technical leadership positions in a cloud/SaaS environment.
Qualifications
B.E./B.Tech/MCA or equivalent degree. Certifications like Microsoft Azure Solutions Architect, Azure DevOps Engineer, or Datadog Observability are a plus.
Desired Attributes
Excellent problem-solving and incident management skills under pressure.
Strong customer communication for enterprise B2B clients.
Ability to collaborate with AI engineers, data scientists, and SRE/DevOps teams.
Passion for automation and continuous improvement.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: Software DevelopmentRole: Technical ArchitectEmployement Type: Full time