Your browser does not support javascript! Please enable it, otherwise web will not work for you.

Senior Site Reliability Engineer, DGX Cloud @ NVIDIA

Home > Devops

 Senior Site Reliability Engineer, DGX Cloud

Job Description

 
  • Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
  • Define SLOs/SLIs, monitor error budgets, and streamline reporting
  • Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
  • Maintain services once they are live by measuring and monitoring availability, latency and overall system health
  • Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
  • Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
  • Lead triage and root-cause analysis of high-severity incidents
  • Practice balanced incident response and blameless postmortems
  • Participate in on-call rotation to support production services
What we need to see:
  • BS in Computer Science or related technical field, or equivalent experience
  • 10+ years of experience operating production services
  • Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
  • Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
  • Proficiency in at least one high-level programming language (e.g., Python, Go)
  • In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
  • Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
  • Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.
Ways to stand out from the crowd:
  • Operating GPU-accelerated clusters with KubeVirt in production
  • Applying generative-AI techniques to reduce operational toil
  • Automating incidents with Shoreline or StackStorm

Job Classification

Industry: Electronic Components / Semiconductors
Functional Area / Department: Engineering - Software & QA
Role Category: DevOps
Role: Site Reliability Engineer
Employement Type: Full time

Contact Details:

Company: Nvidia
Location(s): Kolkata

+ View Contactajax loader


Keyskills:   Computer science Capacity management Networking Linux GCP Consulting Support services Gaming Operations Python

 Job seems aged, it may have been expired!
 Fraud Alert to job seekers!

₹ Not Disclosed

Similar positions

Cloud Assistant Engineer

  • Pepsico
  • 4 - 9 years
  • Hyderabad
  • 7 days ago
₹ Not Disclosed

Cloud Platform Devops Engineer

  • Baker Hughes
  • 4 - 8 years
  • Mumbai
  • 7 days ago
₹ Not Disclosed

Cloud Platform Devops Engineer

  • Baker Hughes
  • 4 - 8 years
  • Hyderabad
  • 7 days ago
₹ Not Disclosed

Devops/ Gcp Cloud Support Engineer At Mumbai-lower Parel.

  • IT Industry
  • 2 - 7 years
  • Mumbai
  • 8 days ago
₹ Not Disclosed

NVIDIA

Nvidia Corporation