Senior Site Reliability Engineer, DGX Cloud @ Nvidia

Home > Devops

Senior Site Reliability Engineer, DGX Cloud

Nvidia
10 - 15 years
Kolkata
8 months ago
Email to a friend
Report this job

Job Description

Build, implement and support operational and reliability aspects of large-scale Kubernetes clusters with focus on performance at scale, real time monitoring, logging and alerting
Define SLOs/SLIs, monitor error budgets, and streamline reporting
Support services before they launch through system creation consulting, developing software tools, platforms and frameworks, capacity management, and launch reviews
Maintain services once they are live by measuring and monitoring availability, latency and overall system health
Operate and optimize GPU workloads across AWS, GCP, Azure, OCI, and private clouds
Scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity
Lead triage and root-cause analysis of high-severity incidents
Practice balanced incident response and blameless postmortems
Participate in on-call rotation to support production services

What we need to see:

BS in Computer Science or related technical field, or equivalent experience
10+ years of experience operating production services
Expert-level knowledge of Kubernetes administration, containerization, and microservices architecture
Experience with infrastructure automation tools (e.g., Terraform, Ansible, Chef, Puppet)
Proficiency in at least one high-level programming language (e.g., Python, Go)
In-depth knowledge of Linux operating systems, networking fundamentals (TCP/IP), and cloud security standards
Proficient knowledge of SRE principles, encompassing SLOs, SLIs, error budgets, and incident handling
Experience building and operating comprehensive observability stacks (monitoring, logging, tracing) using tools like OpenTelemetry, Prometheus, Grafana, ELK Stack, Lightstep, Splunk, etc.

Ways to stand out from the crowd:

Operating GPU-accelerated clusters with KubeVirt in production
Applying generative-AI techniques to reduce operational toil
Automating incidents with Shoreline or StackStorm

Job Classification

Industry: Electronic Components / Semiconductors
Functional Area / Department: Engineering - Software & QA
Role Category: DevOps
Role: Site Reliability Engineer
Employement Type: Full time

Contact Details:

Company: Nvidia
Location(s): Kolkata

+ View Contact

Login

Candidates can login here to view contacts and apply.

Sign In Sign Up

Email:

Password:

Password too short

To create your profile, apply for a job or make a registration

Your name (*)

Email (*)

Mobile (*)

Preferred City (* max. 2 w/comma)

Designation / Expected Role

Current / Recent Company (*)

Experience (*)

Expected Salary (*)

Desired Industry (*):

Functional area / Department (*):

Enter Skills (key skills, subjects, technologies & roles to use in search)

Write briefly about yourself, your experience and education (*)

Attach Resume Max 2.38 MB (RTF, PDF, DOC, DOCX formats only parsed)

Please, check the file size and type.

Add social media [ + ]

Create password

I agree with website service terms and conditions

Candidates are expected to provide most recent and accurate profile information, inappropriate content is strictly prohibited!

Keyskills: Computer science Capacity management Networking Linux GCP Consulting Support services Gaming Operations Python

Job seems aged, it may have been expired!
Fraud Alert to job seekers!

₹ Not Disclosed

Job application

We will notify the employer with your details. You can also attach a resume or a cover letter.

Sign In Sign Up

Email:

Password:

Password too short

To create your profile, apply for a job or make a registration

Your name (*)

Email (*)

Mobile (*)

Preferred City (* max. 2 w/comma)

Designation / Expected Role

Current / Recent Company (*)

Experience (*)

Expected Salary (*)

Desired Industry (*):

Functional area / Department (*):

Enter Skills (key skills, subjects, technologies & roles to use in search)

Write briefly about yourself, your experience and education (*)

Attach ResumeMax 2.38 MB (RTF, PDF, DOC, DOCX formats only parsed)

Please, check the file size and type.

Add social media [ + ]

Create password

I agree with website service terms and conditions

Similar positions

Application Developer-AWS Cloud FullStack

IBM

8 - 10 years

Bengaluru

12 hours ago

₹ Not Disclosed

Application Developer-AWS Cloud Fullstack

IBM

6 - 8 years

Pune

16 hours ago

₹ Not Disclosed

Cloud & AI Engineer

IBM

2 - 5 years

Bengaluru

17 hours ago

₹ Not Disclosed

OpenShift Cloud Developer

IBM

4 - 8 years

Bengaluru

23 hours ago

₹ Not Disclosed

Nvidia

Nvidia Corporation

Senior Site Reliability Engineer, DGX Cloud @ Nvidia

Home > Devops