Lead the design, implementation, and maintenance of SRE practices, including monitoring, alerting, incident management, and automation.
Develop and maintain robust monitoring solutions using tools like Dynatrace and Grafana.
Automate manual processes and build self-healing systems using scripting languages like PowerShell and infrastructure-as-code tools such as Terraform and YAML.
Provide24/7 on-call support and participate in incident resolution.
Identify and address performance bottlenecks and scalability challenges.
Collaborate with development, operations, and security teams to improve system reliability and security.
Document SOPs and best practices.
Lead and mentor a team of SRE engineers.
Drive the adoption of DevOps principles and CI/CD pipelines.
Establish SLAs for incident response and track performance against those goals.
Job Classification
Industry: IT Services & ConsultingFunctional Area / Department: Engineering - Software & QARole Category: DevOpsRole: Site Reliability EngineerEmployement Type: Full time