Job Description
As a Site Reliability Engineer - Application Support, you will:
- Ensure System Reliability & Availability: Monitor, troubleshoot, and maintain critical backend applications and infrastructure to meet SLA/SLO targets and ensure high availability of trading platforms
- Implement SRE Best Practices: Design and implement monitoring, alerting, and observability solutions using tools like Grafana, Dynatrace, and Elasticsearch to proactively identify and resolve issues
- Automate Operations: Develop automation scripts and tools using Linux shell scripting and Python to reduce manual intervention, improve system efficiency, and eliminate toil
- Manage Cloud Infrastructure: Work with AWS services and terraform to provision, manage, and optimize cloud infrastructure while ensuring cost efficiency and security
- Container Orchestration: Manage and troubleshoot Kubernetes clusters and deployments, ensuring optimal performance and resource utilization
- Incident Response & Management: Participate in on-call rotations, lead incident response efforts, perform root cause analysis, and implement preventive measures to reduce recurrence
- Performance Optimization: Conduct performance testing, capacity planning, and load testing to ensure systems can handle peak trading hours and scale effectively
- CI/CD Pipeline Understanding: Work with CI/CD tools like GitLab Runner and Argo CD to ensure smooth and reliable deployment processes
- Database Support: Troubleshoot and optimize Redis caching layers and Oracle databases, including writing and debugging PL/SQL queries for performance tuning
- Collaboration & Documentation: Work closely with development teams to improve application reliability, create runbooks, SOPs, and maintain comprehensive technical documentation
- Continuous Improvement: Analyze system metrics, identify bottlenecks, and propose architectural improvements to enhance reliability and performance
We are looking for someone with:
5-7 years of hands-on experience in SRE, DevOps, or Application Support roles, preferably in high-availability production environments
Linux Administration: Strong experience with Linux systems, proficiency in shell scripting for automation, system monitoring, and troubleshooting
Kubernetes: Hands-on experience managing Kubernetes clusters, troubleshooting pod issues, analyzing logs, configuring deployments, and understanding networking concepts
AWS Cloud Services: Working knowledge of AWS services (EC2, S3, RDS, Lambda, CloudWatch, ECS, etc.) with experience in troubleshooting and optimizing cloud infrastructure
Infrastructure as Code: Experience with Terraform or similar tools for provisioning and managing cloud resources
Monitoring & Observability: Practical experience with APM tools (Dynatrace or similar), Grafana for dashboard creation, and log analysis using Elasticsearch/Kibana
Database Management: Experience with Redis for caching solutions and Oracle databases, including basic PL/SQL querying and performance troubleshooting
CI/CD Tools: Familiarity with GitLab, Jenkins, Argo CD, or similar CI/CD platforms for deployment automation
Scripting & Programming: Proficiency in shell scripting; knowledge of Python/shell or other scripting languages is a plus
Incident Management: Experience with ServiceNow or similar ITSM tools, understanding of ITIL framework for incident, problem, and change management
SRE Principles: Understanding of SLIs, SLOs, SLAs, error budgets, and capacity planning concepts
Problem-Solving Skills: Strong analytical and troubleshooting abilities with attention to detail
Communication Skills: Ability to collaborate effectively with cross-functional teams and document technical processes clearly
Education: Bachelors degree in computer science, Information Technology, or equivalent practical experience
Following aspects would be a plus:
- Prior experience in FinTech, Banking, or Financial Services industries with understanding of regulatory compliance requirements
- Experience with containerization technologies (Docker, Podman) and container security best practices
- Knowledge of API Gateway technologies (Kong, AWS API Gateway, etc.) for managing microservices communication
- Familiarity with chaos engineering and failure injection practices
- Experience with configuration management tools (Ansible, Chef, Puppet)
- Understanding of networking concepts, load balancers, and CDN technologies
- ITIL Foundation certification or strong working knowledge of ITIL processes
- Experience with security scanning tools and implementing security best practices in DevOps pipelines
- Contributions to open-source projects or active participation in technical communities
- Experience with disaster recovery planning and business continuity processes.
Job Classification
Industry: Investment Banking / Venture Capital / Private Equity
Functional Area / Department: Project & Program Management
Role Category: Technology / IT
Role: Technology / IT - Other
Employement Type: Full time
Contact Details:
Company: Hdfc Securities
Location(s): Mumbai
Keyskills:
Site Reliability Engineering
Terraform
Sre
Dynatrace
Splunk
AWS
Grafana
Devops
Kubernetes
Python