Position summary The Adobe Document Cloud Site Reliability team is responsible for delivering a scalable, reliable and secure computing environment to support the millions of transactions that happen every day. We are looking to expand our Site Reliability Engineering team as we embark on a new phase of growth for our product. We are a metrics-driven organization that strives to deliver world-class service both externally and internally. The team strongly believes in the DevOps methodology and works very closely with our peers on the development team. Responsibilities
Implement and support Adobe Document Cloud hosted web applications, virtual machines, databases, storage systems, and service buses in cloud deployments by working with engineering organizations in support of development and test functions
Identify, implement and support application monitoring solutions for supported applications
Troubleshoot and solve complex problems
Support various UNIX-based services to ensure maximum uptime, performance and security
Assist in the creation and refinement of operational documentation
Use your expertise to support your fellow team members
Analyze performance trends across a variety of systems for capacity planning
Work closely with engineering and QA teams to roll out new products and services
Handle day-to-day system administration tasks such as account management, patching, application deployment, system installations, and other routine maintenance
Own and enforce security compliance processes and controls
Programmatically automate routine cloud deployment, administration, and monitoring tasks
Participate in 24x7 on-call pager rotation
Requirements
8+ years of experience in a production (Web Facing) Linux, Solaris or *BSD environments at medium to large scale
Deep experience with AWS, Azure including migrating services to AWS, Azure
Ability and determination to solve complex system/application problems
Relentless approach to getting to the bottom of any problem
Experience with MySQL, Java, Apache, & Tomcat
Experience with configuration management tools like Chef, Puppet or CFengine
Experience with containerization with Docker, Kubernetes/EKS/AKS
Experience with CI/CD with Jenkins, Groovy DSL
Familiarity with Prometheus, Cortex, Grafana, NewRelic, DataDog, and Splunk
Knowledge of key protocols including TCP/IP, SSH, DNS, SMTP, SNMP, SSL, HTTP and LDAP
Experience with different caching architectures
Knowledge of security compliance frameworks, such as SOC II, PCI, HIPPA, ISO27001 and FedRAMP
Strong programming skills, particularly with Python, Java, and Go
Knowledge of well-known open source tools for monitoring, trending and configuration management
A desire to provide a reliable, secure and scalable environment that supports millions of users
Ability to architect and help create a highly automated environment
Participate in the incident management process
Assist in the creation and refinement of operational documentation
Manage our uptime and performance using service level indicators and objectives
Excellent verbal and written communication skills
Self-driven, eager to gets things done
Employement Category:
Employement Type: Full timeIndustry: Full timeFunctional Area: Not ApplicableRole Category: ITRole/Responsibilies: Sr. Site Reliability Engineer, Doc Cloud