Public Cloud SRE is responsible for engineering and operating the cloud infrastructure and platforms of JPMC ensuring reliability, resiliency, and security. We have a Senior Software Engineer, Site Reliability position to build the infrastructure and tooling for JPMC’s Public Cloud Platform.
As a Lead Site Reliability Engineer at JPMorgan Chase within the Cloud Reliability Services, you hold a leadership role in your team, demonstrate strong knowledge across multiple technical domains, and advise others on the technical and business issues facing them. Take lead and conduct resiliency design reviews, break up complex problems into digestible work for other engineers, act as a technical lead for medium to large-sized products, and provide advice and mentoring to other engineers.
Job responsibilities
Engage in and improve the lifecycle of cloud services from inception, design, deployment, and operation Automate repeated manual tasks, develop tools and automation to improve the efficiency of the platform and infrastructure.Analyze defects, propose improvements and drive efficiencies in systems and processes.Helps to develop new cloud engineering strategies and implementations for the firmAs part of Site Reliability, you have the responsibility of ensuring the reliability, availability, and performance of the cloud infrastructure and platform.Demonstrates site reliability principles and practices every day and champions the adoption of site reliability throughout your teamDevelop observability and telemetry tools.Author and improve the quality of technical engineering documentationDebug and solve issues in a production environmentParticipates in SRE on-call rotations and escalation workflows.
Required qualifications, capabilities, and skills
Formal training or certification on software engineering or site reliability engineering and 5+ years applied experienceBachelor’s Degree in Computer Science or equivalent Deep proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices with the ability to implement these practices within an application or platformExpertise in building solutions with AWS cloud services.Knowledge in Infrastructure as Code, tools such as TerraformFluency in at least one programming language such as Python and Java.Proficiency and experience in observability such as white and black box monitoring, SLO alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, etc.Proficiency in continuous integration and continuous delivery tools (e.g., Jenkins, GitLab, Terraform, etc.)Experience with container and container orchestration (e.g., ECS, Kubernetes, Docker, etc.)Experience with troubleshooting common networking technologies and issuesAbility to identify and solve problems related to complex data structures and algorithmsDrive to self-educate and evaluate new technologyAbility to teach new programming languages to team membersAbility to expand and collaborate across different levels and stakeholder groupsExcellent communication skills working with stakeholders and domain experts across the company to design solutions to user problemsSelf-disciplined, self-managed, self-motivated and strong sense of ownership, urgency, and drive
Preferred qualifications, capabilities, and skills
AWS certifications will be a bonus.