Plano, TX, United States
17 hours ago
Lead Site Reliability Engineer

As a Production Support Lead, your commitment to innovation is crucial to maintaining and enhancing our company's operations. In this role, you will be instrumental in developing AI and machine learning solutions aimed at troubleshooting application issues, including identifying, escalating, and resolving incidents. Collaborating with Infrastructure Service Support team members, you will engage in root cause analysis, production changes, budgetary considerations, and staffing challenges. Your experience will be vital in managing and mentoring team members to drive strategic change, both within your team and in partnership with colleagues across JPMorgan Chase & Co.'s global network of innovators.

Key Responsibilities:

Develop and support AI/ML solutions for troubleshooting and incident resolution.Collaborate with cross-functional teams to perform root cause analysis and implement production changes.Mentor and guide team members to foster innovation and strategic change.Coordinate incident management coverage to ensure effective resolution of application issues.

Required Skills and Capabilities:

Expertise in application development and support with multiple technologies and design techniques.Experience in developing AI/ML solutions using public cloud architecture, specifically Azure and AWS.Advanced proficiency in Python for AI/ML modeling.Strong skills in automation and continuous delivery methods.Comprehensive understanding of the Software Development Life Cycle.Familiarity with agile methodologies, including CI/CD, application resiliency, and security.Experience in implementing GenAI services using Azure OpenAI models and AWS Bedrock service.Hands-on experience in system design, application development, testing, and operational stability.Knowledge of cloud platforms like AWS and Pivotal Cloud Foundry.Understanding of network topologies, load balancing, and content delivery networks.Familiarity with web and mobile application development.Awareness of risk controls and compliance with departmental and company-wide standards.Ability to work collaboratively in teams and build meaningful relationships to achieve common goals.Proficiency in running production incident conferences and managing incident resolution.

Incident Management:

Coordinate incident management coverage to ensure appropriate response.Facilitate and coordinate communications during critical outage situations.Document calls, manage queues, analyze tickets, and interface with impacted lines of business for incident impact analysis.Serve as a single voice for our line of business and cross-line business incidents.Maintain an end-to-end view of issues for objectivity.Influence senior technology leads across organizations to ensure timely resolution of incidents
Confirm your E-mail: Send Email