Plano, TX, USA
3 days ago
Site Reliability Engineer

DESCRIPTION:

Duties: Responsible for the availability, performance, change management, monitoring, and capacity management of US Private Bank services/products. Create individualized Service Level Objectives and Service Level Indicators, dashboards, and observability solutions customized to the needs of the Product. Drive adoption of self-healing and resiliency patterns. Contribute to product or software in order to automate manual operational work. Troubleshoot priority incidents, facilitates blameless post-mortems and supports solutions for closure. Apply analytics on past data, like incidents and usage patterns for predicting issues and takes proactive actions. Define and drive adoption of a best in class monitoring framework to accomplish end to end application or service monitoring and noiseless alerting. Deploy sustainable software, system and product upgrades.

QUALIFICATIONS:

Minimum education and experience required: Master's degree in Management Science, Computer Science, any Engineering discipline, Mathematics and Sciences, Data Sciences, or related field of study plus 3 years of experience in the job offered or as Site Reliability Engineer III, Data Engineer, or related occupation. The employer will alternatively accept a Bachelor's degree in Management Science, Computer Science, any Engineering discipline, Mathematics and Sciences, Data Sciences, or related field of study plus 5 years of experience in the job offered or as Site Reliability Engineer III, Data Engineer, or related occupation.

Skills Required: This position requires 1 year of experience with the following: using Ansible to execute event driven automation activities. This position requires 2 years of experience with the following: managing and troubleshooting applications and services deployed across various environments, comprising: Physical Environments, Virtual Environments (VMWare or RedHat OpenShift), On-Premises Containerized Deployments in Private Cloud using Cloud Foundry, Public Cloud Environments (AWS or Azure); programming with Python; with Cloud, API, Event-Driven, and Micro-services technologies for application environments deployed on infrastructure comprising at least 100+ computer cores (CPUs) and SAN storage; experience developing and managing infrastructure as code (IaC) using Terraform. This position requires any amount of experience with the following: designing and maintaining CD pipelines using Spinnaker and Harness; Automating release workflows, managing canary and blue-green deployments, and ensuring zero-downtime rollouts across the environment using Jenkins Pipeline and Harness; Configuring and optimizing Jenkins for CI/CD workflows; conducting chaos engineering experiments using Gremlin; monitoring and optimizing application performance using Dynatrace; Setting up and configuring Grafana for data visualization and monitoring; Splunk for log management and analysis; Utilizing SourceGraph for comprehensive code search and repository navigation; Utilizing Python to collect, index, and analyze application data for issue troubleshooting and ensuring security and compliance; Automating provisioning, implementing modular code practices, and ensuring compliance with security and governance standards using Ansible and Python; Automating the building, testing and deployment processes, integration with various development tools, and management of application development using Python and Terraform; Simulating application failure scenarios to test system resilience, identify weaknesses, and improve overall reliability and fault tolerance using Gremlin and FIS; Implementing end-to-end monitoring, analyzing performance metrics, and troubleshooting issues to ensure high availability and optimal performance; Creating custom dashboards and alerting and integrating dashboards with various data sources; facilitating quick code discovery; tracking dependencies; performing impact assessments; Defining and monitoring Service Level Indicators (SLIs); establishing Service Level Objectives (SLOs); managing Service Level Agreements (SLAs); Utilizing error budgets to balance innovation and reliability; making data-driven decisions on when to prioritize feature development versus system stability improvements.

Job Location: 8181 Communications Pkwy, Plano, TX 75024

Confirm your E-mail: Send Email