Senior System Development Engineer, EC2 Nitro
Amazon.com
Join the EC2 Machine Learning Systems team at Amazon Web Services (AWS) as a System Development Engineer III and lead the development of operational visibility and tooling for EC2 supercomputer instance families. In this role, you'll leverage your specialized knowledge of distributed systems to improve system automation and operational tooling between infrastructure hosting EC2 instances and back-end control plane infrastructure.
This position offers a unique opportunity to work at the intersection of high-performance computing and machine learning infrastructure. You'll apply operations best practices at scale while developing tools and systems that enhance visibility, maintenance, and operations of customer-facing supercomputer instance types. Your work will directly impact how AWS customers leverage compute resources for their most demanding machine learning workloads.
Key job responsibilities
- Design and implement robust operational visibility solutions and tooling for EC2 supercomputer instance families, focusing on system reliability, performance optimization, and scalability across complex infrastructure
- Lead projects that require collaboration across multiple engineering teams to improve maintenance practices and operational efficiency for customer-facing supercomputer instance types
- Develop technical solutions for complex problems involving Nitro systems, considering multiple risks and roadblocks while keeping solutions as simple as possible
- Build and maintain high-quality systems by adopting best practices, owning operational metrics, and understanding the long-term impact on customer experience
- Balance speed of delivery with foundation for the future, identifying critical technical decisions and advocating for the right solutions that prioritize long-term software quality and maintainability
About the team
The EC2 Nitro Machine Learning Systems team is responsible for development, operations, and maintenance of scale-out machine learning platforms used for training and inference workloads. We build and optimize the infrastructure that powers some of the most computationally intensive AI/ML workloads in the cloud. Our team is passionate about creating reliable, high-performance systems that enable customers to push the boundaries of what's possible with machine learning.
Working with us means having the opportunity to influence the future of supercomputing in the cloud while solving complex technical challenges at massive scale. We collaborate closely with customers and internal teams to continuously improve our platforms and deliver innovations that accelerate machine learning workflows.
This position offers a unique opportunity to work at the intersection of high-performance computing and machine learning infrastructure. You'll apply operations best practices at scale while developing tools and systems that enhance visibility, maintenance, and operations of customer-facing supercomputer instance types. Your work will directly impact how AWS customers leverage compute resources for their most demanding machine learning workloads.
Key job responsibilities
- Design and implement robust operational visibility solutions and tooling for EC2 supercomputer instance families, focusing on system reliability, performance optimization, and scalability across complex infrastructure
- Lead projects that require collaboration across multiple engineering teams to improve maintenance practices and operational efficiency for customer-facing supercomputer instance types
- Develop technical solutions for complex problems involving Nitro systems, considering multiple risks and roadblocks while keeping solutions as simple as possible
- Build and maintain high-quality systems by adopting best practices, owning operational metrics, and understanding the long-term impact on customer experience
- Balance speed of delivery with foundation for the future, identifying critical technical decisions and advocating for the right solutions that prioritize long-term software quality and maintainability
About the team
The EC2 Nitro Machine Learning Systems team is responsible for development, operations, and maintenance of scale-out machine learning platforms used for training and inference workloads. We build and optimize the infrastructure that powers some of the most computationally intensive AI/ML workloads in the cloud. Our team is passionate about creating reliable, high-performance systems that enable customers to push the boundaries of what's possible with machine learning.
Working with us means having the opportunity to influence the future of supercomputing in the cloud while solving complex technical challenges at massive scale. We collaborate closely with customers and internal teams to continuously improve our platforms and deliver innovations that accelerate machine learning workflows.
Confirm your E-mail: Send Email
All Jobs from Amazon.com