In office Position
POSITION SUMMARY:
The Senior Systems Engineer is a hands-on senior individual contributor responsible for designing, building, and operating TRISTAR’s core infrastructure platform with a strong emphasis on Linux systems, Kubernetes, and automation. This role will own the Kubernetes platform end-to-end—cluster build, lifecycle management, operational standards, reliability, and day-2 operations—while partnering closely with development teams as TRISTAR transitions toward a DevOps operating model. Success in this role requires deep technical ownership, strong troubleshooting skills across distributed systems, and the ability to improve reliability through thoughtful design, observability, and repeatable automation.
ESSENTIAL DUTIES AND RESPONSIBILITIES:
Kubernetes Platform Engineering & Lifecycle:
• Design, build, and operate Kubernetes clusters in production, including upgrades,
patching, scaling, and reliability improvements.
• Establish platform standards and operating practices as the environment matures
(cluster configuration, access patterns, resource governance, and runbooks).
• Serve as the senior escalation point for Kubernetes platform issues and drive resolution
through root-cause analysis and prevention.
Kubernetes Storage, Backup/Restore & Disaster Recovery:
• Design and implement Kubernetes storage patterns (StorageClasses, PV/PVC lifecycle,
capacity planning) and support stateful workloads.
• Implement, test, and maintain Kubernetes-native backup/restore and recovery
procedures.
• Integrate Kubernetes persistence needs with enterprise storage platforms, including Dell
ObjectScale and existing virtualization/storage systems.
Ingress, Load Balancing & Kubernetes Networking:
• Own Kubernetes traffic entry, including ingress controllers, load balancers, routing
patterns, and TLS/certificate handling.
• Define repeatable patterns for exposing services and troubleshooting connectivity across
platform components.
Linux Systems Engineering:
• Administer and harden Linux systems that support the platform, including patching,
performance tuning, service reliability, logging, and baseline configuration.
• Troubleshoot system and platform issues across compute, storage, and network
dependencies.
Automation, Scripting & API Integrations:
• Build automation to reduce manual work and increase consistency across infrastructure
operations using Python/PowerShell/Bash and API-driven workflows.
• Evaluate, recommend, and help implement an automation / configuration management
approach (tooling, patterns, and standards) to support repeatable tasks such as
provisioning, configuration enforcement, patching, drift detection, and validation.
• Develop reusable automation assets (modules/playbooks/templates/scripts) and
establish version-controlled workflows (Git), documentation, and operational handoff
practices.
• Leverage RESTful APIs to integrate systems and create operational workflows (health
checks, reporting, event-driven automations, and change validation).
Monitoring, Alert Response & Operational Reporting:
• Monitor alert sources and observability tooling (including SolarWinds on-prem),
investigate events, and drive issues to completion.
• Document incidents, actions taken, and final resolutions contribute to improved alerting
quality and operational visibility.
Data Center Support (Occasional):
• Provide occasional on-site support as needed in the data center for infrastructure prep
and troubleshooting (racking equipment, cabling, and physical connectivity verification).
• Maintain working familiarity with server hardware and data center best practices to
support rare hands-on needs.
Cloud Readiness & Future-State Hosting:
• Partner with development and infrastructure teams to plan and progress TRISTAR’s
long-term transition toward cloud-hosted deployments of the application stack
• Contribute to cloud design discussions with a practical understanding of core cloud
concepts (networking, identity/access, security, reliability, scalability, and cost
considerations) across major providers (AWS/Azure/GCP).
• Translate application and platform requirements into cloud-ready operational patterns
(container orchestration in cloud, managed services vs self-managed tradeoffs,
environment isolation per client, and deployment repeatability).
• Support early-stage cloud initiatives such as proofs of concept, reference architectures,
and migration planning, including identifying skill/tooling gaps and recommending
realistic next steps.
• Apply Infrastructure-as-Code and automation principles to cloud readiness efforts to
ensure future deployments are repeatable, supportable, and auditable.
Documentation & Technical Standards:
• Create and maintain IT documentation, including platform runbooks, operational
procedures, and architecture/standards documentation.
Collaboration, Service Desk Support & Cross-Team Execution:
• Work with the Manager, Network Services and general IT staff to analyze and resolve
technical issues affecting infrastructure and applications.
• Partner closely with development teams as part of TRISTAR’s DevOps transition to
improve operability, deployment reliability, and platform usability.
• Work alongside the service desk to remedy end-user workstation issues; backfill and
answer service desk calls when required.
Schedule Flexibility & Travel:
• Perform night/day/weekend work as required to meet project objectives and support
maintenance windows.
• Traveling to remote sites is rare, but possible and may be required as needed
QUALIFICATIONS REQUIRED:
Education/Experience: Bachelor’s degree in a related field (preferred); minimum of 7-year
related experience; or equivalent combination of education and experience.
Knowledge, Skills, and Abilities:
• 7+ years of progressively responsible experience in systems/infrastructure engineering
with strong production experience in Linux administration.
• Hands-on production experience with Kubernetes, including cluster build and lifecycle
management (architecture, upgrades, patching, scaling, troubleshooting).
• Strong understanding of Kubernetes storage and stateful workload operations, including
troubleshooting PV/PVC and storage provisioning patterns.
• Experience implementing Kubernetes-native backup/restore practices and validating
recovery procedures.
• Demonstrated automation experience using scripting (Python/PowerShell/Bash) and
leveraging RESTful APIs for systems integration and automation.
• Experience with monitoring/observability platforms and operational alerting; SolarWinds
experience strongly preferred.
• Strong troubleshooting skills across distributed systems, networking fundamentals, and
infrastructure dependencies.
• Strong written and verbal communication skills, including
documentation/runbooks/standards.
EQUIPMENT OPERATED/USED: Computer, 10-key, printer, copier, fax machine, and other
office equipment.
SPECIAL EQUIPMENT OR CLOTHING: Appropriate office attire.