You will personally build models, write production-quality code, analyze telemetry data, and deploy solutions that enable smart autoscaling, anomaly detection, seasonality modelling, and capacity optimization across thousands of applications. This is a high-impact, execution-focused role where models are expected to run in production and influence real-time system behavior.What you'll do...
About the team
The Site Reliability and Engineering (SRE) group at Walmart Labs serves a vital role in maintaining mission-critical infrastructure and services that ensure high availability and reliability for Walmart's enterprise and e-commerce platforms. Site Reliability Engineers blend systems engineering and software development expertise to deliver scalable solutions and keep platforms running efficiently. The team collaborates cross-functionally to design fail-proof systems, reduce system downtime, and optimize operational excellence using metrics, automation, and advanced tooling. Team owns solution for continuous profiling, tracing, policy management, etc.
What You’ll Do :
Build, train, and deploy time-series models for:Smart and predictive autoscaling of Kubernetes workloadsTraffic and resource demand forecastingSeasonality detection (daily/weekly/annual patterns)Anomaly detection in metrics, logs, and tracesPerform deep exploratory data analysis (EDA) on large-scale telemetry data (CPU, memory, latency, errors, throughput).Select, implement, and tune statistical and ML techniques (ARIMA, Prophet, tree-based models, deep learning as appropriate).Continuously evaluate models using offline metrics and live production feedback.Write production-grade Python code for model training, inference, and evaluation.Integrate ML outputs directly into SRE workflows, including:Kubernetes HPA/VPA and custom autoscaling controllersAlerting and incident detection pipelinesCapacity planning and cost optimization toolsDefine safeguards, fallback logic, and confidence thresholds to ensure safe autonomous actions.Debug model and data issues using real production incidents and postmortems.Build and maintain feature pipelines from observability data sources (Prometheus, OpenTelemetry, logs, traces).Work with streaming and batch data pipelines to process high-cardinality, high-volume time-series data.Ensure data quality, freshness, and correctness for real-time decision systems.Design schemas and feature stores optimized for time-series ML workloads.Own models end to end: development → deployment → monitoring → retraining.Implement monitoring for:Model accuracy and driftData drift and pipeline failuresImpact on system reliability and scaling behaviorAutomate retraining and validation pipelines where appropriate.Act as the go-to expert for applied ML in SRE contexts.Review and improve ML and data science code written by other team members.Partner closely with SREs to translate reliability problems into concrete modeling tasks.Drive adoption of ML solutions by proving value through metrics and outcomes.What You’ll bring
Core Experience
12+ years of experience in data science or applied machine learning.5+ years deploying ML models in production, not just experimentation.Strong experience working with time-series data at scale.Proven track record of owning systems end to end in high-availability environments.Technical Skills (Hands-On)
Expert-level Python (NumPy, Pandas, SciPy, Scikit-learn).Strong experience with time-series forecasting and anomaly detection techniques.Practical understanding of Kubernetes autoscaling (HPA/VPA, custom metrics).Experience working with metrics, logs, and traces from distributed systems.Comfortable querying and analyzing large datasets using SQL and time-series databases.Systems & Cloud Understanding
Strong understanding of distributed systems behavior (latency, load, failures, cascading effects).Hands-on exposure to Kubernetes, cloud platforms (AWS, GCP, Azure), and production observability stacks.Ability to reason about trade-offs between accuracy, latency, reliability, and safety in ML-driven automation.Nice to Have
Agentic systems with evaluation strategies.Experience building predictive or adaptive autoscaling systems.Knowledge of reinforcement learning or control systems for resource optimization.Experience with AIOps, incident prediction, or self-healing platforms.Familiarity with streaming ML or online learning approaches.About Walmart Global Tech
Imagine working in an environment where one line of code can make life easier for hundreds of millions of people. That’s what we do at Walmart Global Tech. We’re a team of software engineers, data scientists, cybersecurity expert's and service professionals within the world’s leading retailer who make an epic impact and are at the forefront of the next retail disruption. People are why we innovate, and people power our innovations. We are people-led and tech-empowered.
We train our team in the skillsets of the future and bring in experts like you to help us grow. We have roles for those chasing their first opportunity as well as those looking for the opportunity that will define their career. Here, you can kickstart a great career in tech, gain new skills and experience for virtually every industry, or leverage your expertise to innovate at scale, impact millions and reimagine the future of retail.
Work
Walmart’s culture sets us apart, and we know being together helps us innovate, learn and grow great careers. This role is based in our [Bangalore/Chennai] office for daily work, with the flexibility for associates to manage their personal lives.
Benefits
Beyond our great compensation package, you can receive incentive awards for your performance. Other great perks include a host of best-in-class benefits maternity and parental leave, PTO, health benefits, and much more.
Belonging
We aim to create a culture where every associate feels valued for who they are, rooted in respect for the individual. Our goal is to foster a sense of belonging, to create opportunities for all our associates, customers and suppliers, and to be a Walmart for everyone.
At Walmart, our vision is "everyone included." By fostering a workplace culture where everyone is—and feels—included, everyone wins. Our associates and customers reflect the makeup of all 19 countries where we operate. By making Walmart a welcoming place where all people feel like they belong, we’re able to engage associates, strengthen our business, improve our ability to serve customers, and support the communities where we operate.
Equal Opportunity Employer
Walmart, Inc., is an Equal Opportunities Employer – By Choice. We believe we are best equipped to help our associates, customers and the communities we serve live better when we really know them. That means understanding, respecting and valuing unique styles, experiences, identities, ideas and opinions – while being inclusive of all people
Minimum Qualifications...Outlined below are the required minimum qualifications for this position. If none are listed, there are no minimum qualifications.
Minimum Qualifications:Option 1: Bachelors degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 5 years' experience in an analytics related field. Option 2: Masters degree in Statistics, Economics, Analytics, Mathematics, Computer Science, Information Technology or related field and 3 years' experience in an analytics related field. Option 3: 7 years' experience in an analytics or related field.Preferred Qualifications...Outlined below are the optional preferred qualifications for this position. If none are listed, there are no preferred qualifications.
Primary Location...G, 1, 3, 4, 5 Floor, Building 11, Sez, Cessna Business Park, Kadubeesanahalli Village, Varthur Hobli , India