Your browser does not support javascript! Please enable it, otherwise web will not work for you.

Principal, Software Engineer @ Walmart

Home > Software Development

 Principal, Software Engineer

Job Description

Position Summary...
What youll do...
Position Summary - to be used in the job description summary column
Job Ad Description
Principal Engineer (Private Cloud Storage - Ceph)
Job Summary
We are seeking a highly skilled Principal Engineer (Ceph Storage) with 15-18 years of deep technical experience in distributed storage systems. This role is focused on hands-on architecture, operations, performance tuning, and troubleshooting of multi-petabyte scale Ceph clusters in mission-critical environments. The ideal candidate will have strong expertise across Linux, networking, storage internals, and distributed systems, with the ability to diagnose complex issues spanning hardware, kernel, and Ceph layers.
This role requires a technical leader and subject matter expert (SME) who can architect resilient storage platforms, resolve production incidents under pressure, and drive innovation in private cloud storage at scale.
About the team
Our Private Cloud Storage Engineering team is responsible for building and operating some of the largest-scale Ceph storage clusters in the industry, supporting mission-critical applications across Walmart s global ecosystem. With hundreds of PB of data under management across multiple production clusters, we provide the backbone of reliable, secure, and high-performance storage for business operations, customer platforms, and innovation workloads.
The team works at the intersection of distributed storage systems, Linux internals, networking, and cloud infrastructure, solving some of the toughest technical challenges in scalability, performance, and resilience. We embrace a culture of deep technical expertise, hands-on problem solving, and continuous learning, while driving adoption of automation, observability, and next-generation storage technologies.
As part of this team, you will collaborate with world-class engineers across compute, networking, security, and cloud to design end-to-end solutions, shape the future of enterprise storage platforms, and contribute to the broader open-source storage community.
What You ll Do (3 to 5 brief pointers about the roles and responsibility)
Ceph Storage Architecture & Operations
  • Architect, deploy, and manage large-scale Ceph clusters across multiple production sites.
  • Ensure storage availability, data durability, and cluster resiliency through advanced CRUSH map configurations, erasure coding, and replication strategies.
  • Define upgrade strategy, cluster augmentation, node rebalancing, and hardware refreshes with minimal downtime.
  • Own end-to-end lifecycle management of Ceph, including OS/Kernel tuning, firmware upgrades, and hardware integration.
Performance, Debugging & Troubleshooting
  • Identify, diagnose, and resolve performance bottlenecks across Ceph, Linux kernel, networking, and hardware layers.
  • Utilize tools such as perf, blktrace, iostat, tcpdump, bpftrace, atop, ceph tell/ceph health detail for advanced debugging.
  • Perform deep analysis of OSD, MON, MDS, RGW performance and optimize cluster parameters.
  • Debug network congestion, packet loss, latency, and RDMA/Ethernet issues impacting Ceph.
  • Drive root cause analysis (RCA) for critical production issues and provide long-term remediation.
Automation & Observability
  • Build and standardise automation for cluster deployment, expansion, and monitoring using Ansible, Terraform, and custom Python/Shell scripts.
  • Develop observability views for real-time monitoring of IOPS, throughput, latency, and cluster health.
  • Automate alerting, log analysis, and anomaly detection for proactive incident response.
Scalability & Innovation
  • Design storage solutions to scale to hundreds of nodes and multiple petabytes while ensuring high availability and fault tolerance.
  • Collaborate with compute and networking teams to integrate Ceph with Kubernetes, OpenStack, and VM workloads.
  • Research and implement new features like CephFS, RGW S3/Swift gateways, Bluestore optimizations, RocksDB tuning.
  • Evaluate next-gen hardware (NVMe SSDs, RDMA NICs, high-density HDDs) and their impact on Ceph performance.
  • Evaluate next-gen server SKUs, perform benchmarking, and compare options to select the most appropriate storage hardware.
Security & Compliance
  • Implement encryption (at-rest and in-transit), access controls, and audit mechanisms for secure data management.
  • Ensure compliance with enterprise and regulatory standards (e.g., PCI-DSS, SOC, HIPAA).
Collaboration & Mentorship
  • Act as technical SME for Ceph within the organization, mentoring junior engineers.
  • Collaborate with cross-functional teams (Compute, Networking, Cloud, Security) to ensure seamless infrastructure integration.
  • Partner with vendors and the Ceph community to drive adoption of best practices and contribute to open-source improvements.
What You ll bring
  • 15-18 years of experience in distributed storage systems, infrastructure engineering, and Linux systems.
  • 10+ years hands-on experience with Ceph, including architecture, operations, and large-scale production support.
  • Proven experience managing clusters at petabyte scale with high performance and resiliency requirements.
  • Strong expertise in:
    • Linux Systems: Kernel tuning, cgroups, systemd, process/thread debugging.
    • Networking: TCP/IP, VLANs, BGP/OSPF, bonding, load balancing, RDMA, Jumbo Frames.
    • Storage Internals: OSD design, Bluestore, RocksDB tuning, journaling, caching layers.
    • Performance Tools: perf, iostat, atop, strace, tcpdump, Wireshark, eBPF.
    • Debugging: Core dump analysis, kernel crash dump (kdump), system call tracing.
  • Proficiency in Python and Shell scripting for automation and tooling.
  • Hands-on experience with configuration management (Ansible, Salt, Puppet) and IaC tools like Terraform.
  • Knowledge of containerization (Docker, Kubernetes, LXC) and their storage backends (CSI, RBD).
  • Experience with monitoring and logging stacks (Prometheus, Grafana, ELK, OpenObserve).
  • Familiarity with cloud platforms (Azure, GCP, OpenStack, AWS) and hybrid cloud storage.
Preferred Skills
  • Contributions to the Ceph community or other distributed storage projects.
  • Experience with large-scale data replication, backup, and disaster recovery strategies.
  • Exposure to AI/ML workloads on Ceph and performance optimization for GPU clusters.
  • Familiarity with hardware accelerators (NVMe-oF, SPDK, DPDK).
Why this role matters
This role is critical to ensuring that our next-generation private cloud storage platform is reliable, performant, and scalable to meet future business demands. As a Principal Engineer - Ceph Storage, you will be the go-to technical authority, solving some of the most complex distributed storage problems in the enterprise world.
.
.
Work
Walmart s culture sets us apart, and we know being together helps us innovate, learn and grow great careers. This role is based in our [Bangalore/Chennai] office for daily work, with the flexibility for associates to manage their personal lives.
.
.
.
Equal Opportunity Employer
Walmart, Inc., is an Equal Opportunities Employer - By Choice. We believe we are best equipped to help our associates, customers and the communities we serve live better when we really know them. That means understanding, respecting and valuing unique styles, experiences, identities, ideas and opinions - while being inclusive of all people.
Minimum Qualifications...
Option 1: Bachelors degree in computer science, computer engineering, computer information systems, software engineering, or related area and 5 years experience in software engineering or related area.
Option 2: 7 years experience in software engineering or related area.
Preferred Qualifications...
Master s degree in computer science, computer engineering, computer information systems, software engineering, or related area and 3 years experience in software engineering or related area., We value candidates with a background in creating inclusive digital experiences, demonstrating knowledge in implementing Web Content Accessibility Guidelines (WCAG) 2.2 AA standards, assistive technologies, and integrating digital accessibility seamlessly. The ideal candidate would have knowledge of accessibility best practices and join us as we continue to create accessible products and services following Walmart s accessibility standards and guidelines for supporting an inclusive culture.
Primary Location... G, 1, 3, 4, 5 Floor, Building 11, Sez, Cessna Business Park, Kadubeesanahalli Village, Varthur Hobli , India

Job Classification

Industry: IT Services & Consulting
Functional Area / Department: Engineering - Software & QA
Role Category: Software Development
Role: Technical Architect
Employement Type: Full time

Contact Details:

Company: Walmart
Location(s): Bengaluru

+ View Contactajax loader


Keyskills:   Performance tuning Automation Linux Production support Configuration management Shell scripting Firmware Troubleshooting Open source Python

 Fraud Alert to job seekers!

₹ Not Disclosed

Similar positions

CTO - Quantum Engineering - Developer

  • Wipro
  • 2 - 7 years
  • Bengaluru
  • 4 days ago
₹ Not Disclosed

Data Engineer (Azure Purview)

  • Capgemini
  • 6 - 11 years
  • Hyderabad
  • 4 days ago
₹ Not Disclosed

MLOps Engineer

  • Capgemini
  • 5 - 10 years
  • Hyderabad
  • 4 days ago
₹ Not Disclosed

Custom Software Engineer

  • Accenture
  • 2 - 5 years
  • Mumbai
  • 4 days ago
₹ Not Disclosed

Walmart

If youre thinking scale, think bigger and dont stop there. At Walmart Global Tech India, we dont just innovate, we enable transformations across stores and different channels for the Walmart experience. A regular day at Walmart Global Tech India means using technology to deliver leading-edge i...