[Remote] Order Management System (OMS) Staff Engineer
Note: The job is a remote job and is open to candidates in USA. Levi Strauss & Co. is a company that values individuality and making a positive impact. They are seeking a Staff Engineer for their Order Management System (OMS) team, responsible for leading the design and architecture of complex systems, ensuring engineering excellence, and promoting operational reliability.
Responsibilities
- Lead the design and domain modeling of complex, distributed systems within the OMS ecosystem. This produces clear, well-reasoned service boundaries, data contracts, and event-driven interaction patterns that stand up to scrutiny and scale
- Champion domain-driven design (DDD) principles, working with product and engineering peers to identify bounded contexts, eliminate implicit coupling, and surface shared language across teams
- Guide decomposition of monolithic or tightly-coupled components into well-defined, independently deployable services—reducing blast radius, improving team autonomy, and promoting faster iteration
- Author architecture decision records (ADRs) and technical design documents that communicate the "why" alongside the "what," helping teams make decisions over time
- Write, review, and guide production-quality code with an emphasis on clarity, testability, and long-term maintainability—setting the bar for engineering craft on the team
- Apply modern software engineering practices: CI/CD pipelines, automated testing strategies, feature flagging, progressive delivery, and trunk-based development
- Identify and eliminate technical debt systematically, balancing short-term velocity with long-term system health through well-argued, incremental improvement plans
- Establish and promote coding standards, patterns, and best practices across the OMS team that are practical, enforceable, and grounded in production experience
- Operate with full production: you design with failure in mind, participate in on-call rotations, and take accountability for the health and reliability of the systems you ship
- Embed reliability engineering into the development lifecycle—defining SLOs, error budgets, and reliability targets upfront rather than as an afterthought
- Treat runbooks, strategies, and operational documentation as first-class engineering artifacts, keeping them accurate, applicable, and tightly coupled to the systems they describe
- Design and implement comprehensive observability strategies—structured logging, distributed tracing, and metrics—so that you can localize any failure mode in production
- Develop dashboards that give engineers, on-call responders, and partners genuine operational insight into system health—not just uptime pings, but meaningful golden signals and business-relevant Goals
- Define and tune alerting strategies that are signal-rich and noise-poor—ensuring you wake on-call engineers for relevant events, not symptoms of unrelated upstream noise
- Champion observability as a design constraint, ensuring you instrument new services and that you make telemetry quality part of every code review and launch checklist
- Design systems that can sustain peak commercial volumes—seasonal traffic spikes, flash sales, and global expansion—without degraded experience or unplanned downtime
- Apply scalability patterns: asynchronous messaging, event sourcing, CQRS, caching strategies, database sharding, and graceful degradation, selecting the right tool for each problem
- Conduct and lead capacity planning exercises, load testing, and performance profiling—translating production data into informed infrastructure and architectural decisions
- Be the senior technical resource during complex production incidents—methodically narrowing hypotheses, leading war rooms, and restoring service while preserving forensic evidence for root cause analysis
- Facilitate blameless post-incident reviews (PIRs) that produce durable improvements—not just immediate fixes, but systemic changes that reduce the likelihood or impact of recurrence
- Develop institutional troubleshooting knowledge: document failure modes, known issues, and diagnostic techniques so the entire team grows more capable with each incident
- Partner with product managers, architects, and other engineers to translate our requirements into clear, achievable technical roadmaps—bridging the gap between strategy and implementation
- Mentor and level up mid-level engineers through hands-on code review, design feedback, pairing sessions, and direct coaching—building engineering depth across the OMS team
- Stay current with industry trends in distributed systems, event-driven architecture, and operational tooling—bringing informed perspectives on when to adopt new approaches versus doubling down on patterns
Skills
- 10+ years of experience in software engineering with a focus on backend systems, distributed architectures, and platform/product engineering at scale
- Deep, practical experience designing and modeling complex distributed systems—you articulate trade-offs and make well-reasoned architectural choices under constraints
- You have experience operating in a 'you build it, you run it' engineering culture. You've been on-call for systems you've built, responded to incidents, and used that experience to make better engineering decisions
- Build for scale and run at scale—you've handled high-throughput, high-availability systems and have the scars and lessons to show for it
- Expert-level understanding of observability: you can instrument a system from scratch, build meaningful dashboards, tune alerting, and use telemetry data as a primary tool for engineering decisions
- Troubleshoot with a systematic, data-driven approach to diagnosing production issues—you stay calm and lead others when systems are on fire
- Demonstrated experience decoupling tightly-coupled systems—whether migrating a monolith, extracting a shared service, or replacing implicit temporal dependencies with well-defined async contracts
- Experience with event-driven architecture, domain-driven design, and modern API design patterns; you know where these patterns add value and where they add unnecessary complexity
- Mastery of CI/CD, automated testing, and DevOps practices; you view them as engineering fundamentals, not optional add-ons
- You can translate technical complexity for non-technical partners and write for engineering audiences—design docs, ADRs, incident reports, and code reviews all reflect your thinking
- Experience working with geographically distributed teams and navigating the complexities of multi-time zone collaboration
- Experience with Order Management Systems (OMS), fulfillment pipelines, or commerce platforms is a meaningful plus—familiarity with the domain accelerates your impact, but is not a prerequisite for the right engineer
Benefits
- Base pay
- Incentive plans
- 401(k) matching
- Paid leave
- Health insurance
- Product discounts
Company Overview
Company H1B Sponsorship