All roles

[Remote] Staff Site Reliability Engineer

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Thrive Market is an online, membership-based market focused on making healthy and sustainable living accessible. They are seeking a Staff Site Reliability Engineer to establish their SRE practice, define reliability metrics, and ensure system scalability during rapid growth.

Responsibilities

  • Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform services
  • Build and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platforms
  • Establish error budgets and use them to balance feature velocity with reliability investments
  • Lead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrence
  • Design and implement chaos engineering practices to proactively identify failure modes before they impact members
  • Architect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency
  • Support large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operations
  • Contribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigation
  • Design and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilities
  • Develop and own disaster recovery plans, capacity planning models, and system hardening initiatives
  • Collaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practices
  • Help establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teams
  • Champion a culture of operational excellence, continuous improvement, and data-driven reliability decisions
  • Create and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooks
  • Participate in weekly on-call rotations and help build sustainable on-call practices that avoid burnout
  • Identify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvement

Skills

  • B.S. in Computer Science or equivalent professional experience
  • 7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companies
  • Deep expertise in Kubernetes (K8s) — including cluster management, Helm charts, service meshes, and production-grade container orchestration
  • Strong systems engineering background with advanced proficiency in Linux administration
  • Advanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languages
  • Extensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and Lambda
  • Strong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar)
  • Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environments
  • Deep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments)
  • Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar)
  • Strong knowledge of web application infrastructure, networking, load balancing, and security best practices
  • Excellent communication skills with the ability to lead incident response and facilitate blameless postmortems
  • Experience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scale
  • Experience with ConcourseCI, Github Actions (GHA) or similar deployment frameworks
  • Experience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar)
  • Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd)
  • Experience building and managing cost-optimization strategies for cloud infrastructure
  • Background in establishing SRE practices in organizations transitioning from traditional DevOps models
  • Experience with configuration management tools (Ansible, Chef, Puppet, or similar)

Benefits

  • Comprehensive health benefits (medical, dental, vision, life and disability)
  • Competitive salary (DOE) + equity
  • 401k plan
  • 9 Observed Holidays
  • Flexible Paid Time Off
  • Subsidized ClassPass Membership with access to fitness classes and wellness and beauty experiences
  • Ability to work in our beautiful office in Playa Vista
  • Free Thrive Market membership with exclusive employee discount
  • Coverage for Life Coaching & Therapy Sessions on our holistic mental health and well-being platform

Company Overview

  • Thrive Market is a membership-based online company that offers natural and organic food products. It was founded in 2013, and is headquartered in Los Angeles, California, USA, with a workforce of 501-1000 employees. Its website is https://thrivemarket.com.
  • Apply To This Job

    Related roles

    [Remote] Senior DevOps Engineer/Site Reliability Engineer-East Coast

    Remote · USA Full-time

    [Remote] Business Development Representative

    Remote · USA Full-time

    [Remote] Information Technology Project Manager

    Remote · USA Full-time

    [Remote] Aftersales Account Manager

    Remote · USA Full-time

    [Remote] Program Manager

    Remote · USA Full-time

    [Remote] Supervision Consultant

    Remote · USA Full-time

    [Remote] Senior Software Developer - Oracle Health, Platform Engineering

    Remote · USA Full-time

    [Remote] Training Manager \- Human Services Program \- Remote

    Remote · USA Full-time

    [Remote] Legal Counsel

    Remote · USA Full-time

    [Remote] Business Development Intern at Oncology Startup

    Remote · USA Full-time

    Remote: Social Media Evaluator (No Degree/Experience RQD/Entry Level)

    Remote · USA Full-time

    Cloud Engineer (Remote Opportunity)

    Remote · USA Full-time

    Senior Backend Engineer (Python/AWS)

    Remote · USA Full-time

    Database Engineer, Postgres

    Remote · USA Full-time

    TEMP-Workers’ Compensation Claims Adjuster

    Remote · USA Full-time

    Experienced Data Entry Specialist – Remote Opportunity for Career Growth at arenaflex

    Remote · USA Full-time

    [Remote] Account Executive - State of New Mexico, Enterprise State & Local Government Sales, West SLG

    Remote · USA Full-time

    Experienced Remote Customer Service Representative – Deliver Exceptional Support to Arenaflex Clients (Up to $19/hour – No Degree Needed)

    Remote · USA Full-time

    Experienced Customer Service Representative – Work from Home Opportunity at arenaflex

    Remote · USA Full-time

    Remote Data Entry Specialist – Part‑Time, $25/hr – Home‑Based Administrative Support at arenaflex

    Remote · USA Full-time