[Remote] Staff Site Reliability Engineer

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Thrive Market is an online, membership-based market focused on making healthy and sustainable living accessible. They are seeking a Staff Site Reliability Engineer to establish their SRE practice, define reliability metrics, and ensure system scalability during rapid growth.

Responsibilities

Define, implement, and own Service Level Objectives (SLOs) and Service Level Indicators (SLIs) across critical platform services
Build and maintain comprehensive monitoring, alerting, and observability systems using tools like Datadog, Prometheus, Grafana, or similar platforms
Establish error budgets and use them to balance feature velocity with reliability investments
Lead incident response efforts, conduct blameless postmortems, and drive systemic improvements that prevent recurrence
Design and implement chaos engineering practices to proactively identify failure modes before they impact members
Architect and optimize our Kubernetes-based container orchestration platform for reliability, performance, and cost efficiency
Support large infrastructure migrations, ensuring a smooth transition with minimal disruption to business operations
Contribute to the evaluation and execution of potential platform migrations, with a focus on reliability planning and risk mitigation
Design and implement automated deployment pipelines that enable rapid, error-free releases with feature flags and built-in rollback/roll-forward capabilities
Develop and own disaster recovery plans, capacity planning models, and system hardening initiatives
Collaborate closely with product engineering teams to help them scale their infrastructure in AWS and adopt SRE best practices
Help establish SRE as a practice at Thrive Market, defining the team’s charter, processes, and engagement model with product engineering teams
Champion a culture of operational excellence, continuous improvement, and data-driven reliability decisions
Create and maintain technical documentation covering architecture decisions, runbooks, incident response procedures, and operational playbooks
Participate in weekly on-call rotations and help build sustainable on-call practices that avoid burnout
Identify systemic problems and inefficiencies across the engineering organization and make strategic recommendations for improvement

Skills

B.S. in Computer Science or equivalent professional experience
7+ years of hands-on experience in SRE, DevOps, or Infrastructure Engineering, with a proven track record of improving reliability at rapidly growing companies
Deep expertise in Kubernetes (K8s) — including cluster management, Helm charts, service meshes, and production-grade container orchestration
Strong systems engineering background with advanced proficiency in Linux administration
Advanced scripting and automation skills in Bash, Python, Golang, Ruby, or similar languages
Extensive experience with core AWS services including EC2, ECS/EKS, S3, VPC, IAM, CloudWatch, Route 53, RDS, and Lambda
Strong experience with Infrastructure as Code tools (Terraform, CloudFormation, Pulumi, or similar)
Hands-on experience defining and implementing SLOs, SLIs, and error budgets in production environments
Deep understanding of CI/CD pipelines and deployment strategies (blue-green, canary, rolling deployments)
Expertise in monitoring and observability platforms (Datadog, Prometheus, Grafana, New Relic, or similar)
Strong knowledge of web application infrastructure, networking, load balancing, and security best practices
Excellent communication skills with the ability to lead incident response and facilitate blameless postmortems
Experience with e-commerce platforms (Magento, Shopify, or comparable) and the unique reliability challenges they present at scale
Experience with ConcourseCI, Github Actions (GHA) or similar deployment frameworks
Experience with chaos engineering tools and practices (Gremlin, Litmus, Chaos Monkey, or similar)
Familiarity with GitOps workflows (ArgoCD, Flux) and service mesh technologies (Istio, Linkerd)
Experience building and managing cost-optimization strategies for cloud infrastructure
Background in establishing SRE practices in organizations transitioning from traditional DevOps models
Experience with configuration management tools (Ansible, Chef, Puppet, or similar)

Benefits

Comprehensive health benefits (medical, dental, vision, life and disability)
Competitive salary (DOE) + equity
401k plan
9 Observed Holidays
Flexible Paid Time Off
Subsidized ClassPass Membership with access to fitness classes and wellness and beauty experiences
Ability to work in our beautiful office in Playa Vista
Free Thrive Market membership with exclusive employee discount
Coverage for Life Coaching & Therapy Sessions on our holistic mental health and well-being platform

Company Overview

Thrive Market is a membership-based online company that offers natural and organic food products. It was founded in 2013, and is headquartered in Los Angeles, California, USA, with a workforce of 501-1000 employees. Its website is https://thrivemarket.com.

Apply To This Job

Apply

[Remote] Staff Site Reliability Engineer

Related roles

[Remote] Senior DevOps Engineer/Site Reliability Engineer-East Coast

[Remote] Business Development Representative

[Remote] Information Technology Project Manager

[Remote] Aftersales Account Manager

[Remote] Program Manager

[Remote] Supervision Consultant

[Remote] Senior Software Developer - Oracle Health, Platform Engineering

[Remote] Training Manager \- Human Services Program \- Remote

[Remote] Legal Counsel

[Remote] Business Development Intern at Oncology Startup

Remote: Social Media Evaluator (No Degree/Experience RQD/Entry Level)

Cloud Engineer (Remote Opportunity)

Senior Backend Engineer (Python/AWS)

Database Engineer, Postgres

TEMP-Workers’ Compensation Claims Adjuster

Experienced Data Entry Specialist – Remote Opportunity for Career Growth at arenaflex

[Remote] Account Executive - State of New Mexico, Enterprise State & Local Government Sales, West SLG

Experienced Remote Customer Service Representative – Deliver Exceptional Support to Arenaflex Clients (Up to $19/hour – No Degree Needed)

Experienced Customer Service Representative – Work from Home Opportunity at arenaflex

Remote Data Entry Specialist – Part‑Time, $25/hr – Home‑Based Administrative Support at arenaflex