[Remote] Member of Technical Staff, Cluster Administration

Remote · USA Full-time New today

Note: The job is a remote job and is open to candidates in USA. Inferact is focused on advancing AI inference technology through its vLLM engine. They are seeking a hands-on cluster administration engineer to manage high-performance GPU compute infrastructure, ensuring its health and availability for engineering productivity.

Responsibilities

Own and operate the high-performance GPU compute infrastructure
Ensure that infrastructure is healthy, available, observable, and usable around the clock
Take ownership of cluster health, GPU availability, monitoring, alerting, scheduling, access, diagnostics, and incident response
Work closely with engineering leadership and infrastructure owners to standardize how we provision, operate, debug, and scale compute across providers

Skills

Bachelor's degree or equivalent experience in computer science, engineering, systems administration, or similar
Hands-on experience administering large compute clusters, HPC environments, university or research clusters, supercomputing systems, or production GPU clusters
Strong Linux systems administration fundamentals across networking, processes, storage, package management, shell scripting, logs, access control, and system debugging
Experience operating GPU servers, including driver management, GPU health monitoring, node failures, memory errors, scheduler issues, and hardware diagnostics
Experience with cluster scheduling and resource allocation using SLURM, Kubernetes, or equivalent tooling
Ability to own urgent infrastructure incidents end-to-end when compute issues are blocking engineering teams
Ability to automate operational workflows using Bash, Python, Ansible, Terraform, Helm, or similar tooling
Experience operating GPU compute across providers such as Lambda, CoreWeave, Crusoe, Nebius, Together, Fireworks, RunPod, or similar environments
Experience improving cluster utilization, reducing idle or unavailable GPU capacity, and debugging scheduling or resource contention issues
Familiarity with high-performance GPU networking such as InfiniBand, RoCE, NVLink / NVSwitch, RDMA, NCCL, or equivalent systems
Experience with storage for HPC or ML workloads, including NFS, Lustre, Ceph, distributed filesystems, or other high-throughput storage systems
Experience managing secure access, identity, permissions, SSH, VPNs, bastion hosts, secrets, and basic infrastructure security hygiene
Background in research computing, scientific computing, ML infrastructure, SRE, platform engineering, or infrastructure operations for engineering-heavy teams

Benefits

Offers Equity
Inferact offers generous health, dental, and vision benefits as well as 401(k) company match.

Company Overview

Inferact mission is to accelerate AI progress by making inference cheaper and faster. It was founded in 2025, and is headquartered in San Francisco, California, USA, with a workforce of 11-50 employees. Its website is https://inferact.ai/.

Company H1B Sponsorship

Inferact has a track record of offering H1B sponsorships, with 1 in 2026. Please note that this does not guarantee sponsorship for this specific role.

Apply To This Job

Apply

[Remote] Member of Technical Staff, Cluster Administration

Related roles

[Remote] Project Manager Northeast Region - US Remote

[Remote] Senior Accountant

[Remote] Product Manager

[Remote] Senior Email Marketing Manager

[Remote] Senior Product Manager, In-Store Tasks

[Remote] Senior Product Manager, In-Store Tasks

[Remote] Finance Transformation Senior Manager

[Remote] Thermohydraulics Safety Analysis Engineer

[Remote] Nevada Riparian Restoration Project Manager

[Remote] Senior Programs and Business Operations Lead, Monthly Stays

Experienced Customer Service Associate – Multilingual Delivery Station Support

OpenShift Engineer

Client Account Executive

Netflix Jobs From Home Tagger$30/Hour

[Remote] Conventional Underwriter

[Remote/WFM] American Express Customer Service Rep

Landscape Construction Team Member

Social Media Specialist

Investment Review Board (IRB) Governance Analyst

Remote Customer Service Representative – Technical Support & Customer Experience Specialist (Work From Home)