[Remote] Member of Technical Staff, Cluster Administration
Note: The job is a remote job and is open to candidates in USA. Inferact is focused on advancing AI inference technology through its vLLM engine. They are seeking a hands-on cluster administration engineer to manage high-performance GPU compute infrastructure, ensuring its health and availability for engineering productivity.
Responsibilities
- Own and operate the high-performance GPU compute infrastructure
- Ensure that infrastructure is healthy, available, observable, and usable around the clock
- Take ownership of cluster health, GPU availability, monitoring, alerting, scheduling, access, diagnostics, and incident response
- Work closely with engineering leadership and infrastructure owners to standardize how we provision, operate, debug, and scale compute across providers
Skills
- Bachelor's degree or equivalent experience in computer science, engineering, systems administration, or similar
- Hands-on experience administering large compute clusters, HPC environments, university or research clusters, supercomputing systems, or production GPU clusters
- Strong Linux systems administration fundamentals across networking, processes, storage, package management, shell scripting, logs, access control, and system debugging
- Experience operating GPU servers, including driver management, GPU health monitoring, node failures, memory errors, scheduler issues, and hardware diagnostics
- Experience with cluster scheduling and resource allocation using SLURM, Kubernetes, or equivalent tooling
- Ability to own urgent infrastructure incidents end-to-end when compute issues are blocking engineering teams
- Ability to automate operational workflows using Bash, Python, Ansible, Terraform, Helm, or similar tooling
- Experience operating GPU compute across providers such as Lambda, CoreWeave, Crusoe, Nebius, Together, Fireworks, RunPod, or similar environments
- Experience improving cluster utilization, reducing idle or unavailable GPU capacity, and debugging scheduling or resource contention issues
- Familiarity with high-performance GPU networking such as InfiniBand, RoCE, NVLink / NVSwitch, RDMA, NCCL, or equivalent systems
- Experience with storage for HPC or ML workloads, including NFS, Lustre, Ceph, distributed filesystems, or other high-throughput storage systems
- Experience managing secure access, identity, permissions, SSH, VPNs, bastion hosts, secrets, and basic infrastructure security hygiene
- Background in research computing, scientific computing, ML infrastructure, SRE, platform engineering, or infrastructure operations for engineering-heavy teams
Benefits
- Offers Equity
- Inferact offers generous health, dental, and vision benefits as well as 401(k) company match.
Company Overview
Company H1B Sponsorship