[Remote] Infrastructure Software Engineer, Fleet & Automation
Note: The job is a remote job and is open to candidates in USA. Nscale is a GPU cloud company focused on AI infrastructure, providing high-performance solutions for AI development. As an Infrastructure Software Engineer for Fleet & Automation, you will ensure the performance and scalability of AI and High-Performance Computing environments by building and maintaining automation and control systems.
Responsibilities
- Perform technical architecture, roadmap and implementation for workflow automation systems, driving architecture decisions that balance automation complexity, reliability, and maintainability
- Identify and resolve performance and scalability issues
- Establish technology and product direction in collaboration with other tech leads, managers, and senior leadership
- Own end-to-end delivery of device provisioning, validation, testing, and remediation workflows at scale
- Design and build workflow orchestration systems for hardware lifecycle management, including GPU nodes and network switches
- Partner with Infrastructure, Platform, and SRE teams to translate operational needs into robust, scalable automation
- Establish engineering standards for reliability, observability, and operational excellence across all services
- Help set up engineering best practices in collaboration with the broader engineering team
- Build production-grade Python systems for hardware lifecycle automation, leveraging AI tools to accelerate delivery
- Assess impact to team software stack from new hardware product programs and explore AI driven process improvement and automation
- Collaborate with cross-functional teams (product, design, operations, infrastructure) to build efficient, interoperable, and maintainable automated systems
Skills
- Bachelor's degree in Computer Science, Computer Engineering, relevant technical field, or equivalent practical experience
- 5+ years relevant experience building large-scale infrastructure applications or similar experience
- Experience in utilizing languages such as C, C++, Java, and scripting languages such as Python for API design and unit testing techniques
- Deep understanding of Linux operating systems, networking fundamentals (TCP/IP, BGP), and familiarity with configuration management tools (e.g., Ansible, Terraform)
- Experience building, running and debugging large-scale infrastructure, stateful and stateless services for distributed systems or networks, and experience with compute technologies, storage, or hardware architecture
- Experience integrating with infrastructure tooling such as: DCIMs, NetBox, OpenStack, bare metal APIs (MAAS, Ironic, IPMI)
- Master's degree or PhD in Engineering, Computer Science, or a related technical field
- Experience designing, analyzing and improving efficiency, scalability, and performance of various system resources
- Direct experience with AI/HPC infrastructure, including NVIDIA GPUs, InfiniBand or high-speed Ethernet fabrics, and related management software (e.g., NCCL, SLURM)
- Experience with advanced observability and monitoring systems (Prometheus, Grafana, OpenTelemetry) for complex, high-cardinality telemetry data
- Familiarity with cloud-native technologies (Kubernetes, Docker) and infrastructure-as-code principles
- Demonstrated ability to integrate AI tools to optimize/redesign workflows and drive measurable impact (e.g., efficiency gains, quality improvements)
- Familiarity with SLOs/metrics measurement, logs/telemetry/metrics integration with tools for enhanced operator experience
Benefits
- Medical
- Dental
- Vision
- Flexible paid time off
- Parental leave
- Retirement plan participation
Company Overview