[Remote] Senior DGX Cloud AI Infrastructure Software Engineer
Note: The job is a remote job and is open to candidates in USA. NVIDIA is a leader in groundbreaking developments in Artificial Intelligence, High-Performance Computing, and Visualization. They are seeking a Senior DGX Cloud AI Infrastructure Software Engineer to contribute to the infrastructure that powers innovative AI research, focusing on developing tools for optimizing efficiency and resiliency of AI workloads.
Responsibilities
- Develop infrastructure software and tools for large-scale pre-training, post-training, and inference
- Develop and optimize tools and libraries to improve infrastructure efficiency and resiliency
- Co-design and implement APIs for integration with NVIDIA's resiliency stacks
- Enhance infrastructure and products underpinning NVIDIA's AI platforms
- Define meaningful and actionable reliability metrics to track and improve system and service reliability
- Skilled in problem-solving, root cause analysis, and optimization
- Root cause and analyze and triage failures from the application level to the hardware level
Skills
- Minimum of 8+ years of experience in developing software infrastructure for large scale AI systems
- Bachelor's degree or higher in Computer Science or a related technical field (or equivalent experience)
- Strong debugging skills and experience in analyzing and triaging AI applications from the application level to the hardware level
- Experience with observability platforms for monitoring and logging (e.g., ELK, Prometheus, Loki)
- Proven track record in building and scaling large-scale distributed systems
- Experience with AI training and inferencing infrastructure services
- Proficiency in programming languages such as Python, C/C++, script languages
- Experience in quality software engineering practices, including test development, defensive programming, version control, and CI
- Excellent communication and collaboration skills, and a culture of diversity, intellectual curiosity, problem solving, and openness are essential
- Background in working with the large scale clusters
- Experience in defining and building observability and telemetry software stack
- Experience with RDMA software stack (NCCL, IB verbs, ucx, libfabrics)
- Experience and root cause analysis of failures and datacenter scale
- Good understanding on DL frameworks internal PyTorch, TensorFlow, JAX, and Ray
Benefits
- Equity
Company Overview
Company H1B Sponsorship