Machine Learning Engineer — Inference Optimization

Featherless AIRemote - (world)+ Equity1mo ago

Remote WW Artificial Intelligence Machine Learning Engineer CUDA Triton ONNX

Upload My Resume

Drop here or click to browse · PDF, DOCX, DOC, RTF, TXT

Apply in One Click

Requirements

• Strong experience in ML inference optimization or high-performance ML systems • Solid understanding of deep learning internals (attention, memory layout, compute graphs) • Hands-on experience with PyTorch (or similar) and model deployment • Familiarity with GPU performance tuning (CUDA, ROCm, Triton, or kernel-level optimizations) • Experience scaling inference for real users (not just research benchmarks) • Comfortable working in fast-moving startup environments with ownership and ambiguity • Experience with LLM or long-context model inference • Knowledge of inference frameworks (TensorRT, ONNX Runtime, vLLM, Triton) • Experience optimizing across different hardware vendors • Open-source contributions in ML systems or inference tooling • Background in distributed systems or low-latency services

Responsibilities

• Optimize inference latency, throughput, and cost for large-scale ML models in production. • Profile and bottleneck GPU/CPU inference pipelines including memory usage, kernel executions, batching strategies, and input/output operations. • Implement and tune quantization techniques such as fp16, bf16, int8, and fp8 to reduce model size and improve performance. • Optimize KV-cache for reuse in inference systems. • Apply speculative decoding strategies along with batching and streaming optimizations. • Perform model pruning or architectural simplifications specifically tailored for the purpose of inference efficiency. • Collaborate closely with research engineers to translate new model architectures into production environments, ensuring they are fast and reliable enough for real user interaction. • Build and maintain robust systems capable of serving ML models (e.g., Triton server or custom runtimes) that can handle various hardware configurations like NVIDIA/AMD GPUs as well as cloud infrastructures. • Benchmark performance across different types of hardware setups, including but not limited to specific GPU and CPU brands from vendors such as NVIDIA and AMD, along with diverse cloud environments. • Enhance system reliability by improving observability features under actual workload conditions. • Work towards optimizing the cost efficiency of inference operations within realistic user scenarios without compromising on performance or accuracy.

Benefits

• Real ownership over performance-critical systems • Direct impact on product reliability and unit economics • Close collaboration with research, infra, and product • Competitive compensation + meaningful equity at Series A • A team that cares about engineering quality, not hype