DistServe: desagregando o preenchimento prévio e a decodificação para inferência LLM otimizada pa...
Vídeos relacionados
1:04:07
verl: Flexible and Scalable Reinforcement Learning Library for LLM Reasoning and Tool-Calling
33:29
How does batching work on modern GPUs?
39:23
PyTorch Expert Exchange: Efficient Generative Models: From Sparse to Distributed Inference
26:53
The Science and Practice of Open and Scalable LLM Evaluations - Grzegorz Chlebus, NVIDIA
1:09:32
High Performance LLM Inference in Production
1:15:19
Lecture 58: Disaggregated LLM Inference
24:27
Helion 1.0: A High-Level DSL for Performance Portable Kernels - Oguz Ulgen, Meta
35:50
Efficient Streaming Language Models with Attention Sinks
53:46
Keynote: PyTorch 2.1 Technical Deep Dive - Mario, Mark, Mergen, Joe, Peng, Will, Yanan
1:04:00
Scaling Parallel Algorithms to Massive Datasets using Multi-SSD Machines
19:37
SGLang: An Efficient Open-Source Framework for Large-Scale LLM Serving - Liangsheng Yin
20:18