Optimum-NVIDIA — Ди Ди на DTF

Занятно...

Optimum-NVIDIA delivers the best inference performance on the NVIDIA platform through Hugging Face. Run LLaMA 2 at 1,200 tokens/second (up to 28x faster than the framework) by changing just a single line in your existing transformers code.

github.com

GitHub - huggingface/optimum-nvidia

Optimum-NVIDIA - By changing just a single line of code, you can unlock up to 28x faster inference and 1,200 tokens/second on the NVIDIA platform. Optimum-NVIDIA is the first Hugging Face inference library to benefit from the new float8 format supported on NVIDIA Ada Lovelace and Hopper architectures. FP8, in addition to the advanced compilation capabilities of NVIDIA TensorRT-LLM software software, dramatically accelerate LLM inference. You can also enable FP8 quantization with a single flag, which allows you to run a bigger model on a single GPU, at faster speeds, and without sacrificing accuracy.

https://twitter.com/rohanpaul_ai/status/1766715314358894693?t=53o1HnDQqnldG4agX86X2g&s=19

https://www.reddit.com/r/LocalLLaMA/comments/18ce925/optimumnvidia_28x_faster_inference_in_just_1_line/

github.com

GitHub - chengzeyi/stable-fast: Best inference performance optimization framework for HuggingFace Diffusers on NVIDIA GPUs.