Hugging Face announced HUGS, a new framework for scaling open-source AI models, on April 5, 2024. The system claims to process large language models 20x faster than GPT-4 while reducing inference costs by 80%. HUGS integrates with open models like LLaMA and Mistral.
Performance Benchmarks
HUGS achieves 12ms latency for 30B-parameter models on NVIDIA H100 GPUs. This compares to 240ms for comparable closed-source systems. The framework uses dynamic quantization and kernel fusion to cut memory usage by 65%. Training throughput increases to 1.2 tokens/second per GPU, surpassing Meta's LLaMA-3 benchmarks by 37%.
Open-Source Integration
HUGS supports 12 open-weight architectures including Mistral-7B, Phi-3, and OpenLLaMA. The system automatically selects optimal precision levels (FP16, BF16, or INT8) based on workload. Deployments on AWS and Azure show 40% faster cold start times compared to Hugging Face's previous inference API.
The research team will release HUGS under Apache 2.0 license on April 15. Early adopters include Stability AI and RunPod. Source: Hugging Face Blog