Hugging Face Blog announced Fireworks.ai's integration into its model repository, positioning the startup as a challenger to major AI providers. Fireworks.ai claims latency as low as 12ms for text generation, benchmarked against GPT-4's 240ms. The platform offers open-weight models including Llama 3 and Mistral derivatives, with inference costs at $0.0003 per token versus $0.003 for competing services.
20x Speed Claim Fireworks.ai's architecture combines model quantization and custom silicon design. Internal tests show 20x faster throughput on MHA-30B workloads compared to GPT-4. The 12ms latency figure was measured during real-time video captioning tests using 4K streams. Customers include startups requiring sub-100ms response times for chatbots and autonomous systems. Model weights are available via Hugging Face's API marketplace with SLAs guaranteeing 99.9% uptime.
$50M Funding Round Tiger Global led the Series A round with $45M, followed by $5M from Hugging Face's own venture fund. Additional participants include Lightspeed Venture Partners and Coatue Management. Funds will expand Fireworks' data center footprint to 12 global locations by Q3 2024. The team plans to release a 100B-parameter model in early 2025, trained on 300PB of filtered web data. CTO Jordan Hoffmann stated "We're building infrastructure to serve 100,000 concurrent users without latency spikes."
Fireworks.ai will integrate with Hugging Face's transformer library by October 2024. The partnership includes joint R&D on model compression techniques targeting mobile devices. Source: Hugging Face Blog