Project overview
Modal LLM Fleet was built to provision the best open-source models for different AI needs, either independently or all at once. The platform exposes OpenAI-compatible endpoints for chat, vision, embeddings, image generation, video generation, text-to-speech, and upscaling, using Modal GPUs, bearer-token authentication, and idle auto-shutdown to control serving costs.

Challenge
Enable multi-agent projects without depending exclusively on proprietary cloud LLMs, while preserving data sovereignty, model-choice flexibility, GPU cost control, and the ability to customize models for specific domains.
Solution
Layered architecture with model registries as the single source of truth, vLLM for text, vision, and embeddings, diffusers for image and video, FastAPI OpenAI-compatible contracts, Modal Secrets for tokens, cache volumes, and independent per-model deployments.
Tech Stack
- Open-source LLM
- Modal
- GPU Inference
- Multi-agent
- Data Sovereignty
Technical scope
- OpenAI-compatible endpoints with bearer tokens
- LLMs, vision, embeddings, image, video, speech, and upscaling
- Independent per-model provisioning or full-fleet deployment
- Data sovereignty and foundation for multi-agent systems
