OnTimeStack | Modal LLM Fleet

Project overview

Modal LLM Fleet was built to provision the best open-source models for different AI needs, either independently or all at once. The platform exposes OpenAI-compatible endpoints for chat, vision, embeddings, image generation, video generation, text-to-speech, and upscaling, using Modal GPUs, bearer-token authentication, and idle auto-shutdown to control serving costs.

Modal LLM Fleet architecture showing model fleet, OpenAI-compatible endpoints, Modal GPU provisioning, stack, and developer workflow — The Modal LLM Fleet provisions secure OpenAI-compatible endpoints for specialized open-source models across text, vision, embeddings, image, video, speech, and upscaling workloads.

Challenge

Enable multi-agent projects without depending exclusively on proprietary cloud LLMs, while preserving data sovereignty, model-choice flexibility, GPU cost control, and the ability to customize models for specific domains.

Solution

Layered architecture with model registries as the single source of truth, vLLM for text, vision, and embeddings, diffusers for image and video, FastAPI OpenAI-compatible contracts, Modal Secrets for tokens, cache volumes, and independent per-model deployments.

Tech Stack

Open-source LLM
Modal
GPU Inference
Multi-agent
Data Sovereignty

Technical scope

OpenAI-compatible endpoints with bearer tokens
LLMs, vision, embeddings, image, video, speech, and upscaling
Independent per-model provisioning or full-fleet deployment
Data sovereignty and foundation for multi-agent systems

Let's build something amazing?

We are ready to understand your technical challenge and propose the best architecture. Contact us for an initial consultation without commitment.