Note
Access to this page requires authorization. You can try signing in or changing directories.
Access to this page requires authorization. You can try changing directories.
Deploy a web app with a local small language model (SLM) as a sidecar container to run AI models entirely within your App Service environment. No outbound calls or external AI service dependencies required. This approach is ideal if you have strict data privacy or compliance requirements, as all AI processing and data remain local to your app. App Service offers high-performance, memory-optimized pricing tiers needed for running SLMs in sidecars.
Overview
Small Language Models (SLMs) are compact AI models, such as Microsoft's Phi-3 and Phi-4 series, that can run efficiently with fewer computational resources compared to large language models. By deploying an SLM as a sidecar container alongside your web application in App Service, you can process AI requests entirely locally without making external API calls.
This architecture provides several advantages:
- Complete data privacy: All data and AI processing stays within your App Service environment
- Zero external dependencies: No reliance on external AI services or internet connectivity
- Predictable latency: Responses are consistently fast with no network overhead
- Cost control: Pay only for App Service compute resources, with no per-token charges
- Regulatory compliance: Meet strict data residency and privacy requirements
When to use local SLMs
Local SLMs are ideal for scenarios where:
- Data privacy is critical: Healthcare, finance, legal, or government applications that cannot send data to external services
- Offline capability is required: Applications that must function without internet connectivity
- Predictable costs are important: Fixed infrastructure costs instead of variable per-request pricing
- Low latency is essential: Sub-100ms response times without network calls
- Moderate AI capabilities suffice: Tasks like classification, summarization, entity extraction, or simple Q&A that don't require the most powerful models
While SLMs may not match the capabilities of large models like GPT-4, they excel in focused, domain-specific tasks where smaller models can be fine-tuned for excellent performance with complete control over your data.
Technical approach
App Service supports running SLMs through sidecar containers that deploy alongside your main application. The sidecar runs the model inference engine (such as ONNX Runtime or llama.cpp) and exposes a local endpoint your app can call. This keeps all processing in-process and maintains isolation while sharing the same compute resources.