Advanced

Deploying AI Assistants

Once your assistant is built, you need to make it available to users. Learn about deployment channels, API design, streaming, authentication, and how to scale for production traffic.

Deployment Channels

Web Widget

The most common deployment channel. A chat widget embedded in your website.

Implementation: JavaScript widget that communicates with your backend via WebSocket or REST API
Advantages: Reaches all website visitors, customizable UI, no app installation
Platforms: Intercom, Crisp, custom builds with React/Vue components

Slack Integration

Implementation: Slack Bot using the Slack Events API and Web API
Best for: Internal assistants, team productivity, IT help desk
Features: Slash commands, threaded conversations, rich message formatting

Microsoft Teams

Implementation: Teams Bot Framework, Azure Bot Service
Best for: Enterprise environments using Microsoft 365
Features: Adaptive cards, meeting integration, Office document access

Implementation: WhatsApp Business API (via Meta Cloud API or BSPs like Twilio)
Best for: Customer support, commerce, markets where WhatsApp is dominant
Limitations: 24-hour messaging window, template messages for outbound

Email

Implementation: Parse incoming emails, generate responses, send via SMTP/API
Best for: Asynchronous support, formal communications, document-heavy interactions

Voice

Implementation: Speech-to-text → LLM → Text-to-speech pipeline (Twilio, Vonage)
Best for: Phone support, accessibility, hands-free scenarios
Challenge: Latency management; users expect near-instant voice responses

API Deployment

REST API

The simplest approach. Client sends a message, server returns the response.

Python - FastAPI Endpoint

from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    session_id: str
    message: str

class ChatResponse(BaseModel):
    reply: str
    session_id: str

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    # Load session, run assistant, save session
    session = load_session(request.session_id)
    reply = assistant.chat(request.message, session)
    save_session(request.session_id, session)
    return ChatResponse(
        reply=reply,
        session_id=request.session_id
    )

WebSocket

For real-time streaming responses. The server sends tokens as they are generated.

Advantages: Low latency, real-time typing effect, bidirectional communication
Best for: Web widgets, interactive chat interfaces

Streaming Responses

Stream tokens to the client as the LLM generates them. This dramatically improves perceived latency — users see the response forming rather than waiting for the complete answer.

Server-Sent Events (SSE): Simpler than WebSocket; one-directional streaming over HTTP
WebSocket: Full-duplex; useful when the client also needs to send real-time data
All major LLM APIs support streaming: Anthropic, OpenAI, and Google all offer streaming endpoints

Rate Limiting

Per-user limits: Prevent individual users from overwhelming the system (e.g., 20 messages/minute)
Global limits: Cap total API calls to stay within budget and LLM provider rate limits
Token limits: Set maximum input and output tokens per request
Implementation: Redis-based rate limiting, API gateway (Kong, Nginx), or cloud provider built-in

Authentication

API keys: For server-to-server communication
JWT tokens: For user-facing applications with login
Session tokens: For anonymous web widget users (tied to browser session)
Never expose LLM API keys to the client: Always proxy through your backend

Analytics and Tracking

Track these metrics to understand usage and improve your assistant:

Conversation volume: Total conversations, messages per conversation
Resolution rate: Percentage of conversations resolved without human escalation
User satisfaction: Post-conversation ratings, sentiment analysis
Common topics: What users ask about most (informs knowledge base improvements)
Fallback rate: How often the assistant cannot answer a question
Cost per conversation: Token usage and API costs

Cost Estimation

Traffic Level	Conversations/Month	Estimated Monthly Cost*
Small	1,000	$10-50
Medium	10,000	$100-500
Large	100,000	$1,000-5,000
Enterprise	1,000,000+	$10,000+

*Estimates based on average conversation length of 5-10 messages using mid-tier models. Actual costs depend on model choice, message length, and tool usage.

Scaling for Traffic

Horizontal scaling: Run multiple API server instances behind a load balancer
Caching: Cache common responses, knowledge base queries, and embeddings
Queue-based processing: For non-real-time channels (email, async), use job queues
CDN for static assets: Serve the widget JavaScript and CSS from a CDN
Database optimization: Use connection pooling, read replicas for conversation history

← Previous Knowledge Base Next → Best Practices