Advanced

Deploying AI Assistants

Once your assistant is built, you need to make it available to users. Learn about deployment channels, API design, streaming, authentication, and how to scale for production traffic.

Deployment Channels

Web Widget

The most common deployment channel. A chat widget embedded in your website.

  • Implementation: JavaScript widget that communicates with your backend via WebSocket or REST API
  • Advantages: Reaches all website visitors, customizable UI, no app installation
  • Platforms: Intercom, Crisp, custom builds with React/Vue components

Slack Integration

  • Implementation: Slack Bot using the Slack Events API and Web API
  • Best for: Internal assistants, team productivity, IT help desk
  • Features: Slash commands, threaded conversations, rich message formatting

Microsoft Teams

  • Implementation: Teams Bot Framework, Azure Bot Service
  • Best for: Enterprise environments using Microsoft 365
  • Features: Adaptive cards, meeting integration, Office document access

WhatsApp

  • Implementation: WhatsApp Business API (via Meta Cloud API or BSPs like Twilio)
  • Best for: Customer support, commerce, markets where WhatsApp is dominant
  • Limitations: 24-hour messaging window, template messages for outbound

Email

  • Implementation: Parse incoming emails, generate responses, send via SMTP/API
  • Best for: Asynchronous support, formal communications, document-heavy interactions

Voice

  • Implementation: Speech-to-text → LLM → Text-to-speech pipeline (Twilio, Vonage)
  • Best for: Phone support, accessibility, hands-free scenarios
  • Challenge: Latency management; users expect near-instant voice responses

API Deployment

REST API

The simplest approach. Client sends a message, server returns the response.

Python - FastAPI Endpoint
from fastapi import FastAPI
from pydantic import BaseModel

app = FastAPI()

class ChatRequest(BaseModel):
    session_id: str
    message: str

class ChatResponse(BaseModel):
    reply: str
    session_id: str

@app.post("/chat", response_model=ChatResponse)
async def chat(request: ChatRequest):
    # Load session, run assistant, save session
    session = load_session(request.session_id)
    reply = assistant.chat(request.message, session)
    save_session(request.session_id, session)
    return ChatResponse(
        reply=reply,
        session_id=request.session_id
    )

WebSocket

For real-time streaming responses. The server sends tokens as they are generated.

  • Advantages: Low latency, real-time typing effect, bidirectional communication
  • Best for: Web widgets, interactive chat interfaces

Streaming Responses

Stream tokens to the client as the LLM generates them. This dramatically improves perceived latency — users see the response forming rather than waiting for the complete answer.

  • Server-Sent Events (SSE): Simpler than WebSocket; one-directional streaming over HTTP
  • WebSocket: Full-duplex; useful when the client also needs to send real-time data
  • All major LLM APIs support streaming: Anthropic, OpenAI, and Google all offer streaming endpoints

Rate Limiting

  • Per-user limits: Prevent individual users from overwhelming the system (e.g., 20 messages/minute)
  • Global limits: Cap total API calls to stay within budget and LLM provider rate limits
  • Token limits: Set maximum input and output tokens per request
  • Implementation: Redis-based rate limiting, API gateway (Kong, Nginx), or cloud provider built-in

Authentication

  • API keys: For server-to-server communication
  • JWT tokens: For user-facing applications with login
  • Session tokens: For anonymous web widget users (tied to browser session)
  • Never expose LLM API keys to the client: Always proxy through your backend

Analytics and Tracking

Track these metrics to understand usage and improve your assistant:

  • Conversation volume: Total conversations, messages per conversation
  • Resolution rate: Percentage of conversations resolved without human escalation
  • User satisfaction: Post-conversation ratings, sentiment analysis
  • Common topics: What users ask about most (informs knowledge base improvements)
  • Fallback rate: How often the assistant cannot answer a question
  • Cost per conversation: Token usage and API costs

Cost Estimation

Traffic LevelConversations/MonthEstimated Monthly Cost*
Small1,000$10-50
Medium10,000$100-500
Large100,000$1,000-5,000
Enterprise1,000,000+$10,000+

*Estimates based on average conversation length of 5-10 messages using mid-tier models. Actual costs depend on model choice, message length, and tool usage.

Scaling for Traffic

  • Horizontal scaling: Run multiple API server instances behind a load balancer
  • Caching: Cache common responses, knowledge base queries, and embeddings
  • Queue-based processing: For non-real-time channels (email, async), use job queues
  • CDN for static assets: Serve the widget JavaScript and CSS from a CDN
  • Database optimization: Use connection pooling, read replicas for conversation history