Building a Production-Ready Multi-Model NLP API with Hugging Face & Docker

MLOps Project Series: part 3

Introduction

Machine Learning Operations (MLOps) is the discipline of delivering machine learning systems to production reliably and efficiently. This comprehensive guide walks you through building a real-world MLOps project: a production-ready multi-model NLP API with 4 Hugging Face models running in Docker.

By the end of this series, you’ll have:

  • ✅ A working NLP API with 4 different models
  • ✅ Docker containerization for easy deployment
  • ✅ FastAPI REST endpoints
  • ✅ Production-ready error handling
  • ✅ Knowledge of MLOps best practices

Table of Contents

  1. Part 1: Project Setup & Architecture
  2. Part 2: Understanding the 4 NLP Models
  3. Part 3: Building the FastAPI Server
  4. Part 4: Containerization with Docker
  5. Part 5: Testing & Verification

Part 1: Project Setup & Architecture

What We’re Building

┌─────────────────────────────────────────────────┐
│         Multi-Model NLP API (Production)         │
├─────────────────────────────────────────────────┤
│  FastAPI Server (Python 3.11)                   │
│  ├─ /analyze/sentiment     (Sentiment Analysis) │
│  ├─ /analyze/ner           (Entity Recognition) │
│  ├─ /analyze/zero-shot     (Classification)     │
│  └─ /analyze/summarize     (Summarization)      │
│                                                  │
│  Models (Hugging Face)                          │
│  ├─ distilbert (Sentiment)                      │
│  ├─ bert-base-NER (NER)                         │
│  ├─ bart-large-mnli (Zero-Shot)                 │
│  └─ bart-large-cnn (Summarization)              │
│                                                  │
│  Infrastructure                                 │
│  └─ Docker Container (python:3.11-slim)        │
└─────────────────────────────────────────────────┘

Project Structure

nlp_end2end/
├── api.py                 # FastAPI application (4 models)
├── call_api.py           # Test script
├── requirements.txt      # Python dependencies
├── Dockerfile            # Container configuration
└── .dockerignore        # Docker optimization

Prerequisites

Before starting, ensure you have:

# Check Python version
python3 --version          # Should be 3.8+

# Install Docker
docker --version           # Should be 20.10+

# Verify pip
pip3 --version            # Should be 20.0+

Step 1: Create Project Directory

# Clone the repo contians the entire code 

git clone git@github.com:rjshk013/mlops-project.git
cd ~/mlops-project/huggingface-project/nlp_huggingface

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Verify
which python
python --version

Step 2: Initialize Dependencies

# Create requirements.txt
cat > requirements.txt << 'EOF'
fastapi==0.104.1
uvicorn==0.24.0
pydantic==2.5.0
transformers==4.36.2
torch==2.1.1
requests==2.31.0
EOF

# Install dependencies
pip install -r requirements.txt

# Verify installation
python -c "from fastapi import FastAPI; print('✓ FastAPI installed')"
python -c "from transformers import pipeline; print('✓ Transformers installed')"

MLOps Principle #1: Environment Consistency

Why it matters: The requirements.txt with pinned versions ensures:

  • Same environment across machines
  • Reproducible builds
  • No “works on my machine” problems
  • Production stability

Best Practice:

❌ Bad:  transformers
✅ Good: transformers==4.36.2


Part 2: Understanding the 4 NLP Models

Model 1: Sentiment Analysis (DistilBERT)

What It Does

Classifies text as POSITIVE or NEGATIVE

When to Use

  • Product reviews
  • Customer feedback
  • Social media monitoring
  • Sentiment trending

Code Example

from transformers import pipeline

sentiment_pipeline = pipeline(
    "sentiment-analysis",
    model="distilbert-base-uncased-finetuned-sst-2-english"
)

result = sentiment_pipeline("I love this product!")
# Output: [{'label': 'POSITIVE', 'score': 0.9999}]

Performance

  • Size: 268 MB
  • Speed: <100ms per request
  • Accuracy: 98%+
  • Use Case: Real-time sentiment analysis

Model 2: Named Entity Recognition (BERT-NER)

What It Does

Extracts entities (names, organizations, locations) from text

When to Use

  • Information extraction
  • Resume parsing
  • Document analysis
  • Knowledge graph building

Code Example

ner_pipeline = pipeline(
    "ner",
    model="dslim/bert-base-NER",
    aggregation_strategy="simple"
)

result = ner_pipeline("Apple CEO Tim Cook announced in California")
# Output: [
#   {'word': 'Apple', 'entity_group': 'ORG', 'score': 0.9998},
#   {'word': 'Tim Cook', 'entity_group': 'PER', 'score': 0.9987},
#   {'word': 'California', 'entity_group': 'LOC', 'score': 0.9995}
# ]

Entities Recognized

  • PER: Person names
  • ORG: Organizations/companies
  • LOC: Locations/places
  • MISC: Miscellaneous

Performance

  • Size: 420 MB
  • Speed: 150-300ms
  • Accuracy: 96%+
  • Use Case: Data extraction at scale

Model 3: Zero-Shot Classification (BART-MNLI)

What It Does

Classifies text into custom categories WITHOUT training

When to Use

  • Email filtering (spam/legitimate/phishing)
  • Content moderation
  • Intent detection
  • Dynamic classification

Code Example

zero_shot_pipeline = pipeline(
    "zero-shot-classification",
    model="facebook/bart-large-mnli"
)

result = zero_shot_pipeline(
    "Click here to claim your prize!",
    labels=["spam", "legitimate", "phishing"]
)
# Output: {
#   'sequence': 'Click here to claim your prize!',
#   'labels': ['spam', 'phishing', 'legitimate'],
#   'scores': [0.9985, 0.0009, 0.0004]
# }

Key Advantage

Custom labels – Define your own categories on the fly!

Performance

  • Size: 1.6 GB
  • Speed: 200-500ms
  • Flexibility: Unlimited label combinations
  • Use Case: Dynamic content classification

Model 4: Text Summarization (BART-CNN)

What It Does

Creates shorter summaries of long text

When to Use

  • News summarization
  • Document summarization
  • Meeting notes
  • Article previews

Code Example

summarization_pipeline = pipeline(
    "summarization",
    model="facebook/bart-large-cnn"
)

text = """
Artificial Intelligence is transforming industries worldwide.
Machine learning enables computers to learn from data without 
explicit programming. Deep learning uses neural networks to 
process complex patterns. These technologies revolutionize 
healthcare, finance, transportation and education.
"""

result = summarization_pipeline(text, max_length=50, min_length=10)
# Output: [{'summary_text': 'Machine learning enables computers to 
# learn from data. Deep learning uses neural networks. These 
# technologies revolutionize healthcare and finance.'}]

Performance

  • Size: 1.6 GB
  • Speed: 1-3 seconds for long text
  • Quality: Good extractive summaries
  • Use Case: Bulk summarization

Model Comparison Matrix

FeatureSentimentNERZero-ShotSummarization
Size268 MB420 MB1.6 GB1.6 GB
Speed<100ms150-300ms200-500ms1-3s
Accuracy98%+96%+90%+85%+
CustomizableNoNoYESNo
Batch ReadyYesYesYesLimited
Real-timeYesYesYesNo

Part 3: Building the FastAPI Server

Why FastAPI?

Traditional Flask:
  ❌ Slow
  ❌ No type hints
  ❌ Manual validation

Modern FastAPI:
  ✅ 3x faster
  ✅ Auto-documentation
  ✅ Built-in validation
  ✅ ASGI async support
  ✅ Production-ready

Step 1: Create API File

Create api.py:

from fastapi import FastAPI
from pydantic import BaseModel
from transformers import pipeline
from typing import List
import logging

# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Initialize FastAPI
app = FastAPI(title="Multi-Model NLP API")

# ============================================================================
# INITIALIZE MODELS
# ============================================================================

try:
    logger.info("Loading sentiment model...")
    sentiment_pipeline = pipeline(
        "sentiment-analysis",
        model="distilbert-base-uncased-finetuned-sst-2-english"
    )
    logger.info("✓ Sentiment model loaded")
except Exception as e:
    logger.error(f"Error loading sentiment model: {e}")
    sentiment_pipeline = None

try:
    logger.info("Loading NER model...")
    ner_pipeline = pipeline(
        "ner",
        model="dslim/bert-base-NER",
        aggregation_strategy="simple"
    )
    logger.info("✓ NER model loaded")
except Exception as e:
    logger.error(f"Error loading NER model: {e}")
    ner_pipeline = None

try:
    logger.info("Loading zero-shot model...")
    zero_shot_pipeline = pipeline(
        "zero-shot-classification",
        model="facebook/bart-large-mnli"
    )
    logger.info("✓ Zero-shot model loaded")
except Exception as e:
    logger.error(f"Error loading zero-shot model: {e}")
    zero_shot_pipeline = None

try:
    logger.info("Loading summarization model...")
    summarization_pipeline = pipeline(
        "summarization",
        model="facebook/bart-large-cnn"
    )
    logger.info("✓ Summarization model loaded")
except Exception as e:
    logger.error(f"Error loading summarization model: {e}")
    summarization_pipeline = None

logger.info("All models initialized!")

Step 2: Define Data Models (Pydantic)

# Data Models for request validation
class SentimentRequest(BaseModel):
    input_string: str

class NERRequest(BaseModel):
    input_string: str

class ZeroShotRequest(BaseModel):
    input_string: str
    labels: List[str]
    multi_class: bool = False

class SummarizationRequest(BaseModel):
    input_string: str
    max_length: int = 50
    min_length: int = 10

MLOps Principle #2: Type Safety

Pydantic provides:

  • ✅ Automatic request validation
  • ✅ Type checking
  • ✅ Error messages
  • ✅ Auto-documentation

Step 3: Define Endpoints

# ============================================================================
# ENDPOINTS
# ============================================================================

@app.get("/health")
def health_check():
    """Health check endpoint"""
    return {
        "status": "healthy",
        "models": [
            "sentiment-analysis",
            "named-entity-recognition",
            "zero-shot-classification",
            "summarization"
        ]
    }

@app.post("/analyze")
def analyze(request: SentimentRequest):
    """Primary endpoint: Sentiment analysis"""
    if sentiment_pipeline is None:
        return {"error": "Sentiment model not loaded"}
    
    try:
        result = sentiment_pipeline(request.input_string)
        return {
            "result": {
                "sentiment": result[0]["label"],
                "score": round(float(result[0]["score"]), 4)
            }
        }
    except Exception as e:
        logger.error(f"Error: {e}")
        return {"error": str(e)}

@app.post("/analyze/ner")
def analyze_ner(request: NERRequest):
    """Extract named entities"""
    if ner_pipeline is None:
        return {"error": "NER model not loaded"}
    
    try:
        entities = ner_pipeline(request.input_string)
        return {
            "input": request.input_string,
            "model": "bert-base-NER",
            "entities": entities,
            "total_entities": len(entities)
        }
    except Exception as e:
        logger.error(f"Error: {e}")
        return {"error": str(e)}

@app.post("/analyze/zero-shot")
def analyze_zero_shot(request: ZeroShotRequest):
    """Custom classification with labels"""
    if zero_shot_pipeline is None:
        return {"error": "Zero-shot model not loaded"}
    
    try:
        result = zero_shot_pipeline(
            request.input_string,
            request.labels,
            multi_class=request.multi_class
        )
        return {
            "input": request.input_string,
            "model": "bart-large-mnli",
            "labels": result["labels"],
            "scores": [round(float(s), 4) for s in result["scores"]]
        }
    except Exception as e:
        logger.error(f"Error: {e}")
        return {"error": str(e)}

@app.post("/analyze/summarize")
def analyze_summarization(request: SummarizationRequest):
    """Summarize text"""
    if summarization_pipeline is None:
        return {"error": "Summarization model not loaded"}
    
    try:
        if len(request.input_string.split()) &lt; 20:
            return {"error": "Text too short (minimum 20 words)"}
        
        summary = summarization_pipeline(
            request.input_string,
            max_length=request.max_length,
            min_length=request.min_length,
            do_sample=False
        )
        
        return {
            "input": request.input_string,
            "model": "bart-large-cnn",
            "summary": summary[0]["summary_text"],
            "original_length": len(request.input_string),
            "summary_length": len(summary[0]["summary_text"])
        }
    except Exception as e:
        logger.error(f"Error: {e}")
        return {"error": str(e)}

Step 4: Test Locally

# Activate virtual environment
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run server
uvicorn api:app --reload --host 0.0.0.0 --port 8000

Expected output:

INFO:     Uvicorn running on http://0.0.0.0:8000
INFO:     Application startup complete

Step 5: Test Endpoints

# In another terminal:

# Health check
curl http://localhost:8000/health

# Sentiment
curl -X POST http://localhost:8000/analyze \
  -H 'Content-Type: application/json' \
  -d '{"input_string": "I love this!"}'

# NER
curl -X POST http://localhost:8000/analyze/ner \
  -H 'Content-Type: application/json' \
  -d '{"input_string": "Apple CEO Tim Cook"}'

# Zero-shot
curl -X POST http://localhost:8000/analyze/zero-shot \
  -H 'Content-Type: application/json' \
  -d '{"input_string":"Spam email","labels":["spam","legitimate"]}'

# Summarization
curl -X POST http://localhost:8000/analyze/summarize \
  -H 'Content-Type: application/json' \
  -d '{"input_string":"AI is transforming industries. Machine learning enables computers to learn from data. Deep learning uses neural networks. These technologies revolutionize healthcare, finance, transportation and education."}'

MLOps Principle #3: Error Handling

Every endpoint includes:

try:
    # Process request
except Exception as e:
    # Log error
    logger.error(f"Error: {e}")
    # Return error response
    return {"error": str(e)}

This ensures:

  • ✅ Server doesn’t crash
  • ✅ Errors are logged
  • ✅ Client gets feedback
  • ✅ Production stability

Part 4: Containerization with Docker

Why Docker for MLOps?

Without Docker:
  ❌ "Works on my machine"
  ❌ Different environments
  ❌ Hard to deploy
  ❌ Dependency conflicts

With Docker:
  ✅ Consistent environments
  ✅ Easy deployment
  ✅ Version control
  ✅ Scalable

Step 1: Create Dockerfile

Create Dockerfile:

# Use Python 3.11-slim (modern, lightweight, secure)
FROM python:3.11-slim

# Set working directory
WORKDIR /app

# Copy requirements first (improves Docker caching)
COPY requirements.txt requirements.txt

# Install dependencies
RUN pip install --no-cache-dir -r requirements.txt && \
    pip cache purge

# Copy application code
COPY . /app

# Expose port 8000
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=40s --retries=3 \
    CMD python -c "import requests; requests.get('http://localhost:8000/health')" || exit 1

# Run the application
CMD ["uvicorn", "api:app", "--host", "0.0.0.0", "--port", "8000"]

Step 2: Create .dockerignore

Create .dockerignore:

__pycache__
*.pyc
*.pyo
*.pyd
.Python
venv/
.venv/
.git
.gitignore
.env
.DS_Store
README.md
*.md

Step 3: Build Image

# Build Docker image
docker build -t sentiment-app:latest .

# This will:
# 1. Download base Python 3.11 image (~150MB)
# 2. Install dependencies (~2GB)
# 3. Copy your code
# 4. Create image (~4GB total)

# Takes 3-5 minutes on first build

Step 4: Run Container

# Run container
docker run -d -p 8000:8000 sentiment-app:latest

# Verify it's running
docker ps

# Check logs
docker logs $(docker ps -q)

# View last 50 lines
docker logs --tail 50 $(docker ps -q)

# Follow logs in real-time
docker logs -f $(docker ps -q)

Docker Best Practices

Layer Caching

# ❌ Bad: Code changes require re-installing all dependencies
COPY . /app
RUN pip install -r requirements.txt

# ✅ Good: Dependencies only rebuild if requirements.txt changes
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY . /app

Image Size Optimization

# ❌ Bad: 500MB+
FROM python:3.11

# ✅ Good: ~150MB
FROM python:3.11-slim

Security

# ✅ Use specific versions (not latest)
FROM python:3.11-slim

# ✅ Create non-root user (production)
RUN useradd -m appuser
USER appuser

MLOps Principle #4: Infrastructure as Code

Dockerfile is code that defines:

  • Environment
  • Dependencies
  • Configuration
  • Runtime behavior

Benefits:

  • ✅ Reproducible builds
  • ✅ Version control
  • ✅ Collaboration
  • ✅ Automation

Part 5: Testing & Verification

Step 1: Health Check

# Test if API is responding
curl http://localhost:8000/health

# Expected response:
# {"status":"healthy","models":["sentiment-analysis",...]}

Step 2: Test Each Endpoint

#!/bin/bash

echo "Testing Sentiment Analysis..."
curl -X POST http://localhost:8000/analyze \
  -H 'Content-Type: application/json' \
  -d '{"input_string": "This is amazing!"}'

echo "Testing NER..."
curl -X POST http://localhost:8000/analyze/ner \
  -H 'Content-Type: application/json' \
  -d '{"input_string": "Apple Inc"}'

echo "Testing Zero-Shot..."
curl -X POST http://localhost:8000/analyze/zero-shot \
  -H 'Content-Type: application/json' \
  -d '{"input_string":"Spam","labels":["spam","legitimate"]}'

echo "Testing Summarization..."
curl -X POST http://localhost:8000/analyze/summarize \
  -H 'Content-Type: application/json' \
  -d '{"input_string":"Long text about AI and machine learning technologies..."}'

Step 3: Create Test Script

Create call_api.py:

import requests
import json

BASE_URL = "http://localhost:8000"

def test_sentiment():
    response = requests.post(
        f"{BASE_URL}/analyze",
        json={"input_string": "I love this!"}
    )
    print("Sentiment:", json.dumps(response.json(), indent=2))

def test_ner():
    response = requests.post(
        f"{BASE_URL}/analyze/ner",
        json={"input_string": "Apple CEO Tim Cook"}
    )
    print("NER:", json.dumps(response.json(), indent=2))

def test_zero_shot():
    response = requests.post(
        f"{BASE_URL}/analyze/zero-shot",
        json={
            "input_string": "Click here!",
            "labels": ["spam", "legitimate"]
        }
    )
    print("Zero-Shot:", json.dumps(response.json(), indent=2))

def test_summarization():
    response = requests.post(
        f"{BASE_URL}/analyze/summarize",
        json={
            "input_string": "AI is transforming industries. Machine learning enables computers to learn from data. Deep learning uses neural networks. These technologies revolutionize healthcare, finance, transportation and education."
        }
    )
    print("Summarization:", json.dumps(response.json(), indent=2))

if __name__ == "__main__":
    print("Testing API endpoints...\n")
    test_sentiment()
    print("\n")
    test_ner()
    print("\n")
    test_zero_shot()
    print("\n")
    test_summarization()
    print("\n✓ All tests completed!")

Run tests:

python3 call_api.py

Conclusion

What You’ve Built

ComponentStatus
4 NLP Models✅ Integrated
FastAPI Server✅ Working
Docker Container✅ Deployed
REST Endpoints✅ Tested
Error Handling✅ Implemented
Health Checks✅ Configured

MLOps Principles Learned

  1. Environment Consistency – Reproducible builds
  2. Type Safety – Validation & error prevention
  3. Error Handling – Graceful failures
  4. Infrastructure as Code – Version control everything
  5. Monitoring – Know what’s happening

Next Steps

Beginner

  • [ ] Customize models for your domain
  • [ ] Add web UI (React, Vue)
  • [ ] Deploy to local machine

Intermediate

  • [ ] Add authentication (API keys)
  • [ ] Implement caching (Redis)
  • [ ] Add database (PostgreSQL)

Advanced

  • [ ] Setup CI/CD (GitHub Actions)
  • [ ] Kubernetes deployment
  • [ ] ML model monitoring
  • [ ] A/B testing framework

Resources

Final Checklist

  • ✅ API running locally
  • ✅ All endpoints tested
  • ✅ Docker image built
  • ✅ Container deployed
  • ✅ Health checks passing
  • ✅ Documentation complete

Leave a Comment

Your email address will not be published. Required fields are marked *

Stay up to date with our blogs.

Subscribe to receive email notifications for new blog posts.