Media Processing

Last updated: 2026-05-26

Overview

Monstermessenger supports two media processing capabilities that enhance the chatbot’s ability to understand user context:

Capability Service Purpose
Audio Transcription AudioTranscriptionService Convert speech to text for voice-first clients
Image Attachments Chat service + encode_image_to_base64 Pass screenshots/photos as multimodal input to the LLM
File Upload POST /files/upload General-purpose file upload with validation
Image Analysis ImageAnalysisService Extract metadata from uploaded images (dimensions, format)

These features are critical for cyberbullying detection: users can share screenshots of harmful messages or describe situations by voice.

Audio Transcription

Architecture

Client (voice recorder)
    │
    │  POST /api/v1/audio/transcribe
    │  Content-Type: multipart/form-data
    │  Body: audio file (webm, mp3, wav, ogg, m4a)
    ▼
┌─────────────────────────────────┐
│    AudioTranscriptionService    │
│                                 │
│  ┌───────────────────────────┐  │
│  │ MIME type validation      │  │
│  │ → normalize (strip codec) │  │
│  └───────────┬───────────────┘  │
│              ▼                  │
│  ┌───────────────────────────┐  │
│  │ Gemini 2.5 Flash Lite     │  │
│  │ (Vertex AI, europe-west9) │  │
│  │ model: gemini-3.1-flash-  │  │
│  │        lite                │  │
│  └───────────┬───────────────┘  │
│              ▼                  │
│  ┌───────────────────────────┐  │
│  │ Clean response            │  │
│  │ → strip prefixes          │  │
│  │ → return plain text       │  │
│  └───────────────────────────┘  │
└─────────────────┬───────────────┘
                  │
                  ▼
    {"transcribed_text": "...",
     "original_filename": "recording.webm",
     "mime_type": "audio/webm"}

Endpoint

POST /api/v1/audio/transcribe
Content-Type: multipart/form-data
Body: audio file (single UploadFile)

Response:

{
    "transcribed_text": "Hier j'ai reçu un message méchant sur Instagram...",
    "original_filename": "voice_note.webm",
    "mime_type": "audio/webm;codecs=opus"
}

Supported Formats

Format MIME Types Notes
WebM audio/webm, audio/webm;codecs=opus Default browser recording format
MP3 audio/mp3, audio/mpeg Common mobile format
WAV audio/wav Uncompressed, large files
OGG audio/ogg, audio/ogg;codecs=opus Open format
M4A audio/m4a iOS recording format

Codec stripping: MIME types with codec parameters (e.g., audio/webm;codecs=opus) are normalized to the base type (audio/webm) before being passed to Gemini, which prefers simpler format identifiers.

Size limit: 25 MB per audio file (enforced by the endpoint).

Transcription Model

The service uses gemini-3.1-flash-lite (configured via settings.llm.transcription_model) with Vertex AI in the europe-west9 region. The prompt instructs the model in French:

“Retranscris cet audio en texte. Retourne uniquement le texte transcrit sans aucun commentaire ou formatage supplémentaire.”

After transcription, the service strips common Gemini prefixes (Transcription:, Text:) to return clean text.

Client Integration

Voice-first clients should:

  1. Record audio in a supported format (WebM for browsers, M4A for iOS)
  2. POST the audio blob to /api/v1/audio/transcribe
  3. Use the transcribed_text field as the message content for POST /api/v1/chat

Example flow:

// 1. Record audio → blob
const audioBlob = await mediaRecorder.getBlob();

// 2. Transcribe
const formData = new FormData();
formData.append('audio_file', audioBlob, 'recording.webm');
const { transcribed_text } = await fetch('/api/v1/audio/transcribe', {
    method: 'POST',
    body: formData,
    headers: { Authorization: `Bearer ${token}` }
}).then(r => r.json());

// 3. Send to chat
await sendMessage(transcribed_text);

Error Handling

Status Condition
400 File is not an audio type (content_type doesn’t start with audio/)
400 Unsupported format (MIME type not in supported list)
400 File exceeds 25 MB
500 Gemini transcription failure (model error, network, etc.)

Image Attachments

Flow

Users can attach screenshots or photos to chat messages. The flow:

1. User uploads image
   POST /api/v1/files/upload  →  {"file_path": "<uuid>.png", ...}
                    │
2. User sends chat message with attachment reference
   POST /api/v1/chat  {
       "message": "Look at this",
       "session_id": "...",
       "attachments": ["<uuid>.png"]
   }
                    │
3. Chat service resolves attachment
   _prepare_message_with_attachments()
   → reads file from uploads/<uuid>.png
   → encode_image_to_base64()
   → attaches as image_url content block
                    │
4. HumanMessage contains multimodal content
   [
       {"type": "text", "text": "Look at this"},
       {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
   ]
                    │
5. Gemini processes text + image together

Image Encoding

The encode_image_to_base64() utility (in api/utils/image_processing.py) handles:

  • Format preservation: Opens the image with PIL, preserves the original format (PNG, JPEG, WebP, etc.)
  • Format fallback: If PIL cannot detect the format, falls back to JPEG
  • Base64 encoding: Returns a (base64_string, format) tuple for constructing the data URI
  • Validation: A companion function validate_image_file() uses PIL’s verify() to check image integrity
from utils.image_processing import encode_image_to_base64

image_str, img_format = encode_image_to_base64("uploads/screenshot.png")
# image_str = "iVBORw0KGgo..." (base64)
# img_format = "PNG"

data_uri = f"data:image/{img_format.lower()};base64,{image_str}"

Chat Service Integration

In ChatService._prepare_message_with_attachments(), each attachment filename is resolved against the uploads/ directory, encoded, and appended to the HumanMessage content as an image_url block. The LLM receives both text and image in a single multimodal request.

Error tolerance: If an attachment file is missing or corrupt, the service logs a warning and proceeds with text-only content — it does not fail the entire message.

File Upload API

Endpoints

Method Path Purpose
POST /api/v1/files/upload Upload one or more files
GET /api/v1/files/{filename} Download an uploaded file
DELETE /api/v1/files/{filename} Delete an uploaded file

Upload (POST /files/upload)

Accepts multiple files as multipart/form-data. Each file is:

  1. Extension-validated against settings.storage.allowed_file_types
  2. Size-checked against settings.storage.max_file_size (default: 10 MB)
  3. Renamed to a UUID + original extension (e.g., a3f2b1c4.png) to prevent collisions
  4. Saved to the configured upload directory (settings.storage.upload_dir, default: uploads/)

Request:

POST /api/v1/files/upload
Content-Type: multipart/form-data
Body: files[] = screenshot1.png, files[] = screenshot2.jpg

Response:

[
    {
        "filename": "screenshot1.png",
        "file_path": "a3f2b1c4-d5e6-7890-abcd-ef1234567890.png",
        "file_size": 245760,
        "content_type": "image/png"
    },
    {
        "filename": "screenshot2.jpg",
        "file_path": "b4f3c2d5-e6f7-8901-bcde-f12345678901.jpg",
        "file_size": 128000,
        "content_type": "image/jpeg"
    }
]

The file_path value is what you pass in the attachments array of POST /chat.

Retrieve (GET /files/{filename})

Returns the raw file with appropriate Content-Type. Returns 404 if the file does not exist.

Delete (DELETE /files/{filename})

Removes the file from the upload directory. Returns {"message": "File deleted successfully"} or 404 if not found.

Configuration

All storage settings are in api/config.py under StorageSettings:

Setting Default Description
upload_dir "uploads" Directory for uploaded files
max_file_size 10 * 1024 * 1024 (10 MB) Maximum file size in bytes
allowed_file_types [".jpg", ".jpeg", ".png", ".gif", ".webp", ".svg", ".bmp", ".pdf"] Allowed file extensions

Image Analysis Service

ImageAnalysisService provides metadata extraction for uploaded images. Currently implements basic dimension/format detection; the analysis methods are designed to be extended with more sophisticated content analysis.

from services.image_analysis import image_analyzer

results = await image_analyzer.analyze_image(image_bytes)
# {
#     "width": 1920,
#     "height": 1080,
#     "format": "PNG",
#     "mode": "RGBA"
# }

Current Capabilities

Method Purpose
analyze_image(bytes) Extract width, height, format, and color mode from raw bytes
analyze_file(Path) Same as above, but reads from a file path

Implementation

  • Uses OpenCV (cv2.imdecode) for decoding raw bytes
  • Uses PIL (Image.fromarray) for format and mode detection
  • Lightweight — no GPU or model loading required

Future Extensions

The service is designed for expansion. Potential additions:

  • Screenshot classification: Detect if an image is a social media screenshot (platform logos, UI patterns)
  • Text extraction (OCR): Extract visible text from screenshots for analysis
  • Harmful content detection: Flag images containing potentially harmful material
  • Metadata extraction: EXIF data, geolocation, timestamp for forensic analysis

Integration Summary

Client Need Step 1 Step 2 Step 3
Voice input Record audio POST /audio/transcribe POST /chat with transcribed text
Share screenshot POST /files/upload POST /chat with attachments: [file_path] LLM processes text + image
Upload file POST /files/upload Use returned file_path GET /files/{name} to retrieve
Analyze image POST /files/upload Call image_analyzer.analyze_file() Use metadata for routing/validation