Media Processing

Last updated: 2026-05-26

Overview

Monstermessenger supports two media processing capabilities that enhance the chatbot’s ability to understand user context:

Capability	Service	Purpose
Audio Transcription	`AudioTranscriptionService`	Convert speech to text for voice-first clients
Image Attachments	Chat service + `encode_image_to_base64`	Pass screenshots/photos as multimodal input to the LLM
File Upload	`POST /files/upload`	General-purpose file upload with validation
Image Analysis	`ImageAnalysisService`	Extract metadata from uploaded images (dimensions, format)

These features are critical for cyberbullying detection: users can share screenshots of harmful messages or describe situations by voice.

Audio Transcription

Architecture

Client (voice recorder)
    │
    │  POST /api/v1/audio/transcribe
    │  Content-Type: multipart/form-data
    │  Body: audio file (webm, mp3, wav, ogg, m4a)
    ▼
┌─────────────────────────────────┐
│    AudioTranscriptionService    │
│                                 │
│  ┌───────────────────────────┐  │
│  │ MIME type validation      │  │
│  │ → normalize (strip codec) │  │
│  └───────────┬───────────────┘  │
│              ▼                  │
│  ┌───────────────────────────┐  │
│  │ Gemini 2.5 Flash Lite     │  │
│  │ (Vertex AI, europe-west9) │  │
│  │ model: gemini-3.1-flash-  │  │
│  │        lite                │  │
│  └───────────┬───────────────┘  │
│              ▼                  │
│  ┌───────────────────────────┐  │
│  │ Clean response            │  │
│  │ → strip prefixes          │  │
│  │ → return plain text       │  │
│  └───────────────────────────┘  │
└─────────────────┬───────────────┘
                  │
                  ▼
    {"transcribed_text": "...",
     "original_filename": "recording.webm",
     "mime_type": "audio/webm"}

Endpoint

POST /api/v1/audio/transcribe
Content-Type: multipart/form-data
Body: audio file (single UploadFile)

Response:

{
    "transcribed_text": "Hier j'ai reçu un message méchant sur Instagram...",
    "original_filename": "voice_note.webm",
    "mime_type": "audio/webm;codecs=opus"
}

Supported Formats

Format	MIME Types	Notes
WebM	`audio/webm`, `audio/webm;codecs=opus`	Default browser recording format
MP3	`audio/mp3`, `audio/mpeg`	Common mobile format
WAV	`audio/wav`	Uncompressed, large files
OGG	`audio/ogg`, `audio/ogg;codecs=opus`	Open format
M4A	`audio/m4a`	iOS recording format

Codec stripping: MIME types with codec parameters (e.g., audio/webm;codecs=opus) are normalized to the base type (audio/webm) before being passed to Gemini, which prefers simpler format identifiers.

Size limit: 25 MB per audio file (enforced by the endpoint).

Transcription Model

The service uses gemini-3.1-flash-lite (configured via settings.llm.transcription_model) with Vertex AI in the europe-west9 region. The prompt instructs the model in French:

“Retranscris cet audio en texte. Retourne uniquement le texte transcrit sans aucun commentaire ou formatage supplémentaire.”

After transcription, the service strips common Gemini prefixes (Transcription:, Text:) to return clean text.

Client Integration

Voice-first clients should:

Record audio in a supported format (WebM for browsers, M4A for iOS)
POST the audio blob to /api/v1/audio/transcribe
Use the transcribed_text field as the message content for POST /api/v1/chat

Example flow:

// 1. Record audio → blob
const audioBlob = await mediaRecorder.getBlob();

// 2. Transcribe
const formData = new FormData();
formData.append('audio_file', audioBlob, 'recording.webm');
const { transcribed_text } = await fetch('/api/v1/audio/transcribe', {
    method: 'POST',
    body: formData,
    headers: { Authorization: `Bearer ${token}` }
}).then(r => r.json());

// 3. Send to chat
await sendMessage(transcribed_text);

Error Handling

Status	Condition
`400`	File is not an audio type (`content_type` doesn’t start with `audio/`)
`400`	Unsupported format (MIME type not in supported list)
`400`	File exceeds 25 MB
`500`	Gemini transcription failure (model error, network, etc.)

Image Attachments

Flow

Users can attach screenshots or photos to chat messages. The flow:

1. User uploads image
   POST /api/v1/files/upload  →  {"file_path": "<uuid>.png", ...}
                    │
2. User sends chat message with attachment reference
   POST /api/v1/chat  {
       "message": "Look at this",
       "session_id": "...",
       "attachments": ["<uuid>.png"]
   }
                    │
3. Chat service resolves attachment
   _prepare_message_with_attachments()
   → reads file from uploads/<uuid>.png
   → encode_image_to_base64()
   → attaches as image_url content block
                    │
4. HumanMessage contains multimodal content
   [
       {"type": "text", "text": "Look at this"},
       {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
   ]
                    │
5. Gemini processes text + image together

Image Encoding

The encode_image_to_base64() utility (in api/utils/image_processing.py) handles:

Format preservation: Opens the image with PIL, preserves the original format (PNG, JPEG, WebP, etc.)
Format fallback: If PIL cannot detect the format, falls back to JPEG
Base64 encoding: Returns a (base64_string, format) tuple for constructing the data URI
Validation: A companion function validate_image_file() uses PIL’s verify() to check image integrity

from utils.image_processing import encode_image_to_base64

image_str, img_format = encode_image_to_base64("uploads/screenshot.png")
# image_str = "iVBORw0KGgo..." (base64)
# img_format = "PNG"

data_uri = f"data:image/{img_format.lower()};base64,{image_str}"

Chat Service Integration

In ChatService._prepare_message_with_attachments(), each attachment filename is resolved against the uploads/ directory, encoded, and appended to the HumanMessage content as an image_url block. The LLM receives both text and image in a single multimodal request.

Error tolerance: If an attachment file is missing or corrupt, the service logs a warning and proceeds with text-only content — it does not fail the entire message.

File Upload API

Endpoints

Method	Path	Purpose
`POST`	`/api/v1/files/upload`	Upload one or more files
`GET`	`/api/v1/files/{filename}`	Download an uploaded file
`DELETE`	`/api/v1/files/{filename}`	Delete an uploaded file

Upload (`POST /files/upload`)

Accepts multiple files as multipart/form-data. Each file is:

Extension-validated against settings.storage.allowed_file_types
Size-checked against settings.storage.max_file_size (default: 10 MB)
Renamed to a UUID + original extension (e.g., a3f2b1c4.png) to prevent collisions
Saved to the configured upload directory (settings.storage.upload_dir, default: uploads/)

Request:

POST /api/v1/files/upload
Content-Type: multipart/form-data
Body: files[] = screenshot1.png, files[] = screenshot2.jpg

Response:

[
    {
        "filename": "screenshot1.png",
        "file_path": "a3f2b1c4-d5e6-7890-abcd-ef1234567890.png",
        "file_size": 245760,
        "content_type": "image/png"
    },
    {
        "filename": "screenshot2.jpg",
        "file_path": "b4f3c2d5-e6f7-8901-bcde-f12345678901.jpg",
        "file_size": 128000,
        "content_type": "image/jpeg"
    }
]

The file_path value is what you pass in the attachments array of POST /chat.

Retrieve (`GET /files/{filename}`)

Returns the raw file with appropriate Content-Type. Returns 404 if the file does not exist.

Delete (`DELETE /files/{filename}`)

Removes the file from the upload directory. Returns {"message": "File deleted successfully"} or 404 if not found.

Configuration

All storage settings are in api/config.py under StorageSettings:

Setting	Default	Description
`upload_dir`	`"uploads"`	Directory for uploaded files
`max_file_size`	`10 * 1024 * 1024` (10 MB)	Maximum file size in bytes
`allowed_file_types`	`[".jpg", ".jpeg", ".png", ".gif", ".webp", ".svg", ".bmp", ".pdf"]`	Allowed file extensions

Image Analysis Service

ImageAnalysisService provides metadata extraction for uploaded images. Currently implements basic dimension/format detection; the analysis methods are designed to be extended with more sophisticated content analysis.

from services.image_analysis import image_analyzer

results = await image_analyzer.analyze_image(image_bytes)
# {
#     "width": 1920,
#     "height": 1080,
#     "format": "PNG",
#     "mode": "RGBA"
# }

Current Capabilities

Method	Purpose
`analyze_image(bytes)`	Extract width, height, format, and color mode from raw bytes
`analyze_file(Path)`	Same as above, but reads from a file path

Implementation

Uses OpenCV (cv2.imdecode) for decoding raw bytes
Uses PIL (Image.fromarray) for format and mode detection
Lightweight — no GPU or model loading required

Future Extensions

The service is designed for expansion. Potential additions:

Screenshot classification: Detect if an image is a social media screenshot (platform logos, UI patterns)
Text extraction (OCR): Extract visible text from screenshots for analysis
Harmful content detection: Flag images containing potentially harmful material
Metadata extraction: EXIF data, geolocation, timestamp for forensic analysis

Integration Summary

Client Need	Step 1	Step 2	Step 3
Voice input	Record audio	`POST /audio/transcribe`	`POST /chat` with transcribed text
Share screenshot	`POST /files/upload`	`POST /chat` with `attachments: [file_path]`	LLM processes text + image
Upload file	`POST /files/upload`	Use returned `file_path`	`GET /files/{name}` to retrieve
Analyze image	`POST /files/upload`	Call `image_analyzer.analyze_file()`	Use metadata for routing/validation