Media Processing
Last updated: 2026-05-26
Overview
Monstermessenger supports two media processing capabilities that enhance the chatbot’s ability to understand user context:
| Capability | Service | Purpose |
|---|---|---|
| Audio Transcription | AudioTranscriptionService |
Convert speech to text for voice-first clients |
| Image Attachments | Chat service + encode_image_to_base64 |
Pass screenshots/photos as multimodal input to the LLM |
| File Upload | POST /files/upload |
General-purpose file upload with validation |
| Image Analysis | ImageAnalysisService |
Extract metadata from uploaded images (dimensions, format) |
These features are critical for cyberbullying detection: users can share screenshots of harmful messages or describe situations by voice.
Audio Transcription
Architecture
Client (voice recorder)
│
│ POST /api/v1/audio/transcribe
│ Content-Type: multipart/form-data
│ Body: audio file (webm, mp3, wav, ogg, m4a)
▼
┌─────────────────────────────────┐
│ AudioTranscriptionService │
│ │
│ ┌───────────────────────────┐ │
│ │ MIME type validation │ │
│ │ → normalize (strip codec) │ │
│ └───────────┬───────────────┘ │
│ ▼ │
│ ┌───────────────────────────┐ │
│ │ Gemini 2.5 Flash Lite │ │
│ │ (Vertex AI, europe-west9) │ │
│ │ model: gemini-3.1-flash- │ │
│ │ lite │ │
│ └───────────┬───────────────┘ │
│ ▼ │
│ ┌───────────────────────────┐ │
│ │ Clean response │ │
│ │ → strip prefixes │ │
│ │ → return plain text │ │
│ └───────────────────────────┘ │
└─────────────────┬───────────────┘
│
▼
{"transcribed_text": "...",
"original_filename": "recording.webm",
"mime_type": "audio/webm"}
Endpoint
POST /api/v1/audio/transcribe
Content-Type: multipart/form-data
Body: audio file (single UploadFile)
Response:
{
"transcribed_text": "Hier j'ai reçu un message méchant sur Instagram...",
"original_filename": "voice_note.webm",
"mime_type": "audio/webm;codecs=opus"
}Supported Formats
| Format | MIME Types | Notes |
|---|---|---|
| WebM | audio/webm, audio/webm;codecs=opus |
Default browser recording format |
| MP3 | audio/mp3, audio/mpeg |
Common mobile format |
| WAV | audio/wav |
Uncompressed, large files |
| OGG | audio/ogg, audio/ogg;codecs=opus |
Open format |
| M4A | audio/m4a |
iOS recording format |
Codec stripping: MIME types with codec parameters (e.g., audio/webm;codecs=opus) are normalized to the base type (audio/webm) before being passed to Gemini, which prefers simpler format identifiers.
Size limit: 25 MB per audio file (enforced by the endpoint).
Transcription Model
The service uses gemini-3.1-flash-lite (configured via settings.llm.transcription_model) with Vertex AI in the europe-west9 region. The prompt instructs the model in French:
“Retranscris cet audio en texte. Retourne uniquement le texte transcrit sans aucun commentaire ou formatage supplémentaire.”
After transcription, the service strips common Gemini prefixes (Transcription:, Text:) to return clean text.
Client Integration
Voice-first clients should:
- Record audio in a supported format (WebM for browsers, M4A for iOS)
POSTthe audio blob to/api/v1/audio/transcribe- Use the
transcribed_textfield as the message content forPOST /api/v1/chat
Example flow:
// 1. Record audio → blob
const audioBlob = await mediaRecorder.getBlob();
// 2. Transcribe
const formData = new FormData();
formData.append('audio_file', audioBlob, 'recording.webm');
const { transcribed_text } = await fetch('/api/v1/audio/transcribe', {
method: 'POST',
body: formData,
headers: { Authorization: `Bearer ${token}` }
}).then(r => r.json());
// 3. Send to chat
await sendMessage(transcribed_text);Error Handling
| Status | Condition |
|---|---|
400 |
File is not an audio type (content_type doesn’t start with audio/) |
400 |
Unsupported format (MIME type not in supported list) |
400 |
File exceeds 25 MB |
500 |
Gemini transcription failure (model error, network, etc.) |
Image Attachments
Flow
Users can attach screenshots or photos to chat messages. The flow:
1. User uploads image
POST /api/v1/files/upload → {"file_path": "<uuid>.png", ...}
│
2. User sends chat message with attachment reference
POST /api/v1/chat {
"message": "Look at this",
"session_id": "...",
"attachments": ["<uuid>.png"]
}
│
3. Chat service resolves attachment
_prepare_message_with_attachments()
→ reads file from uploads/<uuid>.png
→ encode_image_to_base64()
→ attaches as image_url content block
│
4. HumanMessage contains multimodal content
[
{"type": "text", "text": "Look at this"},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
]
│
5. Gemini processes text + image together
Image Encoding
The encode_image_to_base64() utility (in api/utils/image_processing.py) handles:
- Format preservation: Opens the image with PIL, preserves the original format (PNG, JPEG, WebP, etc.)
- Format fallback: If PIL cannot detect the format, falls back to JPEG
- Base64 encoding: Returns a
(base64_string, format)tuple for constructing the data URI - Validation: A companion function
validate_image_file()uses PIL’sverify()to check image integrity
from utils.image_processing import encode_image_to_base64
image_str, img_format = encode_image_to_base64("uploads/screenshot.png")
# image_str = "iVBORw0KGgo..." (base64)
# img_format = "PNG"
data_uri = f"data:image/{img_format.lower()};base64,{image_str}"Chat Service Integration
In ChatService._prepare_message_with_attachments(), each attachment filename is resolved against the uploads/ directory, encoded, and appended to the HumanMessage content as an image_url block. The LLM receives both text and image in a single multimodal request.
Error tolerance: If an attachment file is missing or corrupt, the service logs a warning and proceeds with text-only content — it does not fail the entire message.
File Upload API
Endpoints
| Method | Path | Purpose |
|---|---|---|
POST |
/api/v1/files/upload |
Upload one or more files |
GET |
/api/v1/files/{filename} |
Download an uploaded file |
DELETE |
/api/v1/files/{filename} |
Delete an uploaded file |
Upload (POST /files/upload)
Accepts multiple files as multipart/form-data. Each file is:
- Extension-validated against
settings.storage.allowed_file_types - Size-checked against
settings.storage.max_file_size(default: 10 MB) - Renamed to a UUID + original extension (e.g.,
a3f2b1c4.png) to prevent collisions - Saved to the configured upload directory (
settings.storage.upload_dir, default:uploads/)
Request:
POST /api/v1/files/upload
Content-Type: multipart/form-data
Body: files[] = screenshot1.png, files[] = screenshot2.jpg
Response:
[
{
"filename": "screenshot1.png",
"file_path": "a3f2b1c4-d5e6-7890-abcd-ef1234567890.png",
"file_size": 245760,
"content_type": "image/png"
},
{
"filename": "screenshot2.jpg",
"file_path": "b4f3c2d5-e6f7-8901-bcde-f12345678901.jpg",
"file_size": 128000,
"content_type": "image/jpeg"
}
]The file_path value is what you pass in the attachments array of POST /chat.
Retrieve (GET /files/{filename})
Returns the raw file with appropriate Content-Type. Returns 404 if the file does not exist.
Delete (DELETE /files/{filename})
Removes the file from the upload directory. Returns {"message": "File deleted successfully"} or 404 if not found.
Configuration
All storage settings are in api/config.py under StorageSettings:
| Setting | Default | Description |
|---|---|---|
upload_dir |
"uploads" |
Directory for uploaded files |
max_file_size |
10 * 1024 * 1024 (10 MB) |
Maximum file size in bytes |
allowed_file_types |
[".jpg", ".jpeg", ".png", ".gif", ".webp", ".svg", ".bmp", ".pdf"] |
Allowed file extensions |
Image Analysis Service
ImageAnalysisService provides metadata extraction for uploaded images. Currently implements basic dimension/format detection; the analysis methods are designed to be extended with more sophisticated content analysis.
from services.image_analysis import image_analyzer
results = await image_analyzer.analyze_image(image_bytes)
# {
# "width": 1920,
# "height": 1080,
# "format": "PNG",
# "mode": "RGBA"
# }Current Capabilities
| Method | Purpose |
|---|---|
analyze_image(bytes) |
Extract width, height, format, and color mode from raw bytes |
analyze_file(Path) |
Same as above, but reads from a file path |
Implementation
- Uses OpenCV (
cv2.imdecode) for decoding raw bytes - Uses PIL (
Image.fromarray) for format and mode detection - Lightweight — no GPU or model loading required
Future Extensions
The service is designed for expansion. Potential additions:
- Screenshot classification: Detect if an image is a social media screenshot (platform logos, UI patterns)
- Text extraction (OCR): Extract visible text from screenshots for analysis
- Harmful content detection: Flag images containing potentially harmful material
- Metadata extraction: EXIF data, geolocation, timestamp for forensic analysis
Integration Summary
| Client Need | Step 1 | Step 2 | Step 3 |
|---|---|---|---|
| Voice input | Record audio | POST /audio/transcribe |
POST /chat with transcribed text |
| Share screenshot | POST /files/upload |
POST /chat with attachments: [file_path] |
LLM processes text + image |
| Upload file | POST /files/upload |
Use returned file_path |
GET /files/{name} to retrieve |
| Analyze image | POST /files/upload |
Call image_analyzer.analyze_file() |
Use metadata for routing/validation |