Data Layer

Last updated: 2026-06-01

Overview

The project uses a single PostgreSQL database to manage all persistent data, but this data is divided into four distinct logical schemas (namespaces):

Schema	Owner	Purpose
`app`	SQLAlchemy / Alembic	Users, sessions, conversations, collectives, feature flags
`langgraph`	LangGraph library	LangGraph checkpoint tables (managed automatically)
`rag`	SQLAlchemy	PGVectorStore tables for document embeddings
`cache`	SQLAlchemy	RAG semantic-cache and embedding-cache tables

This multi-schema approach keeps each domain cleanly isolated while sharing a single database connection pool.

Application Data Schema (`api/database/models/`)

The schema for core application data is defined using SQLAlchemy ORM models in api/database/models/. The init_db() function, called on application startup, creates ORM tables. Alembic migrations are run separately via alembic upgrade head (for deployments, this is handled by container startup command).

Entity-Relationship Diagram

Auto-generated from SQLAlchemy models by Paracelsus during documentation build.

erDiagram
  app_sessions {
    UUID id PK
    UUID user_id FK
    JSON capabilities "nullable"
    DATETIME created_at "nullable"
    DATETIME expires_at "nullable"
    DATETIME last_active_at "nullable"
    ENUM origin
    ENUM scope
  }

  chat_sessions {
    UUID id PK
    UUID conversation_id FK
    DATETIME created_at "nullable"
  }

  collective_memberships {
    UUID collective_id PK,FK
    UUID user_id PK,FK
    UUID invitation_id FK "nullable"
    DATETIME created_at "nullable"
    ENUM role
    DATETIME updated_at "nullable"
  }

  collectives {
    UUID id PK
    UUID parent_id FK "nullable"
    DATETIME created_at "nullable"
    VARCHAR(255) name
    BOOLEAN parental_consent_granted
    VARCHAR(50) type
    DATETIME updated_at "nullable"
  }

  consents {
    UUID id PK
    UUID user_id FK
    VARCHAR(45) consent_ip "nullable"
    DATETIME created_at
  }

  conversations {
    UUID id PK
    UUID user_id FK
    ENUM conversation_type "nullable"
    DATETIME created_at "nullable"
    VARCHAR(255) current_monster "nullable"
    TEXT summary "nullable"
    VARCHAR(255) title "nullable"
    DATETIME updated_at "nullable,indexed"
  }

  feature_flags {
    UUID id PK
    DATETIME created_at "nullable"
    VARCHAR description "nullable"
    ENUM environment
    BOOLEAN is_enabled "nullable"
    VARCHAR name UK
    JSON targeting_rules "nullable"
    DATETIME updated_at "nullable"
    ENUM variant
    ENUM visibility
  }

  feedbacks {
    UUID id PK
    TEXT comment "nullable"
    DATETIME created_at "nullable"
    ENUM incident_type "nullable"
    BOOLEAN is_positive "nullable"
    UUID session_id "nullable"
    UUID user_id "nullable"
    INTEGER value
  }

  idp_logins {
    ENUM idp PK
    VARCHAR(255) login_id PK
    UUID user_id FK
    DATETIME created_at "nullable"
  }

  invitations {
    UUID id PK
    UUID accepted_by FK "nullable"
    UUID collective_id FK
    UUID inviter_id FK
    DATETIME accepted_at "nullable"
    ENUM age_class "nullable"
    DATETIME created_at "nullable"
    INTEGER current_uses
    DATETIME expires_at "nullable"
    VARCHAR(255) label "nullable"
    VARCHAR(10) language
    ENUM status
    INTEGER usage_limit "nullable"
  }

  localized_contents {
    ENUM category PK
    VARCHAR(255) key PK
    VARCHAR(10) language PK
    VARCHAR(20) variant PK
    DATETIME updated_at "nullable"
    JSON value
  }

  parental_authorization_requests {
    UUID id PK
    UUID teenager_id FK
    DATETIME created_at
    DATETIME expires_at "nullable"
    ENUM status
  }

  parental_consents {
    UUID parent_id PK,FK
    UUID teenager_id PK,FK
    VARCHAR(45) consent_ip "nullable"
    DATETIME created_at "nullable"
  }

  users {
    UUID id PK
    ENUM age_class "nullable"
    VARCHAR(100) country "nullable"
    DATETIME created_at "nullable"
    VARCHAR(2) language
    DATETIME last_seen_at "nullable"
    DATETIME updated_at "nullable"
    ENUM user_type
  }

  users ||--o{ app_sessions : user_id
  conversations ||--o{ chat_sessions : conversation_id
  users ||--o| collective_memberships : user_id
  collectives ||--o| collective_memberships : collective_id
  invitations ||--o{ collective_memberships : invitation_id
  collectives ||--o{ collectives : parent_id
  users ||--o{ consents : user_id
  users ||--o{ conversations : user_id
  users ||--o{ idp_logins : user_id
  collectives ||--o{ invitations : collective_id
  users ||--o{ invitations : inviter_id
  users ||--o{ invitations : accepted_by
  users ||--o{ parental_authorization_requests : teenager_id
  users ||--o| parental_consents : parent_id
  users ||--o| parental_consents : teenager_id

`User`

Stores registered user profiles.

id (UUID, PK) — generated on registration.
user_type — enum: teenager or parent.
language — 2-letter language code.
age_class, country — optional profile fields.
created_at, updated_at, last_seen_at — timestamps.
Relationships: app_sessions (1→N, eager-loaded), conversations (1→N, eager-loaded), idp_logins (1→N, eager-loaded), memberships (1→N, eager-loaded), consent (1→1, lazy-loaded), parental_consent (1→1, lazy-loaded).

`IDPLogin`

Maps an external Identity Provider identity to a User. Composite primary key (idp, login_id).

idp — enum: mock-idp (supported via IDProvider enum).
login_id — opaque external user ID (e.g., the sub claim from a JWT).
user_id (FK → User.id).
created_at — registration timestamp.

`AppSession`

Represents a single visit / authentication event (Bearer Token session).

id (UUID, PK).
user_id (FK → User.id).
origin — client identifier (e.g., web_react, discord).
scope — enum (SessionScope): ONBOARDING, PENDING_CONSENT, or FULL. Reflects the user’s progress through the consent pipeline.
capabilities (JSON) — negotiated client capabilities for this session.
created_at, last_active_at, expires_at — session lifecycle management.
Relationships: user (N→1, joined-loaded).

`Conversation`

Represents one independent conversation thread belonging to a User.

id (UUID, PK).
user_id (FK → User.id).
title — optional thread title.
conversation_type — enum: incident or general.
summary (Text) — rolling summary persisted by the agent.
current_monster — tracks the active monster persona (e.g., “betty”).
updated_at — indexed for efficient recency queries.
Relationships: user (N→1, joined-loaded), chat_sessions (1→N, eager-loaded).

`ChatSession`

A specific interaction instance within a Conversation. Its id is used as the LangGraph thread_id.

id (UUID, PK).
conversation_id (FK → Conversation.id).
created_at — session start timestamp.
Relationships: conversation (N→1, joined-loaded).

`Collective`

Represents a group of users, such as a family or a school class.

id (UUID, PK).
name — display name.
type — type of collective (e.g., family, class).
parent_id — optional FK to a parent Collective (supports hierarchy).
parental_consent_granted — a flag indicating the parent has explicitly granted consent for all minors in this collective.
Relationships: collective_members (1→N, eager-loaded), invitations (1→N, eager-loaded).

`CollectiveMembership`

Joins a User to a Collective with a specific role.

Composite primary key: user_id (FK → User.id) + collective_id (FK → Collective.id).
role — enum: owner, admin, member.

`FeatureFlag`

Stores remotely-toggleable feature flags consumed by both backend and frontend.

id (UUID, PK), name (unique), description.
is_enabled (Boolean).
environment — enum: development, production, staging, all.
visibility — enum: frontend, backend, all.
variant — enum: teenager, parent, all.
targeting_rules (JSON) — optional key/value rules for fine-grained targeting.

See Backend API — Feature Flags for the full list of seeded flags.

`UserConsent` (`models/consent.py`)

Tracks TOS / general consent acceptance for every user.

id (UUID, PK).
user_id (FK → User.id, CASCADE).
consent_ip (String, nullable) — IP address from which consent was given.
created_at — consent timestamp.
Relationships: user (N→1).

`ParentalConsent` (`models/consent.py`)

Stores parental authorization linking a parent to a teenager. Composite primary key (parent_id, teenager_id).

parent_id (FK → User.id, CASCADE).
teenager_id (FK → User.id, CASCADE).
consent_ip (String, nullable).
created_at — consent timestamp.

`ParentalAuthorizationRequest` (`models/authorization_request.py`)

A temporary voucher for teenagers to request parental authorization. Enables secure linking without exposing internal user IDs in URLs.

id (UUID, PK).
teenager_id (FK → User.id, CASCADE).
status — enum: PENDING, COMPLETED, EXPIRED, REVOKED.
created_at, expires_at — lifecycle timestamps (null expires_at = eternal).
Relationships: teenager (N→1).

`Invitation` (`models/invitation.py`)

A shareable token used to invite users (typically teenagers) into a Collective. Invitations carry settings from the parent who creates them.

id (UUID, PK).
collective_id (FK → Collective.id, CASCADE).
inviter_id (FK → User.id, CASCADE) — the parent/owner who created the invitation.
age_class — optional target age group (child, preteen, teenager, adult).
language — UI language for the invited user.
label — optional display name.
usage_limit — max uses (null = unlimited). Age-class-restricted invites default to 1.
current_uses — number of times used.
status — enum: PENDING, ACTIVE, ACCEPTED, EXPIRED, REVOKED.
created_at, expires_at — lifecycle timestamps.
accepted_at, accepted_by (FK → User.id) — acceptance metadata.
Relationships: collective (N→1), inviter (N→1), accepted_user (N→1).

`LocalizedContent`

Stores dynamic i18n content that overrides filesystem YAML files.

category (Enum, PK) — prompts, prompt_snippets, ui_messages, context_questions, or tools.
language (String, PK) — 2-letter language code.
variant (String, PK) — teenager, parent, or all.
key (String, PK) — unique identifier within the category.
value (JSON) — the actual content (string or dictionary).
updated_at — last update timestamp.

Application Data Service (`services/app_data/`)

All database interactions go through the AppDataService, which is built on a repository pattern:

AppDataService
├── UserRepository
├── IDPLoginRepository
├── AppSessionRepository
├── ConversationRepository
├── ChatSessionRepository
├── CollectiveRepository
├── CollectiveMembershipRepository
├── InvitationRepository
├── UserConsentRepository
└── ParentalConsentRepository

UserAuthService also manages:

UserAuthService
├── UserRepository
├── IDPLoginRepository
├── AppSessionRepository
├── UserConsentRepository
├── ParentalConsentRepository
├── CollectiveRepository
├── CollectiveMembershipRepository
├── InvitationRepository
└── ParentalAuthorizationRequestRepository

Each repository exposes typed get_by_primary_key, create, update, and delete methods via BaseRepository.
Both services use the db_session_router pattern — methods accept an optional db_session parameter and create their own if none is provided, enabling transactional reuse.
FeatureFlagRepository is managed separately via FeatureFlagService.
LocalizedContentRepository is managed separately via LocalizedContentService.
BaseDataService provides shared helper methods (get_repo_record_by_primary_key, update helpers) to eliminate boilerplate.
Composite primary keys (e.g., IDPLogin, ParentalConsent, CollectiveMembership) are fully supported.
Relationships are eagerly loaded using selectin loading to avoid N+1 queries in async SQLAlchemy.

Database Connection (`api/database/database.py` & `api/database/schemas.py`)

The connection module exposes:

engine — synchronous SQLAlchemy engine (used by Alembic and legacy sync code).
async_engine — async SQLAlchemy engine for ORM operations.
get_async_pool() — returns the shared psycopg_pool.AsyncConnectionPool (named general_pool) used by PGVectorStore, PostgresRAGCache, and raw async queries.
get_db() — async generator that yields an AsyncSession, ensuring it is closed after each request.
schema_context(pool, schema) — async context manager that sets search_path on a raw connection for schema-specific operations (defined in api/database/schemas.py).
Schema enum — APP, LANGGRAPH, RAG, CACHE, PUBLIC — used as the canonical source of schema names throughout the codebase.

Connection pools have separate named instances:

Pool name	Used by
`general_pool`	Application ORM queries, RAG service, and RAG semantic cache
`langgraph_pool`	LangGraph checkpoint reads/writes

LangGraph Persistent State

Conversation state (message history, action, context) is managed by LangGraph’s MemorySaver (in-memory, per-server). This is the live working memory for active conversations.

Long-term memory is achieved through a two-layer stack:

PostgresStore (LangGraph built-in) — a key-value store in the langgraph schema. The MemoryService wraps this to read and write:
- Conversation summaries — stored asynchronously via the MemoryLake queue
- Strategy memory — serialized StrategyMemory objects under namespace ("strategy_memory", user_id) with key conversation_id, written synchronously on every strategy_management tool call (see Agent Framework — Strategy Memory)
MemoryLake (agents/service1/memory/lake.py) — a non-blocking async queue that sits in front of MemoryService. The agent drops a QueuedSummary into the lake and returns immediately; a pool of background workers flushes it to the database. This completely decouples the agent’s hot path from database latency. Note: strategy memory writes bypass the lake (aynchronous direct write), marked with #TODO.

RAG Vector Store (`services/rag/`)

PGVectorStore (langchain-postgres)

The RAG knowledge base has migrated from langchain_postgres.PGVector to langchain_postgres.PGVectorStore. Each user type has its own collection:

docs_youth — educational content for teenagers.
docs_adult — parenting guides for the adult / parent variant.

Tables live in the dedicated rag schema. PGEngine instances are created lazily per collection and cached in-process.

Multi-Level Semantic Caching

To reduce embedding costs and LLM latency a two-level cache sits in front of the vector store:

Level 1 — CacheBackedEmbeddings (services/rag/cached_embeddings.py):
Wraps the Gemini embedding model with a LangChain CacheBackedEmbeddings store backed by the cache schema. Identical query strings reuse stored embeddings without hitting the embedding API.

Level 2 — PostgresRAGCache (services/rag/postgres_rag_cache.py):
A custom semantic cache. When a new query arrives, its embedding is compared to previously cached query embeddings using cosine similarity in the cache.rag_cache PostgreSQL table. The cache entry ID is a hash of the query and user type: md5(query || '|' || user_type). If a sufficiently similar query is found (below a distance threshold), the previously retrieved document IDs are returned directly, skipping the vector store search entirely.

Cache updates are performed asynchronously in the background: the RAG service returns results to the agent immediately and schedules the cache write as a non-blocking task, keeping latency low.

Analytics Tables

To allow post-hoc analysis of chatbot interactions the schema also includes analytics tables managed by AnalyticsService (via FeedbackService):

ConversationAnalytics — high-level extracted data: bullying message, summary, mood, strategy, chatbot version, language.
Categorization tables (many-to-one with ConversationAnalytics): ConversationMessagesCategorized, ConversationMoodsCategorized, ConversationStrategiesCategorized, ConversationAnswerTypesCategorized.

`Feedback`

Stores user feedback (rating, comment, is_positive) linked by session_id UUID. There is intentionally no hard FK constraint to the LangGraph thread_id, keeping the application and LangGraph layers independent.

Database Migrations (Alembic)

Alembic is configured in api/alembic.ini and api/migrations/. Migrations are schema-aware and support pgvector extension creation. Run migrations with:

cd api
alembic upgrade head

Container startup auto-migration

On deployment, the application container automatically runs alembic upgrade head before starting the API server. This is configured in the Dockerfile’s entrypoint and ensures the database schema is always up-to-date with the deployed code. See Deployment for details.

Schema search path

The default database-level search_path is set to "$user", public for safer Alembic autogenerate behavior. Application models still resolve to the app schema via SQLAlchemy metadata, while dedicated contexts/pools set schema-specific paths where needed (e.g., LangGraph connections use search_path = langgraph).

Alembic baseline (zero-diff check)

A baseline migration (6c64fe7fdf14) was created as a reference point. The zero_diff_check.py revision ensures that after applying all migrations, Alembic’s autogenerate detects no differences between the ORM models and the database state — confirming the schema is fully aligned.

Multi-schema workaround: `Enum inherit_schema` patch

SQLAlchemy’s Enum column type creates the underlying PostgreSQL ENUM type in the public schema by default, ignoring the MetaData(schema=…) setting on Base. This causes type "xyz" does not exist errors at runtime and in Alembic autogenerate whenever the database search_path does not include public.

The fix — applied at the very top of both api/run.py and api/migrations/env.py, before any ORM imports — monkeypatches Enum.__init__ to default inherit_schema=True:

import sqlalchemy.sql.sqltypes

def _patch_sqlalchemy_enum():
    _orig_enum_init = sqlalchemy.sql.sqltypes.Enum.__init__

    def _patched_enum_init(self, *args, **kwargs):
        kwargs.setdefault("inherit_schema", True)   # co-locate ENUM with the referencing table's schema
        _orig_enum_init(self, *args, **kwargs)

    sqlalchemy.sql.sqltypes.Enum.__init__ = _patched_enum_init

_patch_sqlalchemy_enum()

With inherit_schema=True, PostgreSQL creates one copy of the ENUM type per schema that references it. For example, if a type usertype is used by tables in both app and langgraph, PostgreSQL will create app.usertype and langgraph.usertype as independent types. Currently all application enums are only referenced by tables in the app schema, so in practice only app.<typename> is created — but the behaviour generalises correctly if an enum is ever shared across schemas.

This must be applied before the ORM metadata is first constructed (i.e., before from database import …). It is duplicated in both entry points because run.py and migrations/env.py are independent Python processes. A future clean-up would centralise it in api/core/patches.py and import from there.

Note for contributors: when adding a new Enum column to any model, no additional action is needed — the patch ensures the type is automatically created in the schema of every table that references it. When writing a migration manually (e.g., to rename or drop an enum value), include the schema= argument matching the table’s schema in every DDL operation (e.g., schema='app' for application models).

Overview

Application Data Schema (api/database/models/)

Entity-Relationship Diagram

User

IDPLogin

AppSession

Conversation

ChatSession

Collective

CollectiveMembership

FeatureFlag

UserConsent (models/consent.py)

ParentalConsent (models/consent.py)

ParentalAuthorizationRequest (models/authorization_request.py)

Invitation (models/invitation.py)

LocalizedContent

Application Data Service (services/app_data/)

Database Connection (api/database/database.py & api/database/schemas.py)