Storage Engine¶

How mixpanel_data uses DuckDB for local data storage.

Explore on DeepWiki

Ask questions about DuckDB integration, concurrency, or storage internals.

Overview¶

The StorageEngine class wraps DuckDB to provide persistent local storage for fetched Mixpanel data. Understanding DuckDB's concurrency model helps avoid conflicts when running multiple mp commands.

Storage Modes¶

Three storage modes are available:

Mode	Description	Use Case
Persistent	Database file on disk (default)	Production use, data preservation
Ephemeral	Temp file deleted on close	Testing, one-off analysis
In-Memory	No file, RAM only	Quick scripts, no persistence needed

Mode Selection¶

# Persistent (default) - stored at ~/.mp/data/{project_id}.db
ws = Workspace()

# Custom path
ws = Workspace(path="/path/to/my.db")

# Ephemeral - temp file, deleted on close
ws = Workspace(ephemeral=True)

# In-memory - no file at all
ws = Workspace(in_memory=True)

DuckDB Concurrency Model¶

DuckDB uses a single-writer, multiple-reader concurrency model:

One write connection can be active at a time per database file
Multiple read connections can coexist with each other
Read and write connections cannot coexist on the same file

This differs from client-server databases (PostgreSQL, MySQL) where a server process mediates all access.

What This Means in Practice¶

Scenario	Result
One `mp fetch` command	Works normally
Two `mp fetch` commands to same database	Second command gets `DatabaseLockedError`
`mp fetch` + `mp query` to same database	Query command gets `DatabaseLockedError`
Two `mp query` commands to same database	Both work (when no write lock is held)
Two `mp inspect` commands (API-only)	Both work (no database access)

Lock Conflicts¶

When a second process tries to open a database that's already locked for writing, DuckDB raises an error. mixpanel_data catches this and raises a DatabaseLockedError:

Database locked: /home/user/.mp/data/12345.db
Another mp command may be running. Try again shortly.

Database Not Found¶

When opening a database in read-only mode, the file must already exist. If you run a read command (like mp query or mp inspect tables) before fetching any data, you'll get a DatabaseNotFoundError:

No data yet: /home/user/.mp/data/12345.db
Run 'mp fetch events' or 'mp fetch profiles' to create the database.

This is different from write mode, which creates the database file automatically.

Common Causes¶

Long-running fetch — Large date ranges take time; other commands must wait
Background processes — A previous command didn't exit cleanly
Multiple terminals — Different shells running concurrent mp commands

Resolution¶

Wait — Let the first operation complete
Check for stuck processes — ps aux | grep mp to find orphaned commands
Use separate databases — Specify different --path for concurrent work

Lazy Storage Initialization¶

To avoid unnecessary lock conflicts, Workspace initializes storage lazily:

# These DON'T open the database:
ws = Workspace()
ws.events()           # API call, no storage
ws.segmentation(...)  # API call, no storage
ws.funnels(...)       # API call, no storage

# These DO open the database (on first access):
ws.fetch_events(...)  # Writes to storage
ws.sql(...)           # Reads from storage
ws.tables()           # Reads metadata

This means API-only commands like mp inspect events never conflict with fetch operations, even when targeting the same project.

Avoiding Conflicts¶

Use Ephemeral Mode for Testing¶

# Won't conflict with your main database
mp fetch events --from 2025-01-01 --to 2025-01-07 --ephemeral

Use Separate Paths for Parallel Work¶

# Terminal 1
mp fetch events --from 2025-01-01 --to 2025-06-30 --path ./h1.db

# Terminal 2 (parallel)
mp fetch events --from 2025-07-01 --to 2025-12-31 --path ./h2.db

Combine into Single Commands¶

# Instead of two fetches, use date range in one command
mp fetch events --from 2025-01-01 --to 2025-12-31

Stream Instead of Store¶

If you don't need to query the data repeatedly:

# No database, no locks
mp fetch events --from 2025-01-01 --stdout | process_events.py

Connection Lifecycle¶

The StorageEngine manages its DuckDB connection:

# Workspace as context manager ensures cleanup
with Workspace() as ws:
    ws.fetch_events(from_date="2025-01-01", to_date="2025-01-31")
    df = ws.sql("SELECT * FROM events LIMIT 10")
# Connection closed, lock released

# Or explicit close
ws = Workspace()
try:
    ws.fetch_events(...)
finally:
    ws.close()

CLI commands handle this automatically.

Technical Details¶

Lock File¶

DuckDB creates a .wal (write-ahead log) file alongside the database during write operations. The lock is held for the duration of the connection.

Process Isolation¶

Within a single Python process, multiple Workspace instances can share the same database file (DuckDB handles internal locking). Lock conflicts occur between separate processes.

Read-Only Mode¶

Both StorageEngine and Workspace support a read_only parameter:

# Default: write access (matches DuckDB's native behavior)
ws = Workspace()  # read_only=False

# Explicit read-only for concurrent access
ws = Workspace(path="data.db", read_only=True)

Read-only connections:

Allow multiple reader processes to access the database concurrently (when no write lock is held)
Cannot execute INSERT, UPDATE, DELETE, or DDL statements
Still blocked by an active write lock (DuckDB write locks are exclusive)

The CLI uses this automatically:

Read commands (mp query, mp inspect tables, etc.) use read_only=True
Write commands (mp fetch, mp inspect drop) use read_only=False

Note: If a mp fetch is running, other commands will still be blocked until it completes. The benefit of read-only mode is enabling multiple concurrent read operations (e.g., two mp query commands).