Storage Engine¶
How mixpanel_data uses DuckDB for local data storage.
Explore on DeepWiki
🤖 StorageEngine Deep Dive →
Ask questions about DuckDB integration, concurrency, or storage internals.
Overview¶
The StorageEngine class wraps DuckDB to provide persistent local storage for fetched Mixpanel data. Understanding DuckDB's concurrency model helps avoid conflicts when running multiple mp commands.
Storage Modes¶
Three storage modes are available:
| Mode | Description | Use Case |
|---|---|---|
| Persistent | Database file on disk (default) | Production use, data preservation |
| Ephemeral | Temp file deleted on close | Testing, one-off analysis |
| In-Memory | No file, RAM only | Quick scripts, no persistence needed |
Mode Selection¶
# Persistent (default) - stored at ~/.mp/data/{project_id}.db
ws = Workspace()
# Custom path
ws = Workspace(path="/path/to/my.db")
# Ephemeral - temp file, deleted on close
ws = Workspace(ephemeral=True)
# In-memory - no file at all
ws = Workspace(in_memory=True)
DuckDB Concurrency Model¶
DuckDB uses a single-writer, multiple-reader concurrency model:
- One write connection can be active at a time per database file
- Multiple read connections can coexist with each other
- Read and write connections cannot coexist on the same file
This differs from client-server databases (PostgreSQL, MySQL) where a server process mediates all access.
What This Means in Practice¶
| Scenario | Result |
|---|---|
One mp fetch command |
Works normally |
Two mp fetch commands to same database |
Second command gets DatabaseLockedError |
mp fetch + mp query to same database |
Query command gets DatabaseLockedError |
Two mp query commands to same database |
Both work (when no write lock is held) |
Two mp inspect commands (API-only) |
Both work (no database access) |
Lock Conflicts¶
When a second process tries to open a database that's already locked for writing, DuckDB raises an error. mixpanel_data catches this and raises a DatabaseLockedError:
Database Not Found¶
When opening a database in read-only mode, the file must already exist. If you run a read command (like mp query or mp inspect tables) before fetching any data, you'll get a DatabaseNotFoundError:
No data yet: /home/user/.mp/data/12345.db
Run 'mp fetch events' or 'mp fetch profiles' to create the database.
This is different from write mode, which creates the database file automatically.
Common Causes¶
- Long-running fetch — Large date ranges take time; other commands must wait
- Background processes — A previous command didn't exit cleanly
- Multiple terminals — Different shells running concurrent
mpcommands
Resolution¶
- Wait — Let the first operation complete
- Check for stuck processes —
ps aux | grep mpto find orphaned commands - Use separate databases — Specify different
--pathfor concurrent work
Lazy Storage Initialization¶
To avoid unnecessary lock conflicts, Workspace initializes storage lazily:
# These DON'T open the database:
ws = Workspace()
ws.events() # API call, no storage
ws.segmentation(...) # API call, no storage
ws.funnels(...) # API call, no storage
# These DO open the database (on first access):
ws.fetch_events(...) # Writes to storage
ws.sql(...) # Reads from storage
ws.tables() # Reads metadata
This means API-only commands like mp inspect events never conflict with fetch operations, even when targeting the same project.
Avoiding Conflicts¶
Use Ephemeral Mode for Testing¶
# Won't conflict with your main database
mp fetch events --from 2025-01-01 --to 2025-01-07 --ephemeral
Use Separate Paths for Parallel Work¶
# Terminal 1
mp fetch events --from 2025-01-01 --to 2025-06-30 --path ./h1.db
# Terminal 2 (parallel)
mp fetch events --from 2025-07-01 --to 2025-12-31 --path ./h2.db
Combine into Single Commands¶
# Instead of two fetches, use date range in one command
mp fetch events --from 2025-01-01 --to 2025-12-31
Stream Instead of Store¶
If you don't need to query the data repeatedly:
Connection Lifecycle¶
The StorageEngine manages its DuckDB connection:
# Workspace as context manager ensures cleanup
with Workspace() as ws:
ws.fetch_events(from_date="2025-01-01", to_date="2025-01-31")
df = ws.sql("SELECT * FROM events LIMIT 10")
# Connection closed, lock released
# Or explicit close
ws = Workspace()
try:
ws.fetch_events(...)
finally:
ws.close()
CLI commands handle this automatically.
Technical Details¶
Lock File¶
DuckDB creates a .wal (write-ahead log) file alongside the database during write operations. The lock is held for the duration of the connection.
Process Isolation¶
Within a single Python process, multiple Workspace instances can share the same database file (DuckDB handles internal locking). Lock conflicts occur between separate processes.
Read-Only Mode¶
Both StorageEngine and Workspace support a read_only parameter:
# Default: write access (matches DuckDB's native behavior)
ws = Workspace() # read_only=False
# Explicit read-only for concurrent access
ws = Workspace(path="data.db", read_only=True)
Read-only connections:
- Allow multiple reader processes to access the database concurrently (when no write lock is held)
- Cannot execute INSERT, UPDATE, DELETE, or DDL statements
- Still blocked by an active write lock (DuckDB write locks are exclusive)
The CLI uses this automatically:
- Read commands (
mp query,mp inspect tables, etc.) useread_only=True - Write commands (
mp fetch,mp inspect drop) useread_only=False
Note: If a mp fetch is running, other commands will still be blocked until it completes. The benefit of read-only mode is enabling multiple concurrent read operations (e.g., two mp query commands).
See Also¶
- Design — Overall architecture
- Data Model — Table schemas and query patterns
- DuckDB Documentation — Full DuckDB reference