Analytical data warehouse solution for a jewelry and accessories e-commerce platform. The system processes raw API data, transforms it into analytical tables, and serves business intelligence dashboards through Metabase. The architecture enables real-time data ingestion, hourly ETL updates, and historical data tracking for comprehensive analytics on user behavior, catalog interactions, product views, and conversion funnels.
Features
- Real-time API data ingestion with batch processing
- Comprehensive session analytics tracking user interactions across catalog, products, and cart
- Historical data snapshots with versioning for tracking entity changes over time
- Automated ETL pipelines with hourly updates for analytical tables
- Business intelligence dashboards in Metabase for data visualization
- CLI tools for manual data reprocessing and debugging
- Partitioned data storage optimized for time-series queries
- User behavior analytics including device types, search queries, filters, and conversion funnels
Implementation details
Data Ingestion
Raw API calls are captured in real-time and stored in the pd.api_calls table, containing request/response JSON data, user sessions, endpoints, and timestamps. Data is written in batches with millisecond precision timestamps.
ETL Processing
Python-based ETL scripts organized in jobs.py process raw data into analytical tables:
- Sessions: Aggregated session data with device information, user types, and behavioral metrics
- Catalog Sessions: Enriched catalog browsing data including search queries, filters (material, section, metal, price ranges), pagination, and sorting preferences
- Product Sessions: Product-level analytics tracking views, favorites, cart additions, and purchase events
- Historical Snapshots: Versioned snapshots of production database entities (cart, items, favorites, stock) with change tracking
Data Warehouse Schema
ClickHouse database organized into multiple schemas:
- pd: Raw source data from API calls
- dwh: Analytical tables optimized for BI tools and ad-hoc analysis
- history: Historical snapshots with versioning and deletion tracking
- mysql: Database links for seamless data migration from production MySQL
- etl: Temporary tables for ETL processing
Scheduled Jobs
main.py orchestrates scheduled ETL jobs running hourly, with systemd service integration for production deployment. Jobs are designed to support point-in-time reprocessing for specific dates.
CLI Interface
Command-line interface (cli.py) provides convenient access to ETL functions:
- User agent updates
- Session table updates for specific dates
- Catalog and product session reprocessing
- Historical snapshot generation
- Verbose logging for debugging
Business Intelligence
Metabase integration provides dashboards for:
- Conversion funnels (favorites, cart, purchases)
- User behavior analytics by device type and user status
- Catalog search and filter usage patterns
- Product performance metrics
- Real-time inventory tracking (icart, ifavs views)
Tech Stack
- Language: Python 3.10
- Database: ClickHouse (columnar analytical database)
- BI Tool: Metabase
- Orchestration: systemd service for scheduled jobs
- Data Sources: Production MySQL database, API logs