Analytical data warehouse solution for a jewelry and accessories e-commerce platform. The system processes raw API data, transforms it into analytical tables, and serves business intelligence dashboards through Metabase. The architecture enables real-time data ingestion, hourly ETL updates, and historical data tracking for comprehensive analytics on user behavior, catalog interactions, product views, and conversion funnels.

Features

  • Real-time API data ingestion with batch processing
  • Comprehensive session analytics tracking user interactions across catalog, products, and cart
  • Historical data snapshots with versioning for tracking entity changes over time
  • Automated ETL pipelines with hourly updates for analytical tables
  • Business intelligence dashboards in Metabase for data visualization
  • CLI tools for manual data reprocessing and debugging
  • Partitioned data storage optimized for time-series queries
  • User behavior analytics including device types, search queries, filters, and conversion funnels

Implementation details

Data Ingestion

Raw API calls are captured in real-time and stored in the pd.api_calls table, containing request/response JSON data, user sessions, endpoints, and timestamps. Data is written in batches with millisecond precision timestamps.

ETL Processing

Python-based ETL scripts organized in jobs.py process raw data into analytical tables:

  • Sessions: Aggregated session data with device information, user types, and behavioral metrics
  • Catalog Sessions: Enriched catalog browsing data including search queries, filters (material, section, metal, price ranges), pagination, and sorting preferences
  • Product Sessions: Product-level analytics tracking views, favorites, cart additions, and purchase events
  • Historical Snapshots: Versioned snapshots of production database entities (cart, items, favorites, stock) with change tracking

Data Warehouse Schema

ClickHouse database organized into multiple schemas:

  • pd: Raw source data from API calls
  • dwh: Analytical tables optimized for BI tools and ad-hoc analysis
  • history: Historical snapshots with versioning and deletion tracking
  • mysql: Database links for seamless data migration from production MySQL
  • etl: Temporary tables for ETL processing

Scheduled Jobs

main.py orchestrates scheduled ETL jobs running hourly, with systemd service integration for production deployment. Jobs are designed to support point-in-time reprocessing for specific dates.

CLI Interface

Command-line interface (cli.py) provides convenient access to ETL functions:

  • User agent updates
  • Session table updates for specific dates
  • Catalog and product session reprocessing
  • Historical snapshot generation
  • Verbose logging for debugging

Business Intelligence

Metabase integration provides dashboards for:

  • Conversion funnels (favorites, cart, purchases)
  • User behavior analytics by device type and user status
  • Catalog search and filter usage patterns
  • Product performance metrics
  • Real-time inventory tracking (icart, ifavs views)

Tech Stack

  • Language: Python 3.10
  • Database: ClickHouse (columnar analytical database)
  • BI Tool: Metabase
  • Orchestration: systemd service for scheduled jobs
  • Data Sources: Production MySQL database, API logs