Brand Monitoring BI is a real-time analytics tool designed for small companies looking to identify and engage with relevant discussions about their products. The system monitors Reddit posts matching specific search phrases, enabling founders to discover opportunities to share their product with interested communities. It provides real-time analytics about the volume of relevant Reddit posts and the percentage of posts that match the search criteria.

This project was developed as the final project for DE Zoomcamp 2023, inspired by a tweet from @AdriaanvRossum.

Features

  • Real-time monitoring of Reddit posts matching brand/product search phrases
  • Automated data pipeline from Reddit to analytics dashboard
  • Real-time analytics showing post volume and relevance metrics
  • Partitioned and clustered data storage for efficient querying
  • Scalable architecture supporting multiple data sources (currently Reddit, extensible to others)

Implementation details

Data Producer

Python-based Kafka producer that scrapes Reddit posts and publishes them to Kafka topics. Search phrases are configurable in the producer code.

Data Consumer

PySpark consumer that processes messages from Kafka, applies transformations, and loads data into BigQuery. The consumer handles data partitioning and clustering for optimal query performance.

Data Storage

BigQuery tables partitioned by date and clustered by subreddit for efficient analytics queries.

Infrastructure

Terraform configuration for automated infrastructure provisioning on Google Cloud Platform and Confluent Cloud.

Visualization

Looker Studio dashboards for real-time analytics visualization showing post volume trends and relevance metrics.

Tech Stack

  • Languages: Python, HCL (Terraform)
  • Streaming: Apache Kafka (Confluent Cloud)
  • Processing: PySpark
  • Storage: Google BigQuery
  • Infrastructure: Terraform, Google Cloud Platform
  • Visualization: Looker Studio

View on GitHub