BlueSky Sentiment Analysis | Alexander Koehler

Problem Statement & Technical Challenges

Traditional sentiment analysis systems force impossible tradeoffs between cost, accuracy, and latency

Real-Time Processing Constraints

Traditional batch-based sentiment analysis operates on scheduled intervals (hours or days), creating temporal gaps between data collection and actionable insights. This prevents organizations from tracking continuous sentiment evolution, detecting rapid opinion shifts, or generating temporal sentiment graphs essential for crisis management and campaign monitoring.

Solution: Implemented streaming architecture with 10-second micro-batch processing achieving 18-second end-to-end latency from collection to queryable storage.

Cloud GPU Cost Barriers

Real-time transformer model inference on cloud GPUs (Azure NC-series, AWS p3 instances) incurs prohibitive costs for sustained monitoring. Premium GPU instances cost $3-5/hour, making 24/7 processing economically infeasible for most organizations at scale.

Solution: Hybrid architecture separating managed cloud services (collection/storage) from academic T4 GPU compute, achieving 95% cost reduction while maintaining equivalent performance to premium cloud instances.

Accuracy-Speed Tradeoff

Lexicon-based methods (VADER) achieve only 60-61% accuracy on social media text due to sarcasm, context-dependent language, and evolving linguistic patterns. Transformer models provide superior accuracy but introduce latency and computational complexity unsuitable for streaming applications.

Solution: Deployed pre-trained RoBERTa and DistilRoBERTa models with chunked batch processing (size 32), achieving 17.5 posts/second throughput on academic hardware.

GitHub Data Bridge Innovation

Designed a novel bidirectional data transfer mechanism using GitHub's REST API that enables Databricks (Azure cloud) to exchange data with academic compute resources without direct network connectivity or VPN configuration. The bridge implements:

Export Logic: Databricks notebooks query Delta Lake for posts with timestamps newer than last successful export, encoding JSON data in Base64 format for transmission via GitHub API content creation endpoint
Incremental Processing: Timestamp-based watermarks track processing state, typically resulting in 100-500 post batches during regular operation
Bidirectional Flow: Raw data flows to incremental/ directory, enhanced ML results return through enhanced/ directory with manifest files ensuring data integrity
Academic Sync: Automated git pull operations every 30 minutes with file-based locking to prevent concurrent processing
Reliability: Retry logic for API rate limiting (5000 requests/hour) and commit SHA verification for successful uploads

Impact: Platform-independent integration pattern enabling flexible compute allocation while maintaining enterprise-grade reliability. This architecture validates that GitHub can serve as an effective data bridge for hybrid cloud-academic pipelines.

Technology Stack

Production-grade tools across cloud and academic infrastructure

Azure Functions (Python 3.11) Event Hubs (Standard Tier) Databricks (ML Runtime 13.3 LTS) Delta Lake (ACID Storage) GitHub REST API RoBERTa Transformers DistilRoBERTa PyTorch PySpark T4 GPU (48-core, 124GB RAM) SQL Analytics

Implementation Method

Four-stage production pipeline from collection to ML-enhanced analytics

Stage 1: Serverless Data Collection

Role: Automated social media data ingestion that runs continuously without manual intervention, converting raw API responses into structured messages ready for analysis.

Azure Functions (Python 3.11) execute on 5-minute timers, authenticating via Bluesky AT Protocol to collect 25 posts per hashtag (#trump, #biden, #economy, #ai). The atproto SDK handles API queries while Azure Table Storage tracks pagination state to prevent collecting duplicate posts. Structured JSON messages flow to Event Hubs with exponential backoff retry logic, achieving 94.7% success rate processing 75 posts per execution.

Key Metrics: 300s intervals | 94.7% success | 75 posts/execution

Stage 2: Distributed Stream Processing

Role: Real-time message streaming and transformation that ensures data arrives reliably, gets deduplicated, and lands in queryable storage within seconds of collection.

Event Hubs (Standard tier, 3 partitions) provides 1MB/s ingress with 24-hour retention for replay capability. Databricks Structured Streaming consumes via 10-second micro-batches using ML Runtime 13.3 LTS. Spark processes 7 records/sec input → 4 records/sec output with 1,042ms batch duration. Delta Lake storage implements ACID transactions with post_id deduplication across 11.3k distinct keys, maintaining exactly-once semantics via DBFS checkpoints to prevent duplicate analysis.

Key Metrics: 7 rec/s input | 1,042ms batches | 11.3k dedup keys | Zero message loss

Stage 3: GitHub Data Bridge

Role: Platform-independent connector that moves data between Azure cloud and academic GPU resources without requiring direct network access or VPN configuration.

Databricks exports incremental batches (100-500 posts) via GitHub REST API using timestamp watermarks to track what's already been processed. Base64-encoded JSON flows to incremental/ directory with commit SHA verification ensuring uploads succeed. Academic cluster polls via git pull every 30 minutes with file-based locking to prevent concurrent processing. Enhanced ML results return through enhanced/ directory. Manifest files maintain processing state across the hybrid architecture.

Key Metrics: 100-500 post batches | 30min sync | 5000 req/hr rate limit | Bidirectional flow

Stage 4: GPU-Accelerated ML Processing

Role: Advanced sentiment and emotion classification using state-of-the-art transformer models, running on cost-effective academic hardware to avoid expensive cloud GPU bills.

Academic cluster (48-core CPU, 124GB RAM, T4 GPU) runs PyTorch transformers: cardiffnlp/twitter-roberta-base-sentiment-latest (sentiment) and j-hartmann/emotion-english-distilroberta-base (6-category emotions). Chunked batch processing (size 32) achieves 17.5 posts/sec with 87.9% success rate. Output adds ml_sentiment_score (-1.0 to +1.0 continuous scale), dominant_emotion classification, emotion probability distributions, and processing metadata. Enhanced data returns via GitHub bridge for Delta Lake integration and Mistral-7B-powered natural language insight generation.

Key Metrics: 17.5 posts/sec | 87.9% success | Batch size 32 | 6-emotion classification

Data Quality Assurance

Multi-level validation ensures production reliability: 20-character minimum filtering, UTF-8 encoding verification, Databricks exactly-once checkpointing, and academic cluster timeout management. Comprehensive metadata tracking (timestamps, model versions, hardware specs) enables quality auditing. Successfully processed 6,806 posts with 6,269 ML-enhanced (92.1% completion rate).

Complete System Architecture

Click diagram to view full size | Complete four-stage pipeline showing data flow from collection through ML enhancement

Results & Analysis

Production performance on 6,806 posts analyzing political sentiment patterns

Sentiment Distribution & Classification (#Trump)

Analysis of 6,269 #trump posts revealed 65.3% negative, 23.2% positive, 11.6% neutral sentiment with mean score -0.256. The bimodal distribution shows a dominant negative sentiment cluster around -0.5 and a smaller neutral cluster near 0.0, indicating polarized public opinion. Fine-grained continuous scoring across 4,698 unique values (range: -0.958 to +0.985) demonstrates nuanced classification beyond binary positive/negative labels. High volatility (coefficient of variation: 1.75) validates the system's capability for detecting sentiment complexity in political discourse.

System Performance Metrics

Databricks streaming metrics demonstrate stable production performance with 10-second micro-batch processing achieving 7 records/sec input rate and 4 records/sec processing rate. Consistent 1,042ms average batch duration with 11ms latest processing time validates real-time capabilities. Aggregation state maintained 11.3k distinct keys for deduplication across sustained multi-hour operation.

Key Findings

Comparative Analysis: #biden content (-0.366 avg, 69.6% negative) showed 43% stronger negative bias than #trump (-0.256 avg, 65.3% negative), despite smaller sample size (69 vs 6,269 posts).

Multi-Dimensional Classification: Six-category emotion analysis (joy, anger, sadness, fear, disgust, surprise) achieved average confidence above 0.7, with political content dominated by anger and disgust for negative posts, joy for positive posts.

Production Performance: Sustained 1,000+ posts/hour with 18s end-to-end latency, 87.9% ML success rate, zero message loss, and 95% cost reduction versus cloud GPU instances.

Automated Intelligence: Mistral-7B LLM analysis successfully generated summaries identifying positive themes ("gratitude and appreciation," "entertainment") and negative themes ("authoritarian tendencies," "political criticism") with emotional intensity classification.

Technical Deep Dive

Read the complete 24-page academic paper with full implementation details, literature review, and comprehensive performance analysis including Event Hubs metrics, Databricks streaming characteristics, and comparative cost breakdowns.

Download Full Paper (PDF) View Repository

Real-Time Social Media Sentiment Pipeline