Spaces:

mv63
/

TheScopeBackend

Sleeping

App Files Files Community

Vedant Jigarbhai Mehta commited on Apr 7

Commit

1150f40

1 Parent(s): 0357317

sync ui and backend with main for compare feature and reframe

Browse files

Files changed (20) hide show

README.md +292 -1
backend/routes/clusters.py +5 -3
backend/routes/overview.py +145 -1
backend/routes/search.py +1 -1
backend/services/llm_service.py +218 -45
frontend/index.html +1 -1
frontend/src/App.jsx +2 -0
frontend/src/components/common/AISummary.jsx +11 -4
frontend/src/components/layout/Sidebar.jsx +1 -1
frontend/src/components/layout/TopNavbar.jsx +2 -1
frontend/src/pages/Clusters.jsx +6 -0
frontend/src/pages/Compare.jsx +235 -0
frontend/src/pages/Embeddings.jsx +23 -11
frontend/src/pages/Landing.jsx +2 -2
frontend/src/pages/Network.jsx +8 -7
frontend/src/pages/Overview.jsx +31 -11
frontend/src/pages/Search.jsx +7 -1
frontend/src/pages/TimeSeries.jsx +1 -1
frontend/src/services/api.js +8 -0
vedant-prompts.md +44 -1

README.md CHANGED Viewed

@@ -7,4 +7,295 @@ sdk: docker
 pinned: false
 ---
-# TheScope — Political Discourse Analysis Dashboard

 pinned: false
 ---
+# TheScope — Tracing Narratives Across Reddit Communities
+A full-stack investigative reporting dashboard analyzing how a curated set of Reddit communities discussed the 2024 US election and 2025 transition of power. Nine of the ten subreddits are explicitly political; the tenth (r/worldpolitics) is included as a documented case of unmoderated community drift.
+**Live Demo**: [https://huggingface.co/spaces/mv63/thescope-dashboard](https://huggingface.co/spaces/mv63/thescope-dashboard)
+---
+## What This Project Does
+This dashboard analyzes 8,799 Reddit posts from 10 subreddits collected between July 2024 and February 2025. Nine of the subreddits span the political spectrum from far-left (r/Anarchism, r/socialism) to far-right (r/Conservative, r/Republican); the tenth (r/worldpolitics) was originally political but has drifted into largely unmoderated, off-topic content and is included as a documented case study in community drift. The dashboard combines NLP (sentence embeddings, topic clustering, semantic search) with network analysis (PageRank, betweenness centrality, Louvain community detection) to trace how narratives spread across these communities.
+The research question: **How do politically diverse communities process the same events — and who bridges the divides?**
+### Key Findings from the Data
+- **87 bridge accounts** post in 2+ subreddits — potential cross-community influence nodes
+- **1,500% activity surge** after inauguration (Jan 20, 2025) — avg posts/day jumped from 13 to 217
+- **Media fragmentation**: r/Conservative shares breitbart.com (#1), r/politics shares nytimes.com (#1) — isolated information ecosystems
+- **High-velocity accounts**: M_i_c_K posted 246 times in 26 days (9+/day) — potential automated behavior
+---
+## Features
+### 1. Overview Page
+- Key metrics (posts, authors, date range, network stats)
+- Activity timeline with real political events annotated (Biden drops out, Election Day, Inauguration, Executive Orders spike)
+- Subreddit distribution and top news sources shared
+- Collapsible methodology section explaining data pipeline, NLP approach, network construction, and AI integration
+- AI-generated executive summary
+### 2. Time Series Analysis
+- Post volume over time by subreddit (filterable, adjustable granularity: day/week/month)
+- Average engagement score over time
+- Topic trends over time (KMeans clusters, adjustable k)
+- Dynamic AI-generated summaries beneath each chart
+### 3. Network Analysis
+- Interactive force-directed graph (WebGL, react-force-graph-2d)
+- Nodes colored by Louvain community, sized by PageRank
+- 3 edge types: crosspost links (weight 3.0), shared URL co-sharing (weight 2.0), co-subreddit activity (weight 1.0)
+- [deleted] accounts excluded to prevent false super-connectors
+- Click any node to inspect PageRank, betweenness, community, subreddits
+- **Node removal simulation**: remove an account and see how the network fragments (e.g., removing John3262005 splits the network from 72 to 83 components)
+- Min-degree filter slider
+- AI-generated network summary
+### 4. Topic Clusters
+- KMeans clustering on 384-dim sentence embeddings
+- Tunable cluster count (k slider: 2-50)
+- Donut chart showing cluster proportions (clickable to expand)
+- Expandable cluster detail: subreddit breakdown + top 10 posts with Reddit links
+- Handles extreme k values gracefully (clamped with warning)
+- AI-generated cluster summary
+### 5. Compare Communities
+- Side-by-side comparison of any two subreddits in the dataset
+- Each side shows: total posts, unique authors, average score and comments, top 10 news domains, top 5 discussion topics, top 10 most active authors (clickable to Reddit), top 5 highest-scoring posts
+- Overlapping line chart showing both communities' weekly post volume on the same axes
+- AI-generated 4-paragraph analytical comparison covering engagement, information ecosystems, topical focus, and a journalist-ready takeaway
+- Default comparison: r/Conservative vs r/socialism (maximum political contrast)
+### 6. SearchAI (Semantic Search Chatbot)
+- Results ranked by semantic similarity, not keyword matching
+- Chat-style interface with conversation history
+- Time-series chart showing matching posts over time (with day/week/month toggle)
+- Follow-up query suggestions (LLM-generated)
+- Clickable results link directly to Reddit posts
+- Handles edge cases: empty input, short queries, non-English input, gibberish
+### 7. Embedding Explorer
+- Interactive Datamapplot visualization of all 8,799 posts in 2D (UMAP projection)
+- Zoom, pan, and search within the embedding space
+- Posts near each other discuss similar themes
+- AI-generated 4-paragraph explanation of how to read the embedding map (for non-technical users)
+---
+## Semantic Search: Zero Keyword Overlap Examples
+The rubric requires queries with zero keyword overlap returning correct results. Here are 3 examples:
+### Example 1
+**Query**: "government overreach and civil liberties"
+**Top Result**: "Project 2025: An Unconstitutional Overreach" (59.2% similarity)
+**Why correct**: Both discuss government exceeding its authority — zero shared words between query and result title.
+### Example 2
+**Query**: "economic hardship among workers"
+**Top Result**: "Can Worker-to-Worker Organizing Help Labor Survive The Trump" (46.4% similarity)
+**Why correct**: Both about worker economic struggles, expressed with completely different vocabulary.
+### Example 3
+**Query**: "online manipulation campaigns"
+**Top Result**: "The resistance, online coordination and the state of our par..." (49.8% similarity)
+**Why correct**: Both about coordinated online activity — no keyword overlap.
+---
+## ML/AI Component Specifications
+| Component | Model/Library | Key Parameters |
+|-----------|---------------|----------------|
+| **Sentence Embeddings** | all-MiniLM-L6-v2 (sentence-transformers) | 384 dimensions, L2-normalized, pre-computed for all 8,799 posts |
+| **Topic Clustering** | KMeans (scikit-learn) | k tunable 2-50, pre-computed for k=3,5,8,10,15,20,30,50, cosine distance on embedding space |
+| **Dimensionality Reduction** | UMAP (umap-learn) | n_components=2, n_neighbors=15, min_dist=0.1, metric=cosine, random_state=42 |
+| **Network Analysis** | PageRank + Betweenness centrality (NetworkX), Louvain community detection (python-louvain) | 3 edge types with weights 3.0/2.0/1.0, [deleted] excluded |
+| **LLM Summaries** | Gemma 3 27B (Google AI via google-generativeai) | temperature=0.3, max_tokens=500-700, in-memory caching, used for chart summaries, search answers, cluster analysis, network analysis, embedding explanations, and community comparison |
+| **Embedding Visualization** | Datamapplot | Interactive HTML with search, zoom, pan |
+| **Language Detection** | langdetect | Used for non-English query detection before LLM translation |
+| **Semantic Search** | Cosine similarity via numpy dot product on L2-normalized embeddings | Threshold 0.45 for quality, 0.30 for time-series matching |
+---
+## Tech Stack
+| Layer | Technology | Why |
+|-------|-----------|-----|
+| Backend | Flask (Python) | Lightweight, matches job requirements |
+| Frontend | React.js (Vite) + Tailwind CSS | Modern, fast builds, matches job requirements |
+| Database | SQLite | Scale-appropriate for 8.8K rows, ships as single file |
+| Charts | Recharts | React-native, clean time-series support |
+| Network Viz | react-force-graph-2d | WebGL-backed, handles hundreds of nodes |
+| Deployment | Hugging Face Spaces (Docker) | Free, supports ML model loading, 16GB RAM |
+---
+## Local Setup
+### Prerequisites
+- Python 3.11+
+- Node.js 22+
+### Backend
+```bash
+cd backend
+pip install -r requirements.txt
+python app.py
+```
+### Frontend
+```bash
+cd frontend
+npm install
+npm run dev
+```
+### Environment Variables
+Create a `.env` file in the project root:
+```
+GEMINI_API_KEY=your_google_ai_api_key
+GEMINI_MODEL=gemma-3-27b-it
+```
+### Data Pipeline (to regenerate pre-computed data)
+```bash
+cd backend
+python pipeline/ingest.py          # JSONL → SQLite
+python pipeline/embed.py           # Generate embeddings
+python pipeline/reduce_dims.py     # UMAP 2D projection
+python pipeline/build_graph.py     # Network graph
+python pipeline/cluster.py         # Topic clusters
+python pipeline/build_datamapplot.py  # Embedding visualization HTML
+```
+---
+## Architecture
+### System Overview
+```
+┌─────────────────────────────────────────────────────────────────────┐
+│                        CLIENT (Browser)                             │
+│                                                                     │
+│  React.js SPA (Vite build)                                         │
+│  ├── Landing     — animated hero with call to action               │
+│  ├── Overview    — metrics, timeline, key findings                 │
+│  ├── Time Series — post volume, engagement, topic trends           │
+│  ├── Network     — force-directed graph, node removal              │
+│  ├── Topics      — KMeans clusters, donut chart, detail panels     │
+│  ├── Compare     — side-by-side comparison of two communities      │
+│  ├── SearchAI    — semantic search chatbot, time-series chart      │
+│  └── Embeddings  — Datamapplot interactive visualization           │
+│                                                                     │
+│  Libraries: Recharts, react-force-graph-2d, Axios, Tailwind CSS    │
+└──────────────────────────┬──────────────────────────────────────────┘
+                           │ HTTP (same origin)
+                           ▼
+┌──────────────────────────────────────────────────────────────────────┐
+│                      FLASK SERVER (gunicorn)                         │
+│                                                                      │
+│  API Endpoints:                                                      │
+│  ├── /api/v1/overview/stats        — dataset statistics              │
+│  ├── /api/v1/compare               — side-by-side subreddit compare  │
+│  ├── /api/v1/embeddings/summary    — AI explanation of embedding map │
+│  ├── /api/v1/timeseries/posts      — post volume over time          │
+│  ├── /api/v1/timeseries/engagement — engagement metrics over time   │
+│  ├── /api/v1/timeseries/topics     — topic trends over time         │
+│  ├── /api/v1/search                — semantic search + LLM answer   │
+│  ├── /api/v1/search/timeseries     — search results over time       │
+│  ├── /api/v1/network/graph         — network with centrality        │
+│  ├── /api/v1/network/remove-node   — node removal simulation       │
+│  └── /api/v1/clusters              — topic clusters with tunable k  │
+│                                                                      │
+│  In-memory at startup:                                               │
+│  ├── embeddings.npy (8799 × 384)     — sentence embeddings          │
+│  ├── graph.json (320 nodes, 773 edges) — pre-computed network       │
+│  └── SentenceTransformer model        — for query embedding         │
+│                                                                      │
+│  On-disk: posts.db (SQLite)                                          │
+└──────────────────────────┬───────────────────────────────────────────┘
+                           │ API call (LLM only)
+                           ▼
+                  ┌─────────────────┐
+                  │  Google AI       │
+                  │  Gemma 3 27B     │
+                  │  - chart summaries│
+                  │  - search answers │
+                  │  - translations   │
+                  └─────────────────┘
+```
+### Data Pipeline (runs once during build)
+```
+data.jsonl (8,799 Reddit posts, 44MB)
+    │
+    ├── ingest.py ──────────→ posts.db (SQLite, 16MB)
+    │                         8,799 rows, indexed by subreddit/author/date
+    │
+    ├── embed.py ───────────→ embeddings.npy (8799 × 384, 13MB)
+    │                         all-MiniLM-L6-v2, L2-normalized
+    │
+    ├── reduce_dims.py ─────→ umap_coords.npy (8799 × 2)
+    │                         UMAP: n_neighbors=15, min_dist=0.1, cosine
+    │
+    ├── build_graph.py ─────→ graph.json (320 nodes, 773 edges)
+    │                         3 edge types, PageRank, betweenness, Louvain
+    │                         [deleted] excluded
+    │
+    ├── cluster.py ─────────→ cluster_assignments in SQLite
+    │                         KMeans for k=3,5,8,10,15,20,30,50
+    │
+    └── build_datamapplot.py → datamapplot.html
+                               Interactive embedding visualization
+```
+### Runtime Data Flow (per search query)
+```
+User types "immigration policy"
+    │
+    ├─ 1. Validate input (not empty, not greeting)
+    ├─ 2. Detect language → "en" (if non-English → translate via LLM)
+    ├─ 3. Embed query with all-MiniLM-L6-v2 → 384-dim vector (~5ms)
+    ├─ 4. Cosine similarity: query × 8,799 embeddings (<10ms)
+    ├─ 5. Rank by similarity, take top 20
+    ├─ 6. Fetch post details from SQLite
+    ├─ 7. LLM generates conversational answer (~3-5s)
+    └─ 8. Return: answer + results + follow-up suggestions + time-series
+```
+Pre-computed artifacts are generated once during the pipeline phase. At runtime, the only computation is query embedding (~5ms), cosine similarity search (<10ms), and LLM API calls (~3-5s).
+---
+## Edge Case Handling
+| Scenario | Behavior |
+|----------|----------|
+| Empty search query | "Please type a question..." + starter suggestions |
+| Very short query ("a") | "Too short for semantic search" + suggestion chips |
+| Non-English (Hindi, Japanese, Spanish, etc.) | Detect language, translate via LLM, search, show translation note |
+| Gibberish ("asdfghjkl") | "No strong matches found" + helpful suggestions |
+| Cluster k=100+ | Clamped to 50 with warning message |
+| Cluster k=-5, 0, 1 | Clamped to 2 with warning message |
+| Network node removal | Shows fragmentation impact (before/after component count) |
+| Non-existent node removal | 404 with "Author not found" message |
+| Disconnected graph components | Count displayed in stats, no crash |
+| Greeting ("hello", "hola") | Friendly intro + suggestion chips |
+---
+## A Note on r/worldpolitics
+r/worldpolitics is one of the 10 subreddits in this dataset and is a known case study in unmoderated community drift — what was originally a political discussion sub became almost entirely off-topic and NSFW content over time. Every post in this subreddit is flagged `over_18 = true` in the source data.
+I chose to keep r/worldpolitics in all analyses (stats, network graph, clusters, embeddings) because removing it would invalidate the comparison with the other 9 moderated political subreddits and change the network topology. The contrast between 9 moderated communities and 1 unmoderated one is itself a finding worth surfacing — it's exactly the kind of "actor or community drift" SimPPL builds tools to study.
+Where individual post titles are shown in the UI (Compare, Clusters, Search), I display a small contextual note next to r/worldpolitics content rather than filtering it out. The dashboard is a research tool, not a consumer product — hiding data would be editorializing.

backend/routes/clusters.py CHANGED Viewed

@@ -50,7 +50,8 @@ def get_clusters():
                 placeholders = ','.join(['?' for _ in all_pids])
                 top = conn.execute(f"""
                     SELECT id, title, subreddit, score, author, permalink, created_date FROM posts
-                    WHERE id IN ({placeholders}) ORDER BY score DESC LIMIT 10
                 """, all_pids).fetchall()
                 clusters[cid]['top_posts'] = [
                     {'id': t[0], 'title': t[1], 'subreddit': t[2], 'score': t[3],
@@ -110,11 +111,12 @@ def get_clusters():
                 label = f"Cluster {i}"
         cluster_post_ids = [post_ids[j] for j in range(len(labels)) if labels[j] == i]
-        pids_sample = cluster_post_ids[:5]
         placeholders = ','.join(['?' for _ in pids_sample])
         top = conn.execute(f"""
             SELECT id, title, subreddit, score FROM posts
-            WHERE id IN ({placeholders}) ORDER BY score DESC
         """, pids_sample).fetchall()
         clusters[i] = {

                 placeholders = ','.join(['?' for _ in all_pids])
                 top = conn.execute(f"""
                     SELECT id, title, subreddit, score, author, permalink, created_date FROM posts
+                    WHERE id IN ({placeholders})
+                    ORDER BY score DESC LIMIT 10
                 """, all_pids).fetchall()
                 clusters[cid]['top_posts'] = [
                     {'id': t[0], 'title': t[1], 'subreddit': t[2], 'score': t[3],
                 label = f"Cluster {i}"
         cluster_post_ids = [post_ids[j] for j in range(len(labels)) if labels[j] == i]
+        pids_sample = cluster_post_ids[:10]
         placeholders = ','.join(['?' for _ in pids_sample])
         top = conn.execute(f"""
             SELECT id, title, subreddit, score FROM posts
+            WHERE id IN ({placeholders})
+            ORDER BY score DESC LIMIT 5
         """, pids_sample).fetchall()
         clusters[i] = {

backend/routes/overview.py CHANGED Viewed

@@ -3,11 +3,99 @@ Overview / stats endpoints.
 """
 import sqlite3
-from flask import Blueprint, jsonify, current_app
 overview_bp = Blueprint('overview', __name__)
 @overview_bp.route('/overview/stats')
 def get_stats():
     conn = sqlite3.connect(current_app.config['db_path'])
@@ -63,3 +151,59 @@ def get_stats():
     stats['executive_summary'] = generate_overview_summary(stats)
     return jsonify(stats)

 """
 import sqlite3
+from flask import Blueprint, jsonify, current_app, request
 overview_bp = Blueprint('overview', __name__)
+VALID_SUBREDDITS = {
+    'Anarchism', 'socialism', 'democrats', 'Liberal', 'politics',
+    'PoliticalDiscussion', 'neoliberal', 'worldpolitics', 'Conservative', 'Republican'
+}
+def _get_subreddit_stats(conn, subreddit):
+    """Fetch comprehensive stats for one subreddit."""
+    # Basic counts
+    counts = conn.execute("""
+        SELECT COUNT(*) as total,
+               COUNT(DISTINCT author) as authors,
+               AVG(score) as avg_score,
+               AVG(num_comments) as avg_comments,
+               MAX(score) as max_score,
+               SUM(score) as total_score
+        FROM posts WHERE subreddit = ?
+    """, (subreddit,)).fetchone()
+    # Top news domains
+    top_domains = conn.execute("""
+        SELECT domain, COUNT(*) as count FROM posts
+        WHERE subreddit = ?
+          AND domain NOT LIKE 'self.%'
+          AND domain != ''
+          AND domain != 'i.redd.it'
+          AND domain != 'v.redd.it'
+          AND domain != 'reddit.com'
+        GROUP BY domain ORDER BY count DESC LIMIT 10
+    """, (subreddit,)).fetchall()
+    # Top authors
+    top_authors = conn.execute("""
+        SELECT author, COUNT(*) as count, AVG(score) as avg_score
+        FROM posts
+        WHERE subreddit = ? AND author != '[deleted]'
+        GROUP BY author ORDER BY count DESC LIMIT 10
+    """, (subreddit,)).fetchall()
+    # Top topics from k=15 cluster assignments
+    top_topics = conn.execute("""
+        SELECT c.cluster_label, COUNT(*) as count
+        FROM posts p
+        JOIN cluster_assignments c ON p.id = c.post_id
+        WHERE p.subreddit = ? AND c.k = 15
+        GROUP BY c.cluster_label
+        ORDER BY count DESC
+        LIMIT 5
+    """, (subreddit,)).fetchall()
+    # Top posts (highest scoring)
+    top_posts = conn.execute("""
+        SELECT id, title, score, author, permalink, created_date
+        FROM posts
+        WHERE subreddit = ?
+        ORDER BY score DESC LIMIT 5
+    """, (subreddit,)).fetchall()
+    # Time series — weekly post volume
+    timeseries = conn.execute("""
+        SELECT strftime('%Y-%W', created_date) as week, COUNT(*) as count
+        FROM posts WHERE subreddit = ?
+        GROUP BY week ORDER BY week
+    """, (subreddit,)).fetchall()
+    return {
+        'name': subreddit,
+        'total_posts': counts[0],
+        'unique_authors': counts[1],
+        'avg_score': round(counts[2], 1) if counts[2] else 0,
+        'avg_comments': round(counts[3], 1) if counts[3] else 0,
+        'max_score': counts[4] or 0,
+        'total_score': counts[5] or 0,
+        'top_domains': [{'domain': d[0], 'count': d[1]} for d in top_domains],
+        'top_authors': [
+            {'author': a[0], 'count': a[1], 'avg_score': round(a[2], 1) if a[2] else 0}
+            for a in top_authors
+        ],
+        'top_topics': [{'label': t[0], 'count': t[1]} for t in top_topics],
+        'top_posts': [
+            {'id': p[0], 'title': p[1], 'score': p[2], 'author': p[3],
+             'permalink': p[4], 'date': p[5]}
+            for p in top_posts
+        ],
+        'timeseries': [{'date': t[0], 'count': t[1]} for t in timeseries],
+    }
 @overview_bp.route('/overview/stats')
 def get_stats():
     conn = sqlite3.connect(current_app.config['db_path'])
     stats['executive_summary'] = generate_overview_summary(stats)
     return jsonify(stats)
+@overview_bp.route('/compare')
+def compare_subreddits():
+    """Compare two subreddits side by side."""
+    sub1 = request.args.get('sub1', 'Conservative')
+    sub2 = request.args.get('sub2', 'socialism')
+    # Validate inputs
+    if sub1 not in VALID_SUBREDDITS or sub2 not in VALID_SUBREDDITS:
+        return jsonify({
+            'error': True,
+            'message': f'Invalid subreddit. Must be one of: {", ".join(sorted(VALID_SUBREDDITS))}'
+        }), 400
+    if sub1 == sub2:
+        return jsonify({
+            'error': True,
+            'message': 'Please select two different subreddits to compare.'
+        }), 400
+    conn = sqlite3.connect(current_app.config['db_path'])
+    sub1_stats = _get_subreddit_stats(conn, sub1)
+    sub2_stats = _get_subreddit_stats(conn, sub2)
+    conn.close()
+    # Generate comparison summary via LLM
+    from services.llm_service import generate_comparison_summary
+    summary = generate_comparison_summary(sub1_stats, sub2_stats)
+    return jsonify({
+        'sub1': sub1_stats,
+        'sub2': sub2_stats,
+        'summary': summary,
+    })
+@overview_bp.route('/embeddings/summary')
+def embeddings_summary():
+    """Return an AI-generated explanation of the embedding visualization."""
+    conn = sqlite3.connect(current_app.config['db_path'])
+    total_posts = conn.execute("SELECT COUNT(*) FROM posts").fetchone()[0]
+    subreddit_counts = conn.execute(
+        "SELECT subreddit, COUNT(*) as count FROM posts GROUP BY subreddit ORDER BY count DESC"
+    ).fetchall()
+    conn.close()
+    stats = {
+        'total_posts': total_posts,
+        'subreddits': [{'name': s[0], 'count': s[1]} for s in subreddit_counts],
+    }
+    from services.llm_service import generate_embeddings_summary
+    summary = generate_embeddings_summary(stats)
+    return jsonify({'summary': summary})

backend/routes/search.py CHANGED Viewed

@@ -39,7 +39,7 @@ def search():
     query_lower = query.lower().strip('?!. ')
     if query_lower in greetings:
         return jsonify({
-            'answer': 'Hey! I can help you explore a dataset of 8,799 Reddit posts from 10 political subreddits (July 2024 - Feb 2025). Ask me anything about political discourse, media sharing, or community behavior.',
             'results': [],
             'follow_up_queries': [
                 'What topics dominated after the inauguration?',

     query_lower = query.lower().strip('?!. ')
     if query_lower in greetings:
         return jsonify({
+            'answer': 'Hey! I can help you explore a dataset of 8,799 Reddit posts from 10 subreddits (July 2024 - Feb 2025). Ask me anything about how these communities discussed events, what news they shared, or how they overlap.',
             'results': [],
             'follow_up_queries': [
                 'What topics dominated after the inauguration?',

backend/services/llm_service.py CHANGED Viewed

@@ -67,12 +67,16 @@ def generate_timeseries_summary(series_data, metric, granularity, subreddits=Non
     # Aggregate totals per period
     period_totals = {}
     sub_totals = {}
     for item in series_data:
         date = item.get('date', '')
         count = item.get('count', item.get('avg', 0))
         sub = item.get('subreddit', '')
         period_totals[date] = period_totals.get(date, 0) + count
         sub_totals[sub] = sub_totals.get(sub, 0) + count
     periods = sorted(period_totals.keys())
     if not periods:
@@ -84,46 +88,81 @@ def generate_timeseries_summary(series_data, metric, granularity, subreddits=Non
     lowest_val = period_totals[lowest_period]
     top_sub = max(sub_totals, key=sub_totals.get) if sub_totals else "N/A"
     top_3_subs = sorted(sub_totals.items(), key=lambda x: -x[1])[:3]
-    prompt = f"""Write a 2-3 sentence plain-language summary explaining this chart to someone who cannot read charts. They should understand the key trend just by reading your summary.
-IMPORTANT: The dataset covers Reddit posts from July 2024 to February 2025 ONLY. Do NOT mention any dates outside this range.
 What the chart shows: {metric} per {granularity}, {sub_filter}
 Period covered: {periods[0]} to {periods[-1]}
 Number of {granularity}s shown: {len(periods)}
 Lowest point: {lowest_period} with {lowest_val:.0f}
 Highest point: {peak_period} with {peak_val:.0f}
-Top 3 subreddits by volume: {', '.join([f'r/{s} ({v:.0f})' for s, v in top_3_subs])}
 Starting value: {period_totals.get(periods[0], 0):.0f}
 Ending value: {period_totals.get(periods[-1], 0):.0f}
-Rules:
-- State findings directly, do NOT say "the chart shows" or "the data shows"
-- Use ONLY the numbers provided above — do not invent or hallucinate any numbers
-- Explain what happened in simple terms a journalist could use
-- Mention at least one specific subreddit name and one specific number"""
-    result = _call_llm(prompt, max_tokens=200)
     if result:
         return result
-    # Fallback: rule-based summary
-    change_pct = ((period_totals.get(periods[-1], 0) - period_totals.get(periods[0], 1)) / max(period_totals.get(periods[0], 1), 1)) * 100
-    direction = "increased" if change_pct > 0 else "decreased"
     return (
-        f"Activity {direction} from {period_totals.get(periods[0], 0):.0f} to {period_totals.get(periods[-1], 0):.0f} "
-        f"over the period ({periods[0]} to {periods[-1]}). "
-        f"The peak occurred at {peak_period} with {peak_val:.0f} {metric}. "
-        f"r/{top_sub} was the most active subreddit with {sub_totals.get(top_sub, 0):.0f} total."
     )
 def generate_search_response(query, results, history=None):
     """Generate a conversational response for search results."""
     if not results:
-        prompt = f"""The user searched for "{query}" in a dataset of Reddit political posts, but no strong matches were found.
-Write a brief, helpful response (2 sentences max) acknowledging this and suggesting what they could try instead. Be specific to political discourse topics."""
         result = _call_llm(prompt, max_tokens=100)
         return result or f'No strong matches found for "{query}". Try searching for specific political topics like immigration, tariffs, or executive orders.'
@@ -138,17 +177,24 @@ Write a brief, helpful response (2 sentences max) acknowledging this and suggest
     for r in results[:10]:
         sub_counts[r['subreddit']] = sub_counts.get(r['subreddit'], 0) + 1
-    prompt = f"""You are an analyst for a political discourse research dashboard. The user searched for: "{query}"
-Here are the top 10 most relevant Reddit posts from a dataset of 8,799 posts across 10 political subreddits (July 2024 - Feb 2025):
 {results_context}
 Subreddit distribution in results: {sub_counts}
-Write a 2-3 sentence analytical response summarizing what the data shows about "{query}". Be specific — mention subreddit names, post counts, and any patterns you notice (e.g., which communities discuss this topic most, what framing they use). Do NOT just list the results."""
-    result = _call_llm(prompt, max_tokens=250)
     if result:
         return result
@@ -175,7 +221,7 @@ def generate_follow_up_queries(query, results):
         for r in results[:5]
     ])
-    prompt = f"""The user searched for "{query}" in a Reddit political discourse dataset and got these top results:
 {results_context}
@@ -205,22 +251,31 @@ Return ONLY the 3 questions, one per line, no numbering or bullets."""
 def generate_overview_summary(stats):
     """Generate an executive summary for the overview page."""
-    prompt = f"""Write a plain-text summary (NO markdown, NO headers, NO #, NO bullet points with *) for a political discourse dashboard.
 Dataset: {stats['total_posts']} Reddit posts from {stats['total_authors']} authors
 Subreddits: {', '.join([f"r/{s['name']} ({s['count']})" for s in stats['subreddits']])}
 Date range: {stats['date_range']['start']} to {stats['date_range']['end']}
-Top news sources: {', '.join([f"{d['domain']} ({d['count']} shares)" for d in stats['top_domains'][:5]])}
-Network: {stats['network_stats']['num_nodes']} connected authors, {stats['network_stats']['num_components']} separate components
-Write exactly 3 short paragraphs, plain text only:
-1. What this dataset is and why the time period matters (2024 election + 2025 transition)
-2. One specific insight: which communities share which news sources (give exact names and numbers)
-3. One specific insight: what the fragmented network (72 components) tells us about cross-community dialogue
-Do NOT use any markdown formatting. Do NOT start with "Executive Summary" or any title. Just write the paragraphs directly."""
-    result = _call_llm(prompt, max_tokens=300)
     if result:
         # Strip any markdown the LLM might still add
         cleaned = result.strip()
@@ -237,7 +292,7 @@ Do NOT use any markdown formatting. Do NOT start with "Executive Summary" or any
     return (
         f"This dataset captures {stats['total_posts']} posts from {stats['total_authors']} authors "
-        f"across 10 politically diverse subreddits, spanning {stats['date_range']['start']} to {stats['date_range']['end']}. "
         f"The period covers the 2024 US presidential election through the first weeks of the new administration. "
         f"Top shared news sources include {', '.join([d['domain'] for d in stats['top_domains'][:3]])}."
     )
@@ -250,14 +305,21 @@ def generate_cluster_summary(clusters, k):
         for c in sorted(clusters, key=lambda x: -x['size'])[:10]
     ])
-    prompt = f"""Write a plain-text summary (NO markdown, NO headers, NO #) analyzing these topic clusters from Reddit political discourse data (8,799 posts, 10 subreddits, Jul 2024 - Feb 2025).
-{k} clusters were created using KMeans. Here are the largest ones:
 {cluster_desc}
-Write 2-3 sentences explaining: what are the dominant topics, which topics overlap or are surprising, and what this tells us about what Reddit was discussing during this period. Use specific cluster names and numbers. Do NOT use markdown formatting."""
-    result = _call_llm(prompt, max_tokens=200)
     if result:
         import re
         cleaned = re.sub(r'^#{1,4}\s+.*$', '', result, flags=re.MULTILINE).strip()
@@ -267,14 +329,125 @@ Write 2-3 sentences explaining: what are the dominant topics, which topics overl
 def generate_network_summary(stats):
     """Generate a summary of the network analysis."""
-    prompt = f"""Write a plain-text summary (NO markdown, NO headers, NO #) analyzing this author interaction network from Reddit political discourse data.
-Network stats: {stats['num_nodes']} connected authors, {stats['num_edges']} edges, {stats['num_components']} disconnected components, {stats.get('num_communities', 'unknown')} communities detected.
-Density: {stats.get('density', 'unknown')}
-Write 2-3 sentences explaining: what does the high number of components mean (fragmentation), what does the density tell us about how connected authors are, and what this implies about cross-community interaction on Reddit. Use specific numbers. Do NOT use markdown formatting."""
-    result = _call_llm(prompt, max_tokens=200)
     if result:
         import re
         cleaned = re.sub(r'^#{1,4}\s+.*$', '', result, flags=re.MULTILINE).strip()
@@ -284,7 +457,7 @@ Write 2-3 sentences explaining: what does the high number of components mean (fr
 def answer_chart_question(question, data_context):
     """Answer a user's follow-up question about a specific chart's data."""
-    prompt = f"""You are analyzing a chart from a Reddit political discourse dashboard (8,799 posts from 10 subreddits, Jul 2024 - Feb 2025).
 Chart data and context:
 {data_context}

     # Aggregate totals per period
     period_totals = {}
     sub_totals = {}
+    sub_period_totals = {}  # {sub: {period: count}}
     for item in series_data:
         date = item.get('date', '')
         count = item.get('count', item.get('avg', 0))
         sub = item.get('subreddit', '')
         period_totals[date] = period_totals.get(date, 0) + count
         sub_totals[sub] = sub_totals.get(sub, 0) + count
+        if sub not in sub_period_totals:
+            sub_period_totals[sub] = {}
+        sub_period_totals[sub][date] = sub_period_totals[sub].get(date, 0) + count
     periods = sorted(period_totals.keys())
     if not periods:
     lowest_val = period_totals[lowest_period]
     top_sub = max(sub_totals, key=sub_totals.get) if sub_totals else "N/A"
     top_3_subs = sorted(sub_totals.items(), key=lambda x: -x[1])[:3]
+    bottom_3_subs = sorted(sub_totals.items(), key=lambda x: x[1])[:3]
+    # Compute first half vs second half average
+    mid = len(periods) // 2
+    first_half_avg = sum(period_totals[p] for p in periods[:mid]) / max(mid, 1)
+    second_half_avg = sum(period_totals[p] for p in periods[mid:]) / max(len(periods) - mid, 1)
+    pct_change = ((second_half_avg - first_half_avg) / max(first_half_avg, 1)) * 100
+    # Find which subreddit had the biggest spike near the peak
+    peak_contributors = sorted(
+        [(s, sub_period_totals[s].get(peak_period, 0)) for s in sub_period_totals],
+        key=lambda x: -x[1]
+    )[:3]
+    total_volume = sum(period_totals.values())
+    avg_per_period = total_volume / max(len(periods), 1)
+    prompt = f"""Write a detailed 5-6 sentence plain-language summary explaining this time-series chart to someone who cannot read charts. The reader should understand the trend, the key shifts, who drove the activity, and what the data reveals — purely from your summary.
+IMPORTANT: The dataset covers Reddit posts from July 2024 to February 2025 ONLY. Do NOT mention any dates outside this range. Trump's inauguration was on January 20, 2025.
+CHART CONTEXT
 What the chart shows: {metric} per {granularity}, {sub_filter}
 Period covered: {periods[0]} to {periods[-1]}
 Number of {granularity}s shown: {len(periods)}
+Total volume across the entire period: {total_volume:.0f}
+Average per {granularity}: {avg_per_period:.1f}
+KEY POINTS
 Lowest point: {lowest_period} with {lowest_val:.0f}
 Highest point: {peak_period} with {peak_val:.0f}
 Starting value: {period_totals.get(periods[0], 0):.0f}
 Ending value: {period_totals.get(periods[-1], 0):.0f}
+First half average: {first_half_avg:.1f}
+Second half average: {second_half_avg:.1f}
+Change between halves: {pct_change:+.0f}%
+SUBREDDIT BREAKDOWN
+Top 3 subreddits by total volume: {', '.join([f'r/{s} ({v:.0f})' for s, v in top_3_subs])}
+Bottom 3 subreddits: {', '.join([f'r/{s} ({v:.0f})' for s, v in bottom_3_subs])}
+Top 3 subreddits driving the peak at {peak_period}: {', '.join([f'r/{s} ({v:.0f})' for s, v in peak_contributors])}
+INSTRUCTIONS
+- Write 5 to 6 sentences, in plain English, no markdown, no bullet points.
+- Sentence 1: Describe the overall shape of the trend (was it flat, growing, falling, spiky?) and the magnitude of change between halves.
+- Sentence 2: Pinpoint the peak moment and explain what subreddits drove it.
+- Sentence 3: Compare the most active and least active subreddits — what does this say about which communities dominated the conversation?
+- Sentence 4: Mention any clear inflection point (e.g. activity surge after January 20, 2025 inauguration).
+- Sentence 5-6: End with a takeaway — what does this trend reveal about how these communities discussed events during this period?
+- Use ONLY the numbers provided above. Do not invent any numbers, dates, or subreddit names.
+- Do NOT start with "The chart shows" or "This data shows". State findings directly.
+- Be analytical, like a journalist writing for a non-technical audience."""
+    result = _call_llm(prompt, max_tokens=500)
     if result:
         return result
+    # Fallback: richer rule-based summary
+    direction = "rose sharply" if pct_change > 30 else "declined" if pct_change < -30 else "stayed relatively stable"
     return (
+        f"Activity {direction} over the {len(periods)} {granularity}s shown, with the average shifting from "
+        f"{first_half_avg:.0f} in the first half to {second_half_avg:.0f} in the second half ({pct_change:+.0f}%). "
+        f"The peak occurred at {peak_period} with {peak_val:.0f} — driven primarily by "
+        f"{', '.join([f'r/{s}' for s, _ in peak_contributors[:2]])}. "
+        f"Across the entire period, r/{top_sub} dominated with {sub_totals.get(top_sub, 0):.0f} total, "
+        f"while r/{bottom_3_subs[0][0]} contributed only {bottom_3_subs[0][1]:.0f}. "
+        f"This concentration suggests conversation during this period was unevenly distributed across communities."
     )
 def generate_search_response(query, results, history=None):
     """Generate a conversational response for search results."""
     if not results:
+        prompt = f"""The user searched for "{query}" in a dataset of Reddit posts from 10 politically associated subreddits, but no strong matches were found.
+Write a brief, helpful response (2 sentences max) acknowledging this and suggesting what they could try instead. Be specific to political topics relevant to the 2024 US election and 2025 transition."""
         result = _call_llm(prompt, max_tokens=100)
         return result or f'No strong matches found for "{query}". Try searching for specific political topics like immigration, tariffs, or executive orders.'
     for r in results[:10]:
         sub_counts[r['subreddit']] = sub_counts.get(r['subreddit'], 0) + 1
+    prompt = f"""You are an analyst for a research dashboard tracing narratives across Reddit communities. The user searched for: "{query}"
+Here are the top 10 most relevant Reddit posts from a dataset of 8,799 posts across 10 subreddits (July 2024 - Feb 2025) collected for their political associations:
 {results_context}
 Subreddit distribution in results: {sub_counts}
+Write a detailed 4-5 sentence analytical response answering the user's query based on this data. Structure it like this:
+- Open with a direct answer to "{query}" based on what the results show
+- Describe which communities are most engaged with this topic and how the distribution skews
+- Highlight one or two specific posts that best illustrate the finding (cite post titles)
+- Note any contrast or pattern across communities (e.g., "left-leaning subs frame it differently from right-leaning")
+- End with a takeaway or what's notable about this finding
+Use ONLY the data above. Be specific with subreddit names, post titles, and counts. Do NOT use markdown headers or bullet points — write flowing prose."""
+    result = _call_llm(prompt, max_tokens=450)
     if result:
         return result
         for r in results[:5]
     ])
+    prompt = f"""The user searched for "{query}" in a dataset of Reddit posts from 10 politically associated subreddits and got these top results:
 {results_context}
 def generate_overview_summary(stats):
     """Generate an executive summary for the overview page."""
+    prompt = f"""Write a plain-text executive summary (NO markdown, NO headers, NO #, NO bullet points) for an investigative reporting dashboard tracing narratives across 10 Reddit communities collected for their political associations.
 Dataset: {stats['total_posts']} Reddit posts from {stats['total_authors']} authors
 Subreddits: {', '.join([f"r/{s['name']} ({s['count']})" for s in stats['subreddits']])}
 Date range: {stats['date_range']['start']} to {stats['date_range']['end']}
+Top news sources: {', '.join([f"{d['domain']} ({d['count']} shares)" for d in stats['top_domains'][:8]])}
+Network: {stats['network_stats']['num_nodes']} connected authors, {stats['network_stats']['num_edges']} edges, {stats['network_stats']['num_components']} separate components
+Write exactly 4 substantial paragraphs (3-4 sentences each), plain text only:
+Paragraph 1 — Setting the stage:
+Describe what this dataset captures and why the time period (July 2024 to February 2025) matters historically. Reference the 2024 US presidential election and the January 20, 2025 inauguration of Trump's second term. Mention that the 10 subreddits were collected for their political associations and span the full political spectrum.
+Paragraph 2 — Volume and concentration:
+Explain that 83% of all activity (7,286 of 8,799 posts) is concentrated in January-February 2025, after the inauguration. Average daily posting jumped from ~13 posts/day to ~217 posts/day after January 20 — a 1,500% surge. Explain why this matters for tracing how narratives spread.
+Paragraph 3 — Media ecosystem fragmentation:
+Use the top news sources data to show how different subreddits share fundamentally different sources. For example, r/Conservative shares breitbart.com and foxnews.com, while r/politics shares nytimes.com and theguardian.com. Reference at least 4 specific domains by name with their share counts. This is a sign of isolated information ecosystems.
+Paragraph 4 — Network structure:
+Explain what {stats['network_stats']['num_components']} disconnected components in a {stats['network_stats']['num_nodes']}-node network reveals about cross-community dialogue. Most communities operate in isolation, but ~87 cross-community authors act as bridges. Comment on what this fragmentation means for the spread of narratives.
+Do NOT use any markdown formatting. Do NOT start with "Executive Summary" or any title. Write each paragraph as a standalone block separated by a blank line."""
+    result = _call_llm(prompt, max_tokens=700)
     if result:
         # Strip any markdown the LLM might still add
         cleaned = result.strip()
     return (
         f"This dataset captures {stats['total_posts']} posts from {stats['total_authors']} authors "
+        f"across 10 subreddits collected for their political associations, spanning {stats['date_range']['start']} to {stats['date_range']['end']}. "
         f"The period covers the 2024 US presidential election through the first weeks of the new administration. "
         f"Top shared news sources include {', '.join([d['domain'] for d in stats['top_domains'][:3]])}."
     )
         for c in sorted(clusters, key=lambda x: -x['size'])[:10]
     ])
+    prompt = f"""Write a detailed plain-text analysis (NO markdown, NO headers, NO #) of these topic clusters from a Reddit dataset (8,799 posts, 10 subreddits collected for their political associations, July 2024 to February 2025, covering the 2024 US election and 2025 presidential transition).
+{k} clusters were created using KMeans on 384-dimensional sentence embeddings. Here are the largest clusters:
 {cluster_desc}
+Write 5 to 6 sentences covering:
+1. What are the dominant themes that emerge across the largest clusters? Name at least 3 specific clusters by their keywords.
+2. Which clusters reflect election-period concerns (campaigns, voting, candidates) versus post-inauguration governance (executive orders, immigration, federal workforce)?
+3. Are there any surprising or unexpected clusters — small ones, or topics that wouldn't normally appear in politically associated subreddits?
+4. What does the distribution of cluster sizes tell us — are a few topics dominating the conversation, or is the conversation spread evenly across many topics?
+5. End with a takeaway about what these communities were discussing during this seven-month window.
+Use specific cluster keywords, exact post counts, and percentages where relevant. Do NOT use markdown, bullet points, or headers — write flowing analytical prose."""
+    result = _call_llm(prompt, max_tokens=500)
     if result:
         import re
         cleaned = re.sub(r'^#{1,4}\s+.*$', '', result, flags=re.MULTILINE).strip()
 def generate_network_summary(stats):
     """Generate a summary of the network analysis."""
+    num_nodes = stats.get('num_nodes', 0)
+    num_edges = stats.get('num_edges', 0)
+    num_components = stats.get('num_components', 0)
+    num_communities = stats.get('num_communities', 'unknown')
+    density = stats.get('density', 'unknown')
+    largest = stats.get('largest_component_size', 'unknown')
+    prompt = f"""Write a detailed plain-text analysis (NO markdown, NO headers, NO #) of this author interaction network built from a Reddit dataset (8,799 posts, 10 subreddits collected for their political associations, July 2024 to February 2025).
+The network is built from three signal types: crosspost links (weight 3.0), shared URL co-sharing (weight 2.0), and co-subreddit activity (weight 1.0). The [deleted] meta-author is excluded to prevent false super-connections.
+NETWORK STATS
+Total connected authors (nodes): {num_nodes}
+Total interaction edges: {num_edges}
+Disconnected components: {num_components}
+Communities detected (Louvain algorithm): {num_communities}
+Network density: {density}
+Largest connected component size: {largest}
+Write 5 to 6 sentences covering:
+1. What does {num_components} disconnected components in a {num_nodes}-node network reveal about how fragmented or unified author interaction is across these Reddit communities?
+2. What does the density of {density} tell us about how interconnected authors are in absolute terms? (Density of 1.0 would mean every author interacts with every other; density near 0 means very sparse interaction.)
+3. What do the {num_communities} Louvain communities suggest — are these likely subreddit-aligned communities or do they cross subreddit boundaries?
+4. The largest connected component contains {largest} authors. What does the gap between this and total nodes ({num_nodes}) say about the structure of cross-community interaction?
+5. End with a takeaway: what does this network structure imply about the spread of narratives between politically diverse Reddit communities?
+Use specific numbers throughout. Do NOT use markdown, bullet points, or headers — write flowing analytical prose."""
+    result = _call_llm(prompt, max_tokens=500)
+    if result:
+        import re
+        cleaned = re.sub(r'^#{1,4}\s+.*$', '', result, flags=re.MULTILINE).strip()
+        return cleaned.replace('**', '')
+    return None
+def generate_comparison_summary(sub1, sub2):
+    """Generate an analytical comparison between two subreddits."""
+    sub1_domains = ', '.join([f"{d['domain']} ({d['count']})" for d in sub1['top_domains'][:5]])
+    sub2_domains = ', '.join([f"{d['domain']} ({d['count']})" for d in sub2['top_domains'][:5]])
+    sub1_topics = '; '.join([t['label'] for t in sub1['top_topics'][:3]])
+    sub2_topics = '; '.join([t['label'] for t in sub2['top_topics'][:3]])
+    sub1_top_author = sub1['top_authors'][0] if sub1['top_authors'] else None
+    sub2_top_author = sub2['top_authors'][0] if sub2['top_authors'] else None
+    prompt = f"""Write a detailed plain-text analytical comparison (NO markdown, NO headers, NO #) of two Reddit subreddits from a dataset of 8,799 posts (10 subreddits collected for their political associations, July 2024 to February 2025, covering the 2024 US election and 2025 inauguration).
+SUBREDDIT 1: r/{sub1['name']}
+- Total posts: {sub1['total_posts']}
+- Unique authors: {sub1['unique_authors']}
+- Average upvotes per post: {sub1['avg_score']}
+- Average comments per post: {sub1['avg_comments']}
+- Top news sources shared: {sub1_domains}
+- Top discussion topics: {sub1_topics}
+- Most active author: {f"u/{sub1_top_author['author']} ({sub1_top_author['count']} posts)" if sub1_top_author else 'N/A'}
+SUBREDDIT 2: r/{sub2['name']}
+- Total posts: {sub2['total_posts']}
+- Unique authors: {sub2['unique_authors']}
+- Average upvotes per post: {sub2['avg_score']}
+- Average comments per post: {sub2['avg_comments']}
+- Top news sources shared: {sub2_domains}
+- Top discussion topics: {sub2_topics}
+- Most active author: {f"u/{sub2_top_author['author']} ({sub2_top_author['count']} posts)" if sub2_top_author else 'N/A'}
+Write 4 paragraphs (3-4 sentences each) covering:
+Paragraph 1 — Engagement comparison:
+Compare total post counts, unique authors, average upvotes, and average comments. Which community is more active? Which gets more engagement per post? What does the ratio of authors to posts tell us about whether discussion is concentrated in a few hands or distributed widely?
+Paragraph 2 — Information ecosystem:
+Compare the news sources each community shares. Are they reading the same outlets, or completely different ones? Cite at least 3 specific domains by name. What does this tell us about the information ecosystems each community is plugged into?
+Paragraph 3 — Topical focus:
+Compare the dominant topics in each community. Are they discussing the same events from different angles, or are they focused on entirely different concerns? Reference specific topic keywords from the data.
+Paragraph 4 — The takeaway:
+What's the most striking difference between these two communities? Is there evidence of narrative divergence, echo chambers, or shared concerns? End with a concrete observation that a journalist could use as the seed for a story.
+Use ONLY the data above. Be specific with numbers and names. Do NOT use markdown formatting — write flowing analytical prose."""
+    result = _call_llm(prompt, max_tokens=700)
+    if result:
+        import re
+        cleaned = re.sub(r'^#{1,4}\s+.*$', '', result, flags=re.MULTILINE).strip()
+        return cleaned.replace('**', '')
+    return None
+def generate_embeddings_summary(stats):
+    """Generate a plain-language summary explaining what the embedding visualization shows."""
+    subreddit_list = ', '.join([f"r/{s['name']} ({s['count']})" for s in stats.get('subreddits', [])[:10]])
+    prompt = f"""Write a plain-text 4-paragraph explanation (NO markdown, NO headers, NO #) helping a non-technical reader understand what they are looking at in an interactive embedding visualization.
+CONTEXT
+The visualization shows all 8,799 Reddit posts as dots on a 2D map. Posts that are semantically similar (discuss similar topics in similar ways) are placed near each other. Posts that are different are far apart. The map was created using all-MiniLM-L6-v2 sentence embeddings (384 dimensions per post) reduced to 2D using UMAP. Each post is colored by which subreddit it came from.
+DATASET
+{stats.get('total_posts', 8799)} posts across these 10 subreddits collected for their political associations: {subreddit_list}
+Time period: July 2024 to February 2025 (covering the 2024 US election and 2025 Trump inauguration)
+WRITE 4 PARAGRAPHS
+Paragraph 1 — What you're looking at:
+Explain that this is a "map of meaning" — each dot is a post, and dots near each other talk about similar things. Don't use jargon like "embedding space" or "vector dimensions." Use the metaphor of a city map where similar buildings (posts) cluster into neighborhoods (topics).
+Paragraph 2 — How to read it:
+Explain that distinct clumps of dots are topic clusters that emerged automatically — no one labeled them, the AI just grouped posts that talked about similar things. Mention that the colors show which subreddit each post is from, so you can see whether different communities cluster separately or mix together. Tell the reader to look for: tight clumps (focused topics), sparse areas (unique posts), and surprising overlaps (posts from opposing political subreddits ending up near each other).
+Paragraph 3 — What this reveals about how the communities talk:
+Discuss what an embedding map of these Reddit communities can reveal. For example: posts about Trump's executive orders form one neighborhood, posts about anarchist theory form another, posts about election results form yet another. Communities that share vocabulary will overlap in space, while ideologically distant ones stay apart. The biggest insight from this kind of visualization is finding "bridges" — posts where different political camps unexpectedly land near each other.
+Paragraph 4 — How to use it:
+Tell the reader to use the search bar inside the map to find specific topics (e.g. searching "immigration" highlights all immigration-related posts). Encourage them to zoom into a clump to read individual post titles and see what defines that neighborhood.
+Use plain conversational English a curious newspaper reader would understand. Do NOT use markdown."""
+    result = _call_llm(prompt, max_tokens=700)
     if result:
         import re
         cleaned = re.sub(r'^#{1,4}\s+.*$', '', result, flags=re.MULTILINE).strip()
 def answer_chart_question(question, data_context):
     """Answer a user's follow-up question about a specific chart's data."""
+    prompt = f"""You are analyzing a chart from a Reddit research dashboard (8,799 posts from 10 subreddits collected for their political associations, Jul 2024 - Feb 2025).
 Chart data and context:
 {data_context}

frontend/index.html CHANGED Viewed

@@ -4,7 +4,7 @@
     <meta charset="UTF-8" />
     <link rel="icon" type="image/svg+xml" href="/favicon.svg" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
-    <title>TheScope — Political Discourse Analysis</title>
     <link rel="preconnect" href="https://fonts.googleapis.com" />
     <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
     <link href="https://fonts.googleapis.com/css2?family=DM+Serif+Display&family=DM+Sans:ital,opsz,wght@0,9..40,300;0,9..40,400;0,9..40,500;0,9..40,600;0,9..40,700;1,9..40,400&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet" />

     <meta charset="UTF-8" />
     <link rel="icon" type="image/svg+xml" href="/favicon.svg" />
     <meta name="viewport" content="width=device-width, initial-scale=1.0" />
+    <title>TheScope — Tracing Narratives Across Reddit Communities</title>
     <link rel="preconnect" href="https://fonts.googleapis.com" />
     <link rel="preconnect" href="https://fonts.gstatic.com" crossorigin />
     <link href="https://fonts.googleapis.com/css2?family=DM+Serif+Display&family=DM+Sans:ital,opsz,wght@0,9..40,300;0,9..40,400;0,9..40,500;0,9..40,600;0,9..40,700;1,9..40,400&family=JetBrains+Mono:wght@400;500&display=swap" rel="stylesheet" />

frontend/src/App.jsx CHANGED Viewed

@@ -7,6 +7,7 @@ import Network from './pages/Network'
 import Clusters from './pages/Clusters'
 import Search from './pages/Search'
 import Embeddings from './pages/Embeddings'
 function DashboardRoutes() {
   return (
@@ -16,6 +17,7 @@ function DashboardRoutes() {
         <Route path="/timeseries" element={<TimeSeries />} />
         <Route path="/network" element={<Network />} />
         <Route path="/clusters" element={<Clusters />} />
         <Route path="/search" element={<Search />} />
         <Route path="/embeddings" element={<Embeddings />} />
       </Routes>

 import Clusters from './pages/Clusters'
 import Search from './pages/Search'
 import Embeddings from './pages/Embeddings'
+import Compare from './pages/Compare'
 function DashboardRoutes() {
   return (
         <Route path="/timeseries" element={<TimeSeries />} />
         <Route path="/network" element={<Network />} />
         <Route path="/clusters" element={<Clusters />} />
+        <Route path="/compare" element={<Compare />} />
         <Route path="/search" element={<Search />} />
         <Route path="/embeddings" element={<Embeddings />} />
       </Routes>

frontend/src/components/common/AISummary.jsx CHANGED Viewed

@@ -8,13 +8,20 @@ export default function AISummary({ text }) {
     .replace(/^\s*[-*]\s/gm, '• ')    // convert list markers to bullets
     .trim()
   return (
-    <div className="mt-3 bg-indigo-50/70 backdrop-blur-sm border border-indigo-100/50 rounded-xl px-4 py-3">
-      <div className="flex items-start gap-2">
-        <span className="text-xs bg-indigo-200/80 text-indigo-800 px-1.5 py-0.5 rounded-md font-medium shrink-0 mt-0.5">
           AI Summary
         </span>
-        <div className="text-sm text-gray-700 leading-relaxed whitespace-pre-line">{cleanText}</div>
       </div>
     </div>
   )

     .replace(/^\s*[-*]\s/gm, '• ')    // convert list markers to bullets
     .trim()
+  // Split into paragraphs by blank lines
+  const paragraphs = cleanText.split(/\n\s*\n/).filter(p => p.trim())
   return (
+    <div className="mt-4 bg-indigo-50/70 backdrop-blur-sm border border-indigo-100/50 rounded-xl px-5 py-4">
+      <div className="flex items-center gap-2 mb-3">
+        <span className="text-xs bg-indigo-200/80 text-indigo-800 px-2 py-0.5 rounded-md font-semibold tracking-wide">
           AI Summary
         </span>
+      </div>
+      <div className="space-y-3">
+        {paragraphs.map((para, i) => (
+          <p key={i} className="text-sm text-gray-700 leading-relaxed">{para}</p>
+        ))}
       </div>
     </div>
   )

frontend/src/components/layout/Sidebar.jsx CHANGED Viewed

@@ -14,7 +14,7 @@ export default function Sidebar() {
     <aside className="w-64 min-h-screen bg-gray-900 text-white flex flex-col">
       <div className="p-6 border-b border-gray-700">
         <h1 className="text-lg font-bold tracking-tight">SimPPL Dashboard</h1>
-        <p className="text-xs text-gray-400 mt-1">Political Discourse Analysis</p>
       </div>
       <nav className="flex-1 py-4">

     <aside className="w-64 min-h-screen bg-gray-900 text-white flex flex-col">
       <div className="p-6 border-b border-gray-700">
         <h1 className="text-lg font-bold tracking-tight">SimPPL Dashboard</h1>
+        <p className="text-xs text-gray-400 mt-1">Tracing Narratives Across Reddit</p>
       </div>
       <nav className="flex-1 py-4">

frontend/src/components/layout/TopNavbar.jsx CHANGED Viewed

@@ -6,6 +6,7 @@ const navItems = [
   { path: '/dashboard/timeseries', label: 'Time Series' },
   { path: '/dashboard/network', label: 'Network' },
   { path: '/dashboard/clusters', label: 'Topics' },
   { path: '/dashboard/search', label: 'SearchAI' },
   { path: '/dashboard/embeddings', label: 'Embeddings' },
 ]
@@ -25,7 +26,7 @@ export default function TopNavbar({ darkMode, setDarkMode }) {
               TheScope
             </h1>
             <p className={`text-[10px] leading-tight ${darkMode ? 'text-gray-500' : 'text-gray-400'}`}>
-              Political Discourse Analysis
             </p>
           </div>
         </Link>

   { path: '/dashboard/timeseries', label: 'Time Series' },
   { path: '/dashboard/network', label: 'Network' },
   { path: '/dashboard/clusters', label: 'Topics' },
+  { path: '/dashboard/compare', label: 'Compare' },
   { path: '/dashboard/search', label: 'SearchAI' },
   { path: '/dashboard/embeddings', label: 'Embeddings' },
 ]
               TheScope
             </h1>
             <p className={`text-[10px] leading-tight ${darkMode ? 'text-gray-500' : 'text-gray-400'}`}>
+              Tracing Narratives Across Reddit
             </p>
           </div>
         </Link>

frontend/src/pages/Clusters.jsx CHANGED Viewed

@@ -197,6 +197,12 @@ export default function Clusters() {
                         </a>
                       ))}
                     </div>
                   </div>
                 </div>

                         </a>
                       ))}
                     </div>
+                    {(cluster.top_posts || []).some(p => p.subreddit === 'worldpolitics') && (
+                      <p className="text-[10px] text-amber-700/80 mt-2 italic leading-snug">
+                        Note: this cluster includes posts from r/worldpolitics, a largely unmoderated
+                        community that has drifted away from political discussion. Posts are shown as-is.
+                      </p>
+                    )}
                   </div>
                 </div>

frontend/src/pages/Compare.jsx ADDED Viewed

	@@ -0,0 +1,235 @@

+import { useState, useEffect } from 'react'
+import { getCompareSubreddits } from '../services/api'
+import LoadingSpinner from '../components/common/LoadingSpinner'
+import AISummary from '../components/common/AISummary'
+import {
+  LineChart, Line, XAxis, YAxis, Tooltip, ResponsiveContainer, Legend
+} from 'recharts'
+const SUBREDDITS = [
+  'Anarchism', 'socialism', 'democrats', 'Liberal', 'politics',
+  'PoliticalDiscussion', 'neoliberal', 'worldpolitics', 'Conservative', 'Republican'
+]
+const SUB_COLORS = {
+  Anarchism: '#dc2626', socialism: '#ef4444', democrats: '#3b82f6',
+  Liberal: '#60a5fa', politics: '#8b5cf6', PoliticalDiscussion: '#a78bfa',
+  neoliberal: '#6366f1', worldpolitics: '#14b8a6', Conservative: '#f97316', Republican: '#ea580c'
+}
+export default function Compare() {
+  const [sub1, setSub1] = useState('Conservative')
+  const [sub2, setSub2] = useState('socialism')
+  const [data, setData] = useState(null)
+  const [loading, setLoading] = useState(true)
+  const [error, setError] = useState(null)
+  useEffect(() => {
+    if (sub1 === sub2) return
+    setLoading(true)
+    setError(null)
+    getCompareSubreddits(sub1, sub2)
+      .then(res => setData(res.data))
+      .catch(err => setError(err.response?.data?.message || 'Failed to load comparison'))
+      .finally(() => setLoading(false))
+  }, [sub1, sub2])
+  // Merge timeseries for chart
+  const mergedTimeseries = (() => {
+    if (!data) return []
+    const dateMap = {}
+    for (const item of data.sub1.timeseries || []) {
+      if (!dateMap[item.date]) dateMap[item.date] = { date: item.date }
+      dateMap[item.date][data.sub1.name] = item.count
+    }
+    for (const item of data.sub2.timeseries || []) {
+      if (!dateMap[item.date]) dateMap[item.date] = { date: item.date }
+      dateMap[item.date][data.sub2.name] = item.count
+    }
+    return Object.values(dateMap).sort((a, b) => a.date.localeCompare(b.date))
+  })()
+  return (
+    <div>
+      <h1 className="text-3xl font-bold text-gray-900 mb-2">Compare Communities</h1>
+      <p className="text-gray-500 mb-6 max-w-3xl">
+        Side-by-side comparison of two subreddits. See how each community discusses different topics, shares
+        different news sources, and engages at different rates. Pick any two subreddits below.
+      </p>
+      {/* Picker */}
+      <div className="bg-white/70 backdrop-blur-sm rounded-xl border border-gray-200/50 shadow-sm p-5 mb-6">
+        <div className="grid grid-cols-1 md:grid-cols-[1fr_auto_1fr] items-center gap-4">
+          <div>
+            <label className="block text-xs text-gray-500 font-medium mb-1.5">Community A</label>
+            <select value={sub1} onChange={e => setSub1(e.target.value)}
+              className="w-full px-4 py-2.5 border border-gray-300 rounded-lg text-sm font-medium focus:outline-none focus:ring-2 focus:ring-amber-400 focus:border-transparent bg-white">
+              {SUBREDDITS.map(s => (
+                <option key={s} value={s}>r/{s}</option>
+              ))}
+            </select>
+          </div>
+          <div className="flex items-center justify-center pt-5">
+            <div className="px-3 py-1 bg-gray-100 rounded-full text-xs font-semibold text-gray-500 uppercase tracking-wider">
+              vs
+            </div>
+          </div>
+          <div>
+            <label className="block text-xs text-gray-500 font-medium mb-1.5">Community B</label>
+            <select value={sub2} onChange={e => setSub2(e.target.value)}
+              className="w-full px-4 py-2.5 border border-gray-300 rounded-lg text-sm font-medium focus:outline-none focus:ring-2 focus:ring-amber-400 focus:border-transparent bg-white">
+              {SUBREDDITS.map(s => (
+                <option key={s} value={s}>r/{s}</option>
+              ))}
+            </select>
+          </div>
+        </div>
+        {sub1 === sub2 && (
+          <p className="text-xs text-amber-600 mt-3">Please select two different subreddits to compare.</p>
+        )}
+      </div>
+      {error && (
+        <div className="bg-red-50 border border-red-200 rounded-xl p-4 mb-6">
+          <p className="text-sm text-red-700">{error}</p>
+        </div>
+      )}
+      {loading ? <LoadingSpinner message="Loading comparison..." /> : data && (
+        <>
+          {/* Side-by-side stats grid */}
+          <div className="grid grid-cols-1 md:grid-cols-2 gap-5 mb-6">
+            {[data.sub1, data.sub2].map((sub, idx) => (
+              <div key={sub.name} className="bg-white/70 backdrop-blur-sm rounded-xl border border-gray-200/50 shadow-sm overflow-hidden">
+                {/* Header */}
+                <div className="p-5 border-b border-gray-200/50"
+                  style={{
+                    background: `linear-gradient(135deg, ${SUB_COLORS[sub.name]}10 0%, transparent 100%)`
+                  }}>
+                  <div className="flex items-center gap-2 mb-1">
+                    <div className="w-3 h-3 rounded-full" style={{ backgroundColor: SUB_COLORS[sub.name] }} />
+                    <h2 className="text-xl font-bold text-gray-900">r/{sub.name}</h2>
+                  </div>
+                  <p className="text-xs text-gray-500">{sub.unique_authors} unique authors</p>
+                </div>
+                {/* Metrics */}
+                <div className="grid grid-cols-3 border-b border-gray-200/50">
+                  <div className="p-4 border-r border-gray-200/50">
+                    <p className="text-xs text-gray-500 uppercase tracking-wider mb-1">Posts</p>
+                    <p className="text-2xl font-bold text-gray-900">{sub.total_posts.toLocaleString()}</p>
+                  </div>
+                  <div className="p-4 border-r border-gray-200/50">
+                    <p className="text-xs text-gray-500 uppercase tracking-wider mb-1">Avg Score</p>
+                    <p className="text-2xl font-bold text-gray-900">{sub.avg_score.toLocaleString()}</p>
+                  </div>
+                  <div className="p-4">
+                    <p className="text-xs text-gray-500 uppercase tracking-wider mb-1">Avg Comments</p>
+                    <p className="text-2xl font-bold text-gray-900">{sub.avg_comments.toLocaleString()}</p>
+                  </div>
+                </div>
+                {/* Top news sources */}
+                <div className="p-5 border-b border-gray-200/50">
+                  <h3 className="text-xs text-gray-500 uppercase tracking-wider font-semibold mb-3">Top News Sources</h3>
+                  <div className="space-y-1.5">
+                    {sub.top_domains.slice(0, 5).map(d => (
+                      <div key={d.domain} className="flex items-center justify-between text-sm">
+                        <span className="text-gray-700 truncate">{d.domain}</span>
+                        <span className="text-xs text-gray-400 ml-2 shrink-0">{d.count}</span>
+                      </div>
+                    ))}
+                    {sub.top_domains.length === 0 && (
+                      <p className="text-xs text-gray-400">No external links shared</p>
+                    )}
+                  </div>
+                </div>
+                {/* Top topics */}
+                <div className="p-5 border-b border-gray-200/50">
+                  <h3 className="text-xs text-gray-500 uppercase tracking-wider font-semibold mb-3">Top Discussion Topics</h3>
+                  <div className="space-y-1.5">
+                    {sub.top_topics.slice(0, 5).map((t, i) => (
+                      <div key={i} className="flex items-center justify-between text-sm">
+                        <span className="text-gray-700 text-xs truncate" title={t.label}>{t.label}</span>
+                        <span className="text-xs text-gray-400 ml-2 shrink-0">{t.count}</span>
+                      </div>
+                    ))}
+                    {sub.top_topics.length === 0 && (
+                      <p className="text-xs text-gray-400">No topic data available</p>
+                    )}
+                  </div>
+                </div>
+                {/* Top authors */}
+                <div className="p-5 border-b border-gray-200/50">
+                  <h3 className="text-xs text-gray-500 uppercase tracking-wider font-semibold mb-3">Most Active Authors</h3>
+                  <div className="space-y-1.5">
+                    {sub.top_authors.slice(0, 5).map(a => (
+                      <div key={a.author} className="flex items-center justify-between text-sm">
+                        <a href={`https://reddit.com/u/${a.author}`} target="_blank" rel="noopener noreferrer"
+                          className="text-indigo-600 hover:text-indigo-800 hover:underline truncate">
+                          u/{a.author}
+                        </a>
+                        <span className="text-xs text-gray-400 ml-2 shrink-0">{a.count} posts</span>
+                      </div>
+                    ))}
+                  </div>
+                </div>
+                {/* Top posts */}
+                <div className="p-5">
+                  <h3 className="text-xs text-gray-500 uppercase tracking-wider font-semibold mb-3">Top Posts</h3>
+                  {sub.top_posts.length > 0 ? (
+                    <div className="space-y-2">
+                      {sub.top_posts.slice(0, 3).map(p => (
+                        <a key={p.id} href={p.permalink ? `https://reddit.com${p.permalink}` : '#'}
+                          target="_blank" rel="noopener noreferrer"
+                          className="block text-xs text-gray-600 hover:text-indigo-600 line-clamp-2">
+                          {p.title} <span className="text-gray-400">({p.score.toLocaleString()} upvotes)</span>
+                        </a>
+                      ))}
+                    </div>
+                  ) : (
+                    <p className="text-xs text-gray-400">No posts to display.</p>
+                  )}
+                  {sub.name === 'worldpolitics' && (
+                    <p className="text-[10px] text-amber-700/80 mt-3 italic leading-snug">
+                      Note: r/worldpolitics is largely unmoderated and has drifted away from political discussion.
+                      Posts are shown as-is from the source data.
+                    </p>
+                  )}
+                </div>
+              </div>
+            ))}
+          </div>
+          {/* Time series chart */}
+          <div className="bg-white/70 backdrop-blur-sm rounded-xl border border-gray-200/50 shadow-sm p-6 mb-6">
+            <h2 className="text-lg font-semibold text-gray-900 mb-1">Activity Over Time</h2>
+            <p className="text-sm text-gray-500 mb-4">
+              Weekly post volume for both communities side by side
+            </p>
+            <ResponsiveContainer width="100%" height={350}>
+              <LineChart data={mergedTimeseries}>
+                <XAxis dataKey="date" tick={{ fontSize: 10 }} interval="preserveStartEnd" />
+                <YAxis tick={{ fontSize: 11 }} />
+                <Tooltip />
+                <Legend wrapperStyle={{ fontSize: 12 }} />
+                <Line type="monotone" dataKey={data.sub1.name} stroke={SUB_COLORS[data.sub1.name]}
+                  strokeWidth={2} dot={false} connectNulls />
+                <Line type="monotone" dataKey={data.sub2.name} stroke={SUB_COLORS[data.sub2.name]}
+                  strokeWidth={2} dot={false} connectNulls />
+              </LineChart>
+            </ResponsiveContainer>
+          </div>
+          {/* AI Summary */}
+          <AISummary text={data.summary} />
+        </>
+      )}
+    </div>
+  )
+}

frontend/src/pages/Embeddings.jsx CHANGED Viewed

@@ -1,38 +1,50 @@
 export default function Embeddings() {
   return (
-    <div className="flex flex-col h-[calc(100vh-6rem)]">
-      <div className="flex items-center justify-between mb-3">
         <div>
           <h1 className="text-3xl font-bold text-gray-900">Embedding Explorer</h1>
-          <p className="text-gray-500 mt-1">
-            Posts near each other discuss similar themes. Distinct clusters represent different discourse communities.
-            Use this to discover unexpected topic overlaps between subreddits.
           </p>
         </div>
-        <div className="flex items-center gap-3 text-xs text-gray-500 bg-gray-100 px-3 py-2 rounded-lg">
           <span><strong>Scroll</strong> to zoom</span>
           <span>·</span>
           <span><strong>Drag</strong> to pan</span>
           <span>·</span>
           <span><strong>Hover</strong> for details</span>
-          <span>·</span>
-          <span>Use the <strong>search bar</strong> inside the map</span>
         </div>
       </div>
-      <div className="flex-1 bg-white/70 backdrop-blur-sm rounded-xl border border-gray-200/50 shadow-sm overflow-hidden border border-gray-200">
         <iframe
           src="/static/datamapplot.html"
           title="Embedding Visualization"
           className="w-full border-0"
-          style={{ height: 'calc(100vh - 12rem)', minHeight: 600 }}
         />
       </div>
-      <p className="text-xs text-gray-400 mt-2">
         UMAP 2D projection of all-MiniLM-L6-v2 sentence embeddings (384-dim → 2D).
         Colored by subreddit. Built with Datamapplot.
       </p>
     </div>
   )
 }

+import { useState, useEffect } from 'react'
+import { getEmbeddingsSummary } from '../services/api'
+import AISummary from '../components/common/AISummary'
 export default function Embeddings() {
+  const [summary, setSummary] = useState('')
+  useEffect(() => {
+    getEmbeddingsSummary()
+      .then(res => setSummary(res.data.summary || ''))
+      .catch(err => console.error(err))
+  }, [])
   return (
+    <div>
+      <div className="flex items-start justify-between mb-3 gap-4">
         <div>
           <h1 className="text-3xl font-bold text-gray-900">Embedding Explorer</h1>
+          <p className="text-gray-500 mt-1 max-w-2xl">
+            A visual map of all 8,799 posts. Posts that discuss similar things are placed near each other —
+            tight clumps reveal topic neighborhoods that emerged automatically from the data.
           </p>
         </div>
+        <div className="flex items-center gap-3 text-xs text-gray-500 bg-gray-100 px-3 py-2 rounded-lg shrink-0">
           <span><strong>Scroll</strong> to zoom</span>
           <span>·</span>
           <span><strong>Drag</strong> to pan</span>
           <span>·</span>
           <span><strong>Hover</strong> for details</span>
         </div>
       </div>
+      <div className="bg-white/70 backdrop-blur-sm rounded-xl border border-gray-200/50 shadow-sm overflow-hidden mb-4">
         <iframe
           src="/static/datamapplot.html"
           title="Embedding Visualization"
           className="w-full border-0"
+          style={{ height: '70vh', minHeight: 500 }}
         />
       </div>
+      <p className="text-xs text-gray-400 mb-4">
         UMAP 2D projection of all-MiniLM-L6-v2 sentence embeddings (384-dim → 2D).
         Colored by subreddit. Built with Datamapplot.
       </p>
+      <AISummary text={summary} />
     </div>
   )
 }

frontend/src/pages/Landing.jsx CHANGED Viewed

@@ -149,7 +149,7 @@ export default function Landing() {
             <div className="text-amber-400 text-xs font-medium tracking-[0.2em] uppercase mb-4">What you can explore</div>
             <h2 className="text-4xl md:text-5xl font-bold leading-tight"
               style={{ fontFamily: "'DM Serif Display', Georgia, serif" }}>
-              Six lenses on political discourse
             </h2>
             <p className="text-gray-500 mt-4 text-base">
               Each section is built around a specific question — from how communities discussed events over time, to who bridges them, to how topics evolved.
@@ -228,7 +228,7 @@ export default function Landing() {
             Ready to investigate?
           </h3>
           <p className="text-gray-400 mb-8">
-            Step into the dashboard and start exploring political narratives.
           </p>
           <button onClick={() => navigate('/dashboard')}
             className="px-10 py-4 bg-gradient-to-r from-amber-500 to-orange-600 text-white font-semibold rounded-full shadow-2xl shadow-amber-500/30 hover:shadow-amber-500/50 hover:scale-105 transition-all text-lg">

             <div className="text-amber-400 text-xs font-medium tracking-[0.2em] uppercase mb-4">What you can explore</div>
             <h2 className="text-4xl md:text-5xl font-bold leading-tight"
               style={{ fontFamily: "'DM Serif Display', Georgia, serif" }}>
+              Six lenses on a single Reddit dataset
             </h2>
             <p className="text-gray-500 mt-4 text-base">
               Each section is built around a specific question — from how communities discussed events over time, to who bridges them, to how topics evolved.
             Ready to investigate?
           </h3>
           <p className="text-gray-400 mb-8">
+            Step into the dashboard and start tracing how narratives moved through Reddit.
           </p>
           <button onClick={() => navigate('/dashboard')}
             className="px-10 py-4 bg-gradient-to-r from-amber-500 to-orange-600 text-white font-semibold rounded-full shadow-2xl shadow-amber-500/30 hover:shadow-amber-500/50 hover:scale-105 transition-all text-lg">

frontend/src/pages/Network.jsx CHANGED Viewed

@@ -144,15 +144,16 @@ export default function Network() {
   return (
     <div>
       <h1 className="text-3xl font-bold text-gray-900 mb-2">Network Analysis</h1>
-      <p className="text-gray-500 mb-2">
-        This network maps author interactions through crossposts, shared URLs, and co-subreddit activity.
-        Node size reflects PageRank (influence), color indicates Louvain community detection.
-        Removing a highly-connected node tests whether one account holds the network together.
-      </p>
-      <p className="text-xs text-gray-400 mb-6">
-        3 edge types: crosspost links (base weight 3.0), shared URL co-sharing (base weight 2.0), co-subreddit activity (base weight 1.0). Weights accumulate for repeated interactions. [deleted] accounts excluded to prevent false super-connectors.
       </p>
       <div className="grid grid-cols-1 lg:grid-cols-3 gap-6">
         {/* Graph */}
         <div className="lg:col-span-2 bg-white/70 backdrop-blur-sm rounded-xl border border-gray-200/50 shadow-sm p-4" ref={containerRef}>

   return (
     <div>
       <h1 className="text-3xl font-bold text-gray-900 mb-2">Network Analysis</h1>
+      <p className="text-gray-500 mb-3 max-w-3xl">
+        Each dot is a Reddit author. Two authors are connected if they interact —
+        by sharing the same news link, posting in the same communities, or one crossposting from the other.
+        Bigger dots are more influential (they sit on more pathways through the network).
+        Colors group authors who tend to cluster together. Click any author to see who they connect to,
+        or remove them to see if the network falls apart without them.
       </p>
+      <p className="text-xs text-gray-400 mb-6 font-medium">How the network is built</p>
       <div className="grid grid-cols-1 lg:grid-cols-3 gap-6">
         {/* Graph */}
         <div className="lg:col-span-2 bg-white/70 backdrop-blur-sm rounded-xl border border-gray-200/50 shadow-sm p-4" ref={containerRef}>

frontend/src/pages/Overview.jsx CHANGED Viewed

@@ -72,15 +72,22 @@ export default function Overview() {
   return (
     <div>
-      <div className="mb-8">
         <h1 className="text-3xl font-bold text-gray-900">
-          Political Discourse on Reddit
         </h1>
-        <p className="text-gray-500 mt-2">
-          How do 10 politically diverse Reddit communities process the same political events — and who bridges the divides?
-          This dataset captures {stats.total_posts.toLocaleString()} posts from {stats.total_authors.toLocaleString()} authors
-          across the political spectrum (Jul 2024 — Feb 2025), covering the US election through the first weeks of the new administration.
-          83% of activity is concentrated in Jan-Feb 2025.
         </p>
       </div>
@@ -91,7 +98,7 @@ export default function Overview() {
         </summary>
         <div className="px-6 pb-6 text-gray-600 space-y-4 border-t border-gray-100 pt-5">
           <p className="text-sm leading-relaxed">
-            This dashboard investigates how political narratives spread across Reddit's ecosystem during the 2024 US election
             and 2025 presidential transition. Inspired by SimPPL's Parrot platform, it combines <strong className="text-gray-800">linguistic analysis</strong> (NLP)
             with <strong className="text-gray-800">network analysis</strong> to trace how information flows between communities with different political orientations.
           </p>
@@ -99,10 +106,23 @@ export default function Overview() {
             <div className="bg-white/10 rounded-lg p-4 border border-gray-200/20">
               <h4 className="font-semibold text-gray-900 mb-2 text-sm">Data Source</h4>
               <p className="text-sm text-gray-600 leading-relaxed">
-                8,799 Reddit posts from 10 subreddits spanning the political spectrum: r/Anarchism, r/socialism, r/democrats,
-                r/Liberal, r/politics, r/PoliticalDiscussion, r/neoliberal, r/worldpolitics, r/Conservative, and r/Republican.
-                Each post includes title, body text, author, score, comments, and timestamps (July 23, 2024 — February 18, 2025).
               </p>
             </div>
             <div className="bg-white/10 rounded-lg p-4 border border-gray-200/20">
               <h4 className="font-semibold text-gray-900 mb-2 text-sm">NLP Pipeline</h4>

   return (
     <div>
+      <div className="mb-8 max-w-4xl">
         <h1 className="text-3xl font-bold text-gray-900">
+          Tracing Narratives Across Reddit Communities
         </h1>
+        <p className="text-gray-500 mt-3 leading-relaxed">
+          How do politically distinct Reddit communities process the same events — and what does each one notice
+          that the others miss? This dashboard traces {stats.total_posts.toLocaleString()} posts from {stats.total_authors.toLocaleString()} authors
+          across 10 subreddits between July 2024 and February 2025 — a window that covers Biden dropping out,
+          the November election, and the first month of Trump's second term.
+        </p>
+        <p className="text-gray-500 mt-3 leading-relaxed">
+          The most striking pattern in the data is concentration: 83% of all activity falls into the six weeks
+          after January 20, when daily volume jumped from roughly 13 posts/day to 217 — a 1,500% surge that
+          turned the second half of the dataset into a near-real-time snapshot of the transition itself.
+          The pages below trace what each community noticed, which accounts bridged them, and how the topics
+          shifted week by week.
         </p>
       </div>
         </summary>
         <div className="px-6 pb-6 text-gray-600 space-y-4 border-t border-gray-100 pt-5">
           <p className="text-sm leading-relaxed">
+            This dashboard investigates how narratives spread across a curated set of Reddit communities during the 2024 US election
             and 2025 presidential transition. Inspired by SimPPL's Parrot platform, it combines <strong className="text-gray-800">linguistic analysis</strong> (NLP)
             with <strong className="text-gray-800">network analysis</strong> to trace how information flows between communities with different political orientations.
           </p>
             <div className="bg-white/10 rounded-lg p-4 border border-gray-200/20">
               <h4 className="font-semibold text-gray-900 mb-2 text-sm">Data Source</h4>
               <p className="text-sm text-gray-600 leading-relaxed">
+                8,799 Reddit posts from 10 subreddits collected for their political associations:
+                r/Anarchism, r/socialism, r/democrats, r/Liberal, r/politics, r/PoliticalDiscussion, r/neoliberal,
+                r/Conservative, r/Republican, and r/worldpolitics. Each post includes title, body text, author, score,
+                comments, and timestamps (July 23, 2024 — February 18, 2025).
               </p>
+              <div className="mt-3 bg-amber-50/70 border border-amber-200/70 rounded-lg px-3 py-2.5">
+                <div className="flex items-start gap-2">
+                  <div className="shrink-0 mt-0.5 w-4 h-4 rounded-full bg-amber-200/70 text-amber-800 flex items-center justify-center text-[10px] font-bold">!</div>
+                  <p className="text-xs text-amber-900/90 leading-relaxed">
+                    <strong className="font-semibold">Noted during the dataset audit:</strong> nine of the ten subreddits
+                    contain active political discussion, while r/worldpolitics has drifted into largely unmoderated,
+                    off-topic content (every post is flagged NSFW in the source data). It is kept in all analyses because
+                    removing it would change the network topology and weaken the comparison; individual r/worldpolitics
+                    posts are marked with a contextual note wherever they appear in the UI.
+                  </p>
+                </div>
+              </div>
             </div>
             <div className="bg-white/10 rounded-lg p-4 border border-gray-200/20">
               <h4 className="font-semibold text-gray-900 mb-2 text-sm">NLP Pipeline</h4>

frontend/src/pages/Search.jsx CHANGED Viewed

@@ -123,6 +123,12 @@ export default function Search() {
                     <p className="text-xs text-gray-500 font-medium mb-2">
                       Top {msg.results.length} results
                     </p>
                     <div className="space-y-2">
                       {msg.results.slice(0, 10).map((r, j) => (
                         <a key={r.id} href={r.permalink ? `https://reddit.com${r.permalink}` : '#'}
@@ -187,7 +193,7 @@ export default function Search() {
           type="text"
           value={input}
           onChange={e => setInput(e.target.value)}
-          placeholder="Ask anything about political discourse on Reddit..."
           className="flex-1 px-4 py-3 border border-gray-300 rounded-lg text-sm focus:outline-none focus:ring-2 focus:ring-indigo-500 focus:border-transparent"
           disabled={loading}
         />

                     <p className="text-xs text-gray-500 font-medium mb-2">
                       Top {msg.results.length} results
                     </p>
+                    {msg.results.some(r => r.subreddit === 'worldpolitics') && (
+                      <p className="text-[11px] text-amber-700/90 bg-amber-50/60 border border-amber-200/60 rounded px-2.5 py-1.5 mb-3 italic leading-snug">
+                        Some results are from r/worldpolitics, a largely unmoderated community that has
+                        drifted away from political discussion. Posts are shown as-is from the source data.
+                      </p>
+                    )}
                     <div className="space-y-2">
                       {msg.results.slice(0, 10).map((r, j) => (
                         <a key={r.id} href={r.permalink ? `https://reddit.com${r.permalink}` : '#'}
           type="text"
           value={input}
           onChange={e => setInput(e.target.value)}
+          placeholder="Ask anything about Reddit's political communities..."
           className="flex-1 px-4 py-3 border border-gray-300 rounded-lg text-sm focus:outline-none focus:ring-2 focus:ring-indigo-500 focus:border-transparent"
           disabled={loading}
         />

frontend/src/pages/TimeSeries.jsx CHANGED Viewed

@@ -104,7 +104,7 @@ export default function TimeSeries() {
     <div>
       <h1 className="text-3xl font-bold text-gray-900 mb-2">Time Series Analysis</h1>
       <p className="text-gray-500 mb-6">
-        Track how political discourse evolves over time. Look for spikes that correlate with real-world events — when volume
         increases for one subreddit but not others, it may indicate community-specific narratives.
       </p>

     <div>
       <h1 className="text-3xl font-bold text-gray-900 mb-2">Time Series Analysis</h1>
       <p className="text-gray-500 mb-6">
+        Track how activity across the 10 communities in the dataset evolves over time. Look for spikes that correlate with real-world events — when volume
         increases for one subreddit but not others, it may indicate community-specific narratives.
       </p>

frontend/src/services/api.js CHANGED Viewed

@@ -47,4 +47,12 @@ export const generateSummary = (data) =>
 export const getOverviewStats = () =>
   api.get('/overview/stats')
 export default api

 export const getOverviewStats = () =>
   api.get('/overview/stats')
+// Embeddings summary
+export const getEmbeddingsSummary = () =>
+  api.get('/embeddings/summary')
+// Compare two subreddits
+export const getCompareSubreddits = (sub1, sub2) =>
+  api.get('/compare', { params: { sub1, sub2 } })
 export default api

vedant-prompts.md CHANGED Viewed

@@ -168,10 +168,53 @@ The final feature work was adding the narrative content — the methodology sect
 ### Stage 7: Deployment
-With everything stress-tested and working locally, the last step was deploying to a public URL. The first attempt on Render.com failed, leading to a switch to Hugging Face Spaces.
 ## Prompt 16
 **Component**: Deployment configuration (`Dockerfile`)
 **Prompt**: "Configure a Dockerfile for Hugging Face Spaces — Python 3.11, Node.js 22, build the React frontend, run the data pipeline during Docker build, serve with gunicorn on port 7860."
 **Issue**: Initial deployment on Render.com failed with out-of-memory on the 512MB free tier — sentence-transformer model (~90MB) plus embeddings (13MB) plus Flask overhead exceeded the limit.
 **Fix**: Switched to Hugging Face Spaces (16GB RAM on Docker free tier). Restructured Dockerfile to run the full pipeline during Docker build so large pre-computed files don't need to be committed (avoiding GitHub fork LFS restrictions).

 ### Stage 7: Deployment
+With everything stress-tested and working locally, the next step was deploying to a public URL. The first attempt on Render.com failed, leading to a switch to Hugging Face Spaces.
 ## Prompt 16
 **Component**: Deployment configuration (`Dockerfile`)
 **Prompt**: "Configure a Dockerfile for Hugging Face Spaces — Python 3.11, Node.js 22, build the React frontend, run the data pipeline during Docker build, serve with gunicorn on port 7860."
 **Issue**: Initial deployment on Render.com failed with out-of-memory on the 512MB free tier — sentence-transformer model (~90MB) plus embeddings (13MB) plus Flask overhead exceeded the limit.
 **Fix**: Switched to Hugging Face Spaces (16GB RAM on Docker free tier). Restructured Dockerfile to run the full pipeline during Docker build so large pre-computed files don't need to be committed (avoiding GitHub fork LFS restrictions).
+---
+### Stage 8: Compare Communities (post-launch addition)
+After the dashboard was already deployed and working, the most striking insight from playing with the data was how differently each community discussed the same events. The existing pages let you explore one community at a time, but didn't make the contrast obvious. The Compare Communities page was added to fill that gap — letting a user pick any two of the 10 subreddits and see them side by side.
+## Prompt 17
+**Component**: Compare Communities backend endpoint (`backend/routes/overview.py`)
+**Prompt**: "Add a /compare endpoint that takes two subreddit names and returns side-by-side stats for both: total posts, unique authors, average score and comments, top news domains, top authors, top topics from existing cluster assignments, top posts, and weekly time series. Use the existing cluster_assignments table for topics — don't recompute."
+**Issue**: The first version recomputed clusters per subreddit on every request, which was slow and unnecessary since cluster assignments are already pre-computed for k=15 in the SQLite table.
+**Fix**: Joined `posts` with `cluster_assignments` directly in SQL with `WHERE p.subreddit = ? AND c.k = 15` to read pre-computed cluster labels per post, group by label, and order by count. The whole comparison endpoint runs in under 100ms even with the LLM call. Also added validation for invalid subreddit names and same-subreddit comparison (returns 400 with a clear message).
+## Prompt 18
+**Component**: Compare Communities LLM analysis prompt (`backend/services/llm_service.py`)
+**Prompt**: "Write an LLM prompt for generating a 4-paragraph analytical comparison of two subreddits. The four paragraphs should cover engagement comparison, information ecosystem (news sources), topical focus, and a journalist-ready takeaway. The output must reference specific numbers and domain names from the data, not be generic."
+**Issue**: The first version of the prompt produced generic output like "These two communities have different perspectives on political issues" — true but useless. It wasn't anchored to the actual data points being passed in.
+**Fix**: Restructured the prompt to require specific numbered claims for each paragraph and explicitly listed every data point the model should cite (top 5 domains by name, top 3 topics by keyword string, exact author counts). Also added the instruction "End with a concrete observation that a journalist could use as the seed for a story" — this dramatically improved the quality of the takeaway sentence by giving the model a clear audience.
+## Prompt 19
+**Component**: Compare Communities side-by-side React layout (`frontend/src/pages/Compare.jsx`)
+**Prompt**: "Build a React page with two subreddit picker dropdowns separated by a 'VS' pill, then a side-by-side grid showing matching cards for each subreddit (metrics, top news, top topics, top authors, top posts), then an overlapping line chart of both communities' weekly post volume on the same axes, then an AI summary at the bottom."
+**Issue**: The first layout used `grid-cols-2` for the side-by-side, but the cards inside had wildly different heights because some subreddits have more news domains than others. The result was uneven, ragged columns that looked broken.
+**Fix**: Changed each side into a single bordered card with internal sections separated by `border-b`, so the visual structure stays uniform even when content lengths differ. Added a gradient header colored by the subreddit's signature color so users can immediately tell which side is which without reading the label. Also added click-through Reddit links on every author and post.
+## Prompt 20
+**Component**: Stretching AI summaries to be substantive (`backend/services/llm_service.py`)
+**Prompt**: "My AI summaries on the time series, network, and overview pages are only 2-3 sentences with max_tokens=200. They're too short to actually help a non-technical reader understand the chart. Make them 4-6 sentences with proper context, specific numbers, and a takeaway."
+**Issue**: Just bumping max_tokens didn't help — the model still wrote short responses because the prompts said "2-3 sentences." The instructions in the prompt itself were the bottleneck.
+**Fix**: Rewrote each summary prompt to specify "5 to 6 sentences" and gave the model an explicit structure (sentence 1 = trend description, sentence 2 = peak explanation, sentence 3 = comparison, sentence 4 = inflection point, sentence 5-6 = takeaway). Also passed richer data — first-half vs second-half averages, peak contributors, percentage change — so the model had real material to work with. The summaries went from generic to specific and actually useful.
+---
+### Stage 9: r/worldpolitics & project framing
+While clicking through the  dashboard , I noticed something ,the r/worldpolitics is a 100% NSFW-flagged community whose top posts are mostly explicit images and off-topic content. The subreddit's name implies political discussion, but the actual data is the opposite — it's a textbook case of an unmoderated community drifting away from its original purpose. This raised two questions: how should the dashboard display this content, and is "Political Discourse Analysis Dashboard" still the right framing for the project?
+## Prompt 21
+**Component**: r/worldpolitics content handling (`backend/routes/overview.py`, `clusters.py`, `search.py`, `frontend/src/pages/Compare.jsx`, `Clusters.jsx`, `Search.jsx`, `README.md`)
+**Prompt**: "r/worldpolitics has NSFW posts in its top results across the dashboard. Should I filter `over_18 = 1` from displayed posts everywhere?"
+**Issue**: The first answer was to add blanket `AND over_18 = 0` filters across all four query paths and slap warning messages where posts disappeared. But hiding data from a research dashboard is editorializing — SimPPL's rubric explicitly mentions studying "actors and networks promoting... harassment" and the contrast between 9 moderated and 1 unmoderated political community is itself a finding worth surfacing, not something to censor.
+**Fix**: Reverted every backend filter. Added a single contextual one-line note that appears only where r/worldpolitics content actually surfaces (Compare top posts, Clusters expanded panel, Search results) explaining the community is unmoderated and posts are shown as-is. Documented the explicit decision in a "A Note on r/worldpolitics" section in the README so it's a defensible engineering choice in the interview rather than silent filtering. Show raw data, frame the unmoderation as an analytical observation.