markgewhite's picture
Added Mermaid architecture diagram and type hints
2257df9

A newer version of the Gradio SDK is available: 6.15.1

Upgrade
metadata
title: ArXiv Concept Tracker
emoji: πŸ“š
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
hf_oauth: true

ArXiv Concept Tracker

Track how research concepts evolve over time using AI-powered semantic embeddings and Kalman filtering.

Features

  • πŸ” Search ArXiv papers by keyword with optional year filtering
  • πŸ“Š Track concept evolution through time windows
  • 🧠 Semantic embeddings using sentence-transformers (MiniLM-L6-v2)
  • πŸ“ˆ Interactive timeline visualization with Gradio
  • πŸ—ΊοΈ t-SNE visualization showing concept trajectory through embedding space
  • 🎯 Kalman filter constraints prevent unrealistic concept jumps
  • πŸ’Ύ Local embeddings cache - no API costs
  • πŸ“₯ Export results to JSON or CSV
  • 🌐 Works locally and on HuggingFace Spaces

Quick Start

Local Installation

  1. Clone the repository:
git clone <your-repo-url>
cd concept_tracker
  1. Create and activate virtual environment:
python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

This will install:

  • Gradio (web interface)
  • sentence-transformers (embeddings)
  • ArXiv API client
  • NumPy, scikit-learn (computations)
  • Pytest (testing)

Note: First run will download the embedding model (~400MB) automatically.

Run Locally

python app.py

This will:

  • Start a local server at http://127.0.0.1:7860
  • Open the interface in your browser
  • Generate embeddings cache at cache/embeddings/

Get a Public URL (for showcasing)

python app.py --share

This generates a temporary public URL (valid for 72 hours) that you can share:

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://abc123xyz.gradio.live  ← Share this!

Deploy to HuggingFace Spaces (permanent hosting)

  1. Create a Space on HuggingFace Spaces
  2. Push your code:
git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker
git push hf main

Your app will be live at: https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker

How to Use

1. Search for Seed Papers

  • Enter a search query (e.g., "attention is all you need", "diffusion models")
  • Optionally filter by year range
  • Select 1-5 papers that define your concept

2. Configure Tracking

  • End Date: How far forward to track (auto-calculated as seed date + 2 years)
  • Window Size: Time window for each step (default: 6 months)
  • Max Papers: Papers to fetch per window (500-2000 for GPU, 50-100 for CPU)

3. View Results

  • Timeline: Concept evolution across time windows with expandable paper lists
  • Visualization: t-SNE plot showing concept trajectory through embedding space
  • Statistics: Total papers, time steps, and high/moderate/low confidence counts
  • Export: Download results as JSON (full data) or CSV (flattened paper records)

How It Works

Architecture

flowchart LR
    subgraph Input
        A[ArXiv API]
    end

    subgraph Processing
        B[Paper Fetching]
        C[Embedding Generation]
        D[Kalman Tracking]
    end

    subgraph Output
        E[Visualization]
    end

    subgraph Storage
        F[(Embedding Cache)]
    end

    A --> B
    B --> C
    C --> D
    D --> E

    C <--> F

Concept Tracking Algorithm

  1. Initialization: Start with 1-5 seed papers that define your concept
  2. Embedding: Generate semantic embeddings (title + abstract) using sentence-transformers
  3. Time Windows: Move forward in configurable windows (default: 6 months)
  4. For each window:
    • Fetch candidate papers from ArXiv (cs.LG, cs.CL, cs.AI)
    • Generate embeddings (cached after first generation)
    • Kalman Filtering: Evaluate each paper against physics-inspired constraints:
      • Similarity: Must be β‰₯ 0.50 to current concept vector
      • Velocity: Concept drift must be < 0.05 (prevents sudden jumps)
      • Acceleration: Change in velocity must be < 0.02 (prevents direction shifts)
    • Accept papers that pass all constraints
    • Update concept vector as weighted mean of accepted papers
  5. Repeat until end date

Kalman Filter Validation

The tracker uses physics-inspired constraints to reject papers that would cause unrealistic concept jumps:

  • Similarity < 0.50: Too dissimilar to current concept
  • Velocity > 0.05: Concept jumping too fast through embedding space
  • Acceleration > 0.02: Sudden change in direction

This ensures smooth, realistic concept evolution tracking.

Configuration

The application uses sensible defaults. To customize, edit backend/config.py:

# Kalman Filter Parameters
max_velocity = 1.0        # Max concept drift per time step
max_acceleration = 0.6    # Max change in velocity

# Similarity Thresholds
threshold_auto_include = 0.85  # High confidence (auto-accept)
threshold_strong = 0.75        # Strong confidence
threshold_moderate = 0.60      # Moderate confidence
threshold_reject = 0.50        # Below this = reject

Tuning Tips

Too strict (rejecting true positives)?

  • Increase max_velocity (e.g., 1.5)
  • Increase max_acceleration (e.g., 0.8)
  • Lower threshold_reject (e.g., 0.45)

Too loose (accepting false positives)?

  • Decrease max_velocity (e.g., 0.5)
  • Decrease max_acceleration (e.g., 0.3)
  • Raise threshold_reject (e.g., 0.60)

Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_kalman.py -v

# Run integration tests (uses real ArXiv data)
pytest tests/test_arxiv_client.py -v

Project Structure

concept_tracker/
β”œβ”€β”€ app.py                      # Gradio interface
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ gradio_wrapper.py       # Gradio event handlers + export
β”‚   β”œβ”€β”€ tracker.py              # Main orchestrator
β”‚   β”œβ”€β”€ arxiv_client.py         # ArXiv API wrapper
β”‚   β”œβ”€β”€ embedding_service.py    # Embeddings + cache
β”‚   β”œβ”€β”€ kalman_tracker.py       # Kalman filter logic
β”‚   β”œβ”€β”€ visualization.py        # t-SNE visualization
β”‚   β”œβ”€β”€ config.py               # Configuration
β”‚   β”œβ”€β”€ models.py               # Pydantic data models
β”‚   └── utils/
β”‚       └── cache.py            # Embedding cache
β”œβ”€β”€ cache/                      # Auto-created embedding cache
β”œβ”€β”€ tests/                      # Test suite
β”œβ”€β”€ requirements.txt            # Dependencies
└── README.md                   # This file

Performance

First Run

  • Time: 10-15 minutes
  • Bottleneck: Model download (~400MB) + initial embedding generation

Subsequent Runs (Cached)

  • Time: 2-3 minutes
  • Bottleneck: ArXiv API queries + Kalman filtering

Optimizations

  • All embeddings permanently cached in cache/embeddings/
  • Cache grows ~4KB per paper
  • 10,000 papers β‰ˆ 40MB cache

HuggingFace Spaces Performance

  • Free tier: CPU-only, 16GB RAM, slower but functional
  • ZeroGPU option: Faster inference (requires configuration)

Troubleshooting

Model won't download

Solution: Ensure 1GB free disk space. Model downloads to `/.cache/huggingface/`

ArXiv API errors (429, timeouts)

Solution: Built-in rate limiting (3 sec delay). If errors persist, increase arxiv_rate_limit in config.

No papers accepted in tracking

Solution:

  1. Check console logs for rejection reasons
  2. Lower threshold_reject in config
  3. Increase max_velocity if velocity rejections are common

Out of memory

Solution: Reduce max_papers_per_window (try 50-100 on CPU, 500-2000 on GPU)

Example: Track Transformer Evolution

  1. Search: "attention is all you need"
  2. Select the 2017 paper (arxiv:1706.03762)
  3. Set end date: 2018-12-31
  4. Window: 6 months
  5. Max papers: 500
  6. Click "Track Concept Evolution"

Expected results:

  • Should find BERT and other transformer variants
  • Should NOT jump to unrelated NLP (pure RNN papers)
  • Similarity should stay above 0.50
  • 2-3 time steps with 10-30 papers each

Future Enhancements

  • βœ… Linear tracking (current implementation)
  • πŸ”² Tree branching with HDBSCAN clustering
  • πŸ”² Bidirectional tracking (trace concepts to origins)
  • πŸ”² Multi-signal validation (citations, author overlap)
  • βœ… Export results to JSON/CSV

License

MIT License

Acknowledgments

  • ArXiv for open access to research papers
  • HuggingFace for sentence-transformers and Spaces hosting
  • Gradio for the web interface framework