Spaces:

markgewhite
/

research_concept_tracker

Sleeping

App Files Files Community

research_concept_tracker / README.md

markgewhite

Added Mermaid architecture diagram and type hints

2257df9 5 months ago

preview code

raw

history blame contribute delete

8.72 kB

A newer version of the Gradio SDK is available: 6.15.1

Upgrade

metadata

title: ArXiv Concept Tracker
emoji: 📚
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
hf_oauth: true

ArXiv Concept Tracker

Track how research concepts evolve over time using AI-powered semantic embeddings and Kalman filtering.

Features

🔍 Search ArXiv papers by keyword with optional year filtering
📊 Track concept evolution through time windows
🧠 Semantic embeddings using sentence-transformers (MiniLM-L6-v2)
📈 Interactive timeline visualization with Gradio
🗺️ t-SNE visualization showing concept trajectory through embedding space
🎯 Kalman filter constraints prevent unrealistic concept jumps
💾 Local embeddings cache - no API costs
📥 Export results to JSON or CSV
🌐 Works locally and on HuggingFace Spaces

Quick Start

Local Installation

Clone the repository:

git clone <your-repo-url>
cd concept_tracker

Create and activate virtual environment:

python3 -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

This will install:

Gradio (web interface)
sentence-transformers (embeddings)
ArXiv API client
NumPy, scikit-learn (computations)
Pytest (testing)

Note: First run will download the embedding model (~400MB) automatically.

Run Locally

python app.py

This will:

Start a local server at http://127.0.0.1:7860
Open the interface in your browser
Generate embeddings cache at cache/embeddings/

Get a Public URL (for showcasing)

python app.py --share

This generates a temporary public URL (valid for 72 hours) that you can share:

Running on local URL:  http://127.0.0.1:7860
Running on public URL: https://abc123xyz.gradio.live  ← Share this!

Deploy to HuggingFace Spaces (permanent hosting)

Create a Space on HuggingFace Spaces
Push your code:

git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker
git push hf main

Your app will be live at: https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker

How to Use

1. Search for Seed Papers

Enter a search query (e.g., "attention is all you need", "diffusion models")
Optionally filter by year range
Select 1-5 papers that define your concept

2. Configure Tracking

End Date: How far forward to track (auto-calculated as seed date + 2 years)
Window Size: Time window for each step (default: 6 months)
Max Papers: Papers to fetch per window (500-2000 for GPU, 50-100 for CPU)

3. View Results

Timeline: Concept evolution across time windows with expandable paper lists
Visualization: t-SNE plot showing concept trajectory through embedding space
Statistics: Total papers, time steps, and high/moderate/low confidence counts
Export: Download results as JSON (full data) or CSV (flattened paper records)

How It Works

Architecture

flowchart LR
    subgraph Input
        A[ArXiv API]
    end

    subgraph Processing
        B[Paper Fetching]
        C[Embedding Generation]
        D[Kalman Tracking]
    end

    subgraph Output
        E[Visualization]
    end

    subgraph Storage
        F[(Embedding Cache)]
    end

    A --> B
    B --> C
    C --> D
    D --> E

    C <--> F

Concept Tracking Algorithm

Initialization: Start with 1-5 seed papers that define your concept
Embedding: Generate semantic embeddings (title + abstract) using sentence-transformers
Time Windows: Move forward in configurable windows (default: 6 months)
For each window:
- Fetch candidate papers from ArXiv (cs.LG, cs.CL, cs.AI)
- Generate embeddings (cached after first generation)
- Kalman Filtering: Evaluate each paper against physics-inspired constraints:
  - Similarity: Must be ≥ 0.50 to current concept vector
  - Velocity: Concept drift must be < 0.05 (prevents sudden jumps)
  - Acceleration: Change in velocity must be < 0.02 (prevents direction shifts)
- Accept papers that pass all constraints
- Update concept vector as weighted mean of accepted papers
Repeat until end date

Kalman Filter Validation

The tracker uses physics-inspired constraints to reject papers that would cause unrealistic concept jumps:

Similarity < 0.50: Too dissimilar to current concept
Velocity > 0.05: Concept jumping too fast through embedding space
Acceleration > 0.02: Sudden change in direction

This ensures smooth, realistic concept evolution tracking.

Configuration

The application uses sensible defaults. To customize, edit backend/config.py:

# Kalman Filter Parameters
max_velocity = 1.0        # Max concept drift per time step
max_acceleration = 0.6    # Max change in velocity

# Similarity Thresholds
threshold_auto_include = 0.85  # High confidence (auto-accept)
threshold_strong = 0.75        # Strong confidence
threshold_moderate = 0.60      # Moderate confidence
threshold_reject = 0.50        # Below this = reject

Tuning Tips

Too strict (rejecting true positives)?

Increase max_velocity (e.g., 1.5)
Increase max_acceleration (e.g., 0.8)
Lower threshold_reject (e.g., 0.45)

Too loose (accepting false positives)?

Decrease max_velocity (e.g., 0.5)
Decrease max_acceleration (e.g., 0.3)
Raise threshold_reject (e.g., 0.60)

Testing

# Run all tests
pytest tests/ -v

# Run specific test file
pytest tests/test_kalman.py -v

# Run integration tests (uses real ArXiv data)
pytest tests/test_arxiv_client.py -v

Project Structure

concept_tracker/
├── app.py                      # Gradio interface
├── backend/
│   ├── gradio_wrapper.py       # Gradio event handlers + export
│   ├── tracker.py              # Main orchestrator
│   ├── arxiv_client.py         # ArXiv API wrapper
│   ├── embedding_service.py    # Embeddings + cache
│   ├── kalman_tracker.py       # Kalman filter logic
│   ├── visualization.py        # t-SNE visualization
│   ├── config.py               # Configuration
│   ├── models.py               # Pydantic data models
│   └── utils/
│       └── cache.py            # Embedding cache
├── cache/                      # Auto-created embedding cache
├── tests/                      # Test suite
├── requirements.txt            # Dependencies
└── README.md                   # This file

Performance

First Run

Time: 10-15 minutes
Bottleneck: Model download (~400MB) + initial embedding generation

Subsequent Runs (Cached)

Time: 2-3 minutes
Bottleneck: ArXiv API queries + Kalman filtering

Optimizations

All embeddings permanently cached in cache/embeddings/
Cache grows ~4KB per paper
10,000 papers ≈ 40MB cache

HuggingFace Spaces Performance

Free tier: CPU-only, 16GB RAM, slower but functional
ZeroGPU option: Faster inference (requires configuration)

Troubleshooting

Model won't download

Solution: Ensure ~~1GB free disk space. Model downloads to `~~/.cache/huggingface/`

ArXiv API errors (429, timeouts)

Solution: Built-in rate limiting (3 sec delay). If errors persist, increase arxiv_rate_limit in config.

No papers accepted in tracking

Solution:

Check console logs for rejection reasons
Lower threshold_reject in config
Increase max_velocity if velocity rejections are common

Out of memory

Solution: Reduce max_papers_per_window (try 50-100 on CPU, 500-2000 on GPU)

Example: Track Transformer Evolution

Search: "attention is all you need"
Select the 2017 paper (arxiv:1706.03762)
Set end date: 2018-12-31
Window: 6 months
Max papers: 500
Click "Track Concept Evolution"

Expected results:

Should find BERT and other transformer variants
Should NOT jump to unrelated NLP (pure RNN papers)
Similarity should stay above 0.50
2-3 time steps with 10-30 papers each

Future Enhancements

✅ Linear tracking (current implementation)
🔲 Tree branching with HDBSCAN clustering
🔲 Bidirectional tracking (trace concepts to origins)
🔲 Multi-signal validation (citations, author overlap)
✅ Export results to JSON/CSV

License

MIT License

Acknowledgments

ArXiv for open access to research papers
HuggingFace for sentence-transformers and Spaces hosting
Gradio for the web interface framework