A newer version of the Gradio SDK is available: 6.15.1
title: ArXiv Concept Tracker
emoji: π
colorFrom: blue
colorTo: indigo
sdk: gradio
sdk_version: 5.16.0
app_file: app.py
pinned: false
hf_oauth: true
ArXiv Concept Tracker
Track how research concepts evolve over time using AI-powered semantic embeddings and Kalman filtering.
Features
- π Search ArXiv papers by keyword with optional year filtering
- π Track concept evolution through time windows
- π§ Semantic embeddings using sentence-transformers (MiniLM-L6-v2)
- π Interactive timeline visualization with Gradio
- πΊοΈ t-SNE visualization showing concept trajectory through embedding space
- π― Kalman filter constraints prevent unrealistic concept jumps
- πΎ Local embeddings cache - no API costs
- π₯ Export results to JSON or CSV
- π Works locally and on HuggingFace Spaces
Quick Start
Local Installation
- Clone the repository:
git clone <your-repo-url>
cd concept_tracker
- Create and activate virtual environment:
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
- Install dependencies:
pip install -r requirements.txt
This will install:
- Gradio (web interface)
- sentence-transformers (embeddings)
- ArXiv API client
- NumPy, scikit-learn (computations)
- Pytest (testing)
Note: First run will download the embedding model (~400MB) automatically.
Run Locally
python app.py
This will:
- Start a local server at
http://127.0.0.1:7860 - Open the interface in your browser
- Generate embeddings cache at
cache/embeddings/
Get a Public URL (for showcasing)
python app.py --share
This generates a temporary public URL (valid for 72 hours) that you can share:
Running on local URL: http://127.0.0.1:7860
Running on public URL: https://abc123xyz.gradio.live β Share this!
Deploy to HuggingFace Spaces (permanent hosting)
- Create a Space on HuggingFace Spaces
- Push your code:
git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker
git push hf main
Your app will be live at: https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker
How to Use
1. Search for Seed Papers
- Enter a search query (e.g., "attention is all you need", "diffusion models")
- Optionally filter by year range
- Select 1-5 papers that define your concept
2. Configure Tracking
- End Date: How far forward to track (auto-calculated as seed date + 2 years)
- Window Size: Time window for each step (default: 6 months)
- Max Papers: Papers to fetch per window (500-2000 for GPU, 50-100 for CPU)
3. View Results
- Timeline: Concept evolution across time windows with expandable paper lists
- Visualization: t-SNE plot showing concept trajectory through embedding space
- Statistics: Total papers, time steps, and high/moderate/low confidence counts
- Export: Download results as JSON (full data) or CSV (flattened paper records)
How It Works
Architecture
flowchart LR
subgraph Input
A[ArXiv API]
end
subgraph Processing
B[Paper Fetching]
C[Embedding Generation]
D[Kalman Tracking]
end
subgraph Output
E[Visualization]
end
subgraph Storage
F[(Embedding Cache)]
end
A --> B
B --> C
C --> D
D --> E
C <--> F
Concept Tracking Algorithm
- Initialization: Start with 1-5 seed papers that define your concept
- Embedding: Generate semantic embeddings (title + abstract) using sentence-transformers
- Time Windows: Move forward in configurable windows (default: 6 months)
- For each window:
- Fetch candidate papers from ArXiv (cs.LG, cs.CL, cs.AI)
- Generate embeddings (cached after first generation)
- Kalman Filtering: Evaluate each paper against physics-inspired constraints:
- Similarity: Must be β₯ 0.50 to current concept vector
- Velocity: Concept drift must be < 0.05 (prevents sudden jumps)
- Acceleration: Change in velocity must be < 0.02 (prevents direction shifts)
- Accept papers that pass all constraints
- Update concept vector as weighted mean of accepted papers
- Repeat until end date
Kalman Filter Validation
The tracker uses physics-inspired constraints to reject papers that would cause unrealistic concept jumps:
- Similarity < 0.50: Too dissimilar to current concept
- Velocity > 0.05: Concept jumping too fast through embedding space
- Acceleration > 0.02: Sudden change in direction
This ensures smooth, realistic concept evolution tracking.
Configuration
The application uses sensible defaults. To customize, edit backend/config.py:
# Kalman Filter Parameters
max_velocity = 1.0 # Max concept drift per time step
max_acceleration = 0.6 # Max change in velocity
# Similarity Thresholds
threshold_auto_include = 0.85 # High confidence (auto-accept)
threshold_strong = 0.75 # Strong confidence
threshold_moderate = 0.60 # Moderate confidence
threshold_reject = 0.50 # Below this = reject
Tuning Tips
Too strict (rejecting true positives)?
- Increase
max_velocity(e.g., 1.5) - Increase
max_acceleration(e.g., 0.8) - Lower
threshold_reject(e.g., 0.45)
Too loose (accepting false positives)?
- Decrease
max_velocity(e.g., 0.5) - Decrease
max_acceleration(e.g., 0.3) - Raise
threshold_reject(e.g., 0.60)
Testing
# Run all tests
pytest tests/ -v
# Run specific test file
pytest tests/test_kalman.py -v
# Run integration tests (uses real ArXiv data)
pytest tests/test_arxiv_client.py -v
Project Structure
concept_tracker/
βββ app.py # Gradio interface
βββ backend/
β βββ gradio_wrapper.py # Gradio event handlers + export
β βββ tracker.py # Main orchestrator
β βββ arxiv_client.py # ArXiv API wrapper
β βββ embedding_service.py # Embeddings + cache
β βββ kalman_tracker.py # Kalman filter logic
β βββ visualization.py # t-SNE visualization
β βββ config.py # Configuration
β βββ models.py # Pydantic data models
β βββ utils/
β βββ cache.py # Embedding cache
βββ cache/ # Auto-created embedding cache
βββ tests/ # Test suite
βββ requirements.txt # Dependencies
βββ README.md # This file
Performance
First Run
- Time: 10-15 minutes
- Bottleneck: Model download (~400MB) + initial embedding generation
Subsequent Runs (Cached)
- Time: 2-3 minutes
- Bottleneck: ArXiv API queries + Kalman filtering
Optimizations
- All embeddings permanently cached in
cache/embeddings/ - Cache grows ~4KB per paper
- 10,000 papers β 40MB cache
HuggingFace Spaces Performance
- Free tier: CPU-only, 16GB RAM, slower but functional
- ZeroGPU option: Faster inference (requires configuration)
Troubleshooting
Model won't download
Solution: Ensure 1GB free disk space. Model downloads to `/.cache/huggingface/`
ArXiv API errors (429, timeouts)
Solution: Built-in rate limiting (3 sec delay). If errors persist, increase arxiv_rate_limit in config.
No papers accepted in tracking
Solution:
- Check console logs for rejection reasons
- Lower
threshold_rejectin config - Increase
max_velocityif velocity rejections are common
Out of memory
Solution: Reduce max_papers_per_window (try 50-100 on CPU, 500-2000 on GPU)
Example: Track Transformer Evolution
- Search: "attention is all you need"
- Select the 2017 paper (arxiv:1706.03762)
- Set end date: 2018-12-31
- Window: 6 months
- Max papers: 500
- Click "Track Concept Evolution"
Expected results:
- Should find BERT and other transformer variants
- Should NOT jump to unrelated NLP (pure RNN papers)
- Similarity should stay above 0.50
- 2-3 time steps with 10-30 papers each
Future Enhancements
- β Linear tracking (current implementation)
- π² Tree branching with HDBSCAN clustering
- π² Bidirectional tracking (trace concepts to origins)
- π² Multi-signal validation (citations, author overlap)
- β Export results to JSON/CSV
License
MIT License
Acknowledgments
- ArXiv for open access to research papers
- HuggingFace for sentence-transformers and Spaces hosting
- Gradio for the web interface framework