Commit Β·
220d27f
1
Parent(s): f210eef
Updated README for Gradio-only deployment
Browse files
README.md
CHANGED
|
@@ -15,25 +15,22 @@ Track how research concepts evolve over time using AI-powered semantic embedding
|
|
| 15 |
|
| 16 |
## Features
|
| 17 |
|
| 18 |
-
- π **Search ArXiv papers** by keyword
|
| 19 |
- π **Track concept evolution** through time windows
|
| 20 |
-
- π§ **Semantic
|
| 21 |
-
- π **Interactive timeline** visualization
|
| 22 |
-
- π― **Kalman filter**
|
| 23 |
-
- **
|
| 24 |
-
-
|
| 25 |
-
- **Kalman filtering**: Velocity and acceleration constraints prevent unrealistic concept jumps
|
| 26 |
-
- **ArXiv integration**: Automatic paper fetching and metadata extraction
|
| 27 |
-
- **REST API**: FastAPI backend with JSON responses
|
| 28 |
-
- **Comprehensive caching**: Embeddings are cached locally for fast repeated runs
|
| 29 |
|
| 30 |
## Quick Start
|
| 31 |
|
| 32 |
-
### Installation
|
| 33 |
|
| 34 |
-
1. **Clone
|
| 35 |
```bash
|
| 36 |
-
|
|
|
|
| 37 |
```
|
| 38 |
|
| 39 |
2. **Create and activate virtual environment**:
|
|
@@ -48,136 +45,81 @@ pip install -r requirements.txt
|
|
| 48 |
```
|
| 49 |
|
| 50 |
This will install:
|
| 51 |
-
-
|
| 52 |
-
-
|
| 53 |
- ArXiv API client
|
| 54 |
-
- NumPy, scikit-learn
|
| 55 |
-
- Pytest
|
| 56 |
|
| 57 |
-
**Note**: First run will download the
|
| 58 |
|
| 59 |
-
###
|
| 60 |
-
|
| 61 |
-
The application uses sensible defaults. To customize, copy `.env.example` to `.env` and edit:
|
| 62 |
|
| 63 |
```bash
|
| 64 |
-
|
| 65 |
```
|
| 66 |
|
| 67 |
-
|
| 68 |
-
|
| 69 |
-
|
| 70 |
-
|
| 71 |
-
max_velocity = 0.05 # Max concept drift per time step
|
| 72 |
-
max_acceleration = 0.02 # Max change in velocity
|
| 73 |
-
|
| 74 |
-
# Similarity Thresholds
|
| 75 |
-
threshold_auto_include = 0.85 # High confidence (auto-accept)
|
| 76 |
-
threshold_strong = 0.75 # Moderate confidence
|
| 77 |
-
threshold_moderate = 0.65 # Low confidence (minimum)
|
| 78 |
-
```
|
| 79 |
|
| 80 |
-
##
|
| 81 |
-
|
| 82 |
-
### Start the Server
|
| 83 |
|
| 84 |
```bash
|
| 85 |
-
|
| 86 |
```
|
| 87 |
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
### API Endpoints
|
| 93 |
-
|
| 94 |
-
#### 1. Search Papers
|
| 95 |
-
|
| 96 |
-
Find potential seed papers:
|
| 97 |
-
|
| 98 |
-
```bash
|
| 99 |
-
curl "http://localhost:8000/api/search?query=attention%20is%20all%20you%20need&limit=5"
|
| 100 |
```
|
| 101 |
|
| 102 |
-
###
|
| 103 |
-
|
| 104 |
-
Get details for a specific paper:
|
| 105 |
|
|
|
|
|
|
|
| 106 |
```bash
|
| 107 |
-
|
|
|
|
| 108 |
```
|
| 109 |
|
| 110 |
-
|
| 111 |
|
| 112 |
-
|
| 113 |
|
| 114 |
-
|
| 115 |
-
curl -X POST "http://localhost:8000/api/track" \
|
| 116 |
-
-H "Content-Type: application/json" \
|
| 117 |
-
-d '{
|
| 118 |
-
"seed_paper_ids": ["1706.03762"],
|
| 119 |
-
"end_date": "2018-12-31",
|
| 120 |
-
"window_months": 6,
|
| 121 |
-
"max_papers_per_window": 50
|
| 122 |
-
}'
|
| 123 |
-
```
|
| 124 |
|
| 125 |
-
|
| 126 |
-
-
|
| 127 |
-
-
|
| 128 |
-
- `window_months`: Time window size (default: 6 months)
|
| 129 |
-
- `max_papers_per_window`: Max papers to fetch per window (default: 50)
|
| 130 |
|
| 131 |
-
###
|
| 132 |
|
| 133 |
-
|
| 134 |
-
|
| 135 |
-
|
| 136 |
-
-H "Content-Type: application/json" \
|
| 137 |
-
-d '{
|
| 138 |
-
"seed_paper_ids": ["1706.03762"],
|
| 139 |
-
"end_date": "2018-12-31",
|
| 140 |
-
"window_months": 6,
|
| 141 |
-
"similarity_threshold": 0.65,
|
| 142 |
-
"max_papers_per_window": 50
|
| 143 |
-
}' | python -m json.tool
|
| 144 |
-
```
|
| 145 |
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
|
| 149 |
-
|
| 150 |
-
|
| 151 |
-
{
|
| 152 |
-
"step_number": 1,
|
| 153 |
-
"start_date": "2017-06-12T...",
|
| 154 |
-
"end_date": "2017-12-12T...",
|
| 155 |
-
"papers": [...],
|
| 156 |
-
"avg_similarity": 0.78,
|
| 157 |
-
"num_high_confidence": 12,
|
| 158 |
-
"num_moderate": 8,
|
| 159 |
-
"num_low": 3
|
| 160 |
-
},
|
| 161 |
-
...
|
| 162 |
-
],
|
| 163 |
-
"total_papers": 45,
|
| 164 |
-
"num_steps": 3
|
| 165 |
-
}
|
| 166 |
-
```
|
| 167 |
|
| 168 |
## How It Works
|
| 169 |
|
| 170 |
### Concept Tracking Algorithm
|
| 171 |
|
| 172 |
-
1. **Initialization**: Start with 1-5 seed papers
|
| 173 |
-
2. **Embedding**: Generate semantic embeddings (title + abstract) using
|
| 174 |
3. **Time Windows**: Move forward in configurable windows (default: 6 months)
|
| 175 |
4. **For each window**:
|
| 176 |
-
- Fetch candidate papers from ArXiv
|
| 177 |
- Generate embeddings (cached after first generation)
|
| 178 |
- **Kalman Filtering**: Evaluate each paper against physics-inspired constraints:
|
| 179 |
-
- **Similarity**: Must be
|
| 180 |
-
- **Velocity**:
|
| 181 |
- **Acceleration**: Change in velocity must be < 0.02 (prevents direction shifts)
|
| 182 |
- Accept papers that pass all constraints
|
| 183 |
- Update concept vector as weighted mean of accepted papers
|
|
@@ -185,20 +127,43 @@ curl -X POST "http://localhost:8000/api/track" \
|
|
| 185 |
|
| 186 |
### Kalman Filter Validation
|
| 187 |
|
| 188 |
-
The tracker
|
| 189 |
|
| 190 |
-
- **Similarity < 0.
|
| 191 |
- **Velocity > 0.05**: Concept jumping too fast through embedding space
|
| 192 |
- **Acceleration > 0.02**: Sudden change in direction
|
| 193 |
|
| 194 |
-
|
| 195 |
-
|
| 196 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 197 |
```
|
| 198 |
|
| 199 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 200 |
|
| 201 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 202 |
|
| 203 |
```bash
|
| 204 |
# Run all tests
|
|
@@ -207,142 +172,96 @@ pytest tests/ -v
|
|
| 207 |
# Run specific test file
|
| 208 |
pytest tests/test_kalman.py -v
|
| 209 |
|
| 210 |
-
# Run
|
| 211 |
-
pytest tests/
|
| 212 |
```
|
| 213 |
|
| 214 |
-
### Test Coverage
|
| 215 |
-
|
| 216 |
-
- `test_arxiv_client.py`: ArXiv API integration
|
| 217 |
-
- `test_kalman.py`: Kalman filter constraints
|
| 218 |
-
- `test_api.py`: FastAPI endpoints
|
| 219 |
-
|
| 220 |
## Project Structure
|
| 221 |
|
| 222 |
```
|
| 223 |
concept_tracker/
|
|
|
|
| 224 |
βββ backend/
|
| 225 |
-
β βββ
|
| 226 |
-
β βββ
|
| 227 |
-
β βββ
|
| 228 |
-
β βββ
|
| 229 |
-
β βββ
|
| 230 |
-
β βββ
|
| 231 |
-
β βββ
|
| 232 |
-
β βββ tracker.py # Main orchestrator
|
| 233 |
β βββ utils/
|
| 234 |
-
β
|
| 235 |
-
|
| 236 |
-
βββ
|
| 237 |
-
βββ
|
| 238 |
-
|
| 239 |
-
βββ .env.example # Configuration template
|
| 240 |
-
βββ README.md # This file
|
| 241 |
```
|
| 242 |
|
| 243 |
## Performance
|
| 244 |
|
| 245 |
### First Run
|
| 246 |
-
- **Time**: 10-15 minutes
|
| 247 |
-
- **Bottleneck**:
|
| 248 |
|
| 249 |
### Subsequent Runs (Cached)
|
| 250 |
- **Time**: 2-3 minutes
|
| 251 |
-
- **Bottleneck**: ArXiv API queries
|
| 252 |
|
| 253 |
### Optimizations
|
| 254 |
-
- All embeddings
|
| 255 |
-
- Cache grows ~4KB per paper
|
| 256 |
-
- 10,000 papers
|
| 257 |
|
| 258 |
-
##
|
| 259 |
-
|
| 260 |
-
|
| 261 |
-
|
| 262 |
-
### Too Strict (Rejecting True Positives)
|
| 263 |
-
|
| 264 |
-
Edit `backend/config.py`:
|
| 265 |
-
```python
|
| 266 |
-
max_velocity = 0.07 # Increase from 0.05
|
| 267 |
-
max_acceleration = 0.03 # Increase from 0.02
|
| 268 |
-
threshold_moderate = 0.60 # Decrease from 0.65
|
| 269 |
-
```
|
| 270 |
-
|
| 271 |
-
### Too Loose (Accepting False Positives)
|
| 272 |
-
|
| 273 |
-
Edit `backend/config.py`:
|
| 274 |
-
```python
|
| 275 |
-
max_velocity = 0.03 # Decrease from 0.05
|
| 276 |
-
max_acceleration = 0.01 # Decrease from 0.02
|
| 277 |
-
threshold_moderate = 0.70 # Increase from 0.65
|
| 278 |
-
```
|
| 279 |
-
|
| 280 |
-
Restart the server after changes:
|
| 281 |
-
```bash
|
| 282 |
-
uvicorn backend.main:app --reload
|
| 283 |
-
```
|
| 284 |
|
| 285 |
## Troubleshooting
|
| 286 |
|
| 287 |
-
###
|
| 288 |
-
|
| 289 |
-
**Solution**: Ensure you have ~1GB free disk space. Model downloads to `~/.cache/huggingface/`
|
| 290 |
-
|
| 291 |
-
### Issue: ArXiv API errors (429, timeouts)
|
| 292 |
|
| 293 |
-
|
| 294 |
-
|
| 295 |
-
### Issue: No papers accepted in tracking
|
| 296 |
|
|
|
|
| 297 |
**Solution**:
|
| 298 |
-
1. Check logs for rejection reasons
|
| 299 |
-
2. Lower `
|
| 300 |
3. Increase `max_velocity` if velocity rejections are common
|
| 301 |
|
| 302 |
-
###
|
| 303 |
-
|
| 304 |
-
**Solution**: Reduce `max_papers_per_window` in tracking request
|
| 305 |
|
| 306 |
-
##
|
| 307 |
|
| 308 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 309 |
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
-H "Content-Type: application/json" \
|
| 313 |
-
-d '{
|
| 314 |
-
"seed_paper_ids": ["1706.03762"],
|
| 315 |
-
"end_date": "2018-06-30",
|
| 316 |
-
"window_months": 6,
|
| 317 |
-
"max_papers_per_window": 50
|
| 318 |
-
}'
|
| 319 |
-
```
|
| 320 |
-
|
| 321 |
-
**Expected**:
|
| 322 |
-
- Should find BERT-related papers (1810.04805)
|
| 323 |
-
- Should find other transformer variants
|
| 324 |
- Should NOT jump to unrelated NLP (pure RNN papers)
|
| 325 |
-
- Similarity should stay above 0.
|
| 326 |
- 2-3 time steps with 10-30 papers each
|
| 327 |
|
| 328 |
-
## Future Enhancements
|
| 329 |
|
| 330 |
-
- β
**Linear tracking** (current
|
| 331 |
- π² **Tree branching** with HDBSCAN clustering
|
| 332 |
-
- π² **
|
| 333 |
-
- π² **Bidirectional tracking** (trace concepts to their origins)
|
| 334 |
- π² **Multi-signal validation** (citations, author overlap)
|
|
|
|
| 335 |
|
| 336 |
## License
|
| 337 |
|
| 338 |
-
MIT License
|
| 339 |
-
|
| 340 |
-
## Contributing
|
| 341 |
-
|
| 342 |
-
This is an MVP/prototype. For issues or suggestions, please open an issue on GitHub.
|
| 343 |
|
| 344 |
## Acknowledgments
|
| 345 |
|
| 346 |
- ArXiv for open access to research papers
|
| 347 |
-
-
|
| 348 |
-
-
|
|
|
|
| 15 |
|
| 16 |
## Features
|
| 17 |
|
| 18 |
+
- π **Search ArXiv papers** by keyword with optional year filtering
|
| 19 |
- π **Track concept evolution** through time windows
|
| 20 |
+
- π§ **Semantic embeddings** using sentence-transformers (MiniLM-L6-v2)
|
| 21 |
+
- π **Interactive timeline** visualization with Gradio
|
| 22 |
+
- π― **Kalman filter** constraints prevent unrealistic concept jumps
|
| 23 |
+
- πΎ **Local embeddings cache** - no API costs
|
| 24 |
+
- π **Works locally and on HuggingFace Spaces**
|
|
|
|
|
|
|
|
|
|
|
|
|
| 25 |
|
| 26 |
## Quick Start
|
| 27 |
|
| 28 |
+
### Local Installation
|
| 29 |
|
| 30 |
+
1. **Clone the repository**:
|
| 31 |
```bash
|
| 32 |
+
git clone <your-repo-url>
|
| 33 |
+
cd concept_tracker
|
| 34 |
```
|
| 35 |
|
| 36 |
2. **Create and activate virtual environment**:
|
|
|
|
| 45 |
```
|
| 46 |
|
| 47 |
This will install:
|
| 48 |
+
- Gradio (web interface)
|
| 49 |
+
- sentence-transformers (embeddings)
|
| 50 |
- ArXiv API client
|
| 51 |
+
- NumPy, scikit-learn (computations)
|
| 52 |
+
- Pytest (testing)
|
| 53 |
|
| 54 |
+
**Note**: First run will download the embedding model (~400MB) automatically.
|
| 55 |
|
| 56 |
+
### Run Locally
|
|
|
|
|
|
|
| 57 |
|
| 58 |
```bash
|
| 59 |
+
python app.py
|
| 60 |
```
|
| 61 |
|
| 62 |
+
This will:
|
| 63 |
+
- Start a local server at `http://127.0.0.1:7860`
|
| 64 |
+
- Open the interface in your browser
|
| 65 |
+
- Generate embeddings cache at `cache/embeddings/`
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 66 |
|
| 67 |
+
### Get a Public URL (for showcasing)
|
|
|
|
|
|
|
| 68 |
|
| 69 |
```bash
|
| 70 |
+
python app.py --share
|
| 71 |
```
|
| 72 |
|
| 73 |
+
This generates a temporary public URL (valid for 72 hours) that you can share:
|
| 74 |
+
```
|
| 75 |
+
Running on local URL: http://127.0.0.1:7860
|
| 76 |
+
Running on public URL: https://abc123xyz.gradio.live β Share this!
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77 |
```
|
| 78 |
|
| 79 |
+
### Deploy to HuggingFace Spaces (permanent hosting)
|
|
|
|
|
|
|
| 80 |
|
| 81 |
+
1. **Create a Space** on [HuggingFace Spaces](https://huggingface.co/spaces)
|
| 82 |
+
2. **Push your code**:
|
| 83 |
```bash
|
| 84 |
+
git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker
|
| 85 |
+
git push hf main
|
| 86 |
```
|
| 87 |
|
| 88 |
+
Your app will be live at: `https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker`
|
| 89 |
|
| 90 |
+
## How to Use
|
| 91 |
|
| 92 |
+
### 1. Search for Seed Papers
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
|
| 94 |
+
- Enter a search query (e.g., "attention is all you need", "diffusion models")
|
| 95 |
+
- Optionally filter by year range
|
| 96 |
+
- Select 1-5 papers that define your concept
|
|
|
|
|
|
|
| 97 |
|
| 98 |
+
### 2. Configure Tracking
|
| 99 |
|
| 100 |
+
- **End Date**: How far forward to track (auto-calculated as seed date + 2 years)
|
| 101 |
+
- **Window Size**: Time window for each step (default: 6 months)
|
| 102 |
+
- **Max Papers**: Papers to fetch per window (500-2000 for GPU, 50-100 for CPU)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 103 |
|
| 104 |
+
### 3. View Results
|
| 105 |
+
|
| 106 |
+
- Timeline showing concept evolution across time windows
|
| 107 |
+
- Papers ranked by similarity to concept
|
| 108 |
+
- Statistics: high/moderate/low confidence counts
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 109 |
|
| 110 |
## How It Works
|
| 111 |
|
| 112 |
### Concept Tracking Algorithm
|
| 113 |
|
| 114 |
+
1. **Initialization**: Start with 1-5 seed papers that define your concept
|
| 115 |
+
2. **Embedding**: Generate semantic embeddings (title + abstract) using sentence-transformers
|
| 116 |
3. **Time Windows**: Move forward in configurable windows (default: 6 months)
|
| 117 |
4. **For each window**:
|
| 118 |
+
- Fetch candidate papers from ArXiv (cs.LG, cs.CL, cs.AI)
|
| 119 |
- Generate embeddings (cached after first generation)
|
| 120 |
- **Kalman Filtering**: Evaluate each paper against physics-inspired constraints:
|
| 121 |
+
- **Similarity**: Must be β₯ 0.50 to current concept vector
|
| 122 |
+
- **Velocity**: Concept drift must be < 0.05 (prevents sudden jumps)
|
| 123 |
- **Acceleration**: Change in velocity must be < 0.02 (prevents direction shifts)
|
| 124 |
- Accept papers that pass all constraints
|
| 125 |
- Update concept vector as weighted mean of accepted papers
|
|
|
|
| 127 |
|
| 128 |
### Kalman Filter Validation
|
| 129 |
|
| 130 |
+
The tracker uses physics-inspired constraints to reject papers that would cause unrealistic concept jumps:
|
| 131 |
|
| 132 |
+
- **Similarity < 0.50**: Too dissimilar to current concept
|
| 133 |
- **Velocity > 0.05**: Concept jumping too fast through embedding space
|
| 134 |
- **Acceleration > 0.02**: Sudden change in direction
|
| 135 |
|
| 136 |
+
This ensures smooth, realistic concept evolution tracking.
|
| 137 |
+
|
| 138 |
+
## Configuration
|
| 139 |
+
|
| 140 |
+
The application uses sensible defaults. To customize, edit `backend/config.py`:
|
| 141 |
+
|
| 142 |
+
```python
|
| 143 |
+
# Kalman Filter Parameters
|
| 144 |
+
max_velocity = 1.0 # Max concept drift per time step
|
| 145 |
+
max_acceleration = 0.6 # Max change in velocity
|
| 146 |
+
|
| 147 |
+
# Similarity Thresholds
|
| 148 |
+
threshold_auto_include = 0.85 # High confidence (auto-accept)
|
| 149 |
+
threshold_strong = 0.75 # Strong confidence
|
| 150 |
+
threshold_moderate = 0.60 # Moderate confidence
|
| 151 |
+
threshold_reject = 0.50 # Below this = reject
|
| 152 |
```
|
| 153 |
|
| 154 |
+
### Tuning Tips
|
| 155 |
+
|
| 156 |
+
**Too strict (rejecting true positives)?**
|
| 157 |
+
- Increase `max_velocity` (e.g., 1.5)
|
| 158 |
+
- Increase `max_acceleration` (e.g., 0.8)
|
| 159 |
+
- Lower `threshold_reject` (e.g., 0.45)
|
| 160 |
|
| 161 |
+
**Too loose (accepting false positives)?**
|
| 162 |
+
- Decrease `max_velocity` (e.g., 0.5)
|
| 163 |
+
- Decrease `max_acceleration` (e.g., 0.3)
|
| 164 |
+
- Raise `threshold_reject` (e.g., 0.60)
|
| 165 |
+
|
| 166 |
+
## Testing
|
| 167 |
|
| 168 |
```bash
|
| 169 |
# Run all tests
|
|
|
|
| 172 |
# Run specific test file
|
| 173 |
pytest tests/test_kalman.py -v
|
| 174 |
|
| 175 |
+
# Run integration tests (uses real ArXiv data)
|
| 176 |
+
pytest tests/test_arxiv_client.py -v
|
| 177 |
```
|
| 178 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 179 |
## Project Structure
|
| 180 |
|
| 181 |
```
|
| 182 |
concept_tracker/
|
| 183 |
+
βββ app.py # Gradio interface
|
| 184 |
βββ backend/
|
| 185 |
+
β βββ gradio_wrapper.py # Gradio event handlers
|
| 186 |
+
β βββ tracker.py # Main orchestrator
|
| 187 |
+
β βββ arxiv_client.py # ArXiv API wrapper
|
| 188 |
+
β βββ embedding_service.py # Embeddings + cache
|
| 189 |
+
β βββ kalman_tracker.py # Kalman filter logic
|
| 190 |
+
β βββ config.py # Configuration
|
| 191 |
+
β βββ models.py # Pydantic data models
|
|
|
|
| 192 |
β βββ utils/
|
| 193 |
+
β βββ cache.py # Embedding cache
|
| 194 |
+
βββ cache/ # Auto-created embedding cache
|
| 195 |
+
βββ tests/ # Test suite
|
| 196 |
+
βββ requirements.txt # Dependencies
|
| 197 |
+
βββ README.md # This file
|
|
|
|
|
|
|
| 198 |
```
|
| 199 |
|
| 200 |
## Performance
|
| 201 |
|
| 202 |
### First Run
|
| 203 |
+
- **Time**: 10-15 minutes
|
| 204 |
+
- **Bottleneck**: Model download (~400MB) + initial embedding generation
|
| 205 |
|
| 206 |
### Subsequent Runs (Cached)
|
| 207 |
- **Time**: 2-3 minutes
|
| 208 |
+
- **Bottleneck**: ArXiv API queries + Kalman filtering
|
| 209 |
|
| 210 |
### Optimizations
|
| 211 |
+
- All embeddings permanently cached in `cache/embeddings/`
|
| 212 |
+
- Cache grows ~4KB per paper
|
| 213 |
+
- 10,000 papers β 40MB cache
|
| 214 |
|
| 215 |
+
### HuggingFace Spaces Performance
|
| 216 |
+
- **Free tier**: CPU-only, 16GB RAM, slower but functional
|
| 217 |
+
- **ZeroGPU option**: Faster inference (requires configuration)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 218 |
|
| 219 |
## Troubleshooting
|
| 220 |
|
| 221 |
+
### Model won't download
|
| 222 |
+
**Solution**: Ensure ~1GB free disk space. Model downloads to `~/.cache/huggingface/`
|
|
|
|
|
|
|
|
|
|
| 223 |
|
| 224 |
+
### ArXiv API errors (429, timeouts)
|
| 225 |
+
**Solution**: Built-in rate limiting (3 sec delay). If errors persist, increase `arxiv_rate_limit` in config.
|
|
|
|
| 226 |
|
| 227 |
+
### No papers accepted in tracking
|
| 228 |
**Solution**:
|
| 229 |
+
1. Check console logs for rejection reasons
|
| 230 |
+
2. Lower `threshold_reject` in config
|
| 231 |
3. Increase `max_velocity` if velocity rejections are common
|
| 232 |
|
| 233 |
+
### Out of memory
|
| 234 |
+
**Solution**: Reduce `max_papers_per_window` (try 50-100 on CPU, 500-2000 on GPU)
|
|
|
|
| 235 |
|
| 236 |
+
## Example: Track Transformer Evolution
|
| 237 |
|
| 238 |
+
1. Search: "attention is all you need"
|
| 239 |
+
2. Select the 2017 paper (arxiv:1706.03762)
|
| 240 |
+
3. Set end date: 2018-12-31
|
| 241 |
+
4. Window: 6 months
|
| 242 |
+
5. Max papers: 500
|
| 243 |
+
6. Click "Track Concept Evolution"
|
| 244 |
|
| 245 |
+
**Expected results**:
|
| 246 |
+
- Should find BERT and other transformer variants
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 247 |
- Should NOT jump to unrelated NLP (pure RNN papers)
|
| 248 |
+
- Similarity should stay above 0.50
|
| 249 |
- 2-3 time steps with 10-30 papers each
|
| 250 |
|
| 251 |
+
## Future Enhancements
|
| 252 |
|
| 253 |
+
- β
**Linear tracking** (current implementation)
|
| 254 |
- π² **Tree branching** with HDBSCAN clustering
|
| 255 |
+
- π² **Bidirectional tracking** (trace concepts to origins)
|
|
|
|
| 256 |
- π² **Multi-signal validation** (citations, author overlap)
|
| 257 |
+
- π² **Export results** to JSON/CSV
|
| 258 |
|
| 259 |
## License
|
| 260 |
|
| 261 |
+
MIT License
|
|
|
|
|
|
|
|
|
|
|
|
|
| 262 |
|
| 263 |
## Acknowledgments
|
| 264 |
|
| 265 |
- ArXiv for open access to research papers
|
| 266 |
+
- HuggingFace for sentence-transformers and Spaces hosting
|
| 267 |
+
- Gradio for the web interface framework
|