markgewhite commited on
Commit
220d27f
Β·
1 Parent(s): f210eef

Updated README for Gradio-only deployment

Browse files
Files changed (1) hide show
  1. README.md +132 -213
README.md CHANGED
@@ -15,25 +15,22 @@ Track how research concepts evolve over time using AI-powered semantic embedding
15
 
16
  ## Features
17
 
18
- - πŸ” **Search ArXiv papers** by keyword
19
  - πŸ“Š **Track concept evolution** through time windows
20
- - 🧠 **Semantic similarity** with embeddings (MiniLM-L6-v2)
21
- - πŸ“ˆ **Interactive timeline** visualization
22
- - 🎯 **Kalman filter** for smooth concept tracking
23
- - **Linear concept tracking**: Follow concept evolution from seed papers forward through time
24
- - **Local embeddings**: sentence-transformers (no API costs)
25
- - **Kalman filtering**: Velocity and acceleration constraints prevent unrealistic concept jumps
26
- - **ArXiv integration**: Automatic paper fetching and metadata extraction
27
- - **REST API**: FastAPI backend with JSON responses
28
- - **Comprehensive caching**: Embeddings are cached locally for fast repeated runs
29
 
30
  ## Quick Start
31
 
32
- ### Installation
33
 
34
- 1. **Clone or navigate to the project directory**:
35
  ```bash
36
- cd /Users/markgewhite/Documents/MyFiles/Projects/training/ztm/llm_web_apps/concept_tracker
 
37
  ```
38
 
39
  2. **Create and activate virtual environment**:
@@ -48,136 +45,81 @@ pip install -r requirements.txt
48
  ```
49
 
50
  This will install:
51
- - FastAPI & Uvicorn (web framework)
52
- - Qwen3 embeddings via sentence-transformers
53
  - ArXiv API client
54
- - NumPy, scikit-learn for computations
55
- - Pytest for testing
56
 
57
- **Note**: First run will download the Qwen3 model (~400MB) automatically.
58
 
59
- ### Configuration
60
-
61
- The application uses sensible defaults. To customize, copy `.env.example` to `.env` and edit:
62
 
63
  ```bash
64
- cp .env.example .env
65
  ```
66
 
67
- Key parameters in `backend/config.py`:
68
-
69
- ```python
70
- # Kalman Filter Parameters
71
- max_velocity = 0.05 # Max concept drift per time step
72
- max_acceleration = 0.02 # Max change in velocity
73
-
74
- # Similarity Thresholds
75
- threshold_auto_include = 0.85 # High confidence (auto-accept)
76
- threshold_strong = 0.75 # Moderate confidence
77
- threshold_moderate = 0.65 # Low confidence (minimum)
78
- ```
79
 
80
- ## Usage
81
-
82
- ### Start the Server
83
 
84
  ```bash
85
- uvicorn backend.main:app --reload
86
  ```
87
 
88
- The API will be available at `http://localhost:8000`
89
-
90
- Interactive API documentation: `http://localhost:8000/docs`
91
-
92
- ### API Endpoints
93
-
94
- #### 1. Search Papers
95
-
96
- Find potential seed papers:
97
-
98
- ```bash
99
- curl "http://localhost:8000/api/search?query=attention%20is%20all%20you%20need&limit=5"
100
  ```
101
 
102
- #### 2. Get Single Paper
103
-
104
- Get details for a specific paper:
105
 
 
 
106
  ```bash
107
- curl "http://localhost:8000/api/paper/1706.03762"
 
108
  ```
109
 
110
- #### 3. Track Concept Evolution
111
 
112
- Track a concept from seed papers forward:
113
 
114
- ```bash
115
- curl -X POST "http://localhost:8000/api/track" \
116
- -H "Content-Type: application/json" \
117
- -d '{
118
- "seed_paper_ids": ["1706.03762"],
119
- "end_date": "2018-12-31",
120
- "window_months": 6,
121
- "max_papers_per_window": 50
122
- }'
123
- ```
124
 
125
- **Parameters**:
126
- - `seed_paper_ids`: 1-5 ArXiv IDs to start tracking from
127
- - `end_date`: End date (ISO format: "YYYY-MM-DD")
128
- - `window_months`: Time window size (default: 6 months)
129
- - `max_papers_per_window`: Max papers to fetch per window (default: 50)
130
 
131
- ### Example: Track Transformer Evolution
132
 
133
- ```bash
134
- # Track from "Attention is All You Need" (2017) to end of 2018
135
- curl -X POST "http://localhost:8000/api/track" \
136
- -H "Content-Type: application/json" \
137
- -d '{
138
- "seed_paper_ids": ["1706.03762"],
139
- "end_date": "2018-12-31",
140
- "window_months": 6,
141
- "similarity_threshold": 0.65,
142
- "max_papers_per_window": 50
143
- }' | python -m json.tool
144
- ```
145
 
146
- **Expected output**:
147
- ```json
148
- {
149
- "seed_papers": [...],
150
- "timeline": [
151
- {
152
- "step_number": 1,
153
- "start_date": "2017-06-12T...",
154
- "end_date": "2017-12-12T...",
155
- "papers": [...],
156
- "avg_similarity": 0.78,
157
- "num_high_confidence": 12,
158
- "num_moderate": 8,
159
- "num_low": 3
160
- },
161
- ...
162
- ],
163
- "total_papers": 45,
164
- "num_steps": 3
165
- }
166
- ```
167
 
168
  ## How It Works
169
 
170
  ### Concept Tracking Algorithm
171
 
172
- 1. **Initialization**: Start with 1-5 seed papers (e.g., "Attention is All You Need")
173
- 2. **Embedding**: Generate semantic embeddings (title + abstract) using Qwen3
174
  3. **Time Windows**: Move forward in configurable windows (default: 6 months)
175
  4. **For each window**:
176
- - Fetch candidate papers from ArXiv
177
  - Generate embeddings (cached after first generation)
178
  - **Kalman Filtering**: Evaluate each paper against physics-inspired constraints:
179
- - **Similarity**: Must be > 0.65 to current concept vector
180
- - **Velocity**: Change must be < 0.05 (prevents sudden jumps)
181
  - **Acceleration**: Change in velocity must be < 0.02 (prevents direction shifts)
182
  - Accept papers that pass all constraints
183
  - Update concept vector as weighted mean of accepted papers
@@ -185,20 +127,43 @@ curl -X POST "http://localhost:8000/api/track" \
185
 
186
  ### Kalman Filter Validation
187
 
188
- The tracker rejects papers that would cause unrealistic concept jumps:
189
 
190
- - **Similarity < 0.65**: Too dissimilar to current concept
191
  - **Velocity > 0.05**: Concept jumping too fast through embedding space
192
  - **Acceleration > 0.02**: Sudden change in direction
193
 
194
- Check logs for rejection reasons:
195
- ```bash
196
- uvicorn backend.main:app --log-level=debug
 
 
 
 
 
 
 
 
 
 
 
 
 
197
  ```
198
 
199
- ## Testing
 
 
 
 
 
200
 
201
- ### Run Tests
 
 
 
 
 
202
 
203
  ```bash
204
  # Run all tests
@@ -207,142 +172,96 @@ pytest tests/ -v
207
  # Run specific test file
208
  pytest tests/test_kalman.py -v
209
 
210
- # Run slow integration tests (uses real ArXiv data)
211
- pytest tests/test_api.py -v -s --tb=short
212
  ```
213
 
214
- ### Test Coverage
215
-
216
- - `test_arxiv_client.py`: ArXiv API integration
217
- - `test_kalman.py`: Kalman filter constraints
218
- - `test_api.py`: FastAPI endpoints
219
-
220
  ## Project Structure
221
 
222
  ```
223
  concept_tracker/
 
224
  β”œβ”€β”€ backend/
225
- β”‚ β”œβ”€β”€ __init__.py
226
- β”‚ β”œβ”€β”€ main.py # FastAPI app & endpoints
227
- β”‚ β”œβ”€β”€ config.py # Kalman parameters & settings
228
- β”‚ β”œβ”€β”€ models.py # Pydantic data models
229
- β”‚ β”œβ”€β”€ arxiv_client.py # ArXiv API wrapper
230
- β”‚ β”œβ”€β”€ embedding_service.py # Qwen3 embeddings + cache
231
- β”‚ β”œβ”€β”€ kalman_tracker.py # Core tracking algorithm
232
- β”‚ β”œβ”€β”€ tracker.py # Main orchestrator
233
  β”‚ └── utils/
234
- β”‚ β”œβ”€β”€ __init__.py
235
- β”‚ └── cache.py # Pickle-based cache
236
- β”œβ”€β”€ cache/ # Embedding storage (auto-created)
237
- β”œβ”€β”€ tests/ # Test suite
238
- β”œβ”€β”€ requirements.txt # Python dependencies
239
- β”œβ”€β”€ .env.example # Configuration template
240
- └── README.md # This file
241
  ```
242
 
243
  ## Performance
244
 
245
  ### First Run
246
- - **Time**: 10-15 minutes (one-time embedding generation + download)
247
- - **Bottleneck**: Qwen3 model download (~400MB) and embedding generation
248
 
249
  ### Subsequent Runs (Cached)
250
  - **Time**: 2-3 minutes
251
- - **Bottleneck**: ArXiv API queries and Kalman filtering
252
 
253
  ### Optimizations
254
- - All embeddings are permanently cached in `cache/embeddings/`
255
- - Cache grows ~4KB per paper (1024 floats Γ— 4 bytes)
256
- - 10,000 papers = ~40MB cache (acceptable)
257
 
258
- ## Tuning Kalman Parameters
259
-
260
- If tracking results are not satisfactory:
261
-
262
- ### Too Strict (Rejecting True Positives)
263
-
264
- Edit `backend/config.py`:
265
- ```python
266
- max_velocity = 0.07 # Increase from 0.05
267
- max_acceleration = 0.03 # Increase from 0.02
268
- threshold_moderate = 0.60 # Decrease from 0.65
269
- ```
270
-
271
- ### Too Loose (Accepting False Positives)
272
-
273
- Edit `backend/config.py`:
274
- ```python
275
- max_velocity = 0.03 # Decrease from 0.05
276
- max_acceleration = 0.01 # Decrease from 0.02
277
- threshold_moderate = 0.70 # Increase from 0.65
278
- ```
279
-
280
- Restart the server after changes:
281
- ```bash
282
- uvicorn backend.main:app --reload
283
- ```
284
 
285
  ## Troubleshooting
286
 
287
- ### Issue: Qwen3 model won't download
288
-
289
- **Solution**: Ensure you have ~1GB free disk space. Model downloads to `~/.cache/huggingface/`
290
-
291
- ### Issue: ArXiv API errors (429, timeouts)
292
 
293
- **Solution**: The client includes rate limiting (3 sec delay). If you still get errors, increase `arxiv_rate_limit` in config.
294
-
295
- ### Issue: No papers accepted in tracking
296
 
 
297
  **Solution**:
298
- 1. Check logs for rejection reasons
299
- 2. Lower `threshold_moderate` in config
300
  3. Increase `max_velocity` if velocity rejections are common
301
 
302
- ### Issue: Out of memory during embedding
303
-
304
- **Solution**: Reduce `max_papers_per_window` in tracking request
305
 
306
- ## Validation Example
307
 
308
- Test with known concept evolution (Transformers 2017-2018):
 
 
 
 
 
309
 
310
- ```bash
311
- curl -X POST "http://localhost:8000/api/track" \
312
- -H "Content-Type: application/json" \
313
- -d '{
314
- "seed_paper_ids": ["1706.03762"],
315
- "end_date": "2018-06-30",
316
- "window_months": 6,
317
- "max_papers_per_window": 50
318
- }'
319
- ```
320
-
321
- **Expected**:
322
- - Should find BERT-related papers (1810.04805)
323
- - Should find other transformer variants
324
  - Should NOT jump to unrelated NLP (pure RNN papers)
325
- - Similarity should stay above 0.65
326
  - 2-3 time steps with 10-30 papers each
327
 
328
- ## Future Enhancements (Post-MVP)
329
 
330
- - βœ… **Linear tracking** (current MVP)
331
  - πŸ”² **Tree branching** with HDBSCAN clustering
332
- - πŸ”² **Web UI** with D3.js visualization
333
- - πŸ”² **Bidirectional tracking** (trace concepts to their origins)
334
  - πŸ”² **Multi-signal validation** (citations, author overlap)
 
335
 
336
  ## License
337
 
338
- MIT License - See LICENSE file
339
-
340
- ## Contributing
341
-
342
- This is an MVP/prototype. For issues or suggestions, please open an issue on GitHub.
343
 
344
  ## Acknowledgments
345
 
346
  - ArXiv for open access to research papers
347
- - Qwen team for the embedding model
348
- - FastAPI and sentence-transformers communities
 
15
 
16
  ## Features
17
 
18
+ - πŸ” **Search ArXiv papers** by keyword with optional year filtering
19
  - πŸ“Š **Track concept evolution** through time windows
20
+ - 🧠 **Semantic embeddings** using sentence-transformers (MiniLM-L6-v2)
21
+ - πŸ“ˆ **Interactive timeline** visualization with Gradio
22
+ - 🎯 **Kalman filter** constraints prevent unrealistic concept jumps
23
+ - πŸ’Ύ **Local embeddings cache** - no API costs
24
+ - 🌐 **Works locally and on HuggingFace Spaces**
 
 
 
 
25
 
26
  ## Quick Start
27
 
28
+ ### Local Installation
29
 
30
+ 1. **Clone the repository**:
31
  ```bash
32
+ git clone <your-repo-url>
33
+ cd concept_tracker
34
  ```
35
 
36
  2. **Create and activate virtual environment**:
 
45
  ```
46
 
47
  This will install:
48
+ - Gradio (web interface)
49
+ - sentence-transformers (embeddings)
50
  - ArXiv API client
51
+ - NumPy, scikit-learn (computations)
52
+ - Pytest (testing)
53
 
54
+ **Note**: First run will download the embedding model (~400MB) automatically.
55
 
56
+ ### Run Locally
 
 
57
 
58
  ```bash
59
+ python app.py
60
  ```
61
 
62
+ This will:
63
+ - Start a local server at `http://127.0.0.1:7860`
64
+ - Open the interface in your browser
65
+ - Generate embeddings cache at `cache/embeddings/`
 
 
 
 
 
 
 
 
66
 
67
+ ### Get a Public URL (for showcasing)
 
 
68
 
69
  ```bash
70
+ python app.py --share
71
  ```
72
 
73
+ This generates a temporary public URL (valid for 72 hours) that you can share:
74
+ ```
75
+ Running on local URL: http://127.0.0.1:7860
76
+ Running on public URL: https://abc123xyz.gradio.live ← Share this!
 
 
 
 
 
 
 
 
77
  ```
78
 
79
+ ### Deploy to HuggingFace Spaces (permanent hosting)
 
 
80
 
81
+ 1. **Create a Space** on [HuggingFace Spaces](https://huggingface.co/spaces)
82
+ 2. **Push your code**:
83
  ```bash
84
+ git remote add hf https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker
85
+ git push hf main
86
  ```
87
 
88
+ Your app will be live at: `https://huggingface.co/spaces/YOUR-USERNAME/concept-tracker`
89
 
90
+ ## How to Use
91
 
92
+ ### 1. Search for Seed Papers
 
 
 
 
 
 
 
 
 
93
 
94
+ - Enter a search query (e.g., "attention is all you need", "diffusion models")
95
+ - Optionally filter by year range
96
+ - Select 1-5 papers that define your concept
 
 
97
 
98
+ ### 2. Configure Tracking
99
 
100
+ - **End Date**: How far forward to track (auto-calculated as seed date + 2 years)
101
+ - **Window Size**: Time window for each step (default: 6 months)
102
+ - **Max Papers**: Papers to fetch per window (500-2000 for GPU, 50-100 for CPU)
 
 
 
 
 
 
 
 
 
103
 
104
+ ### 3. View Results
105
+
106
+ - Timeline showing concept evolution across time windows
107
+ - Papers ranked by similarity to concept
108
+ - Statistics: high/moderate/low confidence counts
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
109
 
110
  ## How It Works
111
 
112
  ### Concept Tracking Algorithm
113
 
114
+ 1. **Initialization**: Start with 1-5 seed papers that define your concept
115
+ 2. **Embedding**: Generate semantic embeddings (title + abstract) using sentence-transformers
116
  3. **Time Windows**: Move forward in configurable windows (default: 6 months)
117
  4. **For each window**:
118
+ - Fetch candidate papers from ArXiv (cs.LG, cs.CL, cs.AI)
119
  - Generate embeddings (cached after first generation)
120
  - **Kalman Filtering**: Evaluate each paper against physics-inspired constraints:
121
+ - **Similarity**: Must be β‰₯ 0.50 to current concept vector
122
+ - **Velocity**: Concept drift must be < 0.05 (prevents sudden jumps)
123
  - **Acceleration**: Change in velocity must be < 0.02 (prevents direction shifts)
124
  - Accept papers that pass all constraints
125
  - Update concept vector as weighted mean of accepted papers
 
127
 
128
  ### Kalman Filter Validation
129
 
130
+ The tracker uses physics-inspired constraints to reject papers that would cause unrealistic concept jumps:
131
 
132
+ - **Similarity < 0.50**: Too dissimilar to current concept
133
  - **Velocity > 0.05**: Concept jumping too fast through embedding space
134
  - **Acceleration > 0.02**: Sudden change in direction
135
 
136
+ This ensures smooth, realistic concept evolution tracking.
137
+
138
+ ## Configuration
139
+
140
+ The application uses sensible defaults. To customize, edit `backend/config.py`:
141
+
142
+ ```python
143
+ # Kalman Filter Parameters
144
+ max_velocity = 1.0 # Max concept drift per time step
145
+ max_acceleration = 0.6 # Max change in velocity
146
+
147
+ # Similarity Thresholds
148
+ threshold_auto_include = 0.85 # High confidence (auto-accept)
149
+ threshold_strong = 0.75 # Strong confidence
150
+ threshold_moderate = 0.60 # Moderate confidence
151
+ threshold_reject = 0.50 # Below this = reject
152
  ```
153
 
154
+ ### Tuning Tips
155
+
156
+ **Too strict (rejecting true positives)?**
157
+ - Increase `max_velocity` (e.g., 1.5)
158
+ - Increase `max_acceleration` (e.g., 0.8)
159
+ - Lower `threshold_reject` (e.g., 0.45)
160
 
161
+ **Too loose (accepting false positives)?**
162
+ - Decrease `max_velocity` (e.g., 0.5)
163
+ - Decrease `max_acceleration` (e.g., 0.3)
164
+ - Raise `threshold_reject` (e.g., 0.60)
165
+
166
+ ## Testing
167
 
168
  ```bash
169
  # Run all tests
 
172
  # Run specific test file
173
  pytest tests/test_kalman.py -v
174
 
175
+ # Run integration tests (uses real ArXiv data)
176
+ pytest tests/test_arxiv_client.py -v
177
  ```
178
 
 
 
 
 
 
 
179
  ## Project Structure
180
 
181
  ```
182
  concept_tracker/
183
+ β”œβ”€β”€ app.py # Gradio interface
184
  β”œβ”€β”€ backend/
185
+ β”‚ β”œβ”€β”€ gradio_wrapper.py # Gradio event handlers
186
+ β”‚ β”œβ”€β”€ tracker.py # Main orchestrator
187
+ β”‚ β”œβ”€β”€ arxiv_client.py # ArXiv API wrapper
188
+ β”‚ β”œβ”€β”€ embedding_service.py # Embeddings + cache
189
+ β”‚ β”œβ”€β”€ kalman_tracker.py # Kalman filter logic
190
+ β”‚ β”œβ”€β”€ config.py # Configuration
191
+ β”‚ β”œβ”€β”€ models.py # Pydantic data models
 
192
  β”‚ └── utils/
193
+ β”‚ └── cache.py # Embedding cache
194
+ β”œβ”€β”€ cache/ # Auto-created embedding cache
195
+ β”œβ”€β”€ tests/ # Test suite
196
+ β”œβ”€β”€ requirements.txt # Dependencies
197
+ └── README.md # This file
 
 
198
  ```
199
 
200
  ## Performance
201
 
202
  ### First Run
203
+ - **Time**: 10-15 minutes
204
+ - **Bottleneck**: Model download (~400MB) + initial embedding generation
205
 
206
  ### Subsequent Runs (Cached)
207
  - **Time**: 2-3 minutes
208
+ - **Bottleneck**: ArXiv API queries + Kalman filtering
209
 
210
  ### Optimizations
211
+ - All embeddings permanently cached in `cache/embeddings/`
212
+ - Cache grows ~4KB per paper
213
+ - 10,000 papers β‰ˆ 40MB cache
214
 
215
+ ### HuggingFace Spaces Performance
216
+ - **Free tier**: CPU-only, 16GB RAM, slower but functional
217
+ - **ZeroGPU option**: Faster inference (requires configuration)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
218
 
219
  ## Troubleshooting
220
 
221
+ ### Model won't download
222
+ **Solution**: Ensure ~1GB free disk space. Model downloads to `~/.cache/huggingface/`
 
 
 
223
 
224
+ ### ArXiv API errors (429, timeouts)
225
+ **Solution**: Built-in rate limiting (3 sec delay). If errors persist, increase `arxiv_rate_limit` in config.
 
226
 
227
+ ### No papers accepted in tracking
228
  **Solution**:
229
+ 1. Check console logs for rejection reasons
230
+ 2. Lower `threshold_reject` in config
231
  3. Increase `max_velocity` if velocity rejections are common
232
 
233
+ ### Out of memory
234
+ **Solution**: Reduce `max_papers_per_window` (try 50-100 on CPU, 500-2000 on GPU)
 
235
 
236
+ ## Example: Track Transformer Evolution
237
 
238
+ 1. Search: "attention is all you need"
239
+ 2. Select the 2017 paper (arxiv:1706.03762)
240
+ 3. Set end date: 2018-12-31
241
+ 4. Window: 6 months
242
+ 5. Max papers: 500
243
+ 6. Click "Track Concept Evolution"
244
 
245
+ **Expected results**:
246
+ - Should find BERT and other transformer variants
 
 
 
 
 
 
 
 
 
 
 
 
247
  - Should NOT jump to unrelated NLP (pure RNN papers)
248
+ - Similarity should stay above 0.50
249
  - 2-3 time steps with 10-30 papers each
250
 
251
+ ## Future Enhancements
252
 
253
+ - βœ… **Linear tracking** (current implementation)
254
  - πŸ”² **Tree branching** with HDBSCAN clustering
255
+ - πŸ”² **Bidirectional tracking** (trace concepts to origins)
 
256
  - πŸ”² **Multi-signal validation** (citations, author overlap)
257
+ - πŸ”² **Export results** to JSON/CSV
258
 
259
  ## License
260
 
261
+ MIT License
 
 
 
 
262
 
263
  ## Acknowledgments
264
 
265
  - ArXiv for open access to research papers
266
+ - HuggingFace for sentence-transformers and Spaces hosting
267
+ - Gradio for the web interface framework