| # Vocabulary Optimization & Unification |
|
|
| ## Problem Solved |
|
|
| Previously, the crossword system had **vocabulary redundancy** with 3 separate sources: |
| - **SentenceTransformer Model Vocabulary**: ~30K tokens → ~8-12K actual words after filtering |
| - **NLTK Words Corpus**: 41,998 words for embeddings in thematic generator |
| - **WordFreq Database**: 319,938 words for frequency data |
|
|
| This created inconsistencies, memory waste, and limited vocabulary coverage. |
|
|
| ## Solution: Unified Architecture |
|
|
| ### New Design |
| - **Single Vocabulary Source**: WordFreq database (319,938 words) |
| - **Single Embedding Model**: all-mpnet-base-v2 (generates embeddings for any text) |
| - **Unified Filtering**: Consistent crossword-suitable word filtering |
| - **Shared Caching**: Single vocabulary + embeddings + frequency cache |
|
|
| ### Key Components |
|
|
| #### 1. VocabularyManager (`hack/thematic_word_generator.py`) |
| - Loads and filters WordFreq vocabulary |
| - Applies crossword-suitable filtering (3-12 chars, alphabetic, excludes boring words) |
| - Generates frequency data with 10-tier classification |
| - Handles caching for performance |
|
|
| #### 2. UnifiedThematicWordGenerator (`hack/thematic_word_generator.py`) |
| - Uses WordFreq vocabulary instead of NLTK words |
| - Generates all-mpnet-base-v2 embeddings for WordFreq words |
| - Maintains 10-tier frequency classification system |
| - Provides both hack tool API and backend-compatible API |
|
|
| #### 3. UnifiedWordService (`crossword-app/backend-py/src/services/unified_word_service.py`) |
| - Bridge adapter for backend integration |
| - Compatible with existing VectorSearchService interface |
| - Uses comprehensive WordFreq vocabulary instead of limited model vocabulary |
|
|
| ## Usage |
|
|
| ### For Hack Tools |
| ```python |
| from thematic_word_generator import UnifiedThematicWordGenerator |
| |
| # Initialize with desired vocabulary size |
| generator = UnifiedThematicWordGenerator(vocab_size_limit=100000) |
| generator.initialize() |
| |
| # Generate thematic words with tier info |
| results = generator.generate_thematic_words( |
| topic="science", |
| num_words=10, |
| difficulty_tier="tier_5_common" # Optional tier filtering |
| ) |
| |
| for word, similarity, tier in results: |
| print(f"{word}: {similarity:.3f} ({tier})") |
| ``` |
|
|
| ### For Backend Integration |
|
|
| #### Option 1: Replace VectorSearchService |
| ```python |
| # In crossword_generator.py |
| from .unified_word_service import create_unified_word_service |
| |
| # Initialize |
| vector_service = await create_unified_word_service(vocab_size_limit=100000) |
| crossword_gen = CrosswordGenerator(vector_service=vector_service) |
| ``` |
|
|
| #### Option 2: Direct Usage |
| ```python |
| from .unified_word_service import UnifiedWordService |
| |
| service = UnifiedWordService(vocab_size_limit=100000) |
| await service.initialize() |
| |
| # Compatible with existing interface |
| words = await service.find_similar_words("animal", "medium", max_words=15) |
| ``` |
|
|
| ## Performance Improvements |
|
|
| ### Memory Usage |
| - **Before**: 3 separate vocabularies + embeddings (~500MB+) |
| - **After**: Single vocabulary + embeddings (~200MB) |
| - **Reduction**: ~60% memory usage reduction |
|
|
| ### Vocabulary Coverage |
| - **Before**: Limited to ~8-12K words from model tokenizer |
| - **After**: Up to 100K+ filtered words from WordFreq database |
| - **Improvement**: 10x+ vocabulary coverage |
|
|
| ### Consistency |
| - **Before**: Different words available in hack tools vs backend |
| - **After**: Same comprehensive vocabulary across all components |
| - **Benefit**: Consistent word quality and availability |
|
|
| ## Configuration |
|
|
| ### Environment Variables |
| - `MAX_VOCABULARY_SIZE`: Maximum vocabulary size (default: 100000) |
| - `EMBEDDING_MODEL`: Model name (default: all-mpnet-base-v2) |
| - `WORD_SIMILARITY_THRESHOLD`: Minimum similarity (default: 0.3) |
|
|
| ### Vocabulary Size Options |
| - **Small (10K)**: Fast initialization, basic vocabulary |
| - **Medium (50K)**: Balanced performance and coverage |
| - **Large (100K)**: Comprehensive coverage, slower initialization |
| - **Full (319K)**: Complete WordFreq database, longest initialization |
|
|
| ## Migration Guide |
|
|
| ### For Existing Hack Tools |
| 1. Update imports: `from thematic_word_generator import UnifiedThematicWordGenerator` |
| 2. Replace `ThematicWordGenerator` with `UnifiedThematicWordGenerator` |
| 3. API remains compatible, but now uses comprehensive WordFreq vocabulary |
|
|
| ### For Backend Services |
| 1. Import: `from .unified_word_service import UnifiedWordService` |
| 2. Replace `VectorSearchService` initialization with `UnifiedWordService` |
| 3. All existing methods remain compatible |
| 4. Benefits: Better vocabulary coverage, consistent frequency data |
|
|
| ### Backwards Compatibility |
| - All existing APIs maintained |
| - Same method signatures and return formats |
| - Gradual migration possible - can run both systems in parallel |
|
|
| ## Benefits Summary |
|
|
| ✅ **Eliminates Redundancy**: Single vocabulary source instead of 3 separate ones |
| ✅ **Improves Coverage**: 100K+ words vs previous 8-12K words |
| ✅ **Reduces Memory**: ~60% reduction in memory usage |
| ✅ **Ensures Consistency**: Same vocabulary across hack tools and backend |
| ✅ **Maintains Performance**: Smart caching and batch processing |
| ✅ **Preserves Features**: 10-tier frequency classification, difficulty filtering |
| ✅ **Enables Growth**: Easy to add new features with unified architecture |
|
|
| ## Cache Management |
|
|
| ### Cache Locations |
| - **Hack tools**: `hack/model_cache/` |
| - **Backend**: `crossword-app/backend-py/cache/unified_generator/` |
|
|
| ### Cache Files |
| - `unified_vocabulary_<size>.pkl`: Filtered vocabulary |
| - `unified_frequencies_<size>.pkl`: Frequency data |
| - `unified_embeddings_<model>_<size>.npy`: Pre-computed embeddings |
|
|
| ### Cache Invalidation |
| Caches are automatically rebuilt if: |
| - Vocabulary size limit changes |
| - Embedding model changes |
| - WordFreq database updates (rare) |
|
|
| ## Future Enhancements |
|
|
| 1. **Semantic Clustering**: Group words by semantic similarity |
| 2. **Dynamic Difficulty**: Real-time difficulty adjustment based on user performance |
| 3. **Topic Expansion**: Automatic topic discovery and expansion |
| 4. **Multilingual Support**: Extend to other languages using WordFreq |
| 5. **Custom Vocabularies**: Allow domain-specific vocabulary additions |