Fabrice Fils-Aime's picture

Fabrice Fils-Aime

fabthebest

·

AI & ML interests

None yet

Recent Activity

updated a Space 2 days ago

fabthebest/aqles

published a Space 3 days ago

fabthebest/aqles

reacted to reaperdoesntknow's post with 👍 3 days ago

We present a methodology for training small language models on CPU at FP32 precision that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training. Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross- architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces- sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper- iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum training progressing from language to logic to transfer to depth; (4) continuous belt-fed data ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that transformers were designed for GPU hardware rather than mathematical optimality, and that architecture designed for geometric correctness—metric-space attention, triangle inequality enforcement, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models, CPU training produces more capable models at a fraction of the cost.

View all activity

Organizations

None yet

updated a Space 2 days ago

Aqles

published a Space 3 days ago

Aqles

reactedto reaperdoesntknow's post with 👍 3 days ago

Post

3237

We present a methodology for training small language models on CPU at FP32 precision
that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training.
Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross-
architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language
models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces-
sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper-
iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for
FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate
per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum
training progressing from language to logic to transfer to depth; (4) continuous belt-fed data
ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via
AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with
emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard
compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that transformers were designed for GPU hardware rather than mathematical optimality, and that architecture designed for geometric correctness—metric-space attention, triangle inequality enforcement, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models, CPU training produces more capable models at a fraction of the cost.

6 replies

·

published a dataset 19 days ago

fabthebest/aqles-probe-v1

Updated 19 days ago • 10

updated a Space 5 months ago

ClickMaster Pro 🎯

Click buttons to track and animate clicks

published a Space 5 months ago

ClickMaster Pro 🎯

Click buttons to track and animate clicks