| --- |
| license: other |
| license_name: raml-v1.0 |
| datasets: |
| - ReactiveAI/Beta-Pre-Train-Corpus |
| language: |
| - en |
| - pl |
| pipeline_tag: text-generation |
| tags: |
| - agent |
| gated: true |
| extra_gated_prompt: >- |
| Accept [Reactive AI Model & Architecture License (RAML) |
| v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to |
| access the repository and use model. Reactive Transformer (pending patent |
| #P.453260) is available for free for non-commercial usage. For commercial |
| usage please contact Reactive AI at licensing@rxai.dev |
| extra_gated_fields: |
| Company: text |
| Country: country |
| I want to use this model for: |
| type: select |
| options: |
| - Research |
| - Education |
| - label: Other |
| value: other |
| I agree to use this model for non-commercial use ONLY: checkbox |
| extra_gated_heading: >- |
| You need to agree to use this model only for research or education purposes |
| under Reactive AI Model & Architecture License (RAML) v1.0 |
| extra_gated_description: The repository will be available instantly after accepting license terms |
| extra_gated_button_content: Accept license terms |
| --- |
| |
| <img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/logo_rxt_beta.png" width="512" /> |
|
|
| # RxT-Beta Decoder Base (2.85B A190M) |
| **RxT-Beta** is the world's first real-scale stateful **Reactive Language Model (RxLM)** with infinite memory & context, made to confirm new **Reactive Transformer (RxT)** |
| scaling laws and solve **all** the biggest stateless LLMs problems. **RxT** models are natively conversational (and agentic) - instead of reprocessing all the |
| conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory, |
| that's updated asynchronously between the interactions. It introduces unique features like: |
| - infinite conversation & global context through Mixture-of-Memory (MoM) |
| - live continual learning from interactions in real-time |
| - true real-time processing with near-zero latency |
| - linear conversation cost scaling |
| - fixed computational cost and memory usage for each interaction |
| - increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations" |
| - natively encoded memory, impossible to read without the model |
| - extreme pre-training efficiency |
| - hybrid stateful reasoning |
|
|
| In first small scale experiments **RxT-Alpha** models achieved about **50% higher accuracy** and almost **2x lower perplexity**, than the same size stateless |
| decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were |
| then confirmed on small 10B tokens subset of real-world data and ~0.3B models (**RxT-Beta Micro**), where **RxT** advantage was even bigger. These promising |
| results, along with all the unique features, demonstrate that **Reactive Transformer** is a revolutionary generational leap and a crucial milestone on the |
| path to **Artificial General Intelligence (AGI)**. Of course, if we will confirm this at scale, which is what we plan to do with **RxT-Beta**. |
|
|
| The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 400B |
| training tokens, and significantly outperform them on long multi-turn conversations. |
|
|
| ## Base models |
| **Reactive Transformer** models require new dedicated training pipeline to handle its asynchronous memory and reversed decoder-encoder order. Base models are |
| result of the first supervised stage - _**Joint LM Pre-Training with "cheated context" teacher forcing**_ (more info in Training Process section). |
|
|
| Base decoder (this model) is not a typical generative model. It requires further training and should be connected with encoder and memory attention network, so |
| this model is only the starting point for next stages. It's pre-trained for general knowledge (with focus on reasoning) using textbook quality datasets and it |
| could be further fine-tuned for custom use cases (under the terms of the [RAML v1.0 license](https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/blob/main/LICENSE.md)). |
|
|
| ## Decoder architecture |
| - layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense) |
| - dim: 512 |
| - self-attention: Gated Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads |
| - memory cross-attention: Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads |
| - feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts |
| - routed experts: 384 |
| - active experts: 10 |
| - routed expert dim: 192 |
| - shared experts: 2 with softmax gating |
| - shared expert dim: 384 |
| - activation: SwiGLU |
| - dense layer: 1536 dim with SwiGLU activation |
| - vocab: 65k (english + polish) |
| - params: 2.85B with 190M activated per token |
|
|
| <img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Decoder.png" width="600" /> |
|
|
| ## Decoder Innovations |
| ### Reactive Transformer (RxT) with additional stateless layers |
| Reactive Transformer ([Adam Filipek, 2025](https://arxiv.org/abs/2510.03561)) is our flagship innovation, that redefines conversational and agentic AI, to make |
| it natively stateful. Unlike external agentic memory systems, it treats memory as integral part of the model. It's not a text, that's added to prompt, but a set of |
| dynamic vector embeddings, accessed with decoder's memory cross-attention layers and updated asynchronously after generating the answer (by encoder and memory attention). |
| That makes it far more expressive and compressible than any existing agentic memory. |
|
|
| While **RxT** decoder is similar to original encoder-decoder Transformer, the cross-attention inputs are not just encoder hidden states, but they are accumulated from |
| all previous interactions - that's why we called it _memory cross-attention_. We also don't use positional encoding for memory cross-attention _keys_, because memory |
| doesn't have spatial relationships - it rather has to implicitly learn timestep-based encoding. |
|
|
| Since **RxT-Alpha** models, introduced in paper, we added initial and final stateless layers, that use only self-attention with feed forward, without memory |
| cross-attention. In **RxT-Beta** we have: |
| - two initial stateless layers are designed to improve resolving relations inside current query (they don't have access to previous messages, as it will be against |
| the RxT real-time processing ideas), and between query and answer, before accessing any past information from memory. It helps with better question understanding. |
| - first initial stateless layer use dense MLP, what's a standard solution in modern Mixture-of-Experts architectures. All other layers use MoE |
| - two final stateless layers are made to summarize all the reasoning, after combining current and past information in stateful layers |
|
|
| ### Sparse Query Attention (SQA) |
| Sparse Query Attention ([Adam Filipek, 2025](https://arxiv.org/abs/2510.01817)) is our solution for computationally efficient attention, that's especially useful in **RxT**. |
| Unlike common sparse attention patters, like Sliding Window Attention (SWA), **SQA** is based on structural sparsity, instead of spatial sparsity. By reducing the number of |
| used query heads, it's using partial information from all tokens and performs _scaled dot product attention_ in lower dimensionality (reducing the number of matrix multiplications). |
| **SQA** is optimized especially for compute-bound full sequence processing scenarios, like prompt phase or encoder bidirectional attention. |
|
|
| In **RxT-Beta** we use 50% of query heads, so it has 2x smaller computational cost than baseline GQA (16 query & 4 key/value heads), while quality decrease is neglible. We |
| stay with the same number of key/value head, so memory access cost in autoregressive generation is on the same level. However, in **RxT** KV-cache is limited only to single |
| interaction, so it's no longer a bottleneck. Instead, we have 3 new bidirectional attention layers for each transformer block (one in encoder and two in memory attention), |
| where **SQA** outperforms other solutions. |
|
|
| #### Sparse Attention for RxT |
| Spatially sparse attention solutions are useful for very long context windows in stateless LLMs and chat history reprocessing. In **RxT** we achieved infinite context by... |
| making context window shorter. It may look counterintuitive, but when the context window is limited to single query and answer, it just doesn't need to be long as in LLMs, |
| when it has to fit all the chat history. Then, full **SQA** attention is fast enough and token's relations inside current interaction are naturally the strongest. Furthermore, |
| **RxT** has _native sliding window_, that's not limited to fixed number of tokens, but to current interaction, what's just natural. |
|
|
| On the other hand, sparse attention is designed for unidirectional/autoregressive attention in decoder-only model, so it's compatibility with bidirectional encoder and memory |
| attention is rather weak, especially for memory, that doesn't have spatial relations. |
|
|
| #### Linear Attention for RxT |
| We tested new **Linear Attention** solutions and hybrid attention architecture for **RxT-Beta** self-attention, but for short single interaction sequences used for MVP (1-8k tokens), |
| training was about 2-3x slower than with full **SQA** baseline, due to architectural complexity overhead. We believe, that it will became valuable in future generations, when |
| we'll extend interaction length to 32k+ tokens, and we plan to integrate intra-sequence recurrence (Linear Attention state) with inter-sequence recurrence (RxT memory) in our |
| custom solution called **Memory-driven Gated DeltaNet**. |
|
|
| ### Gated Self-Attention |
| We follow the direction from [Alibaba/Qwen Team research](https://arxiv.org/abs/2505.06708) and added sigmoid gates to our **SQA** self-attention layers (in both decoder and |
| encoder). As in Qwen Team solution, gate values are based on query and applied before final output projection. The only difference is that in **SQA** gate has reduced dimensionality, |
| same as query and attention calculation. |
|
|
| We also tested it in cross-attention, but results were a lot worse than baseline without gates, probably because of different input sources - gate is based on query, which is the |
| processed sequence, while attention results are based on values from memory. So finally, we are using gates only for self-attention layers. |
|
|
| ### Sparse Mixture-of-Experts (MoE) with gated shared experts |
| Latest models, like Kimi K2 or Qwen3-Next, demonstrated high effectiveness of architectures with large number of smaller experts and high sparse activation rates for each token. |
| We follow the same direction in **RxT-Beta** Mixture-of-Experts with 10 from 384 experts activated per token. We are extending it with two bigger shared experts with softmax gate, |
| for even better expresiveness. Both shared experts are used for all tokens, but gate can decide which one is more important for each token - we plan to introduce task-aware shared |
| experts load balancing in next training stages to specialize one expert in reasoning, while second one will be dedicated to fast answers, to better balance hybrid reasoning abilities. |
| Shared experts are 2x bigger than routed experts. |
|
|
| ### Bidirectional Masked Language Modeling (MLM) in decoder pre-training |
| In unique **RxT** pre-training method, decoder is learning with both unidirectional autoregressive language modeling (self-attention) and bidirectional modeling (cross-attention). |
| It boosts training effectiveness with the "super-convergence" effect, but also makes training too easy, what leads to quick loss plateau on early training stage. To prevent this, |
| we make the decoder's task harder by adding random noise to the encoder's outputs. In early experiments, we used small noise levels like 0.15-0.2, but in **RxT-Beta** we increased |
| it to 0.5 as a starting point. Additionally, we decided to add also random masking to encoder outputs to make the prediction of tokens on masked positions even harder. It adds |
| another objective to decoder's training, that is close to masked language modeling, used in encoders training. |
|
|
| To make the training even more effective, we introduced progressive noise level and masking probability increase - with this solution, even loss plateau is "healthy", because |
| with each step, objective becomes harder. |
|
|
| Even with high noise and masking rates, decoder is quickly achieving over 90% prediction accuracy, then in about 99% of training time it's learning to correctly predict |
| remaining 10% (responsible for the most important knowledge), to finally reach 98-99% accuracy level. It's impossible to reach in classic decoder-only LLM training - we |
| believe that this is the main reason of **RxT** extreme training efficiency. More details in training process description below. |
|
|
| ## Training Process |
| Description in progress |
|
|
| <img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Joint-Training.png" /> |