Update README.md

a5a54f4 verified about 1 month ago

13 kB

	---
	license: other
	license_name: raml-v1.0
	datasets:
	- ReactiveAI/Beta-Pre-Train-Corpus
	language:
	- en
	- pl
	pipeline_tag: text-generation
	tags:
	- agent
	gated: true
	extra_gated_prompt: >-
	Accept [Reactive AI Model & Architecture License (RAML)
	v1.0](https://github.com/RxAI-dev/rxlm/blob/main/MODELS_LICENSE.md) terms to
	access the repository and use model. Reactive Transformer (pending patent
	#P.453260) is available for free for non-commercial usage. For commercial
	usage please contact Reactive AI at licensing@rxai.dev
	extra_gated_fields:
	Company: text
	Country: country
	I want to use this model for:
	type: select
	options:
	- Research
	- Education
	- label: Other
	value: other
	I agree to use this model for non-commercial use ONLY: checkbox
	extra_gated_heading: >-
	You need to agree to use this model only for research or education purposes
	under Reactive AI Model & Architecture License (RAML) v1.0
	extra_gated_description: The repository will be available instantly after accepting license terms
	extra_gated_button_content: Accept license terms
	---

	<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/logo_rxt_beta.png" width="512" />

	# RxT-Beta Decoder Base (2.85B A190M)
	RxT-Beta is the world's first real-scale stateful Reactive Language Model (RxLM) with infinite memory & context, made to confirm new Reactive Transformer (RxT)
	scaling laws and solve all the biggest stateless LLMs problems. RxT models are natively conversational (and agentic) - instead of reprocessing all the
	conversation history (chat template) like all the LLMs, it processes only single interactions in real-time and moves the context to dedicated embedding-based memory,
	that's updated asynchronously between the interactions. It introduces unique features like:
	- infinite conversation & global context through Mixture-of-Memory (MoM)
	- live continual learning from interactions in real-time
	- true real-time processing with near-zero latency
	- linear conversation cost scaling
	- fixed computational cost and memory usage for each interaction
	- increasing quality of responses with subsequent steps of dialogue, without "long-term hallucinations"
	- natively encoded memory, impossible to read without the model
	- extreme pre-training efficiency
	- hybrid stateful reasoning

	In first small scale experiments RxT-Alpha models achieved about 50% higher accuracy and almost 2x lower perplexity, than the same size stateless
	decoder-only baseline, trained on the same simple synthetic dataset (additionally, decoder-only model was pre-trained on 5x more tokens). These results were
	then confirmed on small 10B tokens subset of real-world data and ~0.3B models (RxT-Beta Micro), where RxT advantage was even bigger. These promising
	results, along with all the unique features, demonstrate that Reactive Transformer is a revolutionary generational leap and a crucial milestone on the
	path to Artificial General Intelligence (AGI). Of course, if we will confirm this at scale, which is what we plan to do with RxT-Beta.

	The goal is to compete with ~1-3B params dense stateless LLMs, pre-trained on trillions tokens, using model with only 190M active parameters and about 400B
	training tokens, and significantly outperform them on long multi-turn conversations.

	## Base models
	Reactive Transformer models require new dedicated training pipeline to handle its asynchronous memory and reversed decoder-encoder order. Base models are
	result of the first supervised stage - _Joint LM Pre-Training with "cheated context" teacher forcing_ (more info in Training Process section).

	Base decoder (this model) is not a typical generative model. It requires further training and should be connected with encoder and memory attention network, so
	this model is only the starting point for next stages. It's pre-trained for general knowledge (with focus on reasoning) using textbook quality datasets and it
	could be further fine-tuned for custom use cases (under the terms of the [RAML v1.0 license](https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/blob/main/LICENSE.md)).

	## Decoder architecture
	- layers: 25 (21 stateful MoE + 3 stateless MoE + 1 stateless dense)
	- dim: 512
	- self-attention: Gated Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
	- memory cross-attention: Sparse Query Attention (SQA) with 8/16 query heads & 4/16 key/value heads
	- feed forward: Sparse Mixture-of-Experts (MoE) with gated shared experts
	- routed experts: 384
	- active experts: 10
	- routed expert dim: 192
	- shared experts: 2 with softmax gating
	- shared expert dim: 384
	- activation: SwiGLU
	- dense layer: 1536 dim with SwiGLU activation
	- vocab: 65k (english + polish)
	- params: 2.85B with 190M activated per token

	<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Decoder.png" width="600" />

	## Decoder Innovations
	### Reactive Transformer (RxT) with additional stateless layers
	Reactive Transformer ([Adam Filipek, 2025](https://arxiv.org/abs/2510.03561)) is our flagship innovation, that redefines conversational and agentic AI, to make
	it natively stateful. Unlike external agentic memory systems, it treats memory as integral part of the model. It's not a text, that's added to prompt, but a set of
	dynamic vector embeddings, accessed with decoder's memory cross-attention layers and updated asynchronously after generating the answer (by encoder and memory attention).
	That makes it far more expressive and compressible than any existing agentic memory.

	While RxT decoder is similar to original encoder-decoder Transformer, the cross-attention inputs are not just encoder hidden states, but they are accumulated from
	all previous interactions - that's why we called it _memory cross-attention_. We also don't use positional encoding for memory cross-attention _keys_, because memory
	doesn't have spatial relationships - it rather has to implicitly learn timestep-based encoding.

	Since RxT-Alpha models, introduced in paper, we added initial and final stateless layers, that use only self-attention with feed forward, without memory
	cross-attention. In RxT-Beta we have:
	- two initial stateless layers are designed to improve resolving relations inside current query (they don't have access to previous messages, as it will be against
	the RxT real-time processing ideas), and between query and answer, before accessing any past information from memory. It helps with better question understanding.
	- first initial stateless layer use dense MLP, what's a standard solution in modern Mixture-of-Experts architectures. All other layers use MoE
	- two final stateless layers are made to summarize all the reasoning, after combining current and past information in stateful layers

	### Sparse Query Attention (SQA)
	Sparse Query Attention ([Adam Filipek, 2025](https://arxiv.org/abs/2510.01817)) is our solution for computationally efficient attention, that's especially useful in RxT.
	Unlike common sparse attention patters, like Sliding Window Attention (SWA), SQA is based on structural sparsity, instead of spatial sparsity. By reducing the number of
	used query heads, it's using partial information from all tokens and performs _scaled dot product attention_ in lower dimensionality (reducing the number of matrix multiplications).
	SQA is optimized especially for compute-bound full sequence processing scenarios, like prompt phase or encoder bidirectional attention.

	In RxT-Beta we use 50% of query heads, so it has 2x smaller computational cost than baseline GQA (16 query & 4 key/value heads), while quality decrease is neglible. We
	stay with the same number of key/value head, so memory access cost in autoregressive generation is on the same level. However, in RxT KV-cache is limited only to single
	interaction, so it's no longer a bottleneck. Instead, we have 3 new bidirectional attention layers for each transformer block (one in encoder and two in memory attention),
	where SQA outperforms other solutions.

	#### Sparse Attention for RxT
	Spatially sparse attention solutions are useful for very long context windows in stateless LLMs and chat history reprocessing. In RxT we achieved infinite context by...
	making context window shorter. It may look counterintuitive, but when the context window is limited to single query and answer, it just doesn't need to be long as in LLMs,
	when it has to fit all the chat history. Then, full SQA attention is fast enough and token's relations inside current interaction are naturally the strongest. Furthermore,
	RxT has _native sliding window_, that's not limited to fixed number of tokens, but to current interaction, what's just natural.

	On the other hand, sparse attention is designed for unidirectional/autoregressive attention in decoder-only model, so it's compatibility with bidirectional encoder and memory
	attention is rather weak, especially for memory, that doesn't have spatial relations.

	#### Linear Attention for RxT
	We tested new Linear Attention solutions and hybrid attention architecture for RxT-Beta self-attention, but for short single interaction sequences used for MVP (1-8k tokens),
	training was about 2-3x slower than with full SQA baseline, due to architectural complexity overhead. We believe, that it will became valuable in future generations, when
	we'll extend interaction length to 32k+ tokens, and we plan to integrate intra-sequence recurrence (Linear Attention state) with inter-sequence recurrence (RxT memory) in our
	custom solution called Memory-driven Gated DeltaNet.

	### Gated Self-Attention
	We follow the direction from [Alibaba/Qwen Team research](https://arxiv.org/abs/2505.06708) and added sigmoid gates to our SQA self-attention layers (in both decoder and
	encoder). As in Qwen Team solution, gate values are based on query and applied before final output projection. The only difference is that in SQA gate has reduced dimensionality,
	same as query and attention calculation.

	We also tested it in cross-attention, but results were a lot worse than baseline without gates, probably because of different input sources - gate is based on query, which is the
	processed sequence, while attention results are based on values from memory. So finally, we are using gates only for self-attention layers.

	### Sparse Mixture-of-Experts (MoE) with gated shared experts
	Latest models, like Kimi K2 or Qwen3-Next, demonstrated high effectiveness of architectures with large number of smaller experts and high sparse activation rates for each token.
	We follow the same direction in RxT-Beta Mixture-of-Experts with 10 from 384 experts activated per token. We are extending it with two bigger shared experts with softmax gate,
	for even better expresiveness. Both shared experts are used for all tokens, but gate can decide which one is more important for each token - we plan to introduce task-aware shared
	experts load balancing in next training stages to specialize one expert in reasoning, while second one will be dedicated to fast answers, to better balance hybrid reasoning abilities.
	Shared experts are 2x bigger than routed experts.

	### Bidirectional Masked Language Modeling (MLM) in decoder pre-training
	In unique RxT pre-training method, decoder is learning with both unidirectional autoregressive language modeling (self-attention) and bidirectional modeling (cross-attention).
	It boosts training effectiveness with the "super-convergence" effect, but also makes training too easy, what leads to quick loss plateau on early training stage. To prevent this,
	we make the decoder's task harder by adding random noise to the encoder's outputs. In early experiments, we used small noise levels like 0.15-0.2, but in RxT-Beta we increased
	it to 0.5 as a starting point. Additionally, we decided to add also random masking to encoder outputs to make the prediction of tokens on masked positions even harder. It adds
	another objective to decoder's training, that is close to masked language modeling, used in encoders training.

	To make the training even more effective, we introduced progressive noise level and masking probability increase - with this solution, even loss plateau is "healthy", because
	with each step, objective becomes harder.

	Even with high noise and masking rates, decoder is quickly achieving over 90% prediction accuracy, then in about 99% of training time it's learning to correctly predict
	remaining 10% (responsible for the most important knowledge), to finally reach 98-99% accuracy level. It's impossible to reach in classic decoder-only LLM training - we
	believe that this is the main reason of RxT extreme training efficiency. More details in training process description below.

	## Training Process
	Description in progress

	<img src="https://huggingface.co/ReactiveAI/RxT-Beta-Decoder-Base/resolve/main/RxT-Beta-Joint-Training.png" />