new

Get trending papers in your email inbox!

Subscribe

Daily Papers

byAK and the research community

Apr 23

KnowRL: Teaching Language Models to Know What They Know

Truly reliable AI requires more than simply scaling up knowledge; it demands the ability to know what it knows and when it does not. Yet recent research shows that even the best LLMs misjudge their own competence in more than one in five cases, making any response born of such internal uncertainty impossible to fully trust. Inspired by self-improvement reinforcement learning techniques that require minimal data, we present a simple but powerful framework KnowRL that strengthens a model's internal understanding of its own feasibility boundaries, enabling safer and more responsible behaviour. Our framework combines two components: (i) introspection, where the model generates and classifies tasks it judges feasible or infeasible, and (ii) consensus-based rewarding, where stability of self-knowledge assessment is reinforced through internal agreement. By using internally generated data, this design strengthens consistency in self-knowledge and entirely avoids costly external supervision. In experiments on LLaMA-3.1-8B and Qwen-2.5-7B, KnowRL steadily improved self-knowledge, validated by both intrinsic self-consistency and extrinsic benchmarking. With nothing more than a small seed set and no external supervision, our method drove gains as high as 28% in accuracy and 12% in F1, outperforming baselines in just a few iterations. Our framework essentially unlocks the untapped capacity of LLMs to self-improve their knowledge awareness, opening the door to reliable, more accountable AI and safer deployment in critical applications. Owing to its simplicity and independence from external effort, we encourage applying this reliability-enhancing process to all future models.

  • 2 authors
·
Oct 13, 2025

X-Boundary: Establishing Exact Safety Boundary to Shield LLMs from Multi-Turn Jailbreaks without Compromising Usability

Despite the rapid development of safety alignment techniques for LLMs, defending against multi-turn jailbreaks is still a challenging task. In this paper, we conduct a comprehensive comparison, revealing that some existing defense methods can improve the robustness of LLMs against multi-turn jailbreaks but compromise usability, i.e., reducing general capabilities or causing the over-refusal problem. From the perspective of mechanism interpretability of LLMs, we discover that these methods fail to establish a boundary that exactly distinguishes safe and harmful feature representations. Therefore, boundary-safe representations close to harmful representations are inevitably disrupted, leading to a decline in usability. To address this issue, we propose X-Boundary to push harmful representations away from boundary-safe representations and obtain an exact distinction boundary. In this way, harmful representations can be precisely erased without disrupting safe ones. Experimental results show that X-Boundary achieves state-of-the-art defense performance against multi-turn jailbreaks, while reducing the over-refusal rate by about 20% and maintaining nearly complete general capability. Furthermore, we theoretically prove and empirically verify that X-Boundary can accelerate the convergence process during training. Please see our code at: https://github.com/AI45Lab/X-Boundary.

  • 5 authors
·
Feb 14, 2025

Numerical Approximation Capacity of Neural Networks with Bounded Parameters: Do Limits Exist, and How Can They Be Measured?

The Universal Approximation Theorem posits that neural networks can theoretically possess unlimited approximation capacity with a suitable activation function and a freely chosen or trained set of parameters. However, a more practical scenario arises when these neural parameters, especially the nonlinear weights and biases, are bounded. This leads us to question: Does the approximation capacity of a neural network remain universal, or does it have a limit when the parameters are practically bounded? And if it has a limit, how can it be measured? Our theoretical study indicates that while universal approximation is theoretically feasible, in practical numerical scenarios, Deep Neural Networks (DNNs) with any analytic activation functions (such as Tanh and Sigmoid) can only be approximated by a finite-dimensional vector space under a bounded nonlinear parameter space (NP space), whether in a continuous or discrete sense. Based on this study, we introduce the concepts of ε outer measure and Numerical Span Dimension (NSdim) to quantify the approximation capacity limit of a family of networks both theoretically and practically. Furthermore, drawing on our new theoretical study and adopting a fresh perspective, we strive to understand the relationship between back-propagation neural networks and random parameter networks (such as the Extreme Learning Machine (ELM)) with both finite and infinite width. We also aim to provide fresh insights into regularization, the trade-off between width and depth, parameter space, width redundancy, condensation, and other related important issues.

  • 3 authors
·
Sep 25, 2024