Computation and Language 76
☆ Requirements Elicitation Follow-Up Question Generation
Interviews are a widely used technique in eliciting requirements to gather
stakeholder needs, preferences, and expectations for a software system.
Effective interviewing requires skilled interviewers to formulate appropriate
interview questions in real time while facing multiple challenges, including
lack of familiarity with the domain, excessive cognitive load, and information
overload that hinders how humans process stakeholders' speech. Recently, large
language models (LLMs) have exhibited state-of-the-art performance in multiple
natural language processing tasks, including text summarization and entailment.
To support interviewers, we investigate the application of GPT-4o to generate
follow-up interview questions during requirements elicitation by building on a
framework of common interviewer mistake types. In addition, we describe methods
to generate questions based on interviewee speech. We report a controlled
experiment to evaluate LLM-generated and human-authored questions with minimal
guidance, and a second controlled experiment to evaluate the LLM-generated
questions when generation is guided by interviewer mistake types. Our findings
demonstrate that, for both experiments, the LLM-generated questions are no
worse than the human-authored questions with respect to clarity, relevancy, and
informativeness. In addition, LLM-generated questions outperform human-authored
questions when guided by common mistakes types. This highlights the potential
of using LLMs to help interviewers improve the quality and ease of requirements
elicitation interviews in real time.
comment: 13 pages, 2 figures, accepted at the 33rd IEEE International
Requirements Engineering 2025
☆ Answer Matching Outperforms Multiple Choice for Language Model Evaluation
Multiple choice benchmarks have long been the workhorse of language model
evaluation because grading multiple choice is objective and easy to automate.
However, we show multiple choice questions from popular benchmarks can often be
answered without even seeing the question. These shortcuts arise from a
fundamental limitation of discriminative evaluation not shared by evaluations
of the model's free-form, generative answers. Until recently, there appeared to
be no viable, scalable alternative to multiple choice--but, we show that this
has changed. We consider generative evaluation via what we call answer
matching: Give the candidate model the question without the options, have it
generate a free-form response, then use a modern language model with the
reference answer to determine if the response matches the reference. To compare
the validity of different evaluation strategies, we annotate MMLU-Pro and
GPQA-Diamond to obtain human grading data, and measure the agreement of each
evaluation approach. We find answer matching using recent models--even small
ones--achieves near-perfect agreement, in the range of inter-annotator
agreement. In contrast, both multiple choice evaluation and using
LLM-as-a-judge without reference answers aligns poorly with human grading.
Improving evaluations via answer matching is not merely a conceptual concern:
the rankings of several models change significantly when evaluating their
free-form responses with answer matching. In light of these findings, we
discuss how to move the evaluation ecosystem from multiple choice to answer
matching.
comment: 34 pages, Code is available at
https://github.com/nikhilchandak/answer-matching
☆ MOTIF: Modular Thinking via Reinforcement Fine-tuning in LLMs
Recent advancements in the reasoning capabilities of large language models
(LLMs) show that employing group relative policy optimization (GRPO) algorithm
for reinforcement learning (RL) training allows the models to use more
thinking/reasoning tokens for generating better responses. However, LLMs can
generate only a finite amount of tokens while maintaining attention to the
previously generated tokens. This limit, also known as the context size of an
LLM, is a bottleneck in LLM reasoning with arbitrarily large number of tokens.
To think beyond the limit of context size, an LLM must employ a modular
thinking strategy to reason over multiple rounds. In this work, we propose
$\textbf{MOTIF: Modular Thinking via Reinforcement Finetuning}$ -- an RL
training method for generating thinking tokens in multiple rounds, effectively
allowing the model to think with additional context size. We trained the
open-source model Qwen2.5-3B-Instruct on GSM8K dataset via parameter efficient
fine-tuning and tested its accuracy on MATH500 and AIME2024 benchmarks. Our
experiments show 3.8\% and 3.3\% improvements over vanilla GRPO based training
in the respective benchmarks. Furthermore, this improvement was achieved with
only 15\% of samples, thus demonstrating sample efficiency of MOTIF. Our code
and models are available at https://github.com/purbeshmitra/MOTIF and
https://huggingface.co/purbeshmitra/MOTIF, respectively.
☆ LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
We describe a vulnerability in language models (LMs) trained with user
feedback, whereby a single user can persistently alter LM knowledge and
behavior given only the ability to provide prompts and upvote / downvote
feedback on LM outputs. To implement the attack, the attacker prompts the LM to
stochastically output either a "poisoned" or benign response, then upvotes the
poisoned response or downvotes the benign one. When feedback signals are used
in a subsequent preference tuning behavior, LMs exhibit increased probability
of producing poisoned responses even in contexts without malicious prompts. We
show that this attack can be used to (1) insert factual knowledge the model did
not previously possess, (2) modify code generation patterns in ways that
introduce exploitable security flaws, and (3) inject fake financial news. Our
finding both identifies a new qualitative feature of language model preference
tuning (showing that it even highly restricted forms of preference data can be
used to exert fine-grained control over behavior), and a new attack mechanism
for LMs trained with user feedback (extending work on pretraining-time data
poisoning and deployment-time prompt injection).
☆ Legal Requirements Translation from Law
Software systems must comply with legal regulations, which is a
resource-intensive task, particularly for small organizations and startups
lacking dedicated legal expertise. Extracting metadata from regulations to
elicit legal requirements for software is a critical step to ensure compliance.
However, it is a cumbersome task due to the length and complex nature of legal
text. Although prior work has pursued automated methods for extracting
structural and semantic metadata from legal text, key limitations remain: they
do not consider the interplay and interrelationships among attributes
associated with these metadata types, and they rely on manual labeling or
heuristic-driven machine learning, which does not generalize well to new
documents. In this paper, we introduce an approach based on textual entailment
and in-context learning for automatically generating a canonical representation
of legal text, encodable and executable as Python code. Our representation is
instantiated from a manually designed Python class structure that serves as a
domain-specific metamodel, capturing both structural and semantic legal
metadata and their interrelationships. This design choice reduces the need for
large, manually labeled datasets and enhances applicability to unseen
legislation. We evaluate our approach on 13 U.S. state data breach notification
laws, demonstrating that our generated representations pass approximately 89.4%
of test cases and achieve a precision and recall of 82.2 and 88.7,
respectively.
comment: 13 pages, 7 figures, Accepted at the 33rd IEEE International
Requirements Engineering 2025
☆ Visual Contextual Attack: Jailbreaking MLLMs with Image-Driven Context Injection
With the emergence of strong visual-language capabilities, multimodal large
language models (MLLMs) have demonstrated tremendous potential for real-world
applications. However, the security vulnerabilities exhibited by the visual
modality pose significant challenges to deploying such models in open-world
environments. Recent studies have successfully induced harmful responses from
target MLLMs by encoding harmful textual semantics directly into visual inputs.
However, in these approaches, the visual modality primarily serves as a trigger
for unsafe behavior, often exhibiting semantic ambiguity and lacking grounding
in realistic scenarios. In this work, we define a novel setting: visual-centric
jailbreak, where visual information serves as a necessary component in
constructing a complete and realistic jailbreak context. Building on this
setting, we propose the VisCo (Visual Contextual) Attack. VisCo fabricates
contextual dialogue using four distinct visual-focused strategies, dynamically
generating auxiliary images when necessary to construct a visual-centric
jailbreak scenario. To maximize attack effectiveness, it incorporates automatic
toxicity obfuscation and semantic refinement to produce a final attack prompt
that reliably triggers harmful responses from the target black-box MLLMs.
Specifically, VisCo achieves a toxicity score of 4.78 and an Attack Success
Rate (ASR) of 85% on MM-SafetyBench against GPT-4o, significantly outperforming
the baseline, which performs a toxicity score of 2.48 and an ASR of 22.2%. The
code is available at https://github.com/Dtc7w3PQ/Visco-Attack.
comment: 16 pages
☆ StepHint: Multi-level Stepwise Hints Enhance Reinforcement Learning to Reason
Reinforcement learning with verifiable rewards (RLVR) is a promising approach
for improving the complex reasoning abilities of large language models (LLMs).
However, current RLVR methods face two significant challenges: the near-miss
reward problem, where a small mistake can invalidate an otherwise correct
reasoning process, greatly hindering training efficiency; and exploration
stagnation, where models tend to focus on solutions within their ``comfort
zone,'' lacking the motivation to explore potentially more effective
alternatives. To address these challenges, we propose StepHint, a novel RLVR
algorithm that utilizes multi-level stepwise hints to help models explore the
solution space more effectively. StepHint generates valid reasoning chains from
stronger models and partitions these chains into reasoning steps using our
proposed adaptive partitioning method. The initial few steps are used as hints,
and simultaneously, multiple-level hints (each comprising a different number of
steps) are provided to the model. This approach directs the model's exploration
toward a promising solution subspace while preserving its flexibility for
independent exploration. By providing hints, StepHint mitigates the near-miss
reward problem, thereby improving training efficiency. Additionally, the
external reasoning pathways help the model develop better reasoning abilities,
enabling it to move beyond its ``comfort zone'' and mitigate exploration
stagnation. StepHint outperforms competitive RLVR enhancement methods across
six mathematical benchmarks, while also demonstrating superior generalization
and excelling over baselines on out-of-domain benchmarks.
☆ ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning
Recent advances in large language models have been driven by reinforcement
learning (RL)-style post-training, which improves reasoning by optimizing model
outputs based on reward or preference signals. GRPO-style approaches implement
this by using self-generated samples labeled by an outcome-based verifier.
However, these methods depend heavily on the model's initial ability to produce
positive samples. They primarily refine what the model already knows
(distribution sharpening) rather than enabling the model to solve problems
where it initially fails. This limitation is especially problematic in
early-stage RL training and on challenging reasoning tasks, where positive
samples are unlikely to be generated. To unlock reasoning ability in such
settings, the model must explore new reasoning trajectories beyond its current
output distribution. Such exploration requires access to sufficiently good
positive samples to guide the learning. While expert demonstrations seem like a
natural solution, we find that they are often ineffective in RL post-training.
Instead, we identify two key properties of effective positive samples: they
should (1) be likely under the current policy, and (2) increase the model's
likelihood of predicting the correct answer. Based on these insights, we
propose $\textbf{Self-Explanation Policy Optimization (ExPO)}$-a simple and
modular framework that generates such samples by conditioning on the
ground-truth answer. ExPO enables efficient exploration and guides the model to
produce reasoning trajectories more aligned with its policy than expert-written
CoTs, while ensuring higher quality than its own (incorrect) samples.
Experiments show that ExPO improves both learning efficiency and final
performance on reasoning benchmarks, surpassing expert-demonstration-based
methods in challenging settings such as MATH level-5, where the model initially
struggles the most.
☆ Generalizing Verifiable Instruction Following
Valentina Pyatkin, Saumya Malik, Victoria Graf, Hamish Ivison, Shengyi Huang, Pradeep Dasigi, Nathan Lambert, Hannaneh Hajishirzi
A crucial factor for successful human and AI interaction is the ability of
language models or chatbots to follow human instructions precisely. A common
feature of instructions are output constraints like ``only answer with yes or
no" or ``mention the word `abrakadabra' at least 3 times" that the user adds to
craft a more useful answer. Even today's strongest models struggle with
fulfilling such constraints. We find that most models strongly overfit on a
small set of verifiable constraints from the benchmarks that test these
abilities, a skill called precise instruction following, and are not able to
generalize well to unseen output constraints. We introduce a new benchmark,
IFBench, to evaluate precise instruction following generalization on 58 new,
diverse, and challenging verifiable out-of-domain constraints. In addition, we
perform an extensive analysis of how and on what data models can be trained to
improve precise instruction following generalization. Specifically, we
carefully design constraint verification modules and show that reinforcement
learning with verifiable rewards (RLVR) significantly improves instruction
following. In addition to IFBench, we release 29 additional new hand-annotated
training constraints and verification functions, RLVR training prompts, and
code.
comment: 11 pages
☆ SynapseRoute: An Auto-Route Switching Framework on Dual-State Large Language Model
Wencheng Zhang, Shiqin Qiao, Lingjie Luo, Yinfeng Li, Chuanyang Zheng, Qian Xu, Meng Li, Yong Gui, Yijun He, Jianing Qiu, Jindong Hong, Jiankai Sun
With the widespread adoption of large language models (LLMs) in practical
applications, selecting an appropriate model requires balancing not only
performance but also operational cost. The emergence of reasoning-capable
models has further widened the cost gap between "thinking" (high reasoning) and
"non-thinking" (fast, low-cost) modes. In this work, we reveal that
approximately 58% of medical questions can be accurately answered by the
non-thinking mode alone, without requiring the high-cost reasoning process.
This highlights a clear dichotomy in problem complexity and suggests that
dynamically routing queries to the appropriate mode based on complexity could
optimize accuracy, cost-efficiency, and overall user experience. Based on this,
we further propose SynapseRoute, a machine learning-based dynamic routing
framework that intelligently assigns input queries to either thinking or
non-thinking modes. Experimental results on several medical datasets
demonstrate that SynapseRoute not only improves overall accuracy (0.8390 vs.
0.8272) compared to the thinking mode alone but also reduces inference time by
36.8% and token consumption by 39.66%. Importantly, qualitative analysis
indicates that over-reasoning on simpler queries can lead to unnecessary delays
and even decreased accuracy, a pitfall avoided by our adaptive routing.
Finally, this work further introduces the Accuracy-Inference-Token (AIT) index
to comprehensively evaluate the trade-offs among accuracy, latency, and token
cost.
☆ Multimodal Mathematical Reasoning with Diverse Solving Perspective
Recent progress in large-scale reinforcement learning (RL) has notably
enhanced the reasoning capabilities of large language models (LLMs), especially
in mathematical domains. However, current multimodal LLMs (MLLMs) for
mathematical reasoning often rely on one-to-one image-text pairs and
single-solution supervision, overlooking the diversity of valid reasoning
perspectives and internal reflections. In this work, we introduce MathV-DP, a
novel dataset that captures multiple diverse solution trajectories for each
image-question pair, fostering richer reasoning supervision. We further propose
Qwen-VL-DP, a model built upon Qwen-VL, fine-tuned with supervised learning and
enhanced via group relative policy optimization (GRPO), a rule-based RL
approach that integrates correctness discrimination and diversity-aware reward
functions. Our method emphasizes learning from varied reasoning perspectives
and distinguishing between correct yet distinct solutions. Extensive
experiments on the MathVista's minitest and Math-V benchmarks demonstrate that
Qwen-VL-DP significantly outperforms prior base MLLMs in both accuracy and
generative diversity, highlighting the importance of incorporating diverse
perspectives and reflective reasoning in multimodal mathematical reasoning.
comment: 8 pages
☆ Is Reasoning All You Need? Probing Bias in the Age of Reasoning Language Models
Reasoning Language Models (RLMs) have gained traction for their ability to
perform complex, multi-step reasoning tasks through mechanisms such as
Chain-of-Thought (CoT) prompting or fine-tuned reasoning traces. While these
capabilities promise improved reliability, their impact on robustness to social
biases remains unclear. In this work, we leverage the CLEAR-Bias benchmark,
originally designed for Large Language Models (LLMs), to investigate the
adversarial robustness of RLMs to bias elicitation. We systematically evaluate
state-of-the-art RLMs across diverse sociocultural dimensions, using an
LLM-as-a-judge approach for automated safety scoring and leveraging jailbreak
techniques to assess the strength of built-in safety mechanisms. Our evaluation
addresses three key questions: (i) how the introduction of reasoning
capabilities affects model fairness and robustness; (ii) whether models
fine-tuned for reasoning exhibit greater safety than those relying on CoT
prompting at inference time; and (iii) how the success rate of jailbreak
attacks targeting bias elicitation varies with the reasoning mechanisms
employed. Our findings reveal a nuanced relationship between reasoning
capabilities and bias safety. Surprisingly, models with explicit reasoning,
whether via CoT prompting or fine-tuned reasoning traces, are generally more
vulnerable to bias elicitation than base models without such mechanisms,
suggesting reasoning may unintentionally open new pathways for stereotype
reinforcement. Reasoning-enabled models appear somewhat safer than those
relying on CoT prompting, which are particularly prone to contextual reframing
attacks through storytelling prompts, fictional personas, or reward-shaped
instructions. These results challenge the assumption that reasoning inherently
improves robustness and underscore the need for more bias-aware approaches to
reasoning design.
☆ From Long Videos to Engaging Clips: A Human-Inspired Video Editing Framework with Multimodal Narrative Understanding
Xiangfeng Wang, Xiao Li, Yadong Wei, Xueyu Song, Yang Song, Xiaoqiang Xia, Fangrui Zeng, Zaiyi Chen, Liu Liu, Gu Xu, Tong Xu
The rapid growth of online video content, especially on short video
platforms, has created a growing demand for efficient video editing techniques
that can condense long-form videos into concise and engaging clips. Existing
automatic editing methods predominantly rely on textual cues from ASR
transcripts and end-to-end segment selection, often neglecting the rich visual
context and leading to incoherent outputs. In this paper, we propose a
human-inspired automatic video editing framework (HIVE) that leverages
multimodal narrative understanding to address these limitations. Our approach
incorporates character extraction, dialogue analysis, and narrative
summarization through multimodal large language models, enabling a holistic
understanding of the video content. To further enhance coherence, we apply
scene-level segmentation and decompose the editing process into three subtasks:
highlight detection, opening/ending selection, and pruning of irrelevant
content. To facilitate research in this area, we introduce DramaAD, a novel
benchmark dataset comprising over 800 short drama episodes and 500
professionally edited advertisement clips. Experimental results demonstrate
that our framework consistently outperforms existing baselines across both
general and advertisement-oriented editing tasks, significantly narrowing the
quality gap between automatic and human-edited videos.
☆ Self-Correction Bench: Revealing and Addressing the Self-Correction Blind Spot in LLMs
Although large language models (LLMs) have become transformative, they still
make mistakes and can explore unproductive reasoning paths. Self-correction is
an important capability for a trustworthy LLM, particularly an autoregressive
LLM. While LLMs can identify error in user input, they exhibit a systematic
'Self-Correction Blind Spot' - failing to correct identical error in their own
outputs. To systematically study this phenomenon, we introduce Self-Correction
Bench, a systematic framework to measure this phenomenon through controlled
error injection at three complexity levels. Testing 14 models, we find an
average 64.5% blind spot rate. We find multiple evidences that this limitation
relates to training data composition: human training demonstrations
predominantly show error-free responses rather than error-correction sequences,
unlike RL-trained models that learn error correction through outcome feedback.
Remarkably, simply appending "Wait" reduces blind spots by 89.3%, suggesting
that the capability exists but requires activation. Our work highlights a
critical limitation in current LLMs and offers potential avenues for improving
their reliability and trustworthiness.
comment: 31 pages, 18 figures
☆ DeSTA2.5-Audio: Toward General-Purpose Large Audio Language Model with Self-Generated Cross-Modal Alignment
Ke-Han Lu, Zhehuai Chen, Szu-Wei Fu, Chao-Han Huck Yang, Sung-Feng Huang, Chih-Kai Yang, Chee-En Yu, Chun-Wei Chen, Wei-Chih Chen, Chien-yu Huang, Yi-Cheng Lin, Yu-Xiang Lin, Chi-An Fu, Chun-Yi Kuan, Wenze Ren, Xuanjun Chen, Wei-Ping Huang, En-Pei Hu, Tzu-Quan Lin, Yuan-Kuei Wu, Kuan-Po Huang, Hsiao-Ying Huang, Huang-Cheng Chou, Kai-Wei Chang, Cheng-Han Chiang, Boris Ginsburg, Yu-Chiang Frank Wang, Hung-yi Lee
We introduce DeSTA2.5-Audio, a general-purpose Large Audio Language Model
(LALM) designed for robust auditory perception and instruction-following,
without requiring task-specific audio instruction-tuning. Recent LALMs
typically augment Large Language Models (LLMs) with auditory capabilities by
training on large-scale, manually curated or LLM-synthesized audio-instruction
datasets. However, these approaches have often suffered from the catastrophic
forgetting of the LLM's original language abilities. To address this, we
revisit the data construction pipeline and propose DeSTA, a self-generated
cross-modal alignment strategy in which the backbone LLM generates its own
training targets. This approach preserves the LLM's native language proficiency
while establishing effective audio-text alignment, thereby enabling zero-shot
generalization without task-specific tuning. Using DeSTA, we construct
DeSTA-AQA5M, a large-scale, task-agnostic dataset containing 5 million training
samples derived from 7,000 hours of audio spanning 50 diverse datasets,
including speech, environmental sounds, and music. DeSTA2.5-Audio achieves
state-of-the-art or competitive performance across a wide range of
audio-language benchmarks, including Dynamic-SUPERB, MMAU, SAKURA,
Speech-IFEval, and VoiceBench. Comprehensive comparative studies demonstrate
that our self-generated strategy outperforms widely adopted data construction
and training strategies in both auditory perception and instruction-following
capabilities. Our findings underscore the importance of carefully designed data
construction in LALM development and offer practical insights for building
robust, general-purpose LALMs.
comment: Model and code available at:
https://github.com/kehanlu/DeSTA2.5-Audio
☆ Measurement of the Granularity of Vowel Production Space By Just Producible Different (JPD) Limens
A body of work over the past several decades has demonstrated that the
complex and coordinated articulatory movements of human vowel production are
governed (at least in part)by control mechanisms whose targets are regions of
auditory space. Within the target region control at the sub-phonemic level has
also been demonstrated. But the degree of accuracy of that control is unknown.
The current work investigates this question by asking how far apart must two
vowel stimuli lie in auditory space in order to yield reliably different
imitations? This distance is termed 'Just Producible Difference' (JPD). The
current study uses a vowel mimicry paradigm to derive the first measurement of
JPD among two sets of English speakers during front vowel production. JPD is
estimated at between 14 and 51 mels in F1 X F2 space. This finding has
implications for episodic theories of speech production. It also clarifies the
possible structures of human vowel systems, by setting a theoretical lower
bound for how close two vowel phonemes may be in a speaker's formant space, and
hence a psychophysical explanation of observed trends in number and patterns of
possible vowel phonemes.
☆ Early Signs of Steganographic Capabilities in Frontier LLMs
Monitoring Large Language Model (LLM) outputs is crucial for mitigating risks
from misuse and misalignment. However, LLMs could evade monitoring through
steganography: Encoding hidden information within seemingly benign generations.
In this paper, we evaluate the steganography capabilities in frontier LLMs to
better understand the risk they pose. We focus on two types of steganography:
passing encoded messages and performing encoded reasoning. We find that current
models are unable to encode short messages in their outputs without a monitor
noticing under standard affordances. They can succeed, however, if given
additional affordances such as using an unmonitored scratchpad and coordinating
on what encoding scheme to use. We additionally find early signs that models
can perform basic encoded reasoning in a simple state-tracking problem. This
includes some ability to reason with their own and pre-defined schemes,
including encoding schemes such as Hexadecimal. Despite this, they can rarely
hide reasoning subtly within a cover task to fool a monitor. Overall, our
results indicate that current LLMs exhibit nascent steganographic capabilities.
While these capabilities are likely insufficient to bypass well-designed
monitors at present, this could change in the future.
☆ Can LLMs Identify Critical Limitations within Scientific Research? A Systematic Evaluation on AI Research Papers
Peer review is fundamental to scientific research, but the growing volume of
publications has intensified the challenges of this expertise-intensive
process. While LLMs show promise in various scientific tasks, their potential
to assist with peer review, particularly in identifying paper limitations,
remains understudied. We first present a comprehensive taxonomy of limitation
types in scientific research, with a focus on AI. Guided by this taxonomy, for
studying limitations, we present LimitGen, the first comprehensive benchmark
for evaluating LLMs' capability to support early-stage feedback and complement
human peer review. Our benchmark consists of two subsets: LimitGen-Syn, a
synthetic dataset carefully created through controlled perturbations of
high-quality papers, and LimitGen-Human, a collection of real human-written
limitations. To improve the ability of LLM systems to identify limitations, we
augment them with literature retrieval, which is essential for grounding
identifying limitations in prior scientific findings. Our approach enhances the
capabilities of LLM systems to generate limitations in research papers,
enabling them to provide more concrete and constructive feedback.
☆ Exploring Gender Bias Beyond Occupational Titles
In this work, we investigate the correlation between gender and contextual
biases, focusing on elements such as action verbs, object nouns, and
particularly on occupations. We introduce a novel dataset, GenderLexicon, and a
framework that can estimate contextual bias and its related gender bias. Our
model can interpret the bias with a score and thus improve the explainability
of gender bias. Also, our findings confirm the existence of gender biases
beyond occupational stereotypes. To validate our approach and demonstrate its
effectiveness, we conduct evaluations on five diverse datasets, including a
Japanese dataset.
comment: Work in progress
☆ ASDA: Audio Spectrogram Differential Attention Mechanism for Self-Supervised Representation Learning
In recent advancements in audio self-supervised representation learning, the
standard Transformer architecture has emerged as the predominant approach, yet
its attention mechanism often allocates a portion of attention weights to
irrelevant information, potentially impairing the model's discriminative
ability. To address this, we introduce a differential attention mechanism,
which effectively mitigates ineffective attention allocation through the
integration of dual-softmax operations and appropriately tuned differential
coefficients. Experimental results demonstrate that our ASDA model achieves
state-of-the-art (SOTA) performance across multiple benchmarks, including audio
classification (49.0% mAP on AS-2M, 41.5% mAP on AS20K), keyword spotting
(98.3% accuracy on SPC-2), and environmental sound classification (96.1%
accuracy on ESC-50). These results highlight ASDA's effectiveness in audio
tasks, paving the way for broader applications.
comment: Accepted at Interspeech2025
☆ OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding
Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang
Speculative decoding generally dictates having a small, efficient draft model
that is either pretrained or distilled offline to a particular target model
series, for instance, Llama or Qwen models. However, within online deployment
settings, there are two major challenges: 1) usage of a target model that is
incompatible with the draft model; 2) expectation of latency improvements over
usage and time. In this work, we propose OmniDraft, a unified framework that
enables a single draft model to operate with any target model and adapt
dynamically to user data. We introduce an online n-gram cache with hybrid
distillation fine-tuning to address the cross-vocabulary mismatch across draft
and target models; and further improve decoding speed by leveraging adaptive
drafting techniques. OmniDraft is particularly suitable for on-device LLM
applications where model cost, efficiency and user customization are the major
points of contention. This further highlights the need to tackle the above
challenges and motivates the \textit{``one drafter for all''} paradigm. We
showcase the proficiency of the OmniDraft framework by performing online
learning on math reasoning, coding and text generation tasks. Notably,
OmniDraft enables a single Llama-68M model to pair with various target models
including Vicuna-7B, Qwen2-7B and Llama3-8B models for speculative decoding;
and additionally provides up to 1.5-2x speedup.
☆ Decoupled Planning and Execution: A Hierarchical Reasoning Framework for Deep Search
Complex information needs in real-world search scenarios demand deep
reasoning and knowledge synthesis across diverse sources, which traditional
retrieval-augmented generation (RAG) pipelines struggle to address effectively.
Current reasoning-based approaches suffer from a fundamental limitation: they
use a single model to handle both high-level planning and detailed execution,
leading to inefficient reasoning and limited scalability. In this paper, we
introduce HiRA, a hierarchical framework that separates strategic planning from
specialized execution. Our approach decomposes complex search tasks into
focused subtasks, assigns each subtask to domain-specific agents equipped with
external tools and reasoning capabilities, and coordinates the results through
a structured integration mechanism. This separation prevents execution details
from disrupting high-level reasoning while enabling the system to leverage
specialized expertise for different types of information processing.
Experiments on four complex, cross-modal deep search benchmarks demonstrate
that HiRA significantly outperforms state-of-the-art RAG and agent-based
systems. Our results show improvements in both answer quality and system
efficiency, highlighting the effectiveness of decoupled planning and execution
for multi-step information seeking tasks. Our code is available at
https://github.com/ignorejjj/HiRA.
comment: 9 pages
☆ Strategic Intelligence in Large Language Models: Evidence from evolutionary Game Theory
Are Large Language Models (LLMs) a new form of strategic intelligence, able
to reason about goals in competitive settings? We present compelling supporting
evidence. The Iterated Prisoner's Dilemma (IPD) has long served as a model for
studying decision-making. We conduct the first ever series of evolutionary IPD
tournaments, pitting canonical strategies (e.g., Tit-for-Tat, Grim Trigger)
against agents from the leading frontier AI companies OpenAI, Google, and
Anthropic. By varying the termination probability in each tournament (the
"shadow of the future"), we introduce complexity and chance, confounding
memorisation.
Our results show that LLMs are highly competitive, consistently surviving and
sometimes even proliferating in these complex ecosystems. Furthermore, they
exhibit distinctive and persistent "strategic fingerprints": Google's Gemini
models proved strategically ruthless, exploiting cooperative opponents and
retaliating against defectors, while OpenAI's models remained highly
cooperative, a trait that proved catastrophic in hostile environments.
Anthropic's Claude emerged as the most forgiving reciprocator, showing
remarkable willingness to restore cooperation even after being exploited or
successfully defecting. Analysis of nearly 32,000 prose rationales provided by
the models reveals that they actively reason about both the time horizon and
their opponent's likely strategy, and we demonstrate that this reasoning is
instrumental to their decisions. This work connects classic game theory with
machine psychology, offering a rich and granular view of algorithmic
decision-making under uncertainty.
comment: 29 pages, 27 tables, 4 figures
☆ MPF: Aligning and Debiasing Language Models post Deployment via Multi Perspective Fusion ICML 2025
Multiperspective Fusion (MPF) is a novel posttraining alignment framework for
large language models (LLMs) developed in response to the growing need for easy
bias mitigation. Built on top of the SAGED pipeline, an automated system for
constructing bias benchmarks and extracting interpretable baseline
distributions, MPF leverages multiperspective generations to expose and align
biases in LLM outputs with nuanced, humanlike baselines. By decomposing
baseline, such as sentiment distributions from HR professionals, into
interpretable perspective components, MPF guides generation through sampling
and balancing of responses, weighted by the probabilities obtained in the
decomposition. Empirically, we demonstrate its ability to align LLM sentiment
distributions with both counterfactual baselines (absolute equality) and the HR
baseline (biased for Top Univeristy), resulting in small KL divergence,
reduction of calibration error and generalization to unseen questions. This
shows that MPF offers a scalable and interpretable method for alignment and
bias mitigation, compatible with deployed LLMs and requiring no extensive
prompt engineering or finetuning.
comment: Accepted at ICML 2025 AIW Workshop
☆ Revisiting Active Learning under (Human) Label Variation
Access to high-quality labeled data remains a limiting factor in applied
supervised learning. While label variation (LV), i.e., differing labels for the
same instance, is common, especially in natural language processing, annotation
frameworks often still rest on the assumption of a single ground truth. This
overlooks human label variation (HLV), the occurrence of plausible differences
in annotations, as an informative signal. Similarly, active learning (AL), a
popular approach to optimizing the use of limited annotation budgets in
training ML models, often relies on at least one of several simplifying
assumptions, which rarely hold in practice when acknowledging HLV. In this
paper, we examine foundational assumptions about truth and label nature,
highlighting the need to decompose observed LV into signal (e.g., HLV) and
noise (e.g., annotation error). We survey how the AL and (H)LV communities have
addressed -- or neglected -- these distinctions and propose a conceptual
framework for incorporating HLV throughout the AL loop, including instance
selection, annotator choice, and label representation. We further discuss the
integration of large language models (LLM) as annotators. Our work aims to lay
a conceptual foundation for HLV-aware active learning, better reflecting the
complexities of real-world annotation.
☆ WebSailor: Navigating Super-human Reasoning for Web Agent
Kuan Li, Zhongwang Zhang, Huifeng Yin, Liwen Zhang, Litu Ou, Jialong Wu, Wenbiao Yin, Baixuan Li, Zhengwei Tao, Xinyu Wang, Weizhou Shen, Junkai Zhang, Dingchu Zhang, Xixi Wu, Yong Jiang, Ming Yan, Pengjun Xie, Fei Huang, Jingren Zhou
Transcending human cognitive limitations represents a critical frontier in
LLM training. Proprietary agentic systems like DeepResearch have demonstrated
superhuman capabilities on extremely complex information-seeking benchmarks
such as BrowseComp, a feat previously unattainable. We posit that their success
hinges on a sophisticated reasoning pattern absent in open-source models: the
ability to systematically reduce extreme uncertainty when navigating vast
information landscapes. Based on this insight, we introduce WebSailor, a
complete post-training methodology designed to instill this crucial capability.
Our approach involves generating novel, high-uncertainty tasks through
structured sampling and information obfuscation, RFT cold start, and an
efficient agentic RL training algorithm, Duplicating Sampling Policy
Optimization (DUPO). With this integrated pipeline, WebSailor significantly
outperforms all opensource agents in complex information-seeking tasks,
matching proprietary agents' performance and closing the capability gap.
☆ IndianBailJudgments-1200: A Multi-Attribute Dataset for Legal NLP on Indian Bail Orders
Legal NLP remains underdeveloped in regions like India due to the scarcity of
structured datasets. We introduce IndianBailJudgments-1200, a new benchmark
dataset comprising 1200 Indian court judgments on bail decisions, annotated
across 20+ attributes including bail outcome, IPC sections, crime type, and
legal reasoning. Annotations were generated using a prompt-engineered GPT-4o
pipeline and verified for consistency. This resource supports a wide range of
legal NLP tasks such as outcome prediction, summarization, and fairness
analysis, and is the first publicly available dataset focused specifically on
Indian bail jurisprudence.
comment: 9 pages, 9 figures, 2 tables. Dataset available at Hugging Face and
GitHub. Submitted to arXiv for open access
☆ A Cookbook for Community-driven Data Collection of Impaired Speech in LowResource Languages
Sumaya Ahmed Salihs, Isaac Wiafe, Jamal-Deen Abdulai, Elikem Doe Atsakpo, Gifty Ayoka, Richard Cave, Akon Obu Ekpezu, Catherine Holloway, Katrin Tomanek, Fiifi Baffoe Payin Winful
This study presents an approach for collecting speech samples to build
Automatic Speech Recognition (ASR) models for impaired speech, particularly,
low-resource languages. It aims to democratize ASR technology and data
collection by developing a "cookbook" of best practices and training for
community-driven data collection and ASR model building. As a proof-of-concept,
this study curated the first open-source dataset of impaired speech in Akan: a
widely spoken indigenous language in Ghana. The study involved participants
from diverse backgrounds with speech impairments. The resulting dataset, along
with the cookbook and open-source tools, are publicly available to enable
researchers and practitioners to create inclusive ASR technologies tailored to
the unique needs of speech impaired individuals. In addition, this study
presents the initial results of fine-tuning open-source ASR models to better
recognize impaired speech in Akan.
comment: This version has been reviewed and accepted for presentation at the
InterSpeech 2025 conference to be held in Rotterdam from 17 to 21 August. 5
pages and 3 tables
☆ Benchmarking Akan ASR Models Across Domain-Specific Datasets: A Comparative Evaluation of Performance, Scalability, and Adaptability
Mark Atta Mensah, Isaac Wiafe, Akon Ekpezu, Justice Kwame Appati, Jamal-Deen Abdulai, Akosua Nyarkoa Wiafe-Akenten, Frank Ernest Yeboah, Gifty Odame
Most existing automatic speech recognition (ASR) research evaluate models
using in-domain datasets. However, they seldom evaluate how they generalize
across diverse speech contexts. This study addresses this gap by benchmarking
seven Akan ASR models built on transformer architectures, such as Whisper and
Wav2Vec2, using four Akan speech corpora to determine their performance. These
datasets encompass various domains, including culturally relevant image
descriptions, informal conversations, biblical scripture readings, and
spontaneous financial dialogues. A comparison of the word error rate and
character error rate highlighted domain dependency, with models performing
optimally only within their training domains while showing marked accuracy
degradation in mismatched scenarios. This study also identified distinct error
behaviors between the Whisper and Wav2Vec2 architectures. Whereas fine-tuned
Whisper Akan models led to more fluent but potentially misleading transcription
errors, Wav2Vec2 produced more obvious yet less interpretable outputs when
encountering unfamiliar inputs. This trade-off between readability and
transparency in ASR errors should be considered when selecting architectures
for low-resource language (LRL) applications. These findings highlight the need
for targeted domain adaptation techniques, adaptive routing strategies, and
multilingual training frameworks for Akan and other LRLs.
comment: This version has been reviewed and accepted for presentation at the
Future Technologies Conference (FTC) 2025, to be held on 6 & 7 November 2025
in Munich, Germany. 17 pages, 4 figures, 1 table
☆ JoyTTS: LLM-based Spoken Chatbot With Voice Cloning
JoyTTS is an end-to-end spoken chatbot that combines large language models
(LLM) with text-to-speech (TTS) technology, featuring voice cloning
capabilities. This project is built upon the open-source MiniCPM-o and
CosyVoice2 models and trained on 2000 hours of conversational data. We have
also provided the complete training code to facilitate further development and
optimization by the community. On the testing machine seed-tts-zh, it achieves
a SS (speaker similarity) score of 0.73 and a WER (Word Error Rate) of 5.09.
The code and models, along with training and inference scripts, are available
at https://github.com/jdh-algo/JoyTTS.git.
☆ Efficient Code LLM Training via Distribution-Consistent and Diversity-Aware Data Selection
Recent advancements in large language models (LLMs) have significantly
improved code generation and program comprehension, accelerating the evolution
of software engineering. Current methods primarily enhance model performance by
leveraging vast amounts of data, focusing on data quantity while often
overlooking data quality, thereby reducing training efficiency. To address
this, we introduce an approach that utilizes a parametric model for code data
selection, aimed at improving both training efficiency and model performance.
Our method optimizes the parametric model to ensure distribution consistency
and diversity within the selected subset, guaranteeing high-quality data.
Experimental results demonstrate that using only 10K samples, our method
achieves gains of 2.4% (HumanEval) and 2.3% (MBPP) over 92K full-sampled
baseline, outperforming other sampling approaches in both performance and
efficiency. This underscores that our method effectively boosts model
performance while significantly reducing computational costs.
☆ QFFN-BERT: An Empirical Study of Depth, Performance, and Data Efficiency in Hybrid Quantum-Classical Transformers
Parameterized quantum circuits (PQCs) have recently emerged as promising
components for enhancing the expressibility of neural architectures. In this
work, we introduce QFFN-BERT, a hybrid quantum-classical transformer where the
feedforward network (FFN) modules of a compact BERT variant are replaced by
PQC-based layers. This design is motivated by the dominant parameter
contribution of FFNs, which account for approximately two-thirds of the
parameters within standard Transformer encoder blocks. While prior studies have
primarily integrated PQCs into self-attention modules, our work focuses on the
FFN and systematically investigates the trade-offs between PQC depth,
expressibility, and trainability. Our final PQC architecture incorporates a
residual connection, both $R_Y$ and $R_Z$ rotations, and an alternating
entanglement strategy to ensure stable training and high expressibility. Our
experiments, conducted on a classical simulator, on the SST-2 and DBpedia
benchmarks demonstrate two key findings. First, a carefully configured
QFFN-BERT achieves up to 102.0% of the baseline accuracy, surpassing its
classical counterpart in a full-data setting while reducing FFN-specific
parameters by over 99%. Second, our model exhibits a consistent and competitive
edge in few-shot learning scenarios, confirming its potential for superior data
efficiency. These results, supported by an ablation study on a non-optimized
PQC that failed to learn, confirm that PQCs can serve as powerful and
parameter-efficient alternatives to classical FFNs when co-designed with
foundational deep learning principles.
☆ Coling-UniA at SciVQA 2025: Few-Shot Example Retrieval and Confidence-Informed Ensembling for Multimodal Large Language Models ACL 2025
This paper describes our system for the SciVQA 2025 Shared Task on Scientific
Visual Question Answering. Our system employs an ensemble of two Multimodal
Large Language Models and various few-shot example retrieval strategies. The
model and few-shot setting are selected based on the figure and question type.
We also select answers based on the models' confidence levels. On the blind
test data, our system ranks third out of seven with an average F1 score of
85.12 across ROUGE-1, ROUGE-L, and BERTS. Our code is publicly available.
comment: Accepted at 5th Workshop on Scholarly Document Processing @ ACL 2025
☆ DoMIX: An Efficient Framework for Exploiting Domain Knowledge in Fine-Tuning ACL 2025
Domain-Adaptive Pre-training (DAP) has recently gained attention for its
effectiveness in fine-tuning pre-trained models. Building on this, continual
DAP has been explored to develop pre-trained models capable of incrementally
incorporating different domain datasets. However, existing continual DAP
methods face several limitations: (1) high computational cost and GPU memory
usage during training; (2) sensitivity to incremental data order; and (3)
providing a single, generalized model for all end tasks, which contradicts the
essence of DAP. In this paper, we propose DoMIX, a novel approach that
addresses these challenges by leveraging LoRA modules, a representative
parameter-efficient fine-tuning (PEFT) method. Our approach enables efficient
and parallel domain-adaptive pre-training that is robust to domain order and
effectively utilizes accumulated knowledge to provide tailored pre-trained
models for specific tasks. We also demonstrate that our method can be extended
beyond the DAP setting to standard LLM fine-tuning scenarios. Code is available
at https://github.com/dohoonkim-ai/DoMIX.
comment: 22 pages, 5 figures, ACL 2025 Main
☆ Seeing Through Green: Text-Based Classification and the Firm's Returns from Green Patents
This paper introduces Natural Language Processing for identifying ``true''
green patents from official supporting documents. We start our training on
about 12.4 million patents that had been classified as green from previous
literature. Thus, we train a simple neural network to enlarge a baseline
dictionary through vector representations of expressions related to
environmental technologies. After testing, we find that ``true'' green patents
represent about 20\% of the total of patents classified as green from previous
literature. We show heterogeneity by technological classes, and then check that
`true' green patents are about 1\% less cited by following inventions. In the
second part of the paper, we test the relationship between patenting and a
dashboard of firm-level financial accounts in the European Union. After
controlling for reverse causality, we show that holding at least one ``true''
green patent raises sales, market shares, and productivity. If we restrict the
analysis to high-novelty ``true'' green patents, we find that they also yield
higher profits. Our findings underscore the importance of using text analyses
to gauge finer-grained patent classifications that are useful for policymaking
in different domains.
☆ MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, Hao Zhou
Despite improvements by length extrapolation, efficient attention and memory
modules, handling infinitely long documents with linear complexity without
performance degradation during extrapolation remains the ultimate challenge in
long-text processing. We directly optimize for long-text tasks in an end-to-end
fashion and introduce a novel agent workflow, MemAgent, which reads text in
segments and updates the memory using an overwrite strategy. We extend the DAPO
algorithm to facilitate training via independent-context multi-conversation
generation. MemAgent has demonstrated superb long-context capabilities, being
able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task
with performance loss < 5% and achieves 95%+ in 512K RULER test.
comment: Project Page: https://memagent-sialab.github.io/
☆ GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons
Motivation: The Genomic Data Commons (GDC) provides access to high quality,
harmonized cancer genomics data through a unified curation and analysis
platform centered around patient cohorts. While GDC users can interactively
create complex cohorts through the graphical Cohort Builder, users (especially
new ones) may struggle to find specific cohort descriptors across hundreds of
possible fields and properties. However, users may be better able to describe
their desired cohort in free-text natural language.
Results: We introduce GDC Cohort Copilot, an open-source copilot tool for
curating cohorts from the GDC. GDC Cohort Copilot automatically generates the
GDC cohort filter corresponding to a user-input natural language description of
their desired cohort, before exporting the cohort back to the GDC for further
analysis. An interactive user interface allows users to further refine the
generated cohort. We develop and evaluate multiple large language models (LLMs)
for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC
Cohort LLM achieves better results than GPT-4o prompting in generating GDC
cohorts.
Availability and implementation: The standalone docker image for GDC Cohort
Copilot is available at https://quay.io/repository/cdis/gdc-cohort-copilot.
Source code is available at https://github.com/uc-cdis/gdc-cohort-copilot. GDC
Cohort LLM weights are available at https://huggingface.co/uc-ctds.
comment: 11 pages, 1 figure, 7 tables
☆ SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers
Graphical Abstracts (GAs) play a crucial role in visually conveying the key
findings of scientific papers. While recent research has increasingly
incorporated visual materials such as Figure 1 as de facto GAs, their potential
to enhance scientific communication remains largely unexplored. Moreover,
designing effective GAs requires advanced visualization skills, creating a
barrier to their widespread adoption. To tackle these challenges, we introduce
SciGA-145k, a large-scale dataset comprising approximately 145,000 scientific
papers and 1.14 million figures, explicitly designed for supporting GA
selection and recommendation as well as facilitating research in automated GA
generation. As a preliminary step toward GA design support, we define two
tasks: 1) Intra-GA recommendation, which identifies figures within a given
paper that are well-suited to serve as GAs, and 2) Inter-GA recommendation,
which retrieves GAs from other papers to inspire the creation of new GAs. We
provide reasonable baseline models for these tasks. Furthermore, we propose
Confidence Adjusted top-1 ground truth Ratio (CAR), a novel recommendation
metric that offers a fine-grained analysis of model behavior. CAR addresses
limitations in traditional ranking-based metrics by considering cases where
multiple figures within a paper, beyond the explicitly labeled GA, may also
serve as GAs. By unifying these tasks and metrics, our SciGA-145k establishes a
foundation for advancing visual scientific communication while contributing to
the development of AI for Science.
comment: 21 pages, 15 figures, 4 tables. Project Page:
https://iyatomilab.github.io/SciGA/
♻ ☆ Improved Unbiased Watermark for Large Language Models ACL 2025
As artificial intelligence surpasses human capabilities in text generation,
the necessity to authenticate the origins of AI-generated content has become
paramount. Unbiased watermarks offer a powerful solution by embedding
statistical signals into language model-generated text without distorting the
quality. In this paper, we introduce MCmark, a family of unbiased,
Multi-Channel-based watermarks. MCmark works by partitioning the model's
vocabulary into segments and promoting token probabilities within a selected
segment based on a watermark key. We demonstrate that MCmark not only preserves
the original distribution of the language model but also offers significant
improvements in detectability and robustness over existing unbiased watermarks.
Our experiments with widely-used language models demonstrate an improvement in
detectability of over 10% using MCmark, compared to existing state-of-the-art
unbiased watermarks. This advancement underscores MCmark's potential in
enhancing the practical application of watermarking in AI-generated texts.
comment: ACL 2025 Main Conference
♻ ☆ From Web Search towards Agentic Deep Research: Incentivizing Search with Reasoning Agents
Weizhi Zhang, Yangning Li, Yuanchen Bei, Junyu Luo, Guancheng Wan, Liangwei Yang, Chenxuan Xie, Yuyao Yang, Wei-Chieh Huang, Chunyu Miao, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Yankai Chen, Chunkit Chan, Peilin Zhou, Xinyang Zhang, Chenwei Zhang, Jingbo Shang, Ming Zhang, Yangqiu Song, Irwin King, Philip S. Yu
Information retrieval is a cornerstone of modern knowledge acquisition,
enabling billions of queries each day across diverse domains. However,
traditional keyword-based search engines are increasingly inadequate for
handling complex, multi-step information needs. Our position is that Large
Language Models (LLMs), endowed with reasoning and agentic capabilities, are
ushering in a new paradigm termed Agentic Deep Research. These systems
transcend conventional information search techniques by tightly integrating
autonomous reasoning, iterative retrieval, and information synthesis into a
dynamic feedback loop. We trace the evolution from static web search to
interactive, agent-based systems that plan, explore, and learn. We also
introduce a test-time scaling law to formalize the impact of computational
depth on reasoning and search. Supported by benchmark results and the rise of
open-source implementations, we demonstrate that Agentic Deep Research not only
significantly outperforms existing approaches, but is also poised to become the
dominant paradigm for future information seeking. All the related resources,
including industry products, research papers, benchmark datasets, and
open-source implementations, are collected for the community in
https://github.com/DavidZWZ/Awesome-Deep-Research.
♻ ☆ GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling
Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Lu Yin, Can Yang
Modern Large Language Models, such as the LLaMA, Qwen and DeepSeek series,
predominantly adopt the Pre-LayerNorm (Pre-LN) Transformer architecture. While
being stable during pretraining and scalable to large model sizes, Pre-LN
suffers from an exponential growth in activation variance across layers,
causing the shortcut to dominate over sub-layer outputs in the residual
connection and limiting the learning capacity of deeper layers. To mitigate
this issue, we propose Gradient-Preserving Activation Scaling (GPAS), a simple
technique that can be used in combination with existing approaches. GPAS works
by scaling down the intermediate activations while keeping their gradients
unchanged. This leaves information in the activations intact, and avoids the
gradient vanishing problem associated with gradient downscaling. Extensive
experiments across various model sizes from 71M to 1B show that GPAS achieves
consistent performance gains. Beyond enhancing Pre-LN Transformers, GPAS also
shows promise in improving alternative architectures such as Sandwich-LN and
DeepNorm, demonstrating its versatility and potential for improving training
dynamics in a wide range of settings. Our code is available at
https://github.com/dandingsky/GPAS.
♻ ☆ Enhancing Clinical Multiple-Choice Questions Benchmarks with Knowledge Graph Guided Distractor Generation
Clinical tasks such as diagnosis and treatment require strong decision-making
abilities, highlighting the importance of rigorous evaluation benchmarks to
assess the reliability of large language models (LLMs). In this work, we
introduce a knowledge-guided data augmentation framework that enhances the
difficulty of clinical multiple-choice question (MCQ) datasets by generating
distractors (i.e., incorrect choices that are similar to the correct one and
may confuse existing LLMs). Using our KG-based pipeline, the generated choices
are both clinically plausible and deliberately misleading. Our approach
involves multi-step, semantically informed walks on a medical knowledge graph
to identify distractor paths-associations that are medically relevant but
factually incorrect-which then guide the LLM in crafting more deceptive
distractors. We apply the designed knowledge graph guided distractor generation
(KGGDG) pipline, to six widely used medical QA benchmarks and show that it
consistently reduces the accuracy of state-of-the-art LLMs. These findings
establish KGGDG as a powerful tool for enabling more robust and diagnostic
evaluations of medical LLMs.
♻ ☆ Batch-Max: Higher LLM Throughput using Larger Batch Sizes and KV Cache Compression
Several works have developed eviction policies to remove key-value (KV) pairs
from the KV cache for more efficient inference. The focus has been on
compressing the KV cache after the input prompt has been processed for faster
token generation. In settings with limited GPU memory, and when the input
context is longer than the generation length, we show that by also compressing
the KV cache during the input processing phase, larger batch sizes can be used
resulting in significantly higher throughput while still maintaining the
original model's accuracy.
♻ ☆ Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
Boyu Gou, Zanming Huang, Yuting Ning, Yu Gu, Michael Lin, Weijian Qi, Andrei Kopanev, Botao Yu, Bernal Jiménez Gutiérrez, Yiheng Shu, Chan Hee Song, Jiaman Wu, Shijie Chen, Hanane Nour Moussa, Tianshu Zhang, Jian Xie, Yifei Li, Tianci Xue, Zeyi Liao, Kai Zhang, Boyuan Zheng, Zhaowei Cai, Viktor Rozgic, Morteza Ziyadi, Huan Sun, Yu Su
Agentic search such as Deep Research systems-where agents autonomously browse
the web, synthesize information, and return comprehensive citation-backed
answers-represents a major shift in how users interact with web-scale
information. While promising greater efficiency and cognitive offloading, the
growing complexity and open-endedness of agentic search have outpaced existing
evaluation benchmarks and methodologies, which largely assume short search
horizons and static answers. In this paper, we introduce Mind2Web 2, a
benchmark of 130 realistic, high-quality, and long-horizon tasks that require
real-time web browsing and extensive information synthesis, constructed with
over 1000 hours of human labor. To address the challenge of evaluating
time-varying and complex answers, we propose a novel Agent-as-a-Judge
framework. Our method constructs task-specific judge agents based on a
tree-structured rubric design to automatically assess both answer correctness
and source attribution. We conduct a comprehensive evaluation of ten frontier
agentic search systems and human performance, along with a detailed error
analysis to draw insights for future development. The best-performing system,
OpenAI Deep Research, can already achieve 50-70% of human performance while
spending half the time, highlighting its great potential. Altogether, Mind2Web
2 provides a rigorous foundation for developing and benchmarking the next
generation of agentic search systems.
comment: Project Homepage: https://osu-nlp-group.github.io/Mind2Web-2/
♻ ☆ On Characterizations for Language Generation: Interplay of Hallucinations, Breadth, and Stability
We study language generation in the limit - introduced by Kleinberg and
Mullainathan [KM24] - building on classical works of Gold [Gol67] and Angluin
[Ang79]. [KM24]'s main result is an algorithm for generating from any countable
language collection in the limit. While their algorithm eventually generates
unseen strings from the target language $K$, it sacrifices coverage or breadth,
i.e., its ability to generate a rich set of strings. Recent work introduces
different notions of breadth and explores when generation with breadth is
possible, leaving a full characterization of these notions open. Our first set
of results settles this by characterizing generation for existing notions of
breadth and their natural extensions. Interestingly, our lower bounds are very
flexible and hold for many performance metrics beyond breadth - for instance,
showing that, in general, it is impossible to train generators which achieve a
higher perplexity or lower hallucination rate for $K$ compared to other
languages. Next, we study language generation with breadth and stable
generators - algorithms that eventually stop changing after seeing an arbitrary
but finite number of strings - and prove unconditional lower bounds for such
generators, strengthening the results of [KMV25] and demonstrating that
generation with many existing notions of breadth becomes equally hard, when
stability is required. This gives a separation for generation with approximate
breadth, between stable and unstable generators, highlighting the rich
interplay between breadth, stability, and consistency in language generation.
comment: v2 improves exposition and simplifies proofs
♻ ☆ Next-Token Prediction Task Assumes Optimal Data Ordering for LLM Training in Proof Generation
Chenyang An, Shima Imani, Feng Yao, Chengyu Dong, Ali Abbasi, Harsh Shrivastava, Samuel Buss, Jingbo Shang, Gayathri Mahalingam, Pramod Sharma, Maurice Diesendruck
In the field of large language model (LLM)-based proof generation, despite
extensive training on large datasets such as ArXiv, LLMs still exhibit only
modest performance on proving tasks of moderate difficulty. We believe that
this is partly due to the widespread presence of suboptimal ordering within the
data for each proof used in training. For example, published proofs often
follow a purely logical order, where each step logically proceeds from the
previous steps based on the deductive rules. This order is designed to
facilitate the verification of the proof's soundness, rather than to help
people and models learn the discovery process of the proof. In proof
generation, we argue that the optimal order for one training data sample occurs
when the relevant intermediate supervision for a particular proof step in the
proof is always positioned to the left of that proof step. We call such order
the intuitively sequential order. We validate our claims using two tasks:
intuitionistic propositional logic theorem-proving and digit multiplication.
Our experiments verify the order effect and provide support for our
explanations. We demonstrate that training is most effective when the proof is
in the intuitively sequential order. Moreover, the order effect and the
performance gap between models trained on different data orders can be
substantial -- with an 11 percent improvement in proof success rate observed in
the propositional logic theorem-proving task, between models trained on the
optimal order compared to the worst order. Lastly, we define a common type of
order issue in advanced math proofs and find that 17.3 percent of theorems with
nontrivial proofs in the first two chapters of a widely used graduate-level
mathematics textbook suffer from this issue. A detailed list of those proofs is
provided in the appendix.
♻ ☆ Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning NeurIPS 2025
Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Visual-language Chain-of-Thought (CoT) data resources are relatively scarce
compared to text-only counterparts, limiting the improvement of reasoning
capabilities in Vision Language Models (VLMs). However, high-quality
vision-language reasoning data is expensive and labor-intensive to annotate. To
address this issue, we leverage a promising resource: game code, which
naturally contains logical structures and state transition processes.
Therefore, we propose Code2Logic, a novel game-code-driven approach for
multimodal reasoning data synthesis. Our approach leverages Large Language
Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning
processes and results through code execution. Using the Code2Logic approach, we
developed the GameQA dataset to train and evaluate VLMs. GameQA is
cost-effective and scalable, offers controllable difficulty gradation and is
diverse with 30 games and 158 tasks. Surprisingly, despite training solely on
game data, VLMs demonstrated out of domain generalization, specifically
Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language
benchmarks. Our code, dataset and models are available at
https://github.com/tongjingqi/Code2Logic.
comment: 63 pages, 23 figures, submitted to NeurIPS 2025
♻ ☆ Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Ensuring complex systems meet regulations typically requires checking the
validity of assurance cases through a claim-argument-evidence framework. Some
challenges in this process include the complicated nature of legal and
technical texts, the need for model explanations, and limited access to
assurance case data. We propose a compliance detection approach based on
Natural Language Inference (NLI): EXplainable CompLiance detection with
Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the
claim-argument-evidence structure of an assurance case as a multi-hop inference
for explainable and traceable compliance detection. We address the limited
number of assurance cases by generating them using large language models
(LLMs). We introduce metrics that measure the coverage and structural
consistency. We demonstrate the effectiveness of the generated assurance case
from GDPR requirements in a multi-hop inference task as a case study. Our
results highlight the potential of NLI-based approaches in automating the
regulatory compliance process.
♻ ☆ Direct Preference Optimization Using Sparse Feature-Level Constraints
Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
The alignment of large language models (LLMs) with human preferences remains
a key challenge. While post-training techniques like Reinforcement Learning
from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have
achieved notable success, they often introduce computational inefficiencies and
training instability. In this paper, we propose Feature-level constrained
Preference Optimization (FPO), a novel method designed to simplify the
alignment process while ensuring stability. FPO leverages pre-trained Sparse
Autoencoders (SAEs) and introduces feature-level constraints, allowing for
efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using
sparse features activated in a well-trained sparse autoencoder and the quality
of sequential KL divergence by using the feature-level offline reference.
Experimental results on benchmark datasets demonstrate that FPO achieves a
5.08% absolute improvement in win rate with much lower computational cost
compared to state-of-the-art baselines, making it a promising solution for
efficient and controllable LLM alignments.
♻ ☆ Symbolic or Numerical? Understanding Physics Problem Solving in Reasoning LLMs
Navigating the complexities of physics reasoning has long been a difficult
task for Large Language Models (LLMs), requiring a synthesis of profound
conceptual understanding and adept problem-solving techniques. In this study,
we investigate the application of advanced instruction-tuned reasoning models,
such as Deepseek-R1, to address a diverse spectrum of physics problems curated
from the challenging SciBench benchmark. Our comprehensive experimental
evaluation reveals the remarkable capabilities of reasoning models. Not only do
they achieve state-of-the-art accuracy in answering intricate physics
questions, but they also generate distinctive reasoning patterns that emphasize
on symbolic derivation. Furthermore, our findings indicate that even for these
highly sophisticated reasoning models, the strategic incorporation of few-shot
prompting can still yield measurable improvements in overall accuracy,
highlighting the potential for continued performance gains.
♻ ☆ MedAide: Information Fusion and Anatomy of Medical Intents via LLM-based Agent Collaboration
Dingkang Yang, Jinjie Wei, Mingcheng Li, Jiyao Liu, Lihao Liu, Ming Hu, Junjun He, Yakun Ju, Wei Zhou, Yang Liu, Lihua Zhang
In healthcare intelligence, the ability to fuse heterogeneous, multi-intent
information from diverse clinical sources is fundamental to building reliable
decision-making systems. Large Language Model (LLM)-driven information
interaction systems currently showing potential promise in the healthcare
domain. Nevertheless, they often suffer from information redundancy and
coupling when dealing with complex medical intents, leading to severe
hallucinations and performance bottlenecks. To this end, we propose MedAide, an
LLM-based medical multi-agent collaboration framework designed to enable
intent-aware information fusion and coordinated reasoning across specialized
healthcare domains. Specifically, we introduce a regularization-guided module
that combines syntactic constraints with retrieval augmented generation to
decompose complex queries into structured representations, facilitating
fine-grained clinical information fusion and intent resolution. Additionally, a
dynamic intent prototype matching module is proposed to utilize dynamic
prototype representation with a semantic similarity matching mechanism to
achieve adaptive recognition and updating of the agent's intent in multi-round
healthcare dialogues. Ultimately, we design a rotation agent collaboration
mechanism that introduces dynamic role rotation and decision-level information
fusion across specialized medical agents. Extensive experiments are conducted
on four medical benchmarks with composite intents. Experimental results from
automated metrics and expert doctor evaluations show that MedAide outperforms
current LLMs and improves their medical proficiency and strategic reasoning.
comment: LLM-based Multi-Agent Collaboration for Medical Applications
♻ ☆ AI Flow: Perspectives, Scenarios, and Approaches
Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Pioneered by the foundational information theory by Claude Shannon and the
visionary framework of machine intelligence by Alan Turing, the convergent
evolution of information and communication technologies (IT/CT) has created an
unbroken wave of connectivity and computation. This synergy has sparked a
technological revolution, now reaching its peak with large artificial
intelligence (AI) models that are reshaping industries and redefining
human-machine collaboration. However, the realization of ubiquitous
intelligence faces considerable challenges due to substantial resource
consumption in large models and high communication bandwidth demands. To
address these challenges, AI Flow has been introduced as a multidisciplinary
framework that integrates cutting-edge IT and CT advancements, with a
particular emphasis on the following three key points. First, device-edge-cloud
framework serves as the foundation, which integrates end devices, edge servers,
and cloud clusters to optimize scalability and efficiency for low-latency model
inference. Second, we introduce the concept of familial models, which refers to
a series of different-sized models with aligned hidden features, enabling
effective collaboration and the flexibility to adapt to varying resource
constraints and dynamic scenarios. Third, connectivity- and interaction-based
intelligence emergence is a novel paradigm of AI Flow. By leveraging
communication networks to enhance connectivity, the collaboration among AI
models across heterogeneous nodes achieves emergent intelligence that surpasses
the capability of any single model. The innovations of AI Flow provide enhanced
intelligence, timely responsiveness, and ubiquitous accessibility to AI
services, paving the way for the tighter fusion of AI techniques and
communication systems.
comment: Authors are with Institute of Artificial Intelligence (TeleAI), China
Telecom, China. Author names are listed alphabetically by surname. This work
was conducted at TeleAI, facilitated by Dr. Jiawei Shao (e-mail:
shaojw2@chinatelecom.cn) under the leadership of Prof. Xuelong Li. The
corresponding author is Prof. Xuelong Li (e-mail: xuelong li@ieee.org), the
CTO and Chief Scientist of China Telecom
♻ ☆ Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
The rapid evolution of multimodal large language models (MLLMs) has
significantly enhanced their real-world applications. However, achieving
consistent performance across languages, especially when integrating cultural
knowledge, remains a significant challenge. To better assess this issue, we
introduce two new benchmarks: KnowRecall and VisRecall, which evaluate
cross-lingual consistency in MLLMs. KnowRecall is a visual question answering
benchmark designed to measure factual knowledge consistency in 15 languages,
focusing on cultural and historical questions about global landmarks. VisRecall
assesses visual memory consistency by asking models to describe landmark
appearances in 9 languages without access to images. Experimental results
reveal that state-of-the-art MLLMs, including proprietary ones, still struggle
to achieve cross-lingual consistency. This underscores the need for more robust
approaches that produce truly multilingual and culturally aware models.
comment: https://github.com/nlp-waseda/traveling-across-languages
♻ ☆ Self-Guided Process Reward Optimization with Redefined Step-wise Advantage for Process Reinforcement Learning
Process Reinforcement Learning~(PRL) has demonstrated considerable potential
in enhancing the reasoning capabilities of Large Language Models~(LLMs).
However, introducing additional process reward models incurs substantial
computational overhead, and there is no unified theoretical framework for
process-level advantage estimation. To bridge this gap, we propose
\textbf{S}elf-Guided \textbf{P}rocess \textbf{R}eward
\textbf{O}ptimization~(\textbf{SPRO}), a novel framework that enables
process-aware RL through two key innovations: (1) we first theoretically
demonstrate that process rewards can be derived intrinsically from the policy
model itself, and (2) we introduce well-defined cumulative process rewards and
\textbf{M}asked \textbf{S}tep \textbf{A}dvantage (\textbf{MSA}), which
facilitates rigorous step-wise action advantage estimation within shared-prompt
sampling groups. Our experimental results demonstrate that SPRO outperforms
vaniila GRPO with 3.4x higher training efficiency and a 17.5\% test accuracy
improvement. Furthermore, SPRO maintains a stable and elevated policy entropy
throughout training while reducing the average response length by approximately
$1/3$, evidencing sufficient exploration and prevention of reward hacking.
Notably, SPRO incurs no additional computational overhead compared to
outcome-supervised RL methods such as GRPO, which benefit industrial
implementation.
♻ ☆ Robustness of Misinformation Classification Systems to Adversarial Examples Through BeamAttack
We extend BeamAttack, an adversarial attack algorithm designed to evaluate
the robustness of text classification systems through word-level modifications
guided by beam search. Our extensions include support for word deletions and
the option to skip substitutions, enabling the discovery of minimal
modifications that alter model predictions. We also integrate LIME to better
prioritize word replacements. Evaluated across multiple datasets and victim
models (BiLSTM, BERT, and adversarially trained RoBERTa) within the BODEGA
framework, our approach achieves over a 99\% attack success rate while
preserving the semantic and lexical similarity of the original texts. Through
both quantitative and qualitative analysis, we highlight BeamAttack's
effectiveness and its limitations. Our implementation is available at
https://github.com/LucK1Y/BeamAttack
comment: 12 pages main text, 27 pages total including references and
appendices. 13 figures, 10 tables. Accepted for publication in the LNCS
proceedings of CLEF 2025 (Best-of-Labs track)
♻ ☆ Task Prompt Vectors: Effective Initialization through Multi-Task Soft-Prompt Transfer
Prompt tuning is an efficient solution for training large language models
(LLMs). However, current soft-prompt-based methods often sacrifice multi-task
modularity, requiring the training process to be fully or partially repeated
for each newly added task. While recent work on task vectors applied arithmetic
operations on full model weights to achieve the desired multi-task performance,
a similar approach for soft-prompts is still missing. To this end, we introduce
Task Prompt Vectors, created by element-wise difference between weights of
tuned soft-prompts and their random initialization. Experimental results on 12
NLU datasets show that task prompt vectors can be used in low-resource settings
to effectively initialize prompt tuning on similar tasks. In addition, we show
that task prompt vectors are independent of the random initialization of prompt
tuning on 2 different language model architectures. This allows prompt
arithmetics with the pre-trained vectors from different tasks. In this way, we
provide a competitive alternative to state-of-the-art baselines by arithmetic
addition of task prompt vectors from multiple tasks.
♻ ☆ Crafting Hanzi as Narrative Bridges: An AI Co-Creation Workshop for Elderly Migrants
This paper explores how older adults, particularly aging migrants in urban
China, can engage AI-assisted co-creation to express personal narratives that
are often fragmented, underrepresented, or difficult to verbalize. Through a
pilot workshop combining oral storytelling and the symbolic reconstruction of
Hanzi, participants shared memories of migration and recreated new character
forms using Xiaozhuan glyphs, suggested by the Large Language Model (LLM),
together with physical materials. Supported by human facilitation and a soft AI
presence, participants transformed lived experience into visual and tactile
expressions without requiring digital literacy. This approach offers new
perspectives on human-AI collaboration and aging by repositioning AI not as a
content producer but as a supportive mechanism, and by supporting narrative
agency within sociotechnical systems.
comment: A version of this manuscript has been submitted to the [IASDR 2025
Conference](https://iasdr2025.org/) and is currently under review
♻ ☆ Delving into LLM-assisted writing in biomedical publications through excess vocabulary
Large language models (LLMs) like ChatGPT can generate and revise text with
human-level performance. These models come with clear limitations: they can
produce inaccurate information, reinforce existing biases, and be easily
misused. Yet, many scientists use them for their scholarly writing. But how
wide-spread is such LLM usage in the academic literature? To answer this
question for the field of biomedical research, we present an unbiased,
large-scale approach: we study vocabulary changes in over 15 million biomedical
abstracts from 2010--2024 indexed by PubMed, and show how the appearance of
LLMs led to an abrupt increase in the frequency of certain style words. This
excess word analysis suggests that at least 13.5% of 2024 abstracts were
processed with LLMs. This lower bound differed across disciplines, countries,
and journals, reaching 40% for some subcorpora. We show that LLMs have had an
unprecedented impact on scientific writing in biomedical research, surpassing
the effect of major world events such as the Covid pandemic.
comment: v5: Reverting to v3
♻ ☆ AIn't Nothing But a Survey? Using Large Language Models for Coding German Open-Ended Survey Responses on Survey Motivation
The recent development and wider accessibility of LLMs have spurred
discussions about how they can be used in survey research, including
classifying open-ended survey responses. Due to their linguistic capacities, it
is possible that LLMs are an efficient alternative to time-consuming manual
coding and the pre-training of supervised machine learning models. As most
existing research on this topic has focused on English-language responses
relating to non-complex topics or on single LLMs, it is unclear whether its
findings generalize and how the quality of these classifications compares to
established methods. In this study, we investigate to what extent different
LLMs can be used to code open-ended survey responses in other contexts, using
German data on reasons for survey participation as an example. We compare
several state-of-the-art LLMs and several prompting approaches, and evaluate
the LLMs' performance by using human expert codings. Overall performance
differs greatly between LLMs, and only a fine-tuned LLM achieves satisfactory
levels of predictive performance. Performance differences between prompting
approaches are conditional on the LLM used. Finally, LLMs' unequal
classification performance across different categories of reasons for survey
participation results in different categorical distributions when not using
fine-tuning. We discuss the implications of these findings, both for
methodological research on coding open-ended responses and for their
substantive analysis, and for practitioners processing or substantively
analyzing such data. Finally, we highlight the many trade-offs researchers need
to consider when choosing automated methods for open-ended response
classification in the age of LLMs. In doing so, our study contributes to the
growing body of research about the conditions under which LLMs can be
efficiently, accurately, and reliably leveraged in survey research.
comment: to appear in Survey Research Methods
♻ ☆ Improving the Robustness of Distantly-Supervised Named Entity Recognition via Uncertainty-Aware Teacher Learning and Student-Student Collaborative Learning ACL 2024
Distantly-Supervised Named Entity Recognition (DS-NER) is widely used in
real-world scenarios. It can effectively alleviate the burden of annotation by
matching entities in existing knowledge bases with snippets in the text but
suffer from the label noise. Recent works attempt to adopt the teacher-student
framework to gradually refine the training labels and improve the overall
robustness. However, these teacher-student methods achieve limited performance
because the poor calibration of the teacher network produces incorrectly
pseudo-labeled samples, leading to error propagation. Therefore, we propose:
(1) Uncertainty-Aware Teacher Learning that leverages the prediction
uncertainty to reduce the number of incorrect pseudo labels in the
self-training stage; (2) Student-Student Collaborative Learning that allows the
transfer of reliable labels between two student networks instead of
indiscriminately relying on all pseudo labels from its teacher, and further
enables a full exploration of mislabeled samples rather than simply filtering
unreliable pseudo-labeled samples. We evaluate our proposed method on five
DS-NER datasets, demonstrating that our method is superior to the
state-of-the-art DS-NER methods.
comment: ACL 2024 (Findings)
♻ ☆ Incorporating LLMs for Large-Scale Urban Complex Mobility Simulation
Yu-Lun Song, Chung-En Tsern, Che-Cheng Wu, Yu-Ming Chang, Syuan-Bo Huang, Wei-Chu Chen, Michael Chia-Liang Lin, Yu-Ta Lin
This study presents an innovative approach to urban mobility simulation by
integrating a Large Language Model (LLM) with Agent-Based Modeling (ABM).
Unlike traditional rule-based ABM, the proposed framework leverages LLM to
enhance agent diversity and realism by generating synthetic population
profiles, allocating routine and occasional locations, and simulating
personalized routes. Using real-world data, the simulation models individual
behaviors and large-scale mobility patterns in Taipei City. Key insights, such
as route heat maps and mode-specific indicators, provide urban planners with
actionable information for policy-making. Future work focuses on establishing
robust validation frameworks to ensure accuracy and reliability in urban
planning applications.
comment: 8 pages, 8 figures. This paper is reviewed and accepted by the CUPUM
(Computational Urban Planning and Urban Management) Conference held by
University College London (UCL) in 2025
♻ ☆ Decision-Oriented Text Evaluation
Natural language generation (NLG) is increasingly deployed in high-stakes
domains, yet common intrinsic evaluation methods, such as n-gram overlap or
sentence plausibility, weakly correlate with actual decision-making efficacy.
We propose a decision-oriented framework for evaluating generated text by
directly measuring its influence on human and large language model (LLM)
decision outcomes. Using market digest texts--including objective morning
summaries and subjective closing-bell analyses--as test cases, we assess
decision quality based on the financial performance of trades executed by human
investors and autonomous LLM agents informed exclusively by these texts. Our
findings reveal that neither humans nor LLM agents consistently surpass random
performance when relying solely on summaries. However, richer analytical
commentaries enable collaborative human-LLM teams to outperform individual
human or agent baselines significantly. Our approach underscores the importance
of evaluating generated text by its ability to facilitate synergistic
decision-making between humans and LLMs, highlighting critical limitations of
traditional intrinsic metrics.
♻ ☆ Token Prepending: A Training-Free Approach for Eliciting Better Sentence Embeddings from LLMs ACL 2025
Extracting sentence embeddings from large language models (LLMs) is a
promising direction, as LLMs have demonstrated stronger semantic understanding
capabilities. Previous studies typically focus on prompt engineering to elicit
sentence embeddings from LLMs by prompting the model to encode sentence
information into the embedding of the last token. However, LLMs are mostly
decoder-only models with causal attention and the earlier tokens in the
sentence cannot attend to the latter tokens, resulting in biased encoding of
sentence information and cascading effects on the final decoded token. To this
end, we propose a novel Token Prepending (TP) technique that prepends each
layer's decoded sentence embedding to the beginning of the sentence in the next
layer's input, allowing earlier tokens to attend to the complete sentence
information under the causal attention mechanism. The proposed TP technique is
a plug-and-play and training-free technique, which means it can be seamlessly
integrated with various prompt-based sentence embedding methods and
autoregressive LLMs. Extensive experiments on various Semantic Textual
Similarity (STS) tasks and downstream classification tasks demonstrate that our
proposed TP technique can significantly improve the performance of existing
prompt-based sentence embedding methods across different LLMs, while incurring
negligible additional inference cost.
comment: Accept to ACL 2025 (Oral). Code are available on
https://github.com/fuyuchenIfyw/token_prepending.git
♻ ☆ Layered Insights: Generalizable Analysis of Authorial Style by Leveraging All Transformer Layers
We propose a new approach for the authorship attribution task that leverages
the various linguistic representations learned at different layers of
pre-trained transformer-based models. We evaluate our approach on three
datasets, comparing it to a state-of-the-art baseline in in-domain and
out-of-domain scenarios. We found that utilizing various transformer layers
improves the robustness of authorship attribution models when tested on
out-of-domain data, resulting in new state-of-the-art results. Our analysis
gives further insights into how our model's different layers get specialized in
representing certain stylistic features that benefit the model when tested out
of the domain.
♻ ☆ Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
Despite the critical role of reward models (RMs) in reinforcement learning
from human feedback (RLHF), current state-of-the-art open RMs perform poorly on
most existing evaluation benchmarks, failing to capture the spectrum of nuanced
and sophisticated human preferences. Even approaches that incorporate advanced
training techniques have not yielded meaningful performance improvements. We
hypothesize that this brittleness stems primarily from limitations in
preference datasets, which are often narrowly scoped, synthetically labeled, or
lack rigorous quality control. To address these challenges, we present a
large-scale preference dataset comprising 40 million preference pairs, named
SynPref-40M. To enable data curation at scale, we design a human-AI synergistic
two-stage pipeline that leverages the complementary strengths of human
annotation quality and AI scalability. In this pipeline, humans provide
verified annotations, while large language models perform automatic curation
based on human guidance. Training on this preference mixture, we introduce
Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B
parameters, trained on a carefully curated subset of 26 million preference
pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile
across a wide range of capabilities, including alignment with human
preferences, objective correctness, safety, resistance to stylistic biases, and
best-of-N scaling, achieving state-of-the-art performance across seven major
reward model benchmarks. Ablation studies confirm that the effectiveness of our
approach stems not only from data scale but also from high-quality curation.
The Skywork-Reward-V2 series represents substantial progress in open reward
models, highlighting the untapped potential of existing preference datasets and
demonstrating how human-AI curation synergy can unlock significantly higher
data quality.
♻ ☆ Aligning Frozen LLMs by Reinforcement Learning: An Iterative Reweight-then-Optimize Approach
Xinnan Zhang, Chenliang Li, Siliang Zeng, Jiaxiang Li, Zhongruo Wang, Kaixiang Lin, Songtao Lu, Alfredo Garcia, Mingyi Hong
Aligning large language models (LLMs) with human preferences usually requires
fine-tuning methods such as RLHF and DPO. These methods directly optimize the
model parameters, so they cannot be used in test-time to improve model
performance, nor are they applicable when the model weights are not accessible.
In contrast, test-time methods sidestep weight updates by leveraging reward
functions to guide and improve output quality. However, they incur high
inference costs, and their one-shot guidance is often based on imperfect reward
or value functions, leading to suboptimal outputs. In this work, we present a
method named Iterative Reweight-then-Optimize (IRO), a reinforcement learning
(RL) framework that performs RL-style alignment of the (frozen) base model
without touching its parameters. During training, each iteration (i) samples
candidates from the base model, (ii) resamples using current value functions,
and (iii) trains a new lightweight value function that guides the next decoding
pass. At test time, the value functions are used to guide the base model
generation via a search-based optimization process. Notably, users can apply
IRO to align a model on their own dataset, similar to OpenAI's reinforcement
fine-tuning (RFT), but without requiring access to the model weights.
♻ ☆ Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding
Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, Enze Xie
Diffusion-based large language models (Diffusion LLMs) have shown promise for
non-autoregressive text generation with parallel decoding capabilities.
However, the practical inference speed of open-sourced Diffusion LLMs often
lags behind autoregressive models due to the lack of Key-Value (KV) Cache and
quality degradation when decoding multiple tokens simultaneously. To bridge
this gap, we introduce a novel block-wise approximate KV Cache mechanism
tailored for bidirectional diffusion models, enabling cache reuse with
negligible performance drop. Additionally, we identify the root cause of
generation quality degradation in parallel decoding as the disruption of token
dependencies under the conditional independence assumption. To address this, we
propose a confidence-aware parallel decoding strategy that selectively decodes
tokens exceeding a confidence threshold, mitigating dependency violations and
maintaining generation quality. Experimental results on LLaDA and Dream models
across multiple LLM benchmarks demonstrate up to \textbf{27.6$\times$
throughput} improvement with minimal accuracy loss, closing the performance gap
with autoregressive models and paving the way for practical deployment of
Diffusion LLMs.
♻ ☆ Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient ACL2025
Recent Large-Language Models (LLMs) pruning methods typically operate at the
post-training phase without the expensive weight finetuning, however, their
pruning criteria often rely on heuristically hand-crafted metrics, potentially
leading to suboptimal performance. We instead propose a novel
optimization-based structural pruning that learns the pruning masks in a
probabilistic space directly by optimizing the loss of the pruned model. To
preserve efficiency, our method eliminates the back-propagation through the LLM
per se during optimization, requiring only the forward pass of the LLM. We
achieve this by learning an underlying Bernoulli distribution to sample binary
pruning masks, where we decouple the Bernoulli parameters from LLM loss,
facilitating efficient optimization via policy gradient estimator without
back-propagation. Thus, our method can 1) support global and heterogeneous
pruning (i.e., automatically determine different redundancy for different
layers), and 2) optionally initialize with a metric-based method (for our
Bernoulli distributions). Extensive experiments conducted on LLaMA, LLaMA-2,
LLaMA-3, Vicuna, and Mistral models using the C4 and WikiText2 datasets
demonstrate the promising performance of our method in efficiency and
effectiveness. Code is available at
https://github.com/ethanygao/backprop-free_LLM_pruning.
comment: ACL2025 Main Accepted
♻ ☆ REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
Jian Hu, Xibin Wu, Wei Shen, Jason Klein Liu, Zilin Zhu, Weixun Wang, Songlin Jiang, Haoran Wang, Hao Chen, Bin Chen, Weikai Fang, Xianyu, Yu Cao, Haotian Xu
Large Language Models (LLMs) fine-tuned via Reinforcement Learning from Human
Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR)
significantly improve the alignment of human-AI values and further raise the
upper bound of AI capabilities, particularly in reasoning-intensive,
long-context Chain-of-Thought (long-CoT) tasks. However, existing RLHF (or
RLVR) frameworks commonly face challenges such as inference bottlenecks and
complexity barriers, restricting their accessibility for newcomers. To bridge
this gap, we introduce \textbf{OpenRLHF}, a user-friendly, scalable, and
easy-to-learn open-source RLHF framework built upon Ray, vLLM, DeepSpeed, and
HuggingFace Transformers, featuring a simplified design, clear code structure,
and comprehensive documentation to facilitate entry for researchers and
practitioners. Experimental results show that OpenRLHF achieves superior
training efficiency with speedups ranging from 1.22x to 1.68x across different
model sizes compared to state-of-the-art frameworks, while requiring
significantly fewer lines of code for implementation. OpenRLHF is publicly
available at https://github.com/OpenRLHF/OpenRLHF, and has already been adopted
by leading institutions to accelerate RLHF research and learning.
comment: fix typo
♻ ☆ Commander-GPT: Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large Language Models
Sarcasm detection, as a crucial research direction in the field of Natural
Language Processing (NLP), has attracted widespread attention. Traditional
sarcasm detection tasks have typically focused on single-modal approaches
(e.g., text), but due to the implicit and subtle nature of sarcasm, such
methods often fail to yield satisfactory results. In recent years, researchers
have shifted the focus of sarcasm detection to multi-modal approaches. However,
effectively leveraging multi-modal information to accurately identify sarcastic
content remains a challenge that warrants further exploration. Leveraging the
powerful integrated processing capabilities of Multi-Modal Large Language
Models (MLLMs) for various information sources, we propose an innovative
multi-modal Commander-GPT framework. Inspired by military strategy, we first
decompose the sarcasm detection task into six distinct sub-tasks. A central
commander (decision-maker) then assigns the best-suited large language model to
address each specific sub-task. Ultimately, the detection results from each
model are aggregated to identify sarcasm. We conducted extensive experiments on
MMSD and MMSD 2.0, utilizing four multi-modal large language models and six
prompting strategies. Our experiments demonstrate that our approach achieves
state-of-the-art performance, with a 19.3% improvement in F1 score, without
necessitating fine-tuning or ground-truth rationales.
comment: Our original goal was to use Commander-GPT: Dividing and Routing for
Multimodal Sarcasm Detection (arXiv:2506.19420) to replace Commander-GPT:
Fully Unleashing the Sarcasm Detection Capability of Multi-Modal Large
Language Models (arXiv:2503.18681). Due to various reasons, both versions
were released, so we would like to withdraw the latter
♻ ☆ Prompt-Guided Turn-Taking Prediction
Turn-taking prediction models are essential components in spoken dialogue
systems and conversational robots. Recent approaches leverage transformer-based
architectures to predict speech activity continuously and in real-time. In this
study, we propose a novel model that enables turn-taking prediction to be
dynamically controlled via textual prompts. This approach allows intuitive and
explicit control through instructions such as "faster" or "calmer" adapting
dynamically to conversational partners and contexts. The proposed model builds
upon a transformer-based voice activity projection (VAP) model, incorporating
textual prompt embeddings into both channel-wise transformers and a
cross-channel transformer. We evaluated the feasibility of our approach using
over 950 hours of human-human spoken dialogue data. Since textual prompt data
for the proposed approach was not available in existing datasets, we utilized a
large language model (LLM) to generate synthetic prompt sentences. Experimental
results demonstrated that the proposed model improved prediction accuracy and
effectively varied turn-taking timing behaviors according to the textual
prompts.
comment: This paper has been accepted for presentation at SIGdial Meeting on
Discourse and Dialogue 2025 (SIGDIAL 2025) and represents the author's
version of the work
♻ ☆ Optimal strategies to perform multilingual analysis of social content for a novel dataset in the tourism domain
Maxime Masson, Rodrigo Agerri, Christian Sallaberry, Marie-Noelle Bessagnet, Annig Le Parc Lacayrelle, Philippe Roose
The rising influence of social media platforms in various domains, including
tourism, has highlighted the growing need for efficient and automated Natural
Language Processing (NLP) strategies to take advantage of this valuable
resource. However, the transformation of multilingual, unstructured, and
informal texts into structured knowledge still poses significant challenges,
most notably the never-ending requirement for manually annotated data to train
deep learning classifiers. In this work, we study different NLP techniques to
establish the best ones to obtain competitive performances while keeping the
need for training annotated data to a minimum. To do so, we built the first
publicly available multilingual dataset (French, English, and Spanish) for the
tourism domain, composed of tourism-related tweets. The dataset includes
multilayered, manually revised annotations for Named Entity Recognition (NER)
for Locations and Fine-grained Thematic Concepts Extraction mapped to the
Thesaurus of Tourism and Leisure Activities of the World Tourism Organization,
as well as for Sentiment Analysis at the tweet level. Extensive experimentation
comparing various few-shot and fine-tuning techniques with modern language
models demonstrate that modern few-shot techniques allow us to obtain
competitive results for all three tasks with very little annotation data: 5
tweets per label (15 in total) for Sentiment Analysis, 30 tweets for Named
Entity Recognition of Locations and 1K tweets annotated with fine-grained
thematic concepts, a highly fine-grained sequence labeling task based on an
inventory of 315 classes. We believe that our results, grounded in a novel
dataset, pave the way for applying NLP to new domain-specific applications,
reducing the need for manual annotations and circumventing the complexities of
rule-based, ad-hoc solutions.
♻ ☆ Causal Representation Learning with Generative Artificial Intelligence: Application to Texts as Treatments
In this paper, we demonstrate how to enhance the validity of causal inference
with unstructured high-dimensional treatments like texts, by leveraging the
power of generative Artificial Intelligence (GenAI). Specifically, we propose
to use a deep generative model such as large language models (LLMs) to
efficiently generate treatments and use their internal representation for
subsequent causal effect estimation. We show that the knowledge of this true
internal representation helps disentangle the treatment features of interest,
such as specific sentiments and certain topics, from other possibly unknown
confounding features. Unlike existing methods, the proposed GenAI-Powered
Inference (GPI) methodology eliminates the need to learn causal representation
from the data, and hence produces more accurate and efficient estimates. We
formally establish the conditions required for the nonparametric identification
of the average treatment effect, propose an estimation strategy that avoids the
violation of the overlap assumption, and derive the asymptotic properties of
the proposed estimator through the application of double machine learning.
Finally, using an instrumental variables approach, we extend the proposed
methodology to the settings in which the treatment feature is based on human
perception. The proposed GPI methodology is also applicable to text reuse where
an LLM is used to regenerate existing texts. We conduct simulation and
empirical studies, using the generated text data from an open-source LLM, Llama
3, to illustrate the advantages of our estimator over state-of-the-art causal
representation learning algorithms.
♻ ☆ SMARTe: Slot-based Method for Accountable Relational Triple extraction
Relational Triple Extraction (RTE) is a fundamental task in Natural Language
Processing (NLP). However, prior research has primarily focused on optimizing
model performance, with limited efforts to understand the internal mechanisms
driving these models. Many existing methods rely on complex preprocessing to
induce specific interactions, often resulting in opaque systems that may not
fully align with their theoretical foundations. To address these limitations,
we propose SMARTe: a Slot-based Method for Accountable Relational Triple
extraction. SMARTe introduces intrinsic interpretability through a slot
attention mechanism and frames the task as a set prediction problem. Slot
attention consolidates relevant information into distinct slots, ensuring all
predictions can be explicitly traced to learned slot representations and the
tokens contributing to each predicted relational triple. While emphasizing
interpretability, SMARTe achieves performance comparable to state-of-the-art
models. Evaluations on the NYT and WebNLG datasets demonstrate that adding
interpretability does not compromise performance. Furthermore, we conducted
qualitative assessments to showcase the explanations provided by SMARTe, using
attention heatmaps that map to their respective tokens. We conclude with a
discussion of our findings and propose directions for future research. Our code
is available at https://github.com/Chen-XueWen/SMARTe.
♻ ☆ Circuit-tuning: A Mechanistic Approach for Identifying Parameter Redundancy and Fine-tuning Neural Networks
The study of mechanistic interpretability aims to reverse-engineer a model to
explain its behaviors. While recent studies have focused on the static
mechanism of a certain behavior, the learning dynamics inside a model remain to
be explored. In this work, we develop an interpretable fine-tuning method for
analyzing the mechanism behind learning. We first introduce the concept of
node-level intrinsic dimensionality to describe the learning process of a model
in a computational graph. Based on our theory, we propose circuit-tuning, a
two-stage algorithm that iteratively builds the minimal subgraph for a specific
task and updates the key parameters in a heuristic way. Experimental results
confirm the existence of the intrinsic dimensionality at the node level and
demonstrate the effectiveness of our method for transparent and interpretable
fine-tuning. We visualize and analyze the circuits before, during, and after
fine-tuning, providing new insights into the self-organization mechanism of a
neural network in the learning process.
♻ ☆ Mixture of Reasonings: Teach Large Language Models to Reason with Adaptive Strategies
Large language models (LLMs) excel in complex tasks through advanced
prompting techniques like Chain-of-Thought (CoT) and Tree-of-Thought (ToT), but
their reliance on manually crafted, task-specific prompts limits adaptability
and efficiency. We introduce Mixture of Reasoning (MoR), a training framework
that embeds diverse reasoning strategies into LLMs for autonomous,
task-adaptive reasoning without external prompt engineering. MoR has two
phases: Thought Generation, creating reasoning chain templates with models like
GPT-4o, and SFT Dataset Construction, pairing templates with benchmark datasets
for supervised fine-tuning. Our experiments show that MoR significantly
enhances performance, with MoR150 achieving 0.730 (2.2% improvement) using CoT
prompting and 0.734 (13.5% improvement) compared to baselines. MoR eliminates
the need for task-specific prompts, offering a generalizable solution for
robust reasoning across diverse tasks.