Computation and Language 74
☆ StreamBridge: Turning Your Offline Video Large Language Model into a Proactive Streaming Assistant
Haibo Wang, Bo Feng, Zhengfeng Lai, Mingze Xu, Shiyu Li, Weifeng Ge, Afshin Dehghan, Meng Cao, Ping Huang
We present StreamBridge, a simple yet effective framework that seamlessly
transforms offline Video-LLMs into streaming-capable models. It addresses two
fundamental challenges in adapting existing models into online scenarios: (1)
limited capability for multi-turn real-time understanding, and (2) lack of
proactive response mechanisms. Specifically, StreamBridge incorporates (1) a
memory buffer combined with a round-decayed compression strategy, supporting
long-context multi-turn interactions, and (2) a decoupled, lightweight
activation model that can be effortlessly integrated into existing Video-LLMs,
enabling continuous proactive responses. To further support StreamBridge, we
construct Stream-IT, a large-scale dataset tailored for streaming video
understanding, featuring interleaved video-text sequences and diverse
instruction formats. Extensive experiments show that StreamBridge significantly
improves the streaming understanding capabilities of offline Video-LLMs across
various tasks, outperforming even proprietary models such as GPT-4o and Gemini
1.5 Pro. Simultaneously, it achieves competitive or superior performance on
standard video understanding benchmarks.
☆ ComPO: Preference Alignment via Comparison Oracles
Direct alignment methods are increasingly used for aligning large language
models (LLMs) with human preferences. However, these methods suffer from the
issues of verbosity and likelihood displacement, which can be driven by the
noisy preference pairs that induce similar likelihood for preferred and
dispreferred responses. The contributions of this paper are two-fold. First, we
propose a new preference alignment method based on comparison oracles and
provide the convergence guarantee for its basic scheme. Second, we improve our
method using some heuristics and conduct the experiments to demonstrate the
flexibility and compatibility of practical scheme in improving the performance
of LLMs using noisy preference pairs. Evaluations are conducted across multiple
base and instruction-tuned models (Mistral-7B, Llama-3-8B and Gemma-2-9B) with
benchmarks (AlpacaEval 2, MT-Bench and Arena-Hard). Experimental results show
the effectiveness of our method as an alternative to addressing the limitations
of existing direct alignment methods. A highlight of our work is that we
evidence the importance of designing specialized methods for preference pairs
with distinct likelihood margin, which complements the recent findings in
\citet{Razin-2025-Unintentional}.
comment: 25 pages
☆ Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging ICML 2025
Vision-Language Models (VLMs) combine visual perception with the general
capabilities, such as reasoning, of Large Language Models (LLMs). However, the
mechanisms by which these two abilities can be combined and contribute remain
poorly understood. In this work, we explore to compose perception and reasoning
through model merging that connects parameters of different models. Unlike
previous works that often focus on merging models of the same kind, we propose
merging models across modalities, enabling the incorporation of the reasoning
capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate
that model merging offers a successful pathway to transfer reasoning abilities
from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged
models to understand the internal mechanism of perception and reasoning and how
merging affects it. We find that perception capabilities are predominantly
encoded in the early layers of the model, whereas reasoning is largely
facilitated by the middle-to-late layers. After merging, we observe that all
layers begin to contribute to reasoning, whereas the distribution of perception
abilities across layers remains largely unchanged. These observations shed
light on the potential of model merging as a tool for multimodal integration
and interpretation.
comment: ICML 2025. Our code is publicly available at
https://github.com/shiqichen17/VLM_Merging
☆ UKElectionNarratives: A Dataset of Misleading Narratives Surrounding Recent UK General Elections AAAI
Fatima Haouari, Carolina Scarton, Nicolò Faggiani, Nikolaos Nikolaidis, Bonka Kotseva, Ibrahim Abu Farha, Jens Linge, Kalina Bontcheva
Misleading narratives play a crucial role in shaping public opinion during
elections, as they can influence how voters perceive candidates and political
parties. This entails the need to detect these narratives accurately. To
address this, we introduce the first taxonomy of common misleading narratives
that circulated during recent elections in Europe. Based on this taxonomy, we
construct and analyse UKElectionNarratives: the first dataset of
human-annotated misleading narratives which circulated during the UK General
Elections in 2019 and 2024. We also benchmark Pre-trained and Large Language
Models (focusing on GPT-4o), studying their effectiveness in detecting
election-related misleading narratives. Finally, we discuss potential use cases
and make recommendations for future research directions using the proposed
codebook and dataset.
comment: This work was accepted at the International AAAI Conference on Web
and Social Media (ICWSM 2025)
☆ Adaptive Markup Language Generation for Contextually-Grounded Visual Document Understanding CVPR2025
Han Xiao, Yina Xie, Guanxin Tan, Yinghao Chen, Rui Hu, Ke Wang, Aojun Zhou, Hao Li, Hao Shao, Xudong Lu, Peng Gao, Yafei Wen, Xiaoxin Chen, Shuai Ren, Hongsheng Li
Visual Document Understanding has become essential with the increase of
text-rich visual content. This field poses significant challenges due to the
need for effective integration of visual perception and textual comprehension,
particularly across diverse document types with complex layouts. Moreover,
existing fine-tuning datasets for this domain often fall short in providing the
detailed contextual information for robust understanding, leading to
hallucinations and limited comprehension of spatial relationships among visual
elements. To address these challenges, we propose an innovative pipeline that
utilizes adaptive generation of markup languages, such as Markdown, JSON, HTML,
and TiKZ, to build highly structured document representations and deliver
contextually-grounded responses. We introduce two fine-grained structured
datasets: DocMark-Pile, comprising approximately 3.8M pretraining data pairs
for document parsing, and DocMark-Instruct, featuring 624k fine-tuning data
annotations for grounded instruction following. Extensive experiments
demonstrate that our proposed model significantly outperforms existing
state-of-theart MLLMs across a range of visual document understanding
benchmarks, facilitating advanced reasoning and comprehension capabilities in
complex visual scenarios. Our code and models are released at https://github.
com/Euphoria16/DocMark.
comment: CVPR2025
☆ clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
The emergence of instruction-tuned large language models (LLMs) has advanced
the field of dialogue systems, enabling both realistic user simulations and
robust multi-turn conversational agents. However, existing research often
evaluates these components in isolation-either focusing on a single user
simulator or a specific system design-limiting the generalisability of insights
across architectures and configurations. In this work, we propose clem todd
(chat-optimized LLMs for task-oriented dialogue systems development), a
flexible framework for systematically evaluating dialogue systems under
consistent conditions. clem todd enables detailed benchmarking across
combinations of user simulators and dialogue systems, whether existing models
from literature or newly developed ones. It supports plug-and-play integration
and ensures uniform datasets, evaluation metrics, and computational
constraints. We showcase clem todd's flexibility by re-evaluating existing
task-oriented dialogue systems within this unified setup and integrating three
newly proposed dialogue systems into the same evaluation pipeline. Our results
provide actionable insights into how architecture, scale, and prompting
strategies affect dialogue performance, offering practical guidance for
building efficient and effective conversational AI systems.
comment: 30 pages
☆ Ultra-FineWeb: Efficient Data Filtering and Verification for High-Quality LLM Training Data
Yudong Wang, Zixuan Fu, Jie Cai, Peijun Tang, Hongya Lyu, Yewei Fang, Zhi Zheng, Jie Zhou, Guoyang Zeng, Chaojun Xiao, Xu Han, Zhiyuan Liu
Data quality has become a key factor in enhancing model performance with the
rapid development of large language models (LLMs). Model-driven data filtering
has increasingly become a primary approach for acquiring high-quality data.
However, it still faces two main challenges: (1) the lack of an efficient data
verification strategy makes it difficult to provide timely feedback on data
quality; and (2) the selection of seed data for training classifiers lacks
clear criteria and relies heavily on human expertise, introducing a degree of
subjectivity. To address the first challenge, we introduce an efficient
verification strategy that enables rapid evaluation of the impact of data on
LLM training with minimal computational cost. To tackle the second challenge,
we build upon the assumption that high-quality seed data is beneficial for LLM
training, and by integrating the proposed verification strategy, we optimize
the selection of positive and negative samples and propose an efficient data
filtering pipeline. This pipeline not only improves filtering efficiency,
classifier quality, and robustness, but also significantly reduces experimental
and inference costs. In addition, to efficiently filter high-quality data, we
employ a lightweight classifier based on fastText, and successfully apply the
filtering pipeline to two widely-used pre-training corpora, FineWeb and Chinese
FineWeb datasets, resulting in the creation of the higher-quality Ultra-FineWeb
dataset. Ultra-FineWeb contains approximately 1 trillion English tokens and 120
billion Chinese tokens. Empirical results demonstrate that the LLMs trained on
Ultra-FineWeb exhibit significant performance improvements across multiple
benchmark tasks, validating the effectiveness of our pipeline in enhancing both
data quality and training efficiency.
comment: The datasets are available on
https://huggingface.co/datasets/openbmb/UltraFineWeb
☆ TransProQA: an LLM-based literary Translation evaluation metric with Professional Question Answering
The impact of Large Language Models (LLMs) has extended into literary
domains. However, existing evaluation metrics prioritize mechanical accuracy
over artistic expression and tend to overrate machine translation (MT) as being
superior to experienced professional human translation. In the long run, this
bias could result in a permanent decline in translation quality and cultural
authenticity. In response to the urgent need for a specialized literary
evaluation metric, we introduce TransProQA, a novel, reference-free, LLM-based
question-answering (QA) framework designed specifically for literary
translation evaluation. TransProQA uniquely integrates insights from
professional literary translators and researchers, focusing on critical
elements in literary quality assessment such as literary devices, cultural
understanding, and authorial voice. Our extensive evaluation shows that while
literary-finetuned XCOMET-XL yields marginal gains, TransProQA substantially
outperforms current metrics, achieving up to 0.07 gain in correlation (ACC-EQ
and Kendall's tau) and surpassing the best state-of-the-art (SOTA) metrics by
over 15 points in adequacy assessments. Incorporating professional translator
insights as weights further improves performance, highlighting the value of
translator inputs. Notably, TransProQA approaches human-level evaluation
performance comparable to trained linguistic annotators. It demonstrates broad
applicability to open-source models such as LLaMA3.3-70b and Qwen2.5-32b,
indicating its potential as an accessible and training-free literary evaluation
metric and a valuable tool for evaluating texts that require local processing
due to copyright or ethical considerations.
comment: WIP
☆ TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
Haokun Lin, Teng Wang, Yixiao Ge, Yuying Ge, Zhichao Lu, Ying Wei, Qingfu Zhang, Zhenan Sun, Ying Shan
Pioneering token-based works such as Chameleon and Emu3 have established a
foundation for multimodal unification but face challenges of high training
computational overhead and limited comprehension performance due to a lack of
high-level semantics. In this paper, we introduce TokLIP, a visual tokenizer
that enhances comprehension by semanticizing vector-quantized (VQ) tokens and
incorporating CLIP-level semantics while enabling end-to-end multimodal
autoregressive training with standard VQ tokens. TokLIP integrates a low-level
discrete VQ tokenizer with a ViT-based token encoder to capture high-level
continuous semantics. Unlike previous approaches (e.g., VILA-U) that discretize
high-level features, TokLIP disentangles training objectives for comprehension
and generation, allowing the direct application of advanced VQ tokenizers
without the need for tailored quantization operations. Our empirical results
demonstrate that TokLIP achieves exceptional data efficiency, empowering visual
tokens with high-level semantic understanding while enhancing low-level
generative capacity, making it well-suited for autoregressive Transformers in
both comprehension and generation tasks. The code and models are available at
https://github.com/TencentARC/TokLIP.
comment: Technical Report
☆ Reasoning Models Don't Always Say What They Think
Yanda Chen, Joe Benton, Ansh Radhakrishnan, Jonathan Uesato, Carson Denison, John Schulman, Arushi Somani, Peter Hase, Misha Wagner, Fabien Roger, Vlad Mikulik, Samuel R. Bowman, Jan Leike, Jared Kaplan, Ethan Perez
Chain-of-thought (CoT) offers a potential boon for AI safety as it allows
monitoring a model's CoT to try to understand its intentions and reasoning
processes. However, the effectiveness of such monitoring hinges on CoTs
faithfully representing models' actual reasoning processes. We evaluate CoT
faithfulness of state-of-the-art reasoning models across 6 reasoning hints
presented in the prompts and find: (1) for most settings and models tested,
CoTs reveal their usage of hints in at least 1% of examples where they use the
hint, but the reveal rate is often below 20%, (2) outcome-based reinforcement
learning initially improves faithfulness but plateaus without saturating, and
(3) when reinforcement learning increases how frequently hints are used (reward
hacking), the propensity to verbalize them does not increase, even without
training against a CoT monitor. These results suggest that CoT monitoring is a
promising way of noticing undesired behaviors during training and evaluations,
but that it is not sufficient to rule them out. They also suggest that in
settings like ours where CoT reasoning is not necessary, test-time monitoring
of CoTs is unlikely to reliably catch rare and catastrophic unexpected
behaviors.
☆ Crosslingual Reasoning through Test-Time Scaling
Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji
Reasoning capabilities of large language models are primarily studied for
English, even when pretrained models are multilingual. In this work, we
investigate to what extent English reasoning finetuning with long
chain-of-thoughts (CoTs) can generalize across languages. First, we find that
scaling up inference compute for English-centric reasoning language models
(RLMs) improves multilingual mathematical reasoning across many languages
including low-resource languages, to an extent where they outperform models
twice their size. Second, we reveal that while English-centric RLM's CoTs are
naturally predominantly English, they consistently follow a quote-and-think
pattern to reason about quoted non-English inputs. Third, we discover an
effective strategy to control the language of long CoT reasoning, and we
observe that models reason better and more efficiently in high-resource
languages. Finally, we observe poor out-of-domain reasoning generalization, in
particular from STEM to cultural commonsense knowledge, even for English.
Overall, we demonstrate the potentials, study the mechanisms and outline the
limitations of crosslingual generalization of English reasoning test-time
scaling. We conclude that practitioners should let English-centric RLMs reason
in high-resource languages, while further work is needed to improve reasoning
in low-resource languages and out-of-domain contexts.
☆ Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?
Framing in media critically shapes public perception by selectively
emphasizing some details while downplaying others. With the rise of large
language models in automated news and content creation, there is growing
concern that these systems may introduce or even amplify framing biases
compared to human authors. In this paper, we explore how framing manifests in
both out-of-the-box and fine-tuned LLM-generated news content. Our analysis
reveals that, particularly in politically and socially sensitive contexts, LLMs
tend to exhibit more pronounced framing than their human counterparts. In
addition, we observe significant variation in framing tendencies across
different model architectures, with some models displaying notably higher
biases. These findings point to the need for effective post-training mitigation
strategies and tighter evaluation frameworks to ensure that automated news
content upholds the standards of balanced reporting.
☆ ICon: In-Context Contribution for Automatic Data Selection
Data selection for instruction tuning is essential for improving the
performance of Large Language Models (LLMs) and reducing training cost.
However, existing automated selection methods either depend on computationally
expensive gradient-based measures or manually designed heuristics, which may
fail to fully exploit the intrinsic attributes of data. In this paper, we
propose In-context Learning for Contribution Measurement (ICon), a novel
gradient-free method that takes advantage of the implicit fine-tuning nature of
in-context learning (ICL) to measure sample contribution without gradient
computation or manual indicators engineering. ICon offers a computationally
efficient alternative to gradient-based methods and reduces human inductive
bias inherent in heuristic-based approaches. ICon comprises three components
and identifies high-contribution data by assessing performance shifts under
implicit learning through ICL. Extensive experiments on three LLMs across 12
benchmarks and 5 pairwise evaluation sets demonstrate the effectiveness of
ICon. Remarkably, on LLaMA3.1-8B, models trained on 15% of ICon-selected data
outperform full datasets by 5.42% points and exceed the best performance of
widely used selection methods by 2.06% points. We further analyze
high-contribution samples selected by ICon, which show both diverse tasks and
appropriate difficulty levels, rather than just the hardest ones.
☆ Scalable Chain of Thoughts via Elastic Reasoning
Large reasoning models (LRMs) have achieved remarkable progress on complex
tasks by generating extended chains of thought (CoT). However, their
uncontrolled output lengths pose significant challenges for real-world
deployment, where inference-time budgets on tokens, latency, or compute are
strictly constrained. We propose Elastic Reasoning, a novel framework for
scalable chain of thoughts that explicitly separates reasoning into two
phases--thinking and solution--with independently allocated budgets. At test
time, Elastic Reasoning prioritize that completeness of solution segments,
significantly improving reliability under tight resource constraints. To train
models that are robust to truncated thinking, we introduce a lightweight
budget-constrained rollout strategy, integrated into GRPO, which teaches the
model to reason adaptively when the thinking process is cut short and
generalizes effectively to unseen budget constraints without additional
training. Empirical results on mathematical (AIME, MATH500) and programming
(LiveCodeBench, Codeforces) benchmarks demonstrate that Elastic Reasoning
performs robustly under strict budget constraints, while incurring
significantly lower training cost than baseline methods. Remarkably, our
approach also produces more concise and efficient reasoning even in
unconstrained settings. Elastic Reasoning offers a principled and practical
solution to the pressing challenge of controllable reasoning at scale.
☆ Toward Reasonable Parrots: Why Large Language Models Should Argue with Us by Design
Elena Musi, Nadin Kokciyan, Khalid Al-Khatib, Davide Ceolin, Emmanuelle Dietz, Klara Gutekunst, Annette Hautli-Janisz, Cristian Manuel Santibañez Yañez, Jodi Schneider, Jonas Scholz, Cor Steging, Jacky Visser, Henning Wachsmuth
In this position paper, we advocate for the development of conversational
technology that is inherently designed to support and facilitate argumentative
processes. We argue that, at present, large language models (LLMs) are
inadequate for this purpose, and we propose an ideal technology design aimed at
enhancing argumentative skills. This involves re-framing LLMs as tools to
exercise our critical thinking rather than replacing them. We introduce the
concept of 'reasonable parrots' that embody the fundamental principles of
relevance, responsibility, and freedom, and that interact through argumentative
dialogical moves. These principles and moves arise out of millennia of work in
argumentation theory and should serve as the starting point for LLM-based
technology that incorporates basic principles of argumentation.
☆ T-T: Table Transformer for Tagging-based Aspect Sentiment Triplet Extraction IJCAI2025
Kun Peng, Chaodong Tong, Cong Cao, Hao Peng, Qian Li, Guanlin Wu, Lei Jiang, Yanbing Liu, Philip S. Yu
Aspect sentiment triplet extraction (ASTE) aims to extract triplets composed
of aspect terms, opinion terms, and sentiment polarities from given sentences.
The table tagging method is a popular approach to addressing this task, which
encodes a sentence into a 2-dimensional table, allowing for the tagging of
relations between any two words. Previous efforts have focused on designing
various downstream relation learning modules to better capture interactions
between tokens in the table, revealing that a stronger capability to capture
relations can lead to greater improvements in the model. Motivated by this, we
attempt to directly utilize transformer layers as downstream relation learning
modules. Due to the powerful semantic modeling capability of transformers, it
is foreseeable that this will lead to excellent improvement. However, owing to
the quadratic relation between the length of the table and the length of the
input sentence sequence, using transformers directly faces two challenges:
overly long table sequences and unfair local attention interaction. To address
these challenges, we propose a novel Table-Transformer (T-T) for the
tagging-based ASTE method. Specifically, we introduce a stripe attention
mechanism with a loop-shift strategy to tackle these challenges. The former
modifies the global attention mechanism to only attend to a 2-dimensional local
attention window, while the latter facilitates interaction between different
attention windows. Extensive and comprehensive experiments demonstrate that the
T-T, as a downstream relation learning module, achieves state-of-the-art
performance with lower computational costs.
comment: Accepted by IJCAI2025
☆ QualBench: Benchmarking Chinese LLMs with Localized Professional Qualifications for Vertical Domain Evaluation
The rapid advancement of Chinese large language models (LLMs) underscores the
need for domain-specific evaluations to ensure reliable applications. However,
existing benchmarks often lack coverage in vertical domains and offer limited
insights into the Chinese working context. Leveraging qualification exams as a
unified framework for human expertise evaluation, we introduce QualBench, the
first multi-domain Chinese QA benchmark dedicated to localized assessment of
Chinese LLMs. The dataset includes over 17,000 questions across six vertical
domains, with data selections grounded in 24 Chinese qualifications to closely
align with national policies and working standards. Through comprehensive
evaluation, the Qwen2.5 model outperformed the more advanced GPT-4o, with
Chinese LLMs consistently surpassing non-Chinese models, highlighting the
importance of localized domain knowledge in meeting qualification requirements.
The best performance of 75.26% reveals the current gaps in domain coverage
within model capabilities. Furthermore, we present the failure of LLM
collaboration with crowdsourcing mechanisms and suggest the opportunities for
multi-domain RAG knowledge enhancement and vertical domain LLM training with
Federated Learning.
☆ Revealing Weaknesses in Text Watermarking Through Self-Information Rewrite Attacks ICML 2025
Text watermarking aims to subtly embed statistical signals into text by
controlling the Large Language Model (LLM)'s sampling process, enabling
watermark detectors to verify that the output was generated by the specified
model. The robustness of these watermarking algorithms has become a key factor
in evaluating their effectiveness. Current text watermarking algorithms embed
watermarks in high-entropy tokens to ensure text quality. In this paper, we
reveal that this seemingly benign design can be exploited by attackers, posing
a significant risk to the robustness of the watermark. We introduce a generic
efficient paraphrasing attack, the Self-Information Rewrite Attack (SIRA),
which leverages the vulnerability by calculating the self-information of each
token to identify potential pattern tokens and perform targeted attack. Our
work exposes a widely prevalent vulnerability in current watermarking
algorithms. The experimental results show SIRA achieves nearly 100% attack
success rates on seven recent watermarking methods with only 0.88 USD per
million tokens cost. Our approach does not require any access to the watermark
algorithms or the watermarked LLM and can seamlessly transfer to any LLM as the
attack model, even mobile-level models. Our findings highlight the urgent need
for more robust watermarking.
comment: ICML 2025 Accpeted
☆ A Benchmark Dataset and a Framework for Urdu Multimodal Named Entity Recognition
The emergence of multimodal content, particularly text and images on social
media, has positioned Multimodal Named Entity Recognition (MNER) as an
increasingly important area of research within Natural Language Processing.
Despite progress in high-resource languages such as English, MNER remains
underexplored for low-resource languages like Urdu. The primary challenges
include the scarcity of annotated multimodal datasets and the lack of
standardized baselines. To address these challenges, we introduce the U-MNER
framework and release the Twitter2015-Urdu dataset, a pioneering resource for
Urdu MNER. Adapted from the widely used Twitter2015 dataset, it is annotated
with Urdu-specific grammar rules. We establish benchmark baselines by
evaluating both text-based and multimodal models on this dataset, providing
comparative analyses to support future research on Urdu MNER. The U-MNER
framework integrates textual and visual context using Urdu-BERT for text
embeddings and ResNet for visual feature extraction, with a Cross-Modal Fusion
Module to align and fuse information. Our model achieves state-of-the-art
performance on the Twitter2015-Urdu dataset, laying the groundwork for further
MNER research in low-resource languages.
comment: 16 pages, 5 figures. Preprint
☆ Understanding In-context Learning of Addition via Activation Subspaces
To perform in-context learning, language models must extract signals from
individual few-shot examples, aggregate these into a learned prediction rule,
and then apply this rule to new examples. How is this implemented in the
forward pass of modern transformer models? To study this, we consider a
structured family of few-shot learning tasks for which the true prediction rule
is to add an integer $k$ to the input. We find that Llama-3-8B attains high
accuracy on this task for a range of $k$, and localize its few-shot ability to
just three attention heads via a novel optimization approach. We further show
the extracted signals lie in a six-dimensional subspace, where four of the
dimensions track the unit digit and the other two dimensions track overall
magnitude. We finally examine how these heads extract information from
individual few-shot examples, identifying a self-correction mechanism in which
mistakes from earlier examples are suppressed by later examples. Our results
demonstrate how tracking low-dimensional subspaces across a forward pass can
provide insight into fine-grained computational structures.
comment: 16 pages
☆ Unveiling Language-Specific Features in Large Language Models via Sparse Autoencoders
The mechanisms behind multilingual capabilities in Large Language Models
(LLMs) have been examined using neuron-based or internal-activation-based
methods. However, these methods often face challenges such as superposition and
layer-wise activation variance, which limit their reliability. Sparse
Autoencoders (SAEs) offer a more nuanced analysis by decomposing the
activations of LLMs into sparse linear combination of SAE features. We
introduce a novel metric to assess the monolinguality of features obtained from
SAEs, discovering that some features are strongly related to specific
languages. Additionally, we show that ablating these SAE features only
significantly reduces abilities in one language of LLMs, leaving others almost
unaffected. Interestingly, we find some languages have multiple synergistic SAE
features, and ablating them together yields greater improvement than ablating
individually. Moreover, we leverage these SAE-derived language-specific
features to enhance steering vectors, achieving control over the language
generated by LLMs.
☆ X-Driver: Explainable Autonomous Driving with Vision-Language Models
End-to-end autonomous driving has advanced significantly, offering benefits
such as system simplicity and stronger driving performance in both open-loop
and closed-loop settings than conventional pipelines. However, existing
frameworks still suffer from low success rates in closed-loop evaluations,
highlighting their limitations in real-world deployment. In this paper, we
introduce X-Driver, a unified multi-modal large language models(MLLMs)
framework designed for closed-loop autonomous driving, leveraging
Chain-of-Thought(CoT) and autoregressive modeling to enhance perception and
decision-making. We validate X-Driver across multiple autonomous driving tasks
using public benchmarks in CARLA simulation environment, including
Bench2Drive[6]. Our experimental results demonstrate superior closed-loop
performance, surpassing the current state-of-the-art(SOTA) while improving the
interpretability of driving decisions. These findings underscore the importance
of structured reasoning in end-to-end driving and establish X-Driver as a
strong baseline for future research in closed-loop autonomous driving.
☆ Reliably Bounding False Positives: A Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal Prediction
The rapid advancement of large language models has raised significant
concerns regarding their potential misuse by malicious actors. As a result,
developing effective detectors to mitigate these risks has become a critical
priority. However, most existing detection methods focus excessively on
detection accuracy, often neglecting the societal risks posed by high false
positive rates (FPRs). This paper addresses this issue by leveraging Conformal
Prediction (CP), which effectively constrains the upper bound of FPRs. While
directly applying CP constrains FPRs, it also leads to a significant reduction
in detection performance. To overcome this trade-off, this paper proposes a
Zero-Shot Machine-Generated Text Detection Framework via Multiscaled Conformal
Prediction (MCP), which both enforces the FPR constraint and improves detection
performance. This paper also introduces RealDet, a high-quality dataset that
spans a wide range of domains, ensuring realistic calibration and enabling
superior detection performance when combined with MCP. Empirical evaluations
demonstrate that MCP effectively constrains FPRs, significantly enhances
detection performance, and increases robustness against adversarial attacks
across multiple detectors and datasets.
☆ Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization
Consumer Health Queries (CHQs) in Bengali (Bangla), a low-resource language,
often contain extraneous details, complicating efficient medical responses.
This study investigates the zero-shot performance of nine advanced large
language models (LLMs): GPT-3.5-Turbo, GPT-4, Claude-3.5-Sonnet,
Llama3-70b-Instruct, Mixtral-8x22b-Instruct, Gemini-1.5-Pro,
Qwen2-72b-Instruct, Gemma-2-27b, and Athene-70B, in summarizing Bangla CHQs.
Using the BanglaCHQ-Summ dataset comprising 2,350 annotated query-summary
pairs, we benchmarked these LLMs using ROUGE metrics against Bangla T5, a
fine-tuned state-of-the-art model. Mixtral-8x22b-Instruct emerged as the top
performing model in ROUGE-1 and ROUGE-L, while Bangla T5 excelled in ROUGE-2.
The results demonstrate that zero-shot LLMs can rival fine-tuned models,
achieving high-quality summaries even without task-specific training. This work
underscores the potential of LLMs in addressing challenges in low-resource
languages, providing scalable solutions for healthcare query summarization.
☆ CodeMixBench: Evaluating Large Language Models on Code Generation with Code-Mixed Prompts
Large Language Models (LLMs) have achieved remarkable success in code
generation tasks, powering various applications like code completion,
debugging, and programming assistance. However, existing benchmarks such as
HumanEval, MBPP, and BigCodeBench primarily evaluate LLMs on English-only
prompts, overlooking the real-world scenario where multilingual developers
often use code-mixed language while interacting with LLMs. To address this gap,
we introduce CodeMixBench, a novel benchmark designed to evaluate the
robustness of LLMs on code generation from code-mixed prompts. Built upon
BigCodeBench, CodeMixBench introduces controlled code-mixing (CMD) into the
natural language parts of prompts across three language pairs: Hinglish
(Hindi-English), Spanish-English, and Chinese Pinyin-English. We
comprehensively evaluate a diverse set of open-source code generation models
ranging from 1.5B to 15B parameters. Our results show that code-mixed prompts
consistently degrade Pass@1 performance compared to their English-only
counterparts, with performance drops increasing under higher CMD levels for
smaller models. CodeMixBench provides a realistic evaluation framework for
studying multilingual code generation and highlights new challenges and
directions for building robust code generation models that generalize well
across diverse linguistic settings.
☆ Teochew-Wild: The First In-the-wild Teochew Dataset with Orthographic Annotations
This paper reports the construction of the Teochew-Wild, a speech corpus of
the Teochew dialect. The corpus includes 18.9 hours of in-the-wild Teochew
speech data from multiple speakers, covering both formal and colloquial
expressions, with precise orthographic and pinyin annotations. Additionally, we
provide supplementary text processing tools and resources to propel research
and applications in speech tasks for this low-resource language, such as
automatic speech recognition (ASR) and text-to-speech (TTS). To the best of our
knowledge, this is the first publicly available Teochew dataset with accurate
orthographic annotations. We conduct experiments on the corpus, and the results
validate its effectiveness in ASR and TTS tasks.
☆ Image-Text Relation Prediction for Multilingual Tweets
Various social networks have been allowing media uploads for over a decade
now. Still, it has not always been clear what is their relation with the posted
text or even if there is any at all. In this work, we explore how multilingual
vision-language models tackle the task of image-text relation prediction in
different languages, and construct a dedicated balanced benchmark data set from
Twitter posts in Latvian along with their manual translations into English. We
compare our results to previous work and show that the more recently released
vision-language model checkpoints are becoming increasingly capable at this
task, but there is still much room for further improvement.
☆ G-FOCUS: Towards a Robust Method for Assessing UI Design Persuasiveness
Evaluating user interface (UI) design effectiveness extends beyond aesthetics
to influencing user behavior, a principle central to Design Persuasiveness. A/B
testing is the predominant method for determining which UI variations drive
higher user engagement, but it is costly and time-consuming. While recent
Vision-Language Models (VLMs) can process automated UI analysis, current
approaches focus on isolated design attributes rather than comparative
persuasiveness-the key factor in optimizing user interactions. To address this,
we introduce WiserUI-Bench, a benchmark designed for Pairwise UI Design
Persuasiveness Assessment task, featuring 300 real-world UI image pairs labeled
with A/B test results and expert rationales. Additionally, we propose G-FOCUS,
a novel inference-time reasoning strategy that enhances VLM-based
persuasiveness assessment by reducing position bias and improving evaluation
accuracy. Experimental results show that G-FOCUS surpasses existing inference
strategies in consistency and accuracy for pairwise UI evaluation. Through
promoting VLM-driven evaluation of UI persuasiveness, our work offers an
approach to complement A/B testing, propelling progress in scalable UI
preference modeling and design optimization. Code and data will be released
publicly.
comment: 31 pages, 17 figures
☆ Scalable Multi-Stage Influence Function for Large Language Models via Eigenvalue-Corrected Kronecker-Factored Parameterization IJCAI 2025
Pre-trained large language models (LLMs) are commonly fine-tuned to adapt to
downstream tasks. Since the majority of knowledge is acquired during
pre-training, attributing the predictions of fine-tuned LLMs to their
pre-training data may provide valuable insights. Influence functions have been
proposed as a means to explain model predictions based on training data.
However, existing approaches fail to compute ``multi-stage'' influence and lack
scalability to billion-scale LLMs.
In this paper, we propose the multi-stage influence function to attribute the
downstream predictions of fine-tuned LLMs to pre-training data under the
full-parameter fine-tuning paradigm. To enhance the efficiency and practicality
of our multi-stage influence function, we leverage Eigenvalue-corrected
Kronecker-Factored (EK-FAC) parameterization for efficient approximation.
Empirical results validate the superior scalability of EK-FAC approximation and
the effectiveness of our multi-stage influence function. Additionally, case
studies on a real-world LLM, dolly-v2-3b, demonstrate its interpretive power,
with exemplars illustrating insights provided by multi-stage influence
estimates. Our code is public at
https://github.com/colored-dye/multi_stage_influence_function.
comment: 9 pages, accepted by IJCAI 2025
☆ The Pitfalls of Growing Group Complexity: LLMs and Social Choice-Based Aggregation for Group Recommendations
Large Language Models (LLMs) are increasingly applied in recommender systems
aimed at both individuals and groups. Previously, Group Recommender Systems
(GRS) often used social choice-based aggregation strategies to derive a single
recommendation based on the preferences of multiple people. In this paper, we
investigate under which conditions language models can perform these strategies
correctly based on zero-shot learning and analyse whether the formatting of the
group scenario in the prompt affects accuracy. We specifically focused on the
impact of group complexity (number of users and items), different LLMs,
different prompting conditions, including In-Context learning or generating
explanations, and the formatting of group preferences. Our results show that
performance starts to deteriorate when considering more than 100 ratings.
However, not all language models were equally sensitive to growing group
complexity. Additionally, we showed that In-Context Learning (ICL) can
significantly increase the performance at higher degrees of group complexity,
while adding other prompt modifications, specifying domain cues or prompting
for explanations, did not impact accuracy. We conclude that future research
should include group complexity as a factor in GRS evaluation due to its effect
on LLM performance. Furthermore, we showed that formatting the group scenarios
differently, such as rating lists per user or per item, affected accuracy. All
in all, our study implies that smaller LLMs are capable of generating group
recommendations under the right conditions, making the case for using smaller
models that require less computing power and costs.
comment: To be published in: Adjunct Proceedings of the 33rd ACM Conference on
User Modeling, Adaptation and Personalization (UMAP Adjunct '25), June
16--19, 2025, New York City, NY, USA Accepted at the 4th Workshop on Group
Modeling, Adaptation and Personalization (GMAP), co-located at UMAP 2025
☆ Rethinking Invariance in In-context Learning
In-Context Learning (ICL) has emerged as a pivotal capability of
auto-regressive large language models, yet it is hindered by a notable
sensitivity to the ordering of context examples regardless of their mutual
independence. To address this issue, recent studies have introduced several
variant algorithms of ICL that achieve permutation invariance. However, many of
these do not exhibit comparable performance with the standard auto-regressive
ICL algorithm. In this work, we identify two crucial elements in the design of
an invariant ICL algorithm: information non-leakage and context
interdependence, which are not simultaneously achieved by any of the existing
methods. These investigations lead us to the proposed Invariant ICL (InvICL), a
methodology designed to achieve invariance in ICL while ensuring the two
properties. Empirically, our findings reveal that InvICL surpasses previous
models, both invariant and non-invariant, in most benchmark datasets,
showcasing superior generalization capabilities across varying input lengths.
Code is available at https://github.com/PKU-ML/InvICL.
☆ Latent Preference Coding: Aligning Large Language Models via Discrete Latent Codes
Large language models (LLMs) have achieved remarkable success, yet aligning
their generations with human preferences remains a critical challenge. Existing
approaches to preference modeling often rely on an explicit or implicit reward
function, overlooking the intricate and multifaceted nature of human
preferences that may encompass conflicting factors across diverse tasks and
populations. To address this limitation, we introduce Latent Preference Coding
(LPC), a novel framework that models the implicit factors as well as their
combinations behind holistic preferences using discrete latent codes. LPC
seamlessly integrates with various offline alignment algorithms, automatically
inferring the underlying factors and their importance from data without relying
on pre-defined reward functions and hand-crafted combination weights. Extensive
experiments on multiple benchmarks demonstrate that LPC consistently improves
upon three alignment algorithms (DPO, SimPO, and IPO) using three base models
(Mistral-7B, Llama3-8B, and Llama3-8B-Instruct). Furthermore, deeper analysis
reveals that the learned latent codes effectively capture the differences in
the distribution of human preferences and significantly enhance the robustness
of alignment against noise in data. By providing a unified representation for
the multifarious preference factors, LPC paves the way towards developing more
robust and versatile alignment techniques for the responsible deployment of
powerful LLMs.
☆ Rethinking the Relationship between the Power Law and Hierarchical Structures
Statistical analysis of corpora provides an approach to quantitatively
investigate natural languages. This approach has revealed that several power
laws consistently emerge across different corpora and languages, suggesting the
universal principles underlying languages. Particularly, the power-law decay of
correlation has been interpreted as evidence for underlying hierarchical
structures in syntax, semantics, and discourse. This perspective has also been
extended to child languages and animal signals. However, the argument
supporting this interpretation has not been empirically tested. To address this
problem, this study examines the validity of the argument for syntactic
structures. Specifically, we test whether the statistical properties of parse
trees align with the implicit assumptions in the argument. Using English
corpora, we analyze the mutual information, deviations from probabilistic
context-free grammars (PCFGs), and other properties in parse trees, as well as
in the PCFG that approximates these trees. Our results indicate that the
assumptions do not hold for syntactic structures and that it is difficult to
apply the proposed argument to child languages and animal signals, highlighting
the need to reconsider the relationship between the power law and hierarchical
structures.
comment: 13 pages, 11 figures
☆ General Transform: A Unified Framework for Adaptive Transform to Enhance Representations
Discrete transforms, such as the discrete Fourier transform, are widely used
in machine learning to improve model performance by extracting meaningful
features. However, with numerous transforms available, selecting an appropriate
one often depends on understanding the dataset's properties, making the
approach less effective when such knowledge is unavailable. In this work, we
propose General Transform (GT), an adaptive transform-based representation
designed for machine learning applications. Unlike conventional transforms, GT
learns data-driven mapping tailored to the dataset and task of interest. Here,
we demonstrate that models incorporating GT outperform conventional
transform-based approaches across computer vision and natural language
processing tasks, highlighting its effectiveness in diverse learning scenarios.
☆ Chain-of-Thought Tokens are Computer Program Variables
Chain-of-thoughts (CoT) requires large language models (LLMs) to generate
intermediate steps before reaching the final answer, and has been proven
effective to help LLMs solve complex reasoning tasks. However, the inner
mechanism of CoT still remains largely unclear. In this paper, we empirically
study the role of CoT tokens in LLMs on two compositional tasks: multi-digit
multiplication and dynamic programming. While CoT is essential for solving
these problems, we find that preserving only tokens that store intermediate
results would achieve comparable performance. Furthermore, we observe that
storing intermediate results in an alternative latent form will not affect
model performance. We also randomly intervene some values in CoT, and notice
that subsequent CoT tokens and the final answer would change correspondingly.
These findings suggest that CoT tokens may function like variables in computer
programs but with potential drawbacks like unintended shortcuts and
computational complexity limits between tokens. The code and data are available
at https://github.com/solitaryzero/CoTs_are_Variables.
☆ Prompt-Based LLMs for Position Bias-Aware Reranking in Personalized Recommendations
Recommender systems are essential for delivering personalized content across
digital platforms by modeling user preferences and behaviors. Recently, large
language models (LLMs) have been adopted for prompt-based recommendation due to
their ability to generate personalized outputs without task-specific training.
However, LLM-based methods face limitations such as limited context window
size, inefficient pointwise and pairwise prompting, and difficulty handling
listwise ranking due to token constraints. LLMs can also be sensitive to
position bias, as they may overemphasize earlier items in the prompt regardless
of their true relevance. To address and investigate these issues, we propose a
hybrid framework that combines a traditional recommendation model with an LLM
for reranking top-k items using structured prompts. We evaluate the effects of
user history reordering and instructional prompts for mitigating position bias.
Experiments on MovieLens-100K show that randomizing user history improves
ranking quality, but LLM-based reranking does not outperform the base model.
Explicit instructions to reduce position bias are also ineffective. Our
evaluations reveal limitations in LLMs' ability to model ranking context and
mitigate bias. Our code is publicly available at
https://github.com/aminul7506/LLMForReRanking.
☆ T2VTextBench: A Human Evaluation Benchmark for Textual Control in Video Generation Models
Thanks to recent advancements in scalable deep architectures and large-scale
pretraining, text-to-video generation has achieved unprecedented capabilities
in producing high-fidelity, instruction-following content across a wide range
of styles, enabling applications in advertising, entertainment, and education.
However, these models' ability to render precise on-screen text, such as
captions or mathematical formulas, remains largely untested, posing significant
challenges for applications requiring exact textual accuracy. In this work, we
introduce T2VTextBench, the first human-evaluation benchmark dedicated to
evaluating on-screen text fidelity and temporal consistency in text-to-video
models. Our suite of prompts integrates complex text strings with dynamic scene
changes, testing each model's ability to maintain detailed instructions across
frames. We evaluate ten state-of-the-art systems, ranging from open-source
solutions to commercial offerings, and find that most struggle to generate
legible, consistent text. These results highlight a critical gap in current
video generators and provide a clear direction for future research aimed at
enhancing textual manipulation in video synthesis.
☆ Perception, Reason, Think, and Plan: A Survey on Large Multimodal Reasoning Models
Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, Shouzheng Huang, Xinping Zhao, Borui Jiang, Lanqing Hong, Longyue Wang, Zhuotao Tian, Baoxing Huai, Wenhan Luo, Weihua Luo, Zheng Zhang, Baotian Hu, Min Zhang
Reasoning lies at the heart of intelligence, shaping the ability to make
decisions, draw conclusions, and generalize across domains. In artificial
intelligence, as systems increasingly operate in open, uncertain, and
multimodal environments, reasoning becomes essential for enabling robust and
adaptive behavior. Large Multimodal Reasoning Models (LMRMs) have emerged as a
promising paradigm, integrating modalities such as text, images, audio, and
video to support complex reasoning capabilities and aiming to achieve
comprehensive perception, precise understanding, and deep reasoning. As
research advances, multimodal reasoning has rapidly evolved from modular,
perception-driven pipelines to unified, language-centric frameworks that offer
more coherent cross-modal understanding. While instruction tuning and
reinforcement learning have improved model reasoning, significant challenges
remain in omni-modal generalization, reasoning depth, and agentic behavior. To
address these issues, we present a comprehensive and structured survey of
multimodal reasoning research, organized around a four-stage developmental
roadmap that reflects the field's shifting design philosophies and emerging
capabilities. First, we review early efforts based on task-specific modules,
where reasoning was implicitly embedded across stages of representation,
alignment, and fusion. Next, we examine recent approaches that unify reasoning
into multimodal LLMs, with advances such as Multimodal Chain-of-Thought (MCoT)
and multimodal reinforcement learning enabling richer and more structured
reasoning chains. Finally, drawing on empirical insights from challenging
benchmarks and experimental cases of OpenAI O3 and O4-mini, we discuss the
conceptual direction of native large multimodal reasoning models (N-LMRMs),
which aim to support scalable, agentic, and adaptive reasoning and planning in
complex, real-world environments.
comment: 75 Pages,10 figures; Project:
https://github.com/HITsz-TMG/Awesome-Large-Multimodal-Reasoning-Models
☆ An Open-Source Dual-Loss Embedding Model for Semantic Retrieval in Higher Education
Recent advances in AI have catalyzed the adoption of intelligent educational
tools, yet many semantic retrieval systems remain ill-suited to the unique
linguistic and structural characteristics of academic content. This study
presents two open-source embedding models fine-tuned for educational question
answering, particularly in the context of course syllabi. A synthetic dataset
of 3,197 sentence pairs, spanning synonymous terminology, paraphrased
questions, and implicit-explicit mappings, was constructed through a
combination of manual curation and large language model (LLM)-assisted
generation. Two training strategies were evaluated: (1) a baseline model
fine-tuned using MultipleNegativesRankingLoss (MNRL), and (2) a dual-loss model
that combines MNRL with CosineSimilarityLoss to improve both semantic ranking
and similarity calibration. Evaluations were conducted on 28 university course
syllabi using a fixed set of natural language questions categorized into
course, faculty, and teaching assistant information. Results demonstrate that
both fine-tuned models outperform strong open-source baselines, including
all-MiniLM-L6-v2 and multi-qa-MiniLM-L6-cos-v1, and that the dual-loss model
narrows the performance gap with high-performing proprietary embeddings such as
OpenAI's text-embedding-3 series. This work contributes reusable,
domain-aligned embedding models and provides a replicable framework for
educational semantic retrieval, supporting downstream applications such as
academic chatbots, retrieval-augmented generation (RAG) systems, and learning
management system (LMS) integrations.
comment: 17 pages, 3 Tables
☆ Enigme: Generative Text Puzzles for Evaluating Reasoning in Language Models
Transformer-decoder language models are a core innovation in text based
generative artificial intelligence. These models are being deployed as
general-purpose intelligence systems in many applications. Central to their
utility is the capacity to understand natural language commands and exploit the
reasoning embedded in human text corpora to apply some form of reasoning
process to a wide variety of novel tasks. To understand the limitations of this
approach to generating reasoning we argue that we need to consider the
architectural constraints of these systems. Consideration of the latent
variable structure of transformer-decoder models allows us to design reasoning
tasks that should probe the boundary of their capacity to reason. We present
enigme, an open-source library for generating text-based puzzles to be used in
training and evaluating reasoning skills within transformer-decoder models and
future AI architectures.
comment: To be published in the proceedings of The 2025 11th International
Conference on Engineering, Applied Sciences, and Technology (ICEAST)
☆ SpatialPrompting: Keyframe-driven Zero-Shot Spatial Reasoning with Off-the-Shelf Multimodal Large Language Models
This study introduces SpatialPrompting, a novel framework that harnesses the
emergent reasoning capabilities of off-the-shelf multimodal large language
models to achieve zero-shot spatial reasoning in three-dimensional (3D)
environments. Unlike existing methods that rely on expensive 3D-specific
fine-tuning with specialized 3D inputs such as point clouds or voxel-based
features, SpatialPrompting employs a keyframe-driven prompt generation
strategy. This framework uses metrics such as vision-language similarity,
Mahalanobis distance, field of view, and image sharpness to select a diverse
and informative set of keyframes from image sequences and then integrates them
with corresponding camera pose data to effectively abstract spatial
relationships and infer complex 3D structures. The proposed framework not only
establishes a new paradigm for flexible spatial reasoning that utilizes
intuitive visual and positional cues but also achieves state-of-the-art
zero-shot performance on benchmark datasets, such as ScanQA and SQA3D, across
several metrics. The proposed method effectively eliminates the need for
specialized 3D inputs and fine-tuning, offering a simpler and more scalable
alternative to conventional approaches.
comment: 18 pages, 11 figures
☆ ConCISE: Confidence-guided Compression in Step-by-step Efficient Reasoning
Ziqing Qiao, Yongheng Deng, Jiali Zeng, Dong Wang, Lai Wei, Fandong Meng, Jie Zhou, Ju Ren, Yaoxue Zhang
Large Reasoning Models (LRMs) perform strongly in complex reasoning tasks via
Chain-of-Thought (CoT) prompting, but often suffer from verbose outputs caused
by redundant content, increasing computational overhead, and degrading user
experience. Existing compression methods either operate post-hoc pruning,
risking disruption to reasoning coherence, or rely on sampling-based selection,
which fails to intervene effectively during generation. In this work, we
introduce a confidence-guided perspective to explain the emergence of redundant
reflection in LRMs, identifying two key patterns: Confidence Deficit, where the
model reconsiders correct steps due to low internal confidence, and Termination
Delay, where reasoning continues even after reaching a confident answer. Based
on this analysis, we propose ConCISE (Confidence-guided Compression In
Step-by-step Efficient Reasoning), a framework that simplifies reasoning chains
by reinforcing the model's confidence during inference, thus preventing the
generation of redundant reflection steps. It integrates Confidence Injection to
stabilize intermediate steps and Early Stopping to terminate reasoning when
confidence is sufficient. Extensive experiments demonstrate that fine-tuning
LRMs on ConCISE-generated data yields significantly shorter outputs, reducing
length by up to approximately 50% under SimPO, while maintaining high task
accuracy. ConCISE consistently outperforms existing baselines across multiple
reasoning benchmarks.
♻ ☆ Correctness Coverage Evaluation for Medical Multiple-Choice Question Answering Based on the Enhanced Conformal Prediction Framework
Large language models (LLMs) are increasingly adopted in medical
question-answering (QA) scenarios. However, LLMs can generate hallucinations
and nonfactual information, undermining their trustworthiness in high-stakes
medical tasks. Conformal Prediction (CP) provides a statistically rigorous
framework for marginal (average) coverage guarantees but has limited
exploration in medical QA. This paper proposes an enhanced CP framework for
medical multiple-choice question-answering (MCQA) tasks. By associating the
non-conformance score with the frequency score of correct options and
leveraging self-consistency, the framework addresses internal model opacity and
incorporates a risk control strategy with a monotonic loss function. Evaluated
on MedMCQA, MedQA, and MMLU datasets using four off-the-shelf LLMs, the
proposed method meets specified error rate guarantees while reducing average
prediction set size with increased risk level, offering a promising uncertainty
evaluation metric for LLMs.
comment: Published by Mathematics
♻ ☆ DEGAP: Dual Event-Guided Adaptive Prefixes for Templated-Based Event Argument Extraction with Slot Querying
Recent advancements in event argument extraction (EAE) involve incorporating
useful auxiliary information into models during training and inference, such as
retrieved instances and event templates. These methods face two challenges: (1)
the retrieval results may be irrelevant and (2) templates are developed
independently for each event without considering their possible relationship.
In this work, we propose DEGAP to address these challenges through a simple yet
effective components: dual prefixes, i.e. learnable prompt vectors, where the
instance-oriented prefix and template-oriented prefix are trained to learn
information from different event instances and templates. Additionally, we
propose an event-guided adaptive gating mechanism, which can adaptively
leverage possible connections between different events and thus capture
relevant information from the prefix. Finally, these event-guided prefixes
provide relevant information as cues to EAE model without retrieval. Extensive
experiments demonstrate that our method achieves new state-of-the-art
performance on four datasets (ACE05, RAMS, WIKIEVENTS, and MLEE). Further
analysis shows the impact of different components.
comment: Published as a conference paper in COLING 2025
♻ ☆ TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
Jeffrey Li, Mohammadreza Armandpour, Iman Mirzadeh, Sachin Mehta, Vaishaal Shankar, Raviteja Vemulapalli, Samy Bengio, Oncel Tuzel, Mehrdad Farajtabar, Hadi Pouransari, Fartash Faghri
Large Language Models (LLMs) trained on historical web data inevitably become
outdated. We investigate evaluation strategies and update methods for LLMs as
new data becomes available. We introduce a web-scale dataset for time-continual
pretraining of LLMs derived from 114 dumps of Common Crawl (CC) - orders of
magnitude larger than previous continual language modeling benchmarks. We also
design time-stratified evaluations across both general CC data and specific
domains (Wikipedia, StackExchange, and code documentation) to assess how well
various continual learning methods adapt to new data while retaining past
knowledge. Our findings demonstrate that, on general CC data, autoregressive
meta-schedules combined with a fixed-ratio replay of older data can achieve
comparable held-out loss to re-training from scratch, while requiring
significantly less computation (2.6x). However, the optimal balance between
incorporating new data and replaying old data differs as replay is crucial to
avoid forgetting on generic web data but less so on specific domains.
comment: Code available at: https://github.com/apple/ml-tic-lm
♻ ☆ TCAN: Text-oriented Cross Attention Network for Multimodal Sentiment Analysis
Multimodal Sentiment Analysis (MSA) endeavors to understand human sentiment
by leveraging language, visual, and acoustic modalities. Despite the remarkable
performance exhibited by previous MSA approaches, the presence of inherent
multimodal heterogeneities poses a challenge, with the contribution of
different modalities varying considerably. Past research predominantly focused
on improving representation learning techniques and feature fusion strategies.
However, many of these efforts overlooked the variation in semantic richness
among different modalities, treating each modality uniformly. This approach may
lead to underestimating the significance of strong modalities while
overemphasizing the importance of weak ones. Motivated by these insights, we
introduce a Text-oriented Cross-Attention Network (TCAN), emphasizing the
predominant role of the text modality in MSA. Specifically, for each multimodal
sample, by taking unaligned sequences of the three modalities as inputs, we
initially allocate the extracted unimodal features into a visual-text and an
acoustic-text pair. Subsequently, we implement self-attention on the text
modality and apply text-queried cross-attention to the visual and acoustic
modalities. To mitigate the influence of noise signals and redundant features,
we incorporate a gated control mechanism into the framework. Additionally, we
introduce unimodal joint learning to gain a deeper understanding of homogeneous
emotional tendencies across diverse modalities through backpropagation.
Experimental results demonstrate that TCAN consistently outperforms
state-of-the-art MSA methods on two datasets (CMU-MOSI and CMU-MOSEI).
♻ ☆ SAPIENT: Mastering Multi-turn Conversational Recommendation with Strategic Planning and Monte Carlo Tree Search NAACL 2025
Conversational Recommender Systems (CRS) proactively engage users in
interactive dialogues to elicit user preferences and provide personalized
recommendations. Existing methods train Reinforcement Learning (RL)-based agent
with greedy action selection or sampling strategy, and may suffer from
suboptimal conversational planning. To address this, we present a novel Monte
Carlo Tree Search (MCTS)-based CRS framework SAPIENT. SAPIENT consists of a
conversational agent (S-agent) and a conversational planner (S-planner).
S-planner builds a conversational search tree with MCTS based on the initial
actions proposed by S-agent to find conversation plans. The best conversation
plans from S-planner are used to guide the training of S-agent, creating a
self-training loop where S-agent can iteratively improve its capability for
conversational planning. Furthermore, we propose an efficient variant SAPIENT
for trade-off between training efficiency and performance. Extensive
experiments on four benchmark datasets validate the effectiveness of our
approach, showing that SAPIENT outperforms the state-of-the-art baselines. Our
code and data are accessible through https://github.com/ninglab/SAPIENT.
comment: Accepted to NAACL 2025 Main Conference
♻ ☆ Applications of Artificial Intelligence for Cross-language Intelligibility Assessment of Dysarthric Speech
Purpose: Speech intelligibility is a critical outcome in the assessment and
management of dysarthria, yet most research and clinical practices have focused
on English, limiting their applicability across languages. This commentary
introduces a conceptual framework--and a demonstration of how it can be
implemented--leveraging artificial intelligence (AI) to advance cross-language
intelligibility assessment of dysarthric speech. Method: We propose a
two-tiered conceptual framework consisting of a universal speech model that
encodes dysarthric speech into acoustic-phonetic representations, followed by a
language-specific intelligibility assessment model that interprets these
representations within the phonological or prosodic structures of the target
language. We further identify barriers to cross-language intelligibility
assessment of dysarthric speech, including data scarcity, annotation
complexity, and limited linguistic insights into dysarthric speech, and outline
potential AI-driven solutions to overcome these challenges. Conclusion:
Advancing cross-language intelligibility assessment of dysarthric speech
necessitates models that are both efficient and scalable, yet constrained by
linguistic rules to ensure accurate and language-sensitive assessment. Recent
advances in AI provide the foundational tools to support this integration,
shaping future directions toward generalizable and linguistically informed
assessment frameworks.
comment: 14 pages, 2 figure, 2 tables
♻ ☆ SmallPlan: Leverage Small Language Models for Sequential Path Planning with Simulation-Powered, LLM-Guided Distillation
Efficient path planning in robotics, particularly within large-scale, dynamic
environments, remains a significant hurdle. While Large Language Models (LLMs)
offer strong reasoning capabilities, their high computational cost and limited
adaptability in dynamic scenarios hinder real-time deployment on edge devices.
We present SmallPlan -- a novel framework leveraging LLMs as teacher models to
train lightweight Small Language Models (SLMs) for high-level path planning
tasks. In SmallPlan, the SLMs provide optimal action sequences to navigate
across scene graphs that compactly represent full-scaled 3D scenes. The SLMs
are trained in a simulation-powered, interleaved manner with LLM-guided
supervised fine-tuning (SFT) and reinforcement learning (RL). This strategy not
only enables SLMs to successfully complete navigation tasks but also makes them
aware of important factors like travel distance and number of trials. Through
experiments, we demonstrate that the fine-tuned SLMs perform competitively with
larger models like GPT-4o on sequential path planning, without suffering from
hallucination and overfitting. SmallPlan is resource-efficient, making it
well-suited for edge-device deployment and advancing practical autonomous
robotics.
comment: Paper is under review
♻ ☆ Re-evaluating Open-ended Evaluation of Large Language Models ICLR 2025
Evaluation has traditionally focused on ranking candidates for a specific
skill. Modern generalist models, such as Large Language Models (LLMs),
decidedly outpace this paradigm. Open-ended evaluation systems, where candidate
models are compared on user-submitted prompts, have emerged as a popular
solution. Despite their many advantages, we show that the current Elo-based
rating systems can be susceptible to and even reinforce biases in data,
intentional or accidental, due to their sensitivity to redundancies. To address
this issue, we propose evaluation as a 3-player game, and introduce novel
game-theoretic solution concepts to ensure robustness to redundancy. We show
that our method leads to intuitive ratings and provide insights into the
competitive landscape of LLM development.
comment: Published at ICLR 2025
♻ ☆ Benchmarking Open-Source Large Language Models on Healthcare Text Classification Tasks
The application of large language models (LLMs) to healthcare information
extraction has emerged as a promising approach. This study evaluates the
classification performance of five open-source LLMs: GEMMA-3-27B-IT,
LLAMA3-70B, LLAMA4-109B, DEEPSEEK-R1-DISTILL-LLAMA-70B, and
DEEPSEEK-V3-0324-UD-Q2_K_XL, across six healthcare-related classification tasks
involving both social media data (breast cancer, changes in medication regimen,
adverse pregnancy outcomes, potential COVID-19 cases) and clinical data (stigma
labeling, medication change discussion). We report precision, recall, and F1
scores with 95% confidence intervals for all model-task combinations. Our
findings reveal significant performance variability between LLMs, with
DeepSeekV3 emerging as the strongest overall performer, achieving the highest
F1 scores in four tasks. Notably, models generally performed better on social
media tasks compared to clinical data tasks, suggesting potential
domain-specific challenges. GEMMA-3-27B-IT demonstrated exceptionally high
recall despite its smaller parameter count, while LLAMA4-109B showed
surprisingly underwhelming performance compared to its predecessor LLAMA3-70B,
indicating that larger parameter counts do not guarantee improved
classification results. We observed distinct precision-recall trade-offs across
models, with some favoring sensitivity over specificity and vice versa. These
findings highlight the importance of task-specific model selection for
healthcare applications, considering the particular data domain and
precision-recall requirements rather than model size alone. As healthcare
increasingly integrates AI-driven text classification tools, this comprehensive
benchmarking provides valuable guidance for model selection and implementation
while underscoring the need for continued evaluation and domain adaptation of
LLMs in healthcare contexts.
comment: 5 pages
♻ ☆ Faster, Cheaper, Better: Multi-Objective Hyperparameter Optimization for LLM and RAG Systems
While Retrieval Augmented Generation (RAG) has emerged as a popular technique
for improving Large Language Model (LLM) systems, it introduces a large number
of choices, parameters and hyperparameters that must be made or tuned. This
includes the LLM, embedding, and ranker models themselves, as well as
hyperparameters governing individual RAG components. Yet, collectively
optimizing the entire configuration in a RAG or LLM system remains
under-explored - especially in multi-objective settings - due to intractably
large solution spaces, noisy objective evaluations, and the high cost of
evaluations. In this work, we introduce the first approach for multi-objective
parameter optimization of cost, latency, safety and alignment over entire LLM
and RAG systems. We find that Bayesian optimization methods significantly
outperform baseline approaches, obtaining a superior Pareto front on two new
RAG benchmark tasks. We conclude our work with important considerations for
practitioners who are designing multi-objective RAG systems, highlighting
nuances such as how optimal configurations may not generalize across tasks and
objectives.
♻ ☆ Large Language Models Understanding: an Inherent Ambiguity Barrier
A lively ongoing debate is taking place, since the extraordinary emergence of
Large Language Models (LLMs) with regards to their capability to understand the
world and capture the meaning of the dialogues in which they are involved.
Arguments and counter-arguments have been proposed based upon thought
experiments, anecdotal conversations between LLMs and humans, statistical
linguistic analysis, philosophical considerations, and more. In this brief
paper we present a counter-argument based upon a thought experiment and
semi-formal considerations leading to an inherent ambiguity barrier which
prevents LLMs from having any understanding of what their amazingly fluent
dialogues mean.
comment: submitted to NEURAL COMPUTATION
♻ ☆ Combating Confirmation Bias: A Unified Pseudo-Labeling Framework for Entity Alignment
Entity alignment (EA) aims at identifying equivalent entity pairs across
different knowledge graphs (KGs) that refer to the same real-world identity. To
circumvent the shortage of seed alignments provided for training, recent EA
models utilize pseudo-labeling strategies to iteratively add unaligned entity
pairs predicted with high confidence to the seed alignments for model training.
However, the adverse impact of confirmation bias during pseudo-labeling has
been largely overlooked, thus hindering entity alignment performance. To
systematically combat confirmation bias for pseudo-labeling-based entity
alignment, we propose a Unified Pseudo-Labeling framework for Entity Alignment
(UPL-EA) that explicitly eliminates pseudo-labeling errors to boost the
accuracy of entity alignment. UPL-EA consists of two complementary components:
(1) Optimal Transport (OT)-based pseudo-labeling uses discrete OT modeling as
an effective means to determine entity correspondences and reduce erroneous
matches across two KGs. An effective criterion is derived to infer
pseudo-labeled alignments that satisfy one-to-one correspondences; (2) Parallel
pseudo-label ensembling refines pseudo-labeled alignments by combining
predictions over multiple models independently trained in parallel. The
ensembled pseudo-labeled alignments are thereafter used to augment seed
alignments to reinforce subsequent model training for alignment inference. The
effectiveness of UPL-EA in eliminating pseudo-labeling errors is both
theoretically supported and experimentally validated. Our extensive results and
in-depth analyses demonstrate the superiority of UPL-EA over 15 competitive
baselines and its utility as a general pseudo-labeling framework for entity
alignment.
♻ ☆ Large Language Models for Outpatient Referral: Problem Definition, Benchmarking and Challenges
Xiaoxiao Liu, Qingying Xiao, Junying Chen, Xiangyi Feng, Xiangbo Wu, Bairui Zhang, Xiang Wan, Jian Chang, Guangjun Yu, Yan Hu, Benyou Wang
Large language models (LLMs) are increasingly applied to outpatient referral
tasks across healthcare systems. However, there is a lack of standardized
evaluation criteria to assess their effectiveness, particularly in dynamic,
interactive scenarios. In this study, we systematically examine the
capabilities and limitations of LLMs in managing tasks within Intelligent
Outpatient Referral (IOR) systems and propose a comprehensive evaluation
framework specifically designed for such systems. This framework comprises two
core tasks: static evaluation, which focuses on evaluating the ability of
predefined outpatient referrals, and dynamic evaluation, which evaluates
capabilities of refining outpatient referral recommendations through iterative
dialogues. Our findings suggest that LLMs offer limited advantages over
BERT-like models, but show promise in asking effective questions during
interactive dialogues.
♻ ☆ SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding CVPR 2025
Chenkai Zhang, Yiming Lei, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
With the rapid development of Multi-modal Large Language Models (MLLMs), an
increasing number of benchmarks have been established to evaluate the video
understanding capabilities of these models. However, these benchmarks focus on
standalone videos and mainly assess "visual elements" like human actions and
object states. In reality, contemporary videos often encompass complex and
continuous narratives, typically presented as a series. To address this
challenge, we propose SeriesBench, a benchmark consisting of 105 carefully
curated narrative-driven series, covering 28 specialized tasks that require
deep narrative understanding. Specifically, we first select a diverse set of
drama series spanning various genres. Then, we introduce a novel long-span
narrative annotation method, combined with a full-information transformation
approach to convert manual annotations into diverse task formats. To further
enhance model capacity for detailed analysis of plot structures and character
relationships within series, we propose a novel narrative reasoning framework,
PC-DCoT. Extensive results on SeriesBench indicate that existing MLLMs still
face significant challenges in understanding narrative-driven series, while
PC-DCoT enables these MLLMs to achieve performance improvements. Overall, our
SeriesBench and PC-DCoT highlight the critical necessity of advancing model
capabilities to understand narrative-driven series, guiding the future
development of MLLMs. SeriesBench is publicly available at
https://github.com/zackhxn/SeriesBench-CVPR2025.
comment: 29 pages, 15 figures, CVPR 2025
♻ ☆ Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant IJCAI 2025
Quantization has gained attention as a promising solution for the
cost-effective deployment of large and small language models. However, most
prior work has been limited to perplexity or basic knowledge tasks and lacks a
comprehensive evaluation of recent models like Llama-3.3. In this paper, we
conduct a comprehensive evaluation of instruction-tuned models spanning 1B to
405B parameters, applying four quantization methods across 13 datasets. Our
findings reveal that (1) quantized models generally surpass smaller FP16
baselines, yet they often struggle with instruction-following and hallucination
detection; (2) FP8 consistently emerges as the most robust option across tasks,
and AWQ tends to outperform GPTQ in weight-only quantization; (3) smaller
models can suffer severe accuracy drops at 4-bit quantization, while 70B-scale
models maintain stable performance; (4) notably, \textit{hard} tasks do not
always experience the largest accuracy losses, indicating that quantization
magnifies a model's inherent weaknesses rather than simply correlating with
task difficulty; and (5) an LLM-based judge (MT-Bench) highlights significant
performance declines in coding and STEM tasks, though reasoning may sometimes
improve.
comment: Accepted in IJCAI 2025, 21 pages, 2 figure
♻ ☆ WorldCuisines: A Massive-Scale Benchmark for Multilingual and Multicultural Visual Question Answering on Global Cuisines NAACL 2025
Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia, Jan Christian Blaise Cruz, Jan Wira Gotama Putra, Junho Myung, Lucky Susanto, Maria Angelica Riera Machin, Marina Zhukova, Michael Anugraha, Muhammad Farid Adilazuarda, Natasha Santosa, Peerat Limkonchotiwat, Raj Dabre, Rio Alexander Audino, Samuel Cahyawijaya, Shi-Xiong Zhang, Stephanie Yulia Salim, Yi Zhou, Yinxuan Gui, David Ifeoluwa Adelani, En-Shiun Annie Lee, Shogo Okada, Ayu Purwarianti, Alham Fikri Aji, Taro Watanabe, Derry Tanti Wijaya, Alice Oh, Chong-Wah Ngo
Vision Language Models (VLMs) often struggle with culture-specific knowledge,
particularly in languages other than English and in underrepresented cultural
contexts. To evaluate their understanding of such knowledge, we introduce
WorldCuisines, a massive-scale benchmark for multilingual and multicultural,
visually grounded language understanding. This benchmark includes a visual
question answering (VQA) dataset with text-image pairs across 30 languages and
dialects, spanning 9 language families and featuring over 1 million data
points, making it the largest multicultural VQA benchmark to date. It includes
tasks for identifying dish names and their origins. We provide evaluation
datasets in two sizes (12k and 60k instances) alongside a training dataset (1
million instances). Our findings show that while VLMs perform better with
correct location context, they struggle with adversarial contexts and
predicting specific regional cuisines and languages. To support future
research, we release a knowledge base with annotated food entries and images
along with the VQA data.
comment: Best Theme Paper at NAACL 2025
♻ ☆ Accurate and Diverse LLM Mathematical Reasoning via Automated PRM-Guided GFlowNets
Achieving both accuracy and diverse reasoning remains challenging for Large
Language Models (LLMs) in complex domains like mathematics. A key bottleneck is
evaluating intermediate reasoning steps to guide generation without costly
human annotations. To address this, we first introduce a novel Process Reward
Model (PRM) trained automatically using Monte Carlo Tree Search coupled with a
similarity-based data augmentation technique, effectively capturing step-level
reasoning quality. Leveraging this PRM, we then adapt Generative Flow Networks
(GFlowNets) to operate at the reasoning step level. Unlike traditional
reinforcement learning focused on maximizing a single reward, GFlowNets
naturally sample diverse, high-quality solutions proportional to their rewards,
as measured by our PRM. Empirical evaluation shows strong improvements in both
accuracy and solution diversity on challenging mathematical benchmarks (e.g.,
+2.59% absolute accuracy on MATH Level 5 for Llama3.2-3B), with effective
generalization to unseen datasets (+9.4% absolute on SAT MATH). Our work
demonstrates the potential of PRM-guided, step-level GFlowNets for developing
more robust and versatile mathematical reasoning in LLMs.
♻ ☆ The Power of Stories: Narrative Priming Shapes How LLM Agents Collaborate and Compete
Gerrit Großmann, Larisa Ivanova, Sai Leela Poduru, Mohaddeseh Tabrizian, Islam Mesabah, David A. Selby, Sebastian J. Vollmer
According to Yuval Noah Harari, large-scale human cooperation is driven by
shared narratives that encode common beliefs and values. This study explores
whether such narratives can similarly nudge LLM agents toward collaboration. We
use a finitely repeated public goods game in which LLM agents choose either
cooperative or egoistic spending strategies. We prime agents with stories
highlighting teamwork to different degrees and test how this influences
negotiation outcomes. Our experiments explore four questions:(1) How do
narratives influence negotiation behavior? (2) What differs when agents share
the same story versus different ones? (3) What happens when the agent numbers
grow? (4) Are agents resilient against self-serving negotiators? We find that
story-based priming significantly affects negotiation strategies and success
rates. Common stories improve collaboration, benefiting each agent. By
contrast, priming agents with different stories reverses this effect, and those
agents primed toward self-interest prevail. We hypothesize that these results
carry implications for multi-agent system design and AI alignment.
comment: 16 pages, 8 figures. Code available at
https://github.com/storyagents25/story-agents
♻ ☆ E2E-AFG: An End-to-End Model with Adaptive Filtering for Retrieval-Augmented Generation
Retrieval-augmented generation methods often neglect the quality of content
retrieved from external knowledge bases, resulting in irrelevant information or
potential misinformation that negatively affects the generation results of
large language models. In this paper, we propose an end-to-end model with
adaptive filtering for retrieval-augmented generation (E2E-AFG), which
integrates answer existence judgment and text generation into a single
end-to-end framework. This enables the model to focus more effectively on
relevant content while reducing the influence of irrelevant information and
generating accurate answers. We evaluate E2E-AFG on six representative
knowledge-intensive language datasets, and the results show that it
consistently outperforms baseline models across all tasks, demonstrating the
effectiveness and robustness of the proposed approach.
comment: 13 pages, 3 figures, 5 tables
♻ ☆ A Survey on Inference Engines for Large Language Models: Perspectives on Optimization and Efficiency
Large language models (LLMs) are widely applied in chatbots, code generators,
and search engines. Workloads such as chain-of-thought, complex reasoning, and
agent services significantly increase the inference cost by invoking the model
repeatedly. Optimization methods such as parallelism, compression, and caching
have been adopted to reduce costs, but the diverse service requirements make it
hard to select the right method. Recently, specialized LLM inference engines
have emerged as a key component for integrating the optimization methods into
service-oriented infrastructures. However, a systematic study on inference
engines is still lacking. This paper provides a comprehensive evaluation of 25
open-source and commercial inference engines. We examine each inference engine
in terms of ease-of-use, ease-of-deployment, general-purpose support,
scalability, and suitability for throughput- and latency-aware computation.
Furthermore, we explore the design goals of each inference engine by
investigating the optimization techniques it supports. In addition, we assess
the ecosystem maturity of open source inference engines and handle the
performance and cost policy of commercial solutions. We outline future research
directions that include support for complex LLM-based services, support of
various hardware, and enhanced security, offering practical guidance to
researchers and developers in selecting and designing optimized LLM inference
engines. We also provide a public repository to continually track developments
in this fast-evolving field:
https://github.com/sihyeong/Awesome-LLM-Inference-Engine
comment: Under review; 65 pages; 27 figures
♻ ☆ HORAE: A Domain-Agnostic Language for Automated Service Regulation
Yutao Sun, Mingshuai Chen, Tiancheng Zhao, Kangjia Zhao, He Li, Jintao Chen, Zhongyi Wang, Liqiang Lu, Xinkui Zhao, Shuiguang Deng, Jianwei Yin
Artificial intelligence is rapidly encroaching on the field of service
regulation. However, existing AI-based regulation techniques are often tailored
to specific application domains and thus are difficult to generalize in an
automated manner. This paper presents Horae, a unified specification language
for modeling (multimodal) regulation rules across a diverse set of domains. We
showcase how Horae facilitates an intelligent service regulation pipeline by
further exploiting a fine-tuned large language model named RuleGPT that
automates the Horae modeling process, thereby yielding an end-to-end framework
for fully automated intelligent service regulation. The feasibility and
effectiveness of our framework are demonstrated over a benchmark of various
real-world regulation domains. In particular, we show that our open-sourced,
fine-tuned RuleGPT with 7B parameters suffices to outperform GPT-3.5 and
perform on par with GPT-4o.
♻ ☆ Optimizing LLMs for Resource-Constrained Environments: A Survey of Model Compression Techniques
Sanjay Surendranath Girija, Shashank Kapoor, Lakshit Arora, Dipen Pradhan, Aman Raj, Ankit Shetgaonkar
Large Language Models (LLMs) have revolutionized many areas of artificial
intelligence (AI), but their substantial resource requirements limit their
deployment on mobile and edge devices. This survey paper provides a
comprehensive overview of techniques for compressing LLMs to enable efficient
inference in resource-constrained environments. We examine three primary
approaches: Knowledge Distillation, Model Quantization, and Model Pruning. For
each technique, we discuss the underlying principles, present different
variants, and provide examples of successful applications. We also briefly
discuss complementary techniques such as mixture-of-experts and early-exit
strategies. Finally, we highlight promising future directions, aiming to
provide a valuable resource for both researchers and practitioners seeking to
optimize LLMs for edge deployment.
comment: Accepted to IEEE COMPSAC 2025
♻ ☆ Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding NeurIPS 2024
Complex 3D scene understanding has gained increasing attention, with scene
encoding strategies playing a crucial role in this success. However, the
optimal scene encoding strategies for various scenarios remain unclear,
particularly compared to their image-based counterparts. To address this issue,
we present a comprehensive study that probes various visual encoding models for
3D scene understanding, identifying the strengths and limitations of each model
across different scenarios. Our evaluation spans seven vision foundation
encoders, including image-based, video-based, and 3D foundation models. We
evaluate these models in four tasks: Vision-Language Scene Reasoning, Visual
Grounding, Segmentation, and Registration, each focusing on different aspects
of scene understanding. Our evaluations yield key findings: DINOv2 demonstrates
superior performance, video models excel in object-level tasks, diffusion
models benefit geometric tasks, and language-pretrained models show unexpected
limitations in language-related tasks. These insights challenge some
conventional understandings, provide novel perspectives on leveraging visual
foundation models, and highlight the need for more flexible encoder selection
in future vision-language and scene-understanding tasks. Code:
https://github.com/YunzeMan/Lexicon3D
comment: NeurIPS 2024. Project page: https://yunzeman.github.io/lexicon3d
Github: https://github.com/YunzeMan/Lexicon3D
♻ ☆ Quantifying Risk Propensities of Large Language Models: Ethical Focus and Bias Detection through Role-Play
As Large Language Models (LLMs) become more prevalent, concerns about their
safety, ethics, and potential biases have risen. Systematically evaluating
LLMs' risk decision-making tendencies and attitudes, particularly in the
ethical domain, has become crucial. This study innovatively applies the
Domain-Specific Risk-Taking (DOSPERT) scale from cognitive science to LLMs and
proposes a novel Ethical Decision-Making Risk Attitude Scale (EDRAS) to assess
LLMs' ethical risk attitudes in depth. We further propose a novel approach
integrating risk scales and role-playing to quantitatively evaluate systematic
biases in LLMs. Through systematic evaluation and analysis of multiple
mainstream LLMs, we assessed the "risk personalities" of LLMs across multiple
domains, with a particular focus on the ethical domain, and revealed and
quantified LLMs' systematic biases towards different groups. This research
helps understand LLMs' risk decision-making and ensure their safe and reliable
application. Our approach provides a tool for identifying and mitigating
biases, contributing to fairer and more trustworthy AI systems. The code and
data are available.
comment: Accepted by CogSci 2025
♻ ☆ Position: AI Evaluation Should Learn from How We Test Humans ICML 2025
Yan Zhuang, Qi Liu, Zachary A. Pardos, Patrick C. Kyllonen, Jiyun Zu, Zhenya Huang, Shijin Wang, Enhong Chen
As AI systems continue to evolve, their rigorous evaluation becomes crucial
for their development and deployment. Researchers have constructed various
large-scale benchmarks to determine their capabilities, typically against a
gold-standard test set and report metrics averaged across all items. However,
this static evaluation paradigm increasingly shows its limitations, including
high evaluation costs, data contamination, and the impact of low-quality or
erroneous items on evaluation reliability and efficiency. In this Position,
drawing from human psychometrics, we discuss a paradigm shift from static
evaluation methods to adaptive testing. This involves estimating the
characteristics or value of each test item in the benchmark, and tailoring each
model's evaluation instead of relying on a fixed test set. This paradigm
provides robust ability estimation, uncovering the latent traits underlying a
model's observed scores. This position paper analyze the current possibilities,
prospects, and reasons for adopting psychometrics in AI evaluation. We argue
that psychometrics, a theory originating in the 20th century for human
assessment, could be a powerful solution to the challenges in today's AI
evaluations.
comment: Accepted by ICML 2025
♻ ☆ Drift: Decoding-time Personalized Alignments with Implicit User Preferences
Personalized alignments for individual users have been a long-standing goal
in large language models (LLMs). We introduce Drift, a novel framework that
personalizes LLMs at decoding time with implicit user preferences. Traditional
Reinforcement Learning from Human Feedback (RLHF) requires thousands of
annotated examples and expensive gradient updates. In contrast, Drift
personalizes LLMs in a training-free manner, using only a few dozen examples to
steer a frozen model through efficient preference modeling. Our approach models
user preferences as a composition of predefined, interpretable attributes and
aligns them at decoding time to enable personalized generation. Experiments on
both a synthetic persona dataset (Perspective) and a real human-annotated
dataset (PRISM) demonstrate that Drift significantly outperforms RLHF baselines
while using only 50-100 examples. Our results and analysis show that Drift is
both computationally efficient and interpretable.
comment: 19 pages, 6 figures
♻ ☆ Self-Correction is More than Refinement: A Learning Framework for Visual and Language Reasoning Tasks
While Vision-Language Models (VLMs) have shown remarkable abilities in visual
and language reasoning tasks, they invariably generate flawed responses.
Self-correction that instructs models to refine their outputs presents a
promising solution to this issue. Previous studies have mainly concentrated on
Large Language Models (LLMs), while the self-correction abilities of VLMs,
particularly concerning both visual and linguistic information, remain largely
unexamined. This study investigates the self-correction capabilities of VLMs
during both inference and fine-tuning stages. We introduce a Self-Correction
Learning (SCL) approach that enables VLMs to learn from their self-generated
self-correction data through Direct Preference Optimization (DPO) without
relying on external feedback, facilitating self-improvement. Specifically, we
collect preferred and disfavored samples based on the correctness of initial
and refined responses, which are obtained by two-turn self-correction with VLMs
during the inference stage. Experimental results demonstrate that although VLMs
struggle to self-correct effectively during iterative inference without
additional fine-tuning and external feedback, they can enhance their
performance and avoid previous mistakes through preference fine-tuning when
their self-generated self-correction data are categorized into preferred and
disfavored samples. This study emphasizes that self-correction is not merely a
refinement process; rather, it should enhance the reasoning abilities of models
through additional training, enabling them to generate high-quality responses
directly without further refinement.
♻ ★ Mixture-of-Transformers: A Sparse and Scalable Architecture for Multi-Modal Foundation Models
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen-tau Yih, Luke Zettlemoyer, Xi Victoria Lin
The development of large language models (LLMs) has expanded to multi-modal
systems capable of processing text, images, and speech within a unified
framework. Training these models demands significantly larger datasets and
computational resources compared to text-only LLMs. To address the scaling
challenges, we introduce Mixture-of-Transformers (MoT), a sparse multi-modal
transformer architecture that significantly reduces pretraining computational
costs. MoT decouples non-embedding parameters of the model by modality --
including feed-forward networks, attention matrices, and layer normalization --
enabling modality-specific processing with global self-attention over the full
input sequence. We evaluate MoT across multiple settings and model scales. In
the Chameleon 7B setting (autoregressive text-and-image generation), MoT
matches the dense baseline's performance using only 55.8\% of the FLOPs. When
extended to include speech, MoT reaches speech performance comparable to the
dense baseline with only 37.2\% of the FLOPs. In the Transfusion setting, where
text and image are trained with different objectives, a 7B MoT model matches
the image modality performance of the dense baseline with one third of the
FLOPs, and a 760M MoT model outperforms a 1.4B dense baseline across key image
generation metrics. System profiling further highlights MoT's practical
benefits, achieving dense baseline image quality in 47.2\% of the wall-clock
time and text quality in 75.6\% of the wall-clock time (measured on AWS
p4de.24xlarge instances with NVIDIA A100 GPUs).
comment: Accepted to TMLR 2025; 48 pages
♻ ☆ Unified Attacks to Large Language Model Watermarks: Spoofing and Scrubbing in Unauthorized Knowledge Distillation
Watermarking has emerged as a critical technique for combating misinformation
and protecting intellectual property in large language models (LLMs). A recent
discovery, termed watermark radioactivity, reveals that watermarks embedded in
teacher models can be inherited by student models through knowledge
distillation. On the positive side, this inheritance allows for the detection
of unauthorized knowledge distillation by identifying watermark traces in
student models. However, the robustness of watermarks against scrubbing attacks
and their unforgeability in the face of spoofing attacks under unauthorized
knowledge distillation remain largely unexplored. Existing watermark attack
methods either assume access to model internals or fail to simultaneously
support both scrubbing and spoofing attacks. In this work, we propose
Contrastive Decoding-Guided Knowledge Distillation (CDG-KD), a unified
framework that enables bidirectional attacks under unauthorized knowledge
distillation. Our approach employs contrastive decoding to extract corrupted or
amplified watermark texts via comparing outputs from the student model and
weakly watermarked references, followed by bidirectional distillation to train
new student models capable of watermark removal and watermark forgery,
respectively. Extensive experiments show that CDG-KD effectively performs
attacks while preserving the general performance of the distilled model. Our
findings underscore critical need for developing watermarking schemes that are
robust and unforgeable.
♻ ☆ Safety Evaluation of DeepSeek Models in Chinese Contexts
Wenjing Zhang, Xuejiao Lei, Zhaoxiang Liu, Ning Wang, Zhenhong Long, Peijun Yang, Jiaojiao Zhao, Minjie Hua, Chaoyang Ma, Kai Wang, Shiguo Lian
Recently, the DeepSeek series of models, leveraging their exceptional
reasoning capabilities and open-source strategy, is reshaping the global AI
landscape. Despite these advantages, they exhibit significant safety
deficiencies. Research conducted by Robust Intelligence, a subsidiary of Cisco,
in collaboration with the University of Pennsylvania, revealed that DeepSeek-R1
has a 100\% attack success rate when processing harmful prompts. Additionally,
multiple safety companies and research institutions have confirmed critical
safety vulnerabilities in this model. As models demonstrating robust
performance in Chinese and English, DeepSeek models require equally crucial
safety assessments in both language contexts. However, current research has
predominantly focused on safety evaluations in English environments, leaving a
gap in comprehensive assessments of their safety performance in Chinese
contexts. In response to this gap, this study introduces CHiSafetyBench, a
Chinese-specific safety evaluation benchmark. This benchmark systematically
evaluates the safety of DeepSeek-R1 and DeepSeek-V3 in Chinese contexts,
revealing their performance across safety categories. The experimental results
quantify the deficiencies of these two models in Chinese contexts, providing
key insights for subsequent improvements. It should be noted that, despite our
efforts to establish a comprehensive, objective, and authoritative evaluation
benchmark, the selection of test samples, characteristics of data distribution,
and the setting of evaluation criteria may inevitably introduce certain biases
into the evaluation results. We will continuously optimize the evaluation
benchmark and periodically update this report to provide more comprehensive and
accurate assessment outcomes. Please refer to the latest version of the paper
for the most recent evaluation results and conclusions.
comment: 12 pages, 2 tables, 7 figures
♻ ☆ ValuesRAG: Enhancing Cultural Alignment Through Retrieval-Augmented Contextual Learning
Ensuring cultural values alignment in Large Language Models (LLMs) remains a
critical challenge, as these models often embed Western-centric biases from
their training data, leading to misrepresentations and fairness concerns in
cross-cultural applications. Existing approaches such as role assignment and
few-shot learning struggle to address these limitations effectively due to
their reliance on pre-trained knowledge, limited scalability, and inability to
capture nuanced cultural values. To address these issues, we propose ValuesRAG,
a novel and effective framework that applies Retrieval-Augmented Generation
(RAG) with In-Context Learning (ICL) to integrate cultural and demographic
knowledge dynamically during text generation. Leveraging the World Values
Survey (WVS) dataset, ValuesRAG first generates summaries of values for each
individual. We subsequently curate several representative regional datasets to
serve as test datasets and retrieve relevant summaries of values based on
demographic features, followed by a reranking step to select the top-k relevant
summaries. We evaluate ValuesRAG using 6 diverse regional datasets and show
that it consistently outperforms baselines: including zero-shot,
role-assignment, few-shot, and hybrid methods, both in main experiments and
ablation settings. Notably, ValuesRAG achieves the best overall performance
over prior methods, demonstrating its effectiveness in fostering culturally
aligned and inclusive AI systems. Our findings underscore the potential of
dynamic retrieval-based methods to bridge the gap between global LLM
capabilities and localized cultural values.
comment: preprint
♻ ☆ Scaling Synthetic Data Creation with 1,000,000,000 Personas
We propose a novel persona-driven data synthesis methodology that leverages
various perspectives within a large language model (LLM) to create diverse
synthetic data. To fully exploit this methodology at scale, we introduce
Persona Hub -- a collection of 1 billion diverse personas automatically curated
from web data. These 1 billion personas (~13% of the world's total population),
acting as distributed carriers of world knowledge, can tap into almost every
perspective encapsulated within the LLM, thereby facilitating the creation of
diverse synthetic data at scale for various scenarios. By showcasing Persona
Hub's use cases in synthesizing high-quality mathematical and logical reasoning
problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs
and tools (functions) at scale, we demonstrate persona-driven data synthesis is
versatile, scalable, flexible, and easy to use, potentially driving a paradigm
shift in synthetic data creation and applications in practice, which may have a
profound impact on LLM research and development.
comment: Work in progress