Computation and Language 49
☆ The Design of Informative Take-Over Requests for Semi-Autonomous Cyber-Physical Systems: Combining Spoken Language and Visual Icons in a Drone-Controller Setting
The question of how cyber-physical systems should interact with human
partners that can take over control or exert oversight is becoming more
pressing, as these systems are deployed for an ever larger range of tasks.
Drawing on the literatures on handing over control during semi-autonomous
driving and human-robot interaction, we propose a design of a take-over request
that combines an abstract pre-alert with an informative TOR: Relevant sensor
information is highlighted on the controller's display, while a spoken message
verbalizes the reason for the TOR. We conduct our study in the context of a
semi-autonomous drone control scenario as our testbed. The goal of our online
study is to assess in more detail what form a language-based TOR should take.
Specifically, we compare a full sentence condition to shorter fragments, and
test whether the visual highlighting should be done synchronously or
asynchronously with the speech. Participants showed a higher accuracy in
choosing the correct solution with our bi-modal TOR and felt that they were
better able to recognize the critical situation. Using only fragments in the
spoken message rather than full sentences did not lead to improved accuracy or
faster reactions. Also, synchronizing the visual highlighting with the spoken
message did not result in better accuracy and response times were even
increased in this condition.
comment: 21 pages, 8 figures
☆ Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli
Large Language Models still struggle in challenging scenarios that leverage
structured data, complex reasoning, or tool usage. In this paper, we propose
Source2Synth: a new method that can be used for teaching LLMs new skills
without relying on costly human annotations. Source2Synth takes as input a
custom data source and produces synthetic data points with intermediate
reasoning steps grounded in real-world sources. Source2Synth improves the
dataset quality by discarding low-quality generations based on their
answerability. We demonstrate the generality of this approach by applying it to
two challenging domains: we test reasoning abilities in multi-hop question
answering (MHQA), and tool usage in tabular question answering (TQA). Our
method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on
HotPotQA compared to the fine-tuned baselines.
☆ LLM Honeypot: Leveraging Large Language Models as Advanced Interactive Honeypot Systems
The rapid evolution of cyber threats necessitates innovative solutions for
detecting and analyzing malicious activity. Honeypots, which are decoy systems
designed to lure and interact with attackers, have emerged as a critical
component in cybersecurity. In this paper, we present a novel approach to
creating realistic and interactive honeypot systems using Large Language Models
(LLMs). By fine-tuning a pre-trained open-source language model on a diverse
dataset of attacker-generated commands and responses, we developed a honeypot
capable of sophisticated engagement with attackers. Our methodology involved
several key steps: data collection and processing, prompt engineering, model
selection, and supervised fine-tuning to optimize the model's performance.
Evaluation through similarity metrics and live deployment demonstrated that our
approach effectively generates accurate and informative responses. The results
highlight the potential of LLMs to revolutionize honeypot technology, providing
cybersecurity professionals with a powerful tool to detect and analyze
malicious activity, thereby enhancing overall security infrastructure.
comment: 7 pages, 5 figures
☆ What Makes a Maze Look Like a Maze?
A unique aspect of human visual understanding is the ability to flexibly
interpret abstract concepts: acquiring lifted rules explaining what they
symbolize, grounding them across familiar and unfamiliar contexts, and making
predictions or reasoning about them. While off-the-shelf vision-language models
excel at making literal interpretations of images (e.g., recognizing object
categories such as tree branches), they still struggle to make sense of such
visual abstractions (e.g., how an arrangement of tree branches may form the
walls of a maze). To address this challenge, we introduce Deep Schema Grounding
(DSG), a framework that leverages explicit structured representations of visual
abstractions for grounding and reasoning. At the core of DSG are
schemas--dependency graph descriptions of abstract concepts that decompose them
into more primitive-level symbols. DSG uses large language models to extract
schemas, then hierarchically grounds concrete to abstract components of the
schema onto images with vision-language models. The grounded schema is used to
augment visual abstraction understanding. We systematically evaluate DSG and
different methods in reasoning on our new Visual Abstractions Dataset, which
consists of diverse, real-world images of abstract concepts and corresponding
question-answer pairs labeled by humans. We show that DSG significantly
improves the abstract visual reasoning performance of vision-language models,
and is a step toward human-aligned understanding of visual abstractions.
☆ AudioBERT: Audio Knowledge Augmented Language Model
Recent studies have identified that language models, pretrained on text-only
datasets, often lack elementary visual knowledge, \textit{e.g.,} colors of
everyday objects. Motivated by this observation, we ask whether a similar
shortcoming exists in terms of the \textit{auditory} knowledge. To answer this
question, we construct a new dataset called AuditoryBench, which consists of
two novel tasks for evaluating auditory knowledge. Based on our analysis using
the benchmark, we find that language models also suffer from a severe lack of
auditory knowledge. To address this limitation, we propose AudioBERT, a novel
method to augment the auditory knowledge of BERT through a retrieval-based
approach. First, we detect auditory knowledge spans in prompts to query our
retrieval model efficiently. Then, we inject audio knowledge into BERT and
switch on low-rank adaptation for effective adaptation when audio knowledge is
required. Our experiments demonstrate that AudioBERT is quite effective,
achieving superior performance on the AuditoryBench. The dataset and code are
available at \bulurl{https://github.com/HJ-Ok/AudioBERT}.
comment: Preprint
☆ Fine-tuning Large Language Models for Entity Matching
Generative large language models (LLMs) are a promising alternative to
pre-trained language models for entity matching due to their high zero-shot
performance and their ability to generalize to unseen entities. Existing
research on using LLMs for entity matching has focused on prompt engineering
and in-context learning. This paper explores the potential of fine-tuning LLMs
for entity matching. We analyze fine-tuning along two dimensions: 1) The
representation of training examples, where we experiment with adding different
types of LLM-generated explanations to the training set, and 2) the selection
and generation of training examples using LLMs. In addition to the matching
performance on the source dataset, we investigate how fine-tuning affects the
model's ability to generalize to other in-domain datasets as well as across
topical domains. Our experiments show that fine-tuning significantly improves
the performance of the smaller models while the results for the larger models
are mixed. Fine-tuning also improves the generalization to in-domain datasets
while hurting cross-domain transfer. We show that adding structured
explanations to the training set has a positive impact on the performance of
three out of four LLMs, while the proposed example selection and generation
methods only improve the performance of Llama 3.1 8B while decreasing the
performance of GPT-4o Mini.
comment: 8 pages, 4 figures. For related code and data, see this
https://github.com/wbsg-uni-mannheim/TailorMatch
☆ On the Role of Context in Reading Time Prediction
We present a new perspective on how readers integrate context during
real-time language comprehension. Our proposals build on surprisal theory,
which posits that the processing effort of a linguistic unit (e.g., a word) is
an affine function of its in-context information content. We first observe that
surprisal is only one out of many potential ways that a contextual predictor
can be derived from a language model. Another one is the pointwise mutual
information (PMI) between a unit and its context, which turns out to yield the
same predictive power as surprisal when controlling for unigram frequency.
Moreover, both PMI and surprisal are correlated with frequency. This means that
neither PMI nor surprisal contains information about context alone. In response
to this, we propose a technique where we project surprisal onto the orthogonal
complement of frequency, yielding a new contextual predictor that is
uncorrelated with frequency. Our experiments show that the proportion of
variance in reading times explained by context is a lot smaller when context is
represented by the orthogonalized predictor. From an interpretability
standpoint, this indicates that previous studies may have overstated the role
that context has in predicting reading times.
☆ LLM-POTUS Score: A Framework of Analyzing Presidential Debates with Large Language Models
Large language models have demonstrated remarkable capabilities in natural
language processing, yet their application to political discourse analysis
remains underexplored. This paper introduces a novel approach to evaluating
presidential debate performances using LLMs, addressing the longstanding
challenge of objectively assessing debate outcomes. We propose a framework that
analyzes candidates' "Policies, Persona, and Perspective" (3P) and how they
resonate with the "Interests, Ideologies, and Identity" (3I) of four key
audience groups: voters, businesses, donors, and politicians. Our method
employs large language models to generate the LLM-POTUS Score, a quantitative
measure of debate performance based on the alignment between 3P and 3I. We
apply this framework to analyze transcripts from recent U.S. presidential
debates, demonstrating its ability to provide nuanced, multi-dimensional
assessments of candidate performances. Our results reveal insights into the
effectiveness of different debating strategies and their impact on various
audience segments. This study not only offers a new tool for political analysis
but also explores the potential and limitations of using LLMs as impartial
judges in complex social contexts. In addition, this framework provides
individual citizens with an independent tool to evaluate presidential debate
performances, which enhances democratic engagement and reduces reliance on
potentially biased media interpretations and institutional influence, thereby
strengthening the foundation of informed civic participation.
☆ WhisperNER: Unified Open Named Entity and Speech Recognition
Integrating named entity recognition (NER) with automatic speech recognition
(ASR) can significantly enhance transcription accuracy and informativeness. In
this paper, we introduce WhisperNER, a novel model that allows joint speech
transcription and entity recognition. WhisperNER supports open-type NER,
enabling recognition of diverse and evolving entities at inference. Building on
recent advancements in open NER research, we augment a large synthetic dataset
with synthetic speech samples. This allows us to train WhisperNER on a large
number of examples with diverse NER tags. During training, the model is
prompted with NER labels and optimized to output the transcribed utterance
along with the corresponding tagged entities. To evaluate WhisperNER, we
generate synthetic speech for commonly used NER benchmarks and annotate
existing ASR datasets with open NER tags. Our experiments demonstrate that
WhisperNER outperforms natural baselines on both out-of-domain open type NER
and supervised finetuning.
☆ The Faetar Benchmark: Speech Recognition in a Very Under-Resourced Language
Michael Ong, Sean Robertson, Leo Peckham, Alba Jorquera Jimenez de Aberasturi, Paula Arkhangorodsky, Robin Huo, Aman Sakhardande, Mark Hallap, Naomi Nagy, Ewan Dunbar
We introduce the Faetar Automatic Speech Recognition Benchmark, a benchmark
corpus designed to push the limits of current approaches to low-resource speech
recognition. Faetar, a Franco-Proven\c{c}al variety spoken primarily in Italy,
has no standard orthography, has virtually no existing textual or speech
resources other than what is included in the benchmark, and is quite different
from other forms of Franco-Proven\c{c}al. The corpus comes from field
recordings, most of which are noisy, for which only 5 hrs have matching
transcriptions, and for which forced alignment is of variable quality. The
corpus contains an additional 20 hrs of unlabelled speech. We report baseline
results from state-of-the-art multilingual speech foundation models with a best
phone error rate of 30.4%, using a pipeline that continues pre-training on the
foundation model using the unlabelled set.
☆ The CLC-UKET Dataset: Benchmarking Case Outcome Prediction for the UK Employment Tribunal
This paper explores the intersection of technological innovation and access
to justice by developing a benchmark for predicting case outcomes in the UK
Employment Tribunal (UKET). To address the challenge of extensive manual
annotation, the study employs a large language model (LLM) for automatic
annotation, resulting in the creation of the CLC-UKET dataset. The dataset
consists of approximately 19,000 UKET cases and their metadata. Comprehensive
legal annotations cover facts, claims, precedent references, statutory
references, case outcomes, reasons and jurisdiction codes. Facilitated by the
CLC-UKET data, we examine a multi-class case outcome prediction task in the
UKET. Human predictions are collected to establish a performance reference for
model comparison. Empirical results from baseline models indicate that
finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET
prediction task. The performance of zero-shot LLMs can be enhanced by
integrating task-related information into few-shot examples. We hope that the
CLC-UKET dataset, along with human annotations and empirical findings, can
serve as a valuable benchmark for employment-related dispute resolution.
☆ TravelAgent: An AI Assistant for Personalized Travel Planning
As global tourism expands and artificial intelligence technology advances,
intelligent travel planning services have emerged as a significant research
focus. Within dynamic real-world travel scenarios with multi-dimensional
constraints, services that support users in automatically creating practical
and customized travel itineraries must address three key objectives:
Rationality, Comprehensiveness, and Personalization. However, existing systems
with rule-based combinations or LLM-based planning methods struggle to fully
satisfy these criteria. To overcome the challenges, we introduce TravelAgent, a
travel planning system powered by large language models (LLMs) designed to
provide reasonable, comprehensive, and personalized travel itineraries grounded
in dynamic scenarios. TravelAgent comprises four modules: Tool-usage,
Recommendation, Planning, and Memory Module. We evaluate TravelAgent's
performance with human and simulated users, demonstrating its overall
effectiveness in three criteria and confirming the accuracy of personalized
recommendations.
☆ Enhanced Online Grooming Detection Employing Context Determination and Message-Level Analysis
Online Grooming (OG) is a prevalent threat facing predominately children
online, with groomers using deceptive methods to prey on the vulnerability of
children on social media/messaging platforms. These attacks can have severe
psychological and physical impacts, including a tendency towards
revictimization. Current technical measures are inadequate, especially with the
advent of end-to-end encryption which hampers message monitoring. Existing
solutions focus on the signature analysis of child abuse media, which does not
effectively address real-time OG detection. This paper proposes that OG attacks
are complex, requiring the identification of specific communication patterns
between adults and children. It introduces a novel approach leveraging advanced
models such as BERT and RoBERTa for Message-Level Analysis and a Context
Determination approach for classifying actor interactions, including the
introduction of Actor Significance Thresholds and Message Significance
Thresholds. The proposed method aims to enhance accuracy and robustness in
detecting OG by considering the dynamic and multi-faceted nature of these
attacks. Cross-dataset experiments evaluate the robustness and versatility of
our approach. This paper's contributions include improved detection
methodologies and the potential for application in various scenarios,
addressing gaps in current literature and practices.
☆ A corpus-based investigation of pitch contours of monosyllabic words in conversational Taiwan Mandarin
In Mandarin, the tonal contours of monosyllabic words produced in isolation
or in careful speech are characterized by four lexical tones: a high-level tone
(T1), a rising tone (T2), a dipping tone (T3) and a falling tone (T4). However,
in spontaneous speech, the actual tonal realization of monosyllabic words can
deviate significantly from these canonical tones due to intra-syllabic
co-articulation and inter-syllabic co-articulation with adjacent tones. In
addition, Chuang et al. (2024) recently reported that the tonal contours of
disyllabic Mandarin words with T2-T4 tone pattern are co-determined by their
meanings. Following up on their research, we present a corpus-based
investigation of how the pitch contours of monosyllabic words are realized in
spontaneous conversational Mandarin, focusing on the effects of contextual
predictors on the one hand, and the way in words' meanings co-determine pitch
contours on the other hand. We analyze the F0 contours of 3824 tokens of 63
different word types in a spontaneous Taiwan Mandarin corpus, using the
generalized additive (mixed) model to decompose a given observed pitch contour
into a set of component pitch contours. We show that the tonal context
substantially modify a word's canonical tone. Once the effect of tonal context
is controlled for, T2 and T3 emerge as low flat tones, contrasting with T1 as a
high tone, and with T4 as a high-to-mid falling tone. The neutral tone (T0),
which in standard descriptions, is realized based on the preceding tone,
emerges as a low tone in its own right, modified by the other predictors in the
same way as the standard tones T1, T2, T3, and T4. We also show that word, and
even more so, word sense, co-determine words' F0 contours. Analyses of variable
importance using random forests further supported the substantial effect of
tonal context and an effect of word sense.
☆ Learning Rules from KGs Guided by Language Models
Advances in information extraction have enabled the automatic construction of
large knowledge graphs (e.g., Yago, Wikidata or Google KG), which are widely
used in many applications like semantic search or data analytics. However, due
to their semi-automatic construction, KGs are often incomplete. Rule learning
methods, concerned with the extraction of frequent patterns from KGs and
casting them into rules, can be applied to predict potentially missing facts. A
crucial step in this process is rule ranking. Ranking of rules is especially
challenging over highly incomplete or biased KGs (e.g., KGs predominantly
storing facts about famous people), as in this case biased rules might fit the
data best and be ranked at the top based on standard statistical metrics like
rule confidence. To address this issue, prior works proposed to rank rules not
only relying on the original KG but also facts predicted by a KG embedding
model. At the same time, with the recent rise of Language Models (LMs), several
works have claimed that LMs can be used as alternative means for KG completion.
In this work, our goal is to verify to which extent the exploitation of LMs is
helpful for improving the quality of rule learning systems.
comment: proof of concept
☆ FPMT: Enhanced Semi-Supervised Model for Traffic Incident Detection
For traffic incident detection, the acquisition of data and labels is notably
resource-intensive, rendering semi-supervised traffic incident detection both a
formidable and consequential challenge. Thus, this paper focuses on traffic
incident detection with a semi-supervised learning way. It proposes a
semi-supervised learning model named FPMT within the framework of MixText. The
data augmentation module introduces Generative Adversarial Networks to balance
and expand the dataset. During the mix-up process in the hidden space, it
employs a probabilistic pseudo-mixing mechanism to enhance regularization and
elevate model precision. In terms of training strategy, it initiates with
unsupervised training on all data, followed by supervised fine-tuning on a
subset of labeled data, and ultimately completing the goal of semi-supervised
training. Through empirical validation on four authentic datasets, our FPMT
model exhibits outstanding performance across various metrics. Particularly
noteworthy is its robust performance even in scenarios with low label rates.
comment: 14 pages, 3 figures, accepted by ICPR 2024
☆ Online vs Offline: A Comparative Study of First-Party and Third-Party Evaluations of Social Chatbots
This paper explores the efficacy of online versus offline evaluation methods
in assessing conversational chatbots, specifically comparing first-party direct
interactions with third-party observational assessments. By extending a
benchmarking dataset of user dialogs with empathetic chatbots with offline
third-party evaluations, we present a systematic comparison between the
feedback from online interactions and the more detached offline third-party
evaluations. Our results reveal that offline human evaluations fail to capture
the subtleties of human-chatbot interactions as effectively as online
assessments. In comparison, automated third-party evaluations using a GPT-4
model offer a better approximation of first-party human judgments given
detailed instructions. This study highlights the limitations of third-party
evaluations in grasping the complexities of user experiences and advocates for
the integration of direct interaction feedback in conversational AI evaluation
to enhance system development and user satisfaction.
☆ Controllable Synthetic Clinical Note Generation with Privacy Guarantees
In the field of machine learning, domain-specific annotated data is an
invaluable resource for training effective models. However, in the medical
domain, this data often includes Personal Health Information (PHI), raising
significant privacy concerns. The stringent regulations surrounding PHI limit
the availability and sharing of medical datasets, which poses a substantial
challenge for researchers and practitioners aiming to develop advanced machine
learning models. In this paper, we introduce a novel method to "clone" datasets
containing PHI. Our approach ensures that the cloned datasets retain the
essential characteristics and utility of the original data without compromising
patient privacy. By leveraging differential-privacy techniques and a novel
fine-tuning task, our method produces datasets that are free from identifiable
information while preserving the statistical properties necessary for model
training. We conduct utility testing to evaluate the performance of machine
learning models trained on the cloned datasets. The results demonstrate that
our cloned datasets not only uphold privacy standards but also enhance model
performance compared to those trained on traditional anonymized datasets. This
work offers a viable solution for the ethical and effective utilization of
sensitive medical data in machine learning, facilitating progress in medical
research and the development of robust predictive models.
☆ Full-text Error Correction for Chinese Speech Recognition with Large Language Model
Large Language Models (LLMs) have demonstrated substantial potential for
error correction in Automatic Speech Recognition (ASR). However, most research
focuses on utterances from short-duration speech recordings, which are the
predominant form of speech data for supervised ASR training. This paper
investigates the effectiveness of LLMs for error correction in full-text
generated by ASR systems from longer speech recordings, such as transcripts
from podcasts, news broadcasts, and meetings. First, we develop a Chinese
dataset for full-text error correction, named ChFT, utilizing a pipeline that
involves text-to-speech synthesis, ASR, and error-correction pair extractor.
This dataset enables us to correct errors across contexts, including both
full-text and segment, and to address a broader range of error types, such as
punctuation restoration and inverse text normalization, thus making the
correction process comprehensive. Second, we fine-tune a pre-trained LLM on the
constructed dataset using a diverse set of prompts and target formats, and
evaluate its performance on full-text error correction. Specifically, we design
prompts based on full-text and segment, considering various output formats,
such as directly corrected text and JSON-based error-correction pairs. Through
various test settings, including homogeneous, up-to-date, and hard test sets,
we find that the fine-tuned LLMs perform well in the full-text setting with
different prompts, each presenting its own strengths and weaknesses. This
establishes a promising baseline for further research. The dataset is available
on the website.
☆ Stable Language Model Pre-training by Reducing Embedding Variability
Stable pre-training is essential for achieving better-performing language
models. However, tracking pre-training stability by calculating gradient
variance at every step is impractical due to the significant computational
costs. We explore Token Embedding Variability (TEV) as a simple and efficient
proxy for assessing pre-training stability in language models with pre-layer
normalization, given that shallower layers are more prone to gradient explosion
(section 2.2). Moreover, we propose Multi-head Low-Rank Attention (MLRA) as an
architecture to alleviate such instability by limiting the exponential growth
of output embedding variance, thereby preventing the gradient explosion
(section 3.2). Empirical results on GPT-2 with MLRA demonstrate increased
stability and lower perplexity, particularly in deeper models.
☆ Supporting Online Discussions: Integrating AI Into the adhocracy+ Participation Platform To Enhance Deliberation
Online spaces allow people to discuss important issues and make joint
decisions, regardless of their location or time zone. However, without proper
support and thoughtful design, these discussions often lack structure and
politeness during the exchanges of opinions. Artificial intelligence (AI)
represents an opportunity to support both participants and organizers of
large-scale online participation processes. In this paper, we present an
extension of adhocracy+, a large-scale open source participation platform, that
provides two additional debate modules that are supported by AI to enhance the
discussion quality and participant interaction.
☆ Top-down Activity Representation Learning for Video Question Answering
Capturing complex hierarchical human activities, from atomic actions (e.g.,
picking up one present, moving to the sofa, unwrapping the present) to
contextual events (e.g., celebrating Christmas) is crucial for achieving
high-performance video question answering (VideoQA). Recent works have expanded
multimodal models (e.g., CLIP, LLaVA) to process continuous video sequences,
enhancing the model's temporal reasoning capabilities. However, these
approaches often fail to capture contextual events that can be decomposed into
multiple atomic actions non-continuously distributed over relatively long-term
sequences. In this paper, to leverage the spatial visual context representation
capability of the CLIP model for obtaining non-continuous visual
representations in terms of contextual events in videos, we convert long-term
video sequences into a spatial image domain and finetune the multimodal model
LLaVA for the VideoQA task. Our approach achieves competitive performance on
the STAR task, in particular, with a 78.4% accuracy score, exceeding the
current state-of-the-art score by 2.8 points on the NExTQA task.
comment: presented at MIRU2024
☆ Multi-object event graph representation learning for Video Question Answering
Video question answering (VideoQA) is a task to predict the correct answer to
questions posed about a given video. The system must comprehend spatial and
temporal relationships among objects extracted from videos to perform causal
and temporal reasoning. While prior works have focused on modeling individual
object movements using transformer-based methods, they falter when capturing
complex scenarios involving multiple objects (e.g., "a boy is throwing a ball
in a hoop"). We propose a contrastive language event graph representation
learning method called CLanG to address this limitation. Aiming to capture
event representations associated with multiple objects, our method employs a
multi-layer GNN-cluster module for adversarial graph representation learning,
enabling contrastive learning between the question text and its relevant
multi-object event graph. Our method outperforms a strong baseline, achieving
up to 2.2% higher accuracy on two challenging VideoQA datasets, NExT-QA and
TGIF-QA-R. In particular, it is 2.8% better than baselines in handling causal
and temporal questions, highlighting its strength in reasoning multiple
object-based events.
comment: presented at MIRU2024
☆ Ruri: Japanese General Text Embeddings
We report the development of Ruri, a series of Japanese general text
embedding models. While the development of general-purpose text embedding
models in English and multilingual contexts has been active in recent years,
model development in Japanese remains insufficient. The primary reasons for
this are the lack of datasets and the absence of necessary expertise. In this
report, we provide a detailed account of the development process of Ruri.
Specifically, we discuss the training of embedding models using synthesized
datasets generated by LLMs, the construction of the reranker for dataset
filtering and knowledge distillation, and the performance evaluation of the
resulting general-purpose text embedding models.
☆ Experimenting with Legal AI Solutions: The Case of Question-Answering for Access to Justice ICML 2024
Generative AI models, such as the GPT and Llama series, have significant
potential to assist laypeople in answering legal questions. However, little
prior work focuses on the data sourcing, inference, and evaluation of these
models in the context of laypersons. To this end, we propose a human-centric
legal NLP pipeline, covering data sourcing, inference, and evaluation. We
introduce and release a dataset, LegalQA, with real and specific legal
questions spanning from employment law to criminal law, corresponding answers
written by legal experts, and citations for each answer. We develop an
automatic evaluation protocol for this dataset, then show that
retrieval-augmented generation from only 850 citations in the train set can
match or outperform internet-wide retrieval, despite containing 9 orders of
magnitude less data. Finally, we propose future directions for open-sourced
efforts, which fall behind closed-sourced models.
comment: Accepted into GenLaw '24 (ICML 2024 workshop)
☆ DSBench: How Far Are Data Science Agents to Becoming Data Science Experts?
Liqiang Jing, Zhehui Huang, Xiaoyang Wang, Wenlin Yao, Wenhao Yu, Kaixin Ma, Hongming Zhang, Xinya Du, Dong Yu
Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) have
demonstrated impressive language/vision reasoning abilities, igniting the
recent trend of building agents for targeted applications such as shopping
assistants or AI software engineers. Recently, many data science benchmarks
have been proposed to investigate their performance in the data science domain.
However, existing data science benchmarks still fall short when compared to
real-world data science applications due to their simplified settings. To
bridge this gap, we introduce DSBench, a comprehensive benchmark designed to
evaluate data science agents with realistic tasks. This benchmark includes 466
data analysis tasks and 74 data modeling tasks, sourced from Eloquence and
Kaggle competitions. DSBench offers a realistic setting by encompassing long
contexts, multimodal task backgrounds, reasoning with large data files and
multi-table structures, and performing end-to-end data modeling tasks. Our
evaluation of state-of-the-art LLMs, LVLMs, and agents shows that they struggle
with most tasks, with the best agent solving only 34.12% of data analysis tasks
and achieving a 34.74% Relative Performance Gap (RPG). These findings
underscore the need for further advancements in developing more practical,
intelligent, and autonomous data science agents.
☆ Enhancing Q&A Text Retrieval with Ranking Models: Benchmarking, fine-tuning and deploying Rerankers for RAG
Gabriel de Souza P. Moreira, Ronay Ak, Benedikt Schifferer, Mengyao Xu, Radek Osmulski, Even Oldridge
Ranking models play a crucial role in enhancing overall accuracy of text
retrieval systems. These multi-stage systems typically utilize either dense
embedding models or sparse lexical indices to retrieve relevant passages based
on a given query, followed by ranking models that refine the ordering of the
candidate passages by its relevance to the query.
This paper benchmarks various publicly available ranking models and examines
their impact on ranking accuracy. We focus on text retrieval for
question-answering tasks, a common use case for Retrieval-Augmented Generation
systems. Our evaluation benchmarks include models some of which are
commercially viable for industrial applications.
We introduce a state-of-the-art ranking model, NV-RerankQA-Mistral-4B-v3,
which achieves a significant accuracy increase of ~14% compared to pipelines
with other rerankers. We also provide an ablation study comparing the
fine-tuning of ranking models with different sizes, losses and self-attention
mechanisms.
Finally, we discuss challenges of text retrieval pipelines with ranking
models in real-world industry applications, in particular the trade-offs among
model size, ranking accuracy and system requirements like indexing and serving
latency / throughput.
comment: Accepted for the 1st Workshop on GenAI and RAG Systems for Enterprise
@ CIKM 2024
☆ An Unsupervised Dialogue Topic Segmentation Model Based on Utterance Rewriting
Dialogue topic segmentation plays a crucial role in various types of dialogue
modeling tasks. The state-of-the-art unsupervised DTS methods learn topic-aware
discourse representations from conversation data through adjacent discourse
matching and pseudo segmentation to further mine useful clues in unlabeled
conversational relations. However, in multi-round dialogs, discourses often
have co-references or omissions, leading to the fact that direct use of these
discourses for representation learning may negatively affect the semantic
similarity computation in the neighboring discourse matching task. In order to
fully utilize the useful cues in conversational relations, this study proposes
a novel unsupervised dialog topic segmentation method that combines the
Utterance Rewriting (UR) technique with an unsupervised learning algorithm to
efficiently utilize the useful cues in unlabeled dialogs by rewriting the
dialogs in order to recover the co-referents and omitted words. Compared with
existing unsupervised models, the proposed Discourse Rewriting Topic
Segmentation Model (UR-DTS) significantly improves the accuracy of topic
segmentation. The main finding is that the performance on DialSeg711 improves
by about 6% in terms of absolute error score and WD, achieving 11.42% in terms
of absolute error score and 12.97% in terms of WD. on Doc2Dial the absolute
error score and WD improves by about 3% and 2%, respectively, resulting in SOTA
reaching 35.17% in terms of absolute error score and 38.49% in terms of WD.
This shows that the model is very effective in capturing the nuances of
conversational topics, as well as the usefulness and challenges of utilizing
unlabeled conversations.
comment: in Chinese language
♻ ☆ A Transfer Attack to Image Watermarks
Watermark has been widely deployed by industry to detect AI-generated images.
The robustness of such watermark-based detector against evasion attacks in the
white-box and black-box settings is well understood in the literature. However,
the robustness in the no-box setting is much less understood. In this work, we
propose a new transfer evasion attack to image watermark in the no-box setting.
Our transfer attack adds a perturbation to a watermarked image to evade
multiple surrogate watermarking models trained by the attacker itself, and the
perturbed watermarked image also evades the target watermarking model. Our
major contribution is to show that, both theoretically and empirically,
watermark-based AI-generated image detector is not robust to evasion attacks
even if the attacker does not have access to the watermarking model nor the
detection API.
♻ ☆ TeXBLEU: Automatic Metric for Evaluate LaTeX Format
LaTeX is suitable for creating specially formatted documents in science,
technology, mathematics, and computer science. Although the use of mathematical
expressions in LaTeX format along with language models is increasing, there are
no proper evaluation matrices to evaluate them. In this study, we propose
TeXBLEU, a metric for evaluating mathematical expressions in the LaTeX format
built on the n-gram-based BLEU metric widely used in translation tasks. The
proposed TeXBLEU consists of a predefined tokenizer trained on the arXiv paper
dataset and a fine-tuned embedding model with positional encoding. The TeXBLEU
score was calculated by replacing BLUE's modified precision score with the
similarity of n-gram-based tokens. TeXBLEU showed improvements of 86\%, 121\%,
and 610\% over traditional evaluation metrics, such as BLEU, sacreBLEU, and
Rouge, respectively, on the MathBridge dataset with 1,000 data points. The code
is available at https://github.com/KyuDan1/TeXBLEU.
comment: 5 pages, 4 figures
♻ ☆ Profiling checkpointing schedules in adjoint ST-AD
Laurent Hascoët, Jean-Luc Bouchot, Shreyas Sunil Gaikwad, Sri Hari Krishna Narayanan, Jan Hückelheim
Checkpointing is a cornerstone of data-flow reversal in adjoint algorithmic
differentiation. Checkpointing is a storage/recomputation trade-off that can be
applied at different levels, one of which being the call tree. We are looking
for good placements of checkpoints onto the call tree of a given application,
to reduce run time and memory footprint of its adjoint. There is no known
optimal solution to this problem other than a combinatorial search on all
placements. We propose a heuristics based on run-time profiling of the adjoint
code. We describe implementation of this profiling tool in an existing
source-transformation AD tool. We demonstrate the interest of this approach on
test cases taken from the MITgcm ocean and atmospheric global circulation
model. We discuss the limitations of our approach and propose directions to
lift them.
♻ ☆ GSIFN: A Graph-Structured and Interlaced-Masked Multimodal Transformer-based Fusion Network for Multimodal Sentiment Analysis
Multimodal Sentiment Analysis (MSA) leverages multiple data modals to analyze
human sentiment. Existing MSA models generally employ cutting-edge multimodal
fusion and representation learning-based methods to promote MSA capability.
However, there are two key challenges: (i) in existing multimodal fusion
methods, the decoupling of modal combinations and tremendous parameter
redundancy, lead to insufficient fusion performance and efficiency; (ii) a
challenging trade-off exists between representation capability and
computational overhead in unimodal feature extractors and encoders. Our
proposed GSIFN incorporates two main components to solve these problems: (i) a
graph-structured and interlaced-masked multimodal Transformer. It adopts the
Interlaced Mask mechanism to construct robust multimodal graph embedding,
achieve all-modal-in-one Transformer-based fusion, and greatly reduce the
computational overhead; (ii) a self-supervised learning framework with low
computational overhead and high performance, which utilizes a parallelized LSTM
with matrix memory to enhance non-verbal modal features for unimodal label
generation. Evaluated on the MSA datasets CMU-MOSI, CMU-MOSEI, and CH-SIMS,
GSIFN demonstrates superior performance with significantly lower computational
overhead compared with previous state-of-the-art models.
♻ ☆ DrugAgent: Explainable Drug Repurposing Agent with Large Language Model-based Reasoning
Drug repurposing offers a promising avenue for accelerating drug development
by identifying new therapeutic potentials of existing drugs. In this paper, we
propose a multi-agent framework to enhance the drug repurposing process using
state-of-the-art machine learning techniques and knowledge integration. Our
framework comprises several specialized agents: an AI Agent trains robust
drug-target interaction (DTI) models; a Knowledge Graph Agent utilizes the
drug-gene interaction database (DGIdb), DrugBank, Comparative Toxicogenomics
Database (CTD), and Search Tool for Interactions of Chemicals (STITCH) to
systematically extract DTIs; and a Search Agent interacts with biomedical
literature to annotate and verify computational predictions. By integrating
outputs from these agents, our system effectively harnesses diverse data
sources, including external databases, to propose viable repurposing
candidates. Preliminary results demonstrate the potential of our approach in
not only predicting drug-disease interactions but also in reducing the time and
cost associated with traditional drug discovery methods. This paper highlights
the scalability of multi-agent systems in biomedical research and their role in
driving innovation in drug repurposing. Our approach not only outperforms
existing methods in predicting drug repurposing potential but also provides
interpretable results, paving the way for more efficient and cost-effective
drug discovery processes.
comment: 18 pages, 1 figure
♻ ☆ What is the Role of Small Models in the LLM Era: A Survey
Large Language Models (LLMs) have made significant progress in advancing
artificial general intelligence (AGI), leading to the development of
increasingly large models such as GPT-4 and LLaMA-405B. However, scaling up
model sizes results in exponentially higher computational costs and energy
consumption, making these models impractical for academic researchers and
businesses with limited resources. At the same time, Small Models (SMs) are
frequently used in practical settings, although their significance is currently
underestimated. This raises important questions about the role of small models
in the era of LLMs, a topic that has received limited attention in prior
research. In this work, we systematically examine the relationship between LLMs
and SMs from two key perspectives: Collaboration and Competition. We hope this
survey provides valuable insights for practitioners, fostering a deeper
understanding of the contribution of small models and promoting more efficient
use of computational resources. The code is available at
https://github.com/tigerchen52/role_of_small_models
comment: a survey paper of small models
♻ ☆ Minimum projective linearizations of trees in linear time
The Minimum Linear Arrangement problem (MLA) consists of finding a mapping
$\pi$ from vertices of a graph to distinct integers that minimizes
$\sum_{\{u,v\}\in E}|\pi(u) - \pi(v)|$. In that setting, vertices are often
assumed to lie on a horizontal line and edges are drawn as semicircles above
said line. For trees, various algorithms are available to solve the problem in
polynomial time in $n=|V|$. There exist variants of the MLA in which the
arrangements are constrained. Iordanskii, and later Hochberg and Stallmann
(HS), put forward $O(n)$-time algorithms that solve the problem when
arrangements are constrained to be planar (also known as one-page book
embeddings). We also consider linear arrangements of rooted trees that are
constrained to be projective (planar embeddings where the root is not covered
by any edge). Gildea and Temperley (GT) sketched an algorithm for projective
arrangements which they claimed runs in $O(n)$ but did not provide any
justification of its cost. In contrast, Park and Levy claimed that GT's
algorithm runs in $O(n \log d_{max})$ where $d_{max}$ is the maximum degree but
did not provide sufficient detail. Here we correct an error in HS's algorithm
for the planar case, show its relationship with the projective case, and derive
simple algorithms for the projective and planar cases that run without a doubt
in $O(n)$ time.
comment: Here we have corrected a mistake we made in the previous version. In
particular, line 7 of Algorithm 3.2 used to say: "For i = 1 to |C_v| ..."; it
should be "For i = 2 to |C_v| ..." (notice the change from 'i=1' to 'i=2')
♻ ☆ Large Language Models and Cognitive Science: A Comprehensive Review of Similarities, Differences, and Challenges
This comprehensive review explores the intersection of Large Language Models
(LLMs) and cognitive science, examining similarities and differences between
LLMs and human cognitive processes. We analyze methods for evaluating LLMs
cognitive abilities and discuss their potential as cognitive models. The review
covers applications of LLMs in various cognitive fields, highlighting insights
gained for cognitive science research. We assess cognitive biases and
limitations of LLMs, along with proposed methods for improving their
performance. The integration of LLMs with cognitive architectures is examined,
revealing promising avenues for enhancing artificial intelligence (AI)
capabilities. Key challenges and future research directions are identified,
emphasizing the need for continued refinement of LLMs to better align with
human cognition. This review provides a balanced perspective on the current
state and future potential of LLMs in advancing our understanding of both
artificial and human intelligence.
comment: 10 pages, 1 figure
♻ ☆ Predictability maximization and the origins of word order harmony
We address the linguistic problem of the sequential arrangement of a head and
its dependents from an information theoretic perspective. In particular, we
consider the optimal placement of a head that maximizes the predictability of
the sequence. We assume that dependents are statistically independent given a
head, in line with the open-choice principle and the core assumptions of
dependency grammar. We demonstrate the optimality of harmonic order, i.e.,
placing the head last maximizes the predictability of the head whereas placing
the head first maximizes the predictability of dependents. We also show that
postponing the head is the optimal strategy to maximize its predictability
while bringing it forward is the optimal strategy to maximize the
predictability of dependents. We unravel the advantages of the strategy of
maximizing the predictability of the head over maximizing the predictability of
dependents. Our findings shed light on the placements of the head adopted by
real languages or emerging in different kinds of experiments.
comment: Typos corrected; new references added
♻ ☆ How Easily do Irrelevant Inputs Skew the Responses of Large Language Models?
By leveraging the retrieval of information from external knowledge databases,
Large Language Models (LLMs) exhibit enhanced capabilities for accomplishing
many knowledge-intensive tasks. However, due to the inherent flaws of current
retrieval systems, there might exist irrelevant information within those
retrieving top-ranked passages. In this work, we present a comprehensive
investigation into the robustness of LLMs to different types of irrelevant
information under various conditions. We initially introduce a framework to
construct high-quality irrelevant information that ranges from semantically
unrelated, partially related, and related to questions. Furthermore, our
analysis demonstrates that the constructed irrelevant information not only
scores highly on similarity metrics, being highly retrieved by existing
systems, but also bears semantic connections to the context. Our investigation
reveals that current LLMs still face challenges in discriminating highly
semantically related information and can be easily distracted by these
irrelevant yet misleading content. Besides, we also find that current solutions
for handling irrelevant information have limitations in improving the
robustness of LLMs to such distractions. All the resources are available on
GitHub at https://github.com/Di-viner/LLM-Robustness-to-Irrelevant-Information.
comment: COLM 2024
♻ ☆ Linear Adversarial Concept Erasure ICML 2022
Modern neural models trained on textual data rely on pre-trained
representations that emerge without direct supervision. As these
representations are increasingly being used in real-world applications, the
inability to \emph{control} their content becomes an increasingly important
problem. We formulate the problem of identifying and erasing a linear subspace
that corresponds to a given concept, in order to prevent linear predictors from
recovering the concept. We model this problem as a constrained, linear maximin
game, and show that existing solutions are generally not optimal for this task.
We derive a closed-form solution for certain objectives, and propose a convex
relaxation, \method, that works well for others. When evaluated in the context
of binary gender removal, the method recovers a low-dimensional subspace whose
removal mitigates bias by intrinsic and extrinsic evaluation. We show that the
method is highly expressive, effectively mitigating bias in deep nonlinear
classifiers while maintaining tractability and interpretability.
comment: Accepted in ICML 2022; a revised version
♻ ☆ I Know About "Up"! Enhancing Spatial Reasoning in Visual Language Models Through 3D Reconstruction
Visual Language Models (VLMs) are essential for various tasks, particularly
visual reasoning tasks, due to their robust multi-modal information
integration, visual reasoning capabilities, and contextual awareness. However,
existing \VLMs{}' visual spatial reasoning capabilities are often inadequate,
struggling even with basic tasks such as distinguishing left from right. To
address this, we propose the \ours{} model, designed to enhance the visual
spatial reasoning abilities of VLMS. ZeroVLM employs Zero-1-to-3, a 3D
reconstruction model for obtaining different views of the input images and
incorporates a prompting mechanism to further improve visual spatial reasoning.
Experimental results on four visual spatial reasoning datasets show that our
\ours{} achieves up to 19.48% accuracy improvement, which indicates the
effectiveness of the 3D reconstruction and prompting mechanisms of our ZeroVLM.
♻ ☆ Do These LLM Benchmarks Agree? Fixing Benchmark Evaluation with BenchBench
Yotam Perlitz, Ariel Gera, Ofir Arviv, Asaf Yehudai, Elron Bandel, Eyal Shnarch, Michal Shmueli-Scheuer, Leshem Choshen
Recent advancements in Language Models (LMs) have catalyzed the creation of
multiple benchmarks, designed to assess these models' general capabilities. A
crucial task, however, is assessing the validity of the benchmarks themselves.
This is most commonly done via Benchmark Agreement Testing (BAT), where new
benchmarks are validated against established ones using some agreement metric
(e.g., rank correlation). Despite the crucial role of BAT for benchmark
builders and consumers, there are no standardized procedures for such agreement
testing. This deficiency can lead to invalid conclusions, fostering mistrust in
benchmarks and upending the ability to properly choose the appropriate
benchmark to use. By analyzing over 40 prominent benchmarks, we demonstrate how
some overlooked methodological choices can significantly influence BAT results,
potentially undermining the validity of conclusions. To address these
inconsistencies, we propose a set of best practices for BAT and demonstrate how
utilizing these methodologies greatly improves BAT robustness and validity. To
foster adoption and facilitate future research,, we introduce BenchBench, a
python package for BAT, and release the BenchBench-leaderboard, a
meta-benchmark designed to evaluate benchmarks using their peers. Our findings
underscore the necessity for standardized BAT, ensuring the robustness and
validity of benchmark evaluations in the evolving landscape of language model
research.
BenchBench Package: github.com/IBM/BenchBench
Leaderboard: hf.co/spaces/IBM/BenchBench
comment: Under Review
♻ ☆ The SAME score: Improved cosine based bias score for word embeddings
With the enourmous popularity of large language models, many researchers have
raised ethical concerns regarding social biases incorporated in such models.
Several methods to measure social bias have been introduced, but apparently
these methods do not necessarily agree regarding the presence or severity of
bias. Furthermore, some works have shown theoretical issues or severe
limitations with certain bias measures. For that reason, we introduce SAME, a
novel bias score for semantic bias in embeddings. We conduct a thorough
theoretical analysis as well as experiments to show its benefits compared to
similar bias scores from the literature. We further highlight a substantial
relation of semantic bias measured by SAME with downstream bias, a connection
that has recently been argued to be negligible. Instead, we show that SAME is
capable of measuring semantic bias and identify potential causes for social
bias in downstream tasks.
comment: 12 pages, 3 figures
♻ ☆ Semantic Properties of cosine based bias scores for word embeddings
Plenty of works have brought social biases in language models to attention
and proposed methods to detect such biases. As a result, the literature
contains a great deal of different bias tests and scores, each introduced with
the premise to uncover yet more biases that other scores fail to detect. What
severely lacks in the literature, however, are comparative studies that analyse
such bias scores and help researchers to understand the benefits or limitations
of the existing methods. In this work, we aim to close this gap for cosine
based bias scores. By building on a geometric definition of bias, we propose
requirements for bias scores to be considered meaningful for quantifying
biases. Furthermore, we formally analyze cosine based scores from the
literature with regard to these requirements. We underline these findings with
experiments to show that the bias scores' limitations have an impact in the
application case.
comment: 11 pages, 3 figures. arXiv admin note: text overlap with
arXiv:2111.07864
♻ ☆ Evaluating Metrics for Bias in Word Embeddings
Over the last years, word and sentence embeddings have established as text
preprocessing for all kinds of NLP tasks and improved the performances
significantly. Unfortunately, it has also been shown that these embeddings
inherit various kinds of biases from the training data and thereby pass on
biases present in society to NLP solutions. Many papers attempted to quantify
bias in word or sentence embeddings to evaluate debiasing methods or compare
different embedding models, usually with cosine-based metrics. However, lately
some works have raised doubts about these metrics showing that even though such
metrics report low biases, other tests still show biases. In fact, there is a
great variety of bias metrics or tests proposed in the literature without any
consensus on the optimal solutions. Yet we lack works that evaluate bias
metrics on a theoretical level or elaborate the advantages and disadvantages of
different bias metrics. In this work, we will explore different cosine based
bias metrics. We formalize a bias definition based on the ideas from previous
works and derive conditions for bias metrics. Furthermore, we thoroughly
investigate the existing cosine-based metrics and their limitations to show why
these metrics can fail to report biases in some cases. Finally, we propose a
new metric, SAME, to address the shortcomings of existing metrics and
mathematically prove that SAME behaves appropriately.
comment: 32 pages, 8 figures
♻ ☆ StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis AAAI 2024
Yu Zhang, Rongjie Huang, Ruiqi Li, JinZheng He, Yan Xia, Feiyang Chen, Xinyu Duan, Baoxing Huai, Zhou Zhao
Style transfer for out-of-domain (OOD) singing voice synthesis (SVS) focuses
on generating high-quality singing voices with unseen styles (such as timbre,
emotion, pronunciation, and articulation skills) derived from reference singing
voice samples. However, the endeavor to model the intricate nuances of singing
voice styles is an arduous task, as singing voices possess a remarkable degree
of expressiveness. Moreover, existing SVS methods encounter a decline in the
quality of synthesized singing voices in OOD scenarios, as they rest upon the
assumption that the target vocal attributes are discernible during the training
phase. To overcome these challenges, we propose StyleSinger, the first singing
voice synthesis model for zero-shot style transfer of out-of-domain reference
singing voice samples. StyleSinger incorporates two critical approaches for
enhanced effectiveness: 1) the Residual Style Adaptor (RSA) which employs a
residual quantization module to capture diverse style characteristics in
singing voices, and 2) the Uncertainty Modeling Layer Normalization (UMLN) to
perturb the style attributes within the content representation during the
training phase and thus improve the model generalization. Our extensive
evaluations in zero-shot style transfer undeniably establish that StyleSinger
outperforms baseline models in both audio quality and similarity to the
reference singing voice samples. Access to singing voice samples can be found
at https://stylesinger.github.io/.
comment: Accepted by AAAI 2024
♻ ☆ On Leveraging Large Language Models for Enhancing Entity Resolution: A Cost-efficient Approach
Entity resolution, the task of identifying and merging records that refer to
the same real-world entity, is crucial in sectors like e-commerce, healthcare,
and law enforcement. Large Language Models (LLMs) introduce an innovative
approach to this task, capitalizing on their advanced linguistic capabilities
and a ``pay-as-you-go'' model that provides significant advantages to those
without extensive data science expertise. However, current LLMs are costly due
to per-API request billing. Existing methods often either lack quality or
become prohibitively expensive at scale. To address these problems, we propose
an uncertainty reduction framework using LLMs to improve entity resolution
results. We first initialize possible partitions of the entity cluster, refer
to the same entity, and define the uncertainty of the result. Then, we reduce
the uncertainty by selecting a few valuable matching questions for LLM
verification. Upon receiving the answers, we update the probability
distribution of the possible partitions. To further reduce costs, we design an
efficient algorithm to judiciously select the most valuable matching pairs to
query. Additionally, we create error-tolerant techniques to handle LLM mistakes
and a dynamic adjustment method to reach truly correct partitions. Experimental
results show that our method is efficient and effective, offering promising
applications in real-world tasks.
comment: 9 pages, preprint under review
♻ ☆ RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation
Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a
novel paradigm, aiming to enhance the model's ability to generalize to new
objects and instructions. However, due to variations in camera specifications
and mounting positions, existing methods exhibit significant performance
disparities across different robotic platforms. To address this challenge, we
propose RoboUniView in this paper, an innovative approach that decouples visual
feature extraction from action learning. We first learn a unified view
representation from multi-perspective views by pre-training on readily
accessible data, and then derive actions from this unified view representation
to control robotic manipulation. This unified view representation more
accurately mirrors the physical world and is not constrained by the robotic
platform's camera parameters. Thanks to this methodology, we achieve
state-of-the-art performance on the demanding CALVIN benchmark, enhancing the
success rate in the $D \to D$ setting from 93.0% to 96.2%, and in the $ABC \to
D$ setting from 92.2% to 94.2%. Moreover, our model exhibits outstanding
adaptability and flexibility: it maintains high performance under unseen camera
parameters, can utilize multiple datasets with varying camera parameters, and
is capable of joint cross-task learning across datasets. Code is provided for
re-implementation. https://github.com/liufanfanlff/RoboUniview
♻ ☆ Representational Analysis of Binding in Large Language Models
Entity tracking is essential for complex reasoning. To perform in-context
entity tracking, language models (LMs) must bind an entity to its attribute
(e.g., bind a container to its content) to recall attribute for a given entity.
For example, given a context mentioning ``The coffee is in Box Z, the stone is
in Box M, the map is in Box H'', to infer ``Box Z contains the coffee'' later,
LMs must bind ``Box Z'' to ``coffee''. To explain the binding behaviour of LMs,
Feng and Steinhardt (2023) introduce a Binding ID mechanism and state that LMs
use a abstract concept called Binding ID (BI) to internally mark
entity-attribute pairs. However, they have not directly captured the BI
determinant information from entity activations. In this work, we provide a
novel view of the Binding ID mechanism by localizing the prototype of BI
information. Specifically, we discover that there exists a low-rank subspace in
the hidden state (or activation) of LMs, that primarily encodes the order of
entity and attribute and which is used as the prototype of BI to causally
determine the binding. To identify this subspace, we choose principle component
analysis as our first attempt and it is empirically proven to be effective.
Moreover, we also discover that when editing representations along directions
in the subspace, LMs tend to bind a given entity to other attributes
accordingly. For example, by patching activations along the BI encoding
direction we can make the LM to infer ``Box Z contains the stone'' and ``Box Z
contains the map''.
comment: The key phrase "BI Subspace" might be misleading, because it sounds
like the subspace that directly encodes BI, and which is different with its
intended meaning that the subspace that is the base (or prototype) of BI.
Therefore, the naming of the subspace and its corresponding wording needs
further discussion and review
♻ ☆ A Survey of Backdoor Attacks and Defenses on Large Language Models: Implications for Security Measures
Shuai Zhao, Meihuizi Jia, Zhongliang Guo, Leilei Gan, Xiaoyu Xu, Xiaobao Wu, Jie Fu, Yichao Feng, Fengjun Pan, Luu Anh Tuan
Large Language Models (LLMs), which bridge the gap between human language
understanding and complex problem-solving, achieve state-of-the-art performance
on several NLP tasks, particularly in few-shot and zero-shot settings. Despite
the demonstrable efficacy of LLMs, due to constraints on computational
resources, users have to engage with open-source language models or outsource
the entire training process to third-party platforms. However, research has
demonstrated that language models are susceptible to potential security
vulnerabilities, particularly in backdoor attacks. Backdoor attacks are
designed to introduce targeted vulnerabilities into language models by
poisoning training samples or model weights, allowing attackers to manipulate
model responses through malicious triggers. While existing surveys on backdoor
attacks provide a comprehensive overview, they lack an in-depth examination of
backdoor attacks specifically targeting LLMs. To bridge this gap and grasp the
latest trends in the field, this paper presents a novel perspective on backdoor
attacks for LLMs by focusing on fine-tuning methods. Specifically, we
systematically classify backdoor attacks into three categories: full-parameter
fine-tuning, parameter-efficient fine-tuning, and no fine-tuning Based on
insights from a substantial review, we also discuss crucial issues for future
research on backdoor attacks, such as further exploring attack algorithms that
do not require fine-tuning, or developing more covert attack algorithms.