Computation and Language 59
☆ OneLLM: One Framework to Align All Modalities with Language
Jiaming Han, Kaixiong Gong, Yiyuan Zhang, Jiaqi Wang, Kaipeng Zhang, Dahua Lin, Yu Qiao, Peng Gao, Xiangyu Yue
Multimodal large language models (MLLMs) have gained significant attention
due to their strong multimodal understanding capability. However, existing
works rely heavily on modality-specific encoders, which usually differ in
architecture and are limited to common modalities. In this paper, we present
OneLLM, an MLLM that aligns eight modalities to language using a unified
framework. We achieve this through a unified multimodal encoder and a
progressive multimodal alignment pipeline. In detail, we first train an image
projection module to connect a vision encoder with LLM. Then, we build a
universal projection module (UPM) by mixing multiple image projection modules
and dynamic routing. Finally, we progressively align more modalities to LLM
with the UPM. To fully leverage the potential of OneLLM in following
instructions, we also curated a comprehensive multimodal instruction dataset,
including 2M items from image, audio, video, point cloud, depth/normal map, IMU
and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks,
encompassing tasks such as multimodal captioning, question answering and
reasoning, where it delivers excellent performance. Code, data, model and
online demo are available at https://github.com/csuhan/OneLLM
comment: Code: https://github.com/csuhan/OneLLM
☆ PROMISE: A Framework for Model-Driven Stateful Prompt Orchestration
Wenyuan Wu, Jasmin Heierli, Max Meisterhans, Adrian Moser, Andri Färber, Mateusz Dolata, Elena Gavagnin, Alexandre de Spindler, Gerhard Schwabe
The advent of increasingly powerful language models has raised expectations
for language-based interactions. However, controlling these models is a
challenge, emphasizing the need to be able to investigate the feasibility and
value of their application. We present PROMISE, a framework that facilitates
the development of complex language-based interactions with information
systems. Its use of state machine modeling concepts enables model-driven,
dynamic prompt orchestration across hierarchically nested states and
transitions. This improves the control of the behavior of language models and
thus enables their effective and efficient use. We show the benefits of PROMISE
in the context of application scenarios within health information systems and
demonstrate its ability to handle complex interactions.
☆ Evaluating and Mitigating Discrimination in Language Model Decisions
Alex Tamkin, Amanda Askell, Liane Lovitt, Esin Durmus, Nicholas Joseph, Shauna Kravec, Karina Nguyen, Jared Kaplan, Deep Ganguli
As language models (LMs) advance, interest is growing in applying them to
high-stakes societal decisions, such as determining financing or housing
eligibility. However, their potential for discrimination in such contexts
raises ethical concerns, motivating the need for better methods to evaluate
these risks. We present a method for proactively evaluating the potential
discriminatory impact of LMs in a wide range of use cases, including
hypothetical use cases where they have not yet been deployed. Specifically, we
use an LM to generate a wide array of potential prompts that decision-makers
may input into an LM, spanning 70 diverse decision scenarios across society,
and systematically vary the demographic information in each prompt. Applying
this methodology reveals patterns of both positive and negative discrimination
in the Claude 2.0 model in select settings when no interventions are applied.
While we do not endorse or permit the use of language models to make automated
decisions for the high-risk use cases we study, we demonstrate techniques to
significantly decrease both positive and negative discrimination through
careful prompt engineering, providing pathways toward safer deployment in use
cases where they may be appropriate. Our work enables developers and
policymakers to anticipate, measure, and address discrimination as language
model capabilities and applications continue to expand. We release our dataset
and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval
☆ An Integration of Pre-Trained Speech and Language Models for End-to-End Speech Recognition
Advances in machine learning have made it possible to perform various text
and speech processing tasks, including automatic speech recognition (ASR), in
an end-to-end (E2E) manner. Since typical E2E approaches require large amounts
of training data and resources, leveraging pre-trained foundation models
instead of training from scratch is gaining attention. Although there have been
attempts to use pre-trained speech and language models in ASR, most of them are
limited to using either. This paper explores the potential of integrating a
pre-trained speech representation model with a large language model (LLM) for
E2E ASR. The proposed model enables E2E ASR by generating text tokens in an
autoregressive manner via speech representations as speech prompts, taking
advantage of the vast knowledge provided by the LLM. Furthermore, the proposed
model can incorporate remarkable developments for LLM utilization, such as
inference optimization and parameter-efficient domain adaptation. Experimental
results show that the proposed model achieves performance comparable to modern
E2E ASR models.
comment: 6 pages, 2 figures, 3 tables, The model is available at
https://huggingface.co/rinna/nue-asr
☆ Generative agent-based modeling with actions grounded in physical, social, or digital space using Concordia
Alexander Sasha Vezhnevets, John P. Agapiou, Avia Aharon, Ron Ziv, Jayd Matyas, Edgar A. Duéñez-Guzmán, William A. Cunningham, Simon Osindero, Danny Karmon, Joel Z. Leibo
Agent-based modeling has been around for decades, and applied widely across
the social and natural sciences. The scope of this research method is now
poised to grow dramatically as it absorbs the new affordances provided by Large
Language Models (LLM)s. Generative Agent-Based Models (GABM) are not just
classic Agent-Based Models (ABM)s where the agents talk to one another. Rather,
GABMs are constructed using an LLM to apply common sense to situations, act
"reasonably", recall common semantic knowledge, produce API calls to control
digital technologies like apps, and communicate both within the simulation and
to researchers viewing it from the outside. Here we present Concordia, a
library to facilitate constructing and working with GABMs. Concordia makes it
easy to construct language-mediated simulations of physically- or
digitally-grounded environments. Concordia agents produce their behavior using
a flexible component system which mediates between two fundamental operations:
LLM calls and associative memory retrieval. A special agent called the Game
Master (GM), which was inspired by tabletop role-playing games, is responsible
for simulating the environment where the agents interact. Agents take actions
by describing what they want to do in natural language. The GM then translates
their actions into appropriate implementations. In a simulated physical world,
the GM checks the physical plausibility of agent actions and describes their
effects. In digital environments simulating technologies such as apps and
services, the GM may handle API calls to integrate with external tools such as
general AI assistants (e.g., Bard, ChatGPT), and digital apps (e.g., Calendar,
Email, Search, etc.). Concordia was designed to support a wide array of
applications both in scientific research and for evaluating performance of real
digital services by simulating users and/or generating synthetic data.
comment: 31 pages, 5 figures
★ Interpretability Illusions in the Generalization of Simplified Models
A common method to study deep learning systems is to use simplified model
representations -- for example, using singular value decomposition to visualize
the model's hidden states in a lower dimensional space. This approach assumes
that the results of these simplified are faithful to the original model. Here,
we illustrate an important caveat to this assumption: even if the simplified
representations can accurately approximate the full model on the training set,
they may fail to accurately capture the model's behavior out of distribution --
the understanding developed from simplified representations may be an illusion.
We illustrate this by training Transformer models on controlled datasets with
systematic generalization splits. First, we train models on the Dyck
balanced-parenthesis languages. We simplify these models using tools like
dimensionality reduction and clustering, and then explicitly test how these
simplified proxies match the behavior of the original model on various
out-of-distribution test sets. We find that the simplified proxies are
generally less faithful out of distribution. In cases where the original model
generalizes to novel structures or deeper depths, the simplified versions may
fail, or generalize better. This finding holds even if the simplified
representations do not directly depend on the training distribution. Next, we
study a more naturalistic task: predicting the next character in a dataset of
computer code. We find similar generalization gaps between the original model
and simplified proxies, and conduct further analysis to investigate which
aspects of the code completion task are associated with the largest gaps.
Together, our results raise questions about the extent to which mechanistic
interpretations derived using tools like SVD can reliably predict what a model
will do in novel situations.
☆ Not All Large Language Models (LLMs) Succumb to the "Reversal Curse": A Comparative Study of Deductive Logical Reasoning in BERT and GPT Models
The "Reversal Curse" refers to the scenario where auto-regressive decoder
large language models (LLMs), such as ChatGPT, trained on "A is B" fail to
learn "B is A", demonstrating a basic failure of logical deduction. This raises
a red flag in the use of GPT models for certain general tasks such as
constructing knowledge graphs, considering their adherence to this symmetric
principle. In our study, we examined a bidirectional LLM, BERT, and found that
it is immune to the reversal curse. Driven by ongoing efforts to construct
biomedical knowledge graphs with LLMs, we also embarked on evaluating more
complex but essential deductive reasoning capabilities. This process included
first training encoder and decoder language models to master the intersection
($\cap$) and union ($\cup$) operations on two sets and then moving on to assess
their capability to infer different combinations of union ($\cup$) and
intersection ($\cap$) operations on three newly created sets. The findings
showed that while both encoder and decoder language models, trained for tasks
involving two sets (union/intersection), were proficient in such scenarios,
they encountered difficulties when dealing with operations that included three
sets (various combinations of union and intersection). Our research highlights
the distinct characteristics of encoder and decoder models in simple and
complex logical reasoning. In practice, the choice between BERT and GPT should
be guided by the specific requirements and nature of the task at hand,
leveraging their respective strengths in bidirectional context comprehension
and sequence prediction.
☆ Improving Bias Mitigation through Bias Experts in Natural Language Understanding EMNLP 2023
Biases in the dataset often enable the model to achieve high performance on
in-distribution data, while poorly performing on out-of-distribution data. To
mitigate the detrimental effect of the bias on the networks, previous works
have proposed debiasing methods that down-weight the biased examples identified
by an auxiliary model, which is trained with explicit bias labels. However,
finding a type of bias in datasets is a costly process. Therefore, recent
studies have attempted to make the auxiliary model biased without the guidance
(or annotation) of bias labels, by constraining the model's training
environment or the capability of the model itself. Despite the promising
debiasing results of recent works, the multi-class learning objective, which
has been naively used to train the auxiliary model, may harm the bias
mitigation effect due to its regularization effect and competitive nature
across classes. As an alternative, we propose a new debiasing framework that
introduces binary classifiers between the auxiliary model and the main model,
coined bias experts. Specifically, each bias expert is trained on a binary
classification task derived from the multi-class classification task via the
One-vs-Rest approach. Experimental results demonstrate that our proposed
strategy improves the bias identification ability of the auxiliary model.
Consequently, our debiased model consistently outperforms the state-of-the-art
on various challenge datasets.
comment: Accepted in EMNLP 2023 as a long paper
☆ XAIQA: Explainer-Based Data Augmentation for Extractive Question Answering ML4H
Joel Stremmel, Ardavan Saeedi, Hamid Hassanzadeh, Sanjit Batra, Jeffrey Hertzberg, Jaime Murillo, Eran Halperin
Extractive question answering (QA) systems can enable physicians and
researchers to query medical records, a foundational capability for designing
clinical studies and understanding patient medical history. However, building
these systems typically requires expert-annotated QA pairs. Large language
models (LLMs), which can perform extractive QA, depend on high quality data in
their prompts, specialized for the application domain. We introduce a novel
approach, XAIQA, for generating synthetic QA pairs at scale from data naturally
available in electronic health records. Our method uses the idea of a
classification model explainer to generate questions and answers about medical
concepts corresponding to medical codes. In an expert evaluation with two
physicians, our method identifies $2.2\times$ more semantic matches and
$3.8\times$ more clinical abbreviations than two popular approaches that use
sentence transformers to create QA pairs. In an ML evaluation, adding our QA
pairs improves performance of GPT-4 as an extractive QA model, including on
difficult questions. In both the expert and ML evaluations, we examine
trade-offs between our method and sentence transformers for QA pair generation
depending on question difficulty.
comment: Extended Abstract presented at Machine Learning for Health (ML4H)
symposium 2023, December 10th, 2023, New Orleans, United States, 8 pages
☆ Holmes: Towards Distributed Training Across Clusters with Heterogeneous NIC Environment
Large language models (LLMs) such as GPT-3, OPT, and LLaMA have demonstrated
remarkable accuracy in a wide range of tasks. However, training these models
can incur significant expenses, often requiring tens of thousands of GPUs for
months of continuous operation. Typically, this training is carried out in
specialized GPU clusters equipped with homogeneous high-speed Remote Direct
Memory Access (RDMA) network interface cards (NICs). The acquisition and
maintenance of such dedicated clusters is challenging. Current LLM training
frameworks, like Megatron-LM and Megatron-DeepSpeed, focus primarily on
optimizing training within homogeneous cluster settings. In this paper, we
introduce Holmes, a training framework for LLMs that employs thoughtfully
crafted data and model parallelism strategies over the heterogeneous NIC
environment. Our primary technical contribution lies in a novel scheduling
method that intelligently allocates distinct computational tasklets in LLM
training to specific groups of GPU devices based on the characteristics of
their connected NICs. Furthermore, our proposed framework, utilizing pipeline
parallel techniques, demonstrates scalability to multiple GPU clusters, even in
scenarios without high-speed interconnects between nodes in distinct clusters.
We conducted comprehensive experiments that involved various scenarios in the
heterogeneous NIC environment. In most cases, our framework achieves
performance levels close to those achievable with homogeneous RDMA-capable
networks (InfiniBand or RoCE), significantly exceeding training efficiency
within the pure Ethernet environment. Additionally, we verified that our
framework outperforms other mainstream LLM frameworks under heterogeneous NIC
environment in terms of training efficiency and can be seamlessly integrated
with them.
comment: 14 pages
☆ Sig-Networks Toolkit: Signature Networks for Longitudinal Language Modelling
Talia Tseriotou, Ryan Sze-Yin Chan, Adam Tsakalidis, Iman Munire Bilal, Elena Kochkina, Terry Lyons, Maria Liakata
We present an open-source, pip installable toolkit, Sig-Networks, the first
of its kind for longitudinal language modelling. A central focus is the
incorporation of Signature-based Neural Network models, which have recently
shown success in temporal tasks. We apply and extend published research
providing a full suite of signature-based models. Their components can be used
as PyTorch building blocks in future architectures. Sig-Networks enables
task-agnostic dataset plug-in, seamless pre-processing for sequential data,
parameter flexibility, automated tuning across a range of models. We examine
signature networks under three different NLP tasks of varying temporal
granularity: counselling conversations, rumour stance switch and mood changes
in social media threads, showing SOTA performance in all three, and provide
guidance for future tasks. We release the Toolkit as a PyTorch package with an
introductory video, Git repositories for preprocessing and modelling including
sample notebooks on the modeled NLP tasks.
☆ Exploring Answer Information Methods for Question Generation with Transformers
There has been a lot of work in question generation where different methods
to provide target answers as input, have been employed. This experimentation
has been mostly carried out for RNN based models. We use three different
methods and their combinations for incorporating answer information and explore
their effect on several automatic evaluation metrics. The methods that are used
are answer prompting, using a custom product method using answer embeddings and
encoder outputs, choosing sentences from the input paragraph that have answer
related information, and using a separate cross-attention attention block in
the decoder which attends to the answer. We observe that answer prompting
without any additional modes obtains the best scores across rouge, meteor
scores. Additionally, we use a custom metric to calculate how many of the
generated questions have the same answer, as the answer which is used to
generate them.
☆ AMR Parsing is Far from Solved: GrAPES, the Granular AMR Parsing Evaluation Suite EMNLP 2023
We present the Granular AMR Parsing Evaluation Suite (GrAPES), a challenge
set for Abstract Meaning Representation (AMR) parsing with accompanying
evaluation metrics. AMR parsers now obtain high scores on the standard AMR
evaluation metric Smatch, close to or even above reported inter-annotator
agreement. But that does not mean that AMR parsing is solved; in fact, human
evaluation in previous work indicates that current parsers still quite
frequently make errors on node labels or graph structure that substantially
distort sentence meaning. Here, we provide an evaluation suite that tests AMR
parsers on a range of phenomena of practical, technical, and linguistic
interest. Our 36 categories range from seen and unseen labels, to structural
generalization, to coreference. GrAPES reveals in depth the abilities and
shortcomings of current AMR parsers.
comment: Accepted at EMNLP 2023. For the associated GitHub repository, see
https://github.com/jgroschwitz/GrAPES
☆ DBCopilot: Scaling Natural Language Querying to Massive Databases
Text-to-SQL simplifies database interactions by enabling non-experts to
convert their natural language (NL) questions into Structured Query Language
(SQL) queries. While recent advances in large language models (LLMs) have
improved the zero-shot text-to-SQL paradigm, existing methods face scalability
challenges when dealing with massive, dynamically changing databases. This
paper introduces DBCopilot, a framework that addresses these challenges by
employing a compact and flexible copilot model for routing across massive
databases. Specifically, DBCopilot decouples the text-to-SQL process into
schema routing and SQL generation, leveraging a lightweight
sequence-to-sequence neural network-based router to formulate database
connections and navigate natural language questions through databases and
tables. The routed schemas and questions are then fed into LLMs for efficient
SQL generation. Furthermore, DBCopilot also introduced a reverse
schema-to-question generation paradigm, which can learn and adapt the router
over massive databases automatically without requiring manual intervention.
Experimental results demonstrate that DBCopilot is a scalable and effective
solution for real-world text-to-SQL tasks, providing a significant advancement
in handling large-scale schemas.
comment: Code and data are available at https://github.com/tshu-w/DBCopilot
☆ Think from Words(TFW): Initiating Human-Like Cognition in Large Language Models Through Think from Words for Japanese Text-level Classification
The proliferation of Large Language Models (LLMs) has spurred extensive
research into LLM-related Prompt investigations, such as Instruction Learning
(IL), In-context Learning (ICL), and Chain-of-Thought (CoT). These approaches
aim to improve LLMs' responses by enabling them to provide concise statements
or examples for deeper contemplation when addressing questions. However,
independent thinking by LLMs can introduce variability in their thought
processes, leading to potential inaccuracies. In response, our study seeks to
bridge the gap between LLM and human-like thinking processes, recognizing that
text comprehension begins with understanding individual words. To tackle this
challenge, we have expanded the CoT method to cater to a specific domain. Our
approach, known as "Think from Words" (TFW), initiates the comprehension
process at the word level and then extends it to encompass the entire text. We
also propose "TFW with Extra word-level information" (TFW Extra), augmenting
comprehension with additional word-level data. To assess our methods, we employ
text classification on six Japanese datasets comprising text-level and
word-level elements. Our findings not only validate the effectiveness of TFW
but also shed light on the impact of various word-level information types on
LLMs' text comprehension, offering insights into their potential to cause
misinterpretations and errors in the overall comprehension of the final text.
☆ Compressed Context Memory For Online Language Model Interaction
This paper presents a novel context compression method for Transformer
language models in online scenarios such as ChatGPT, where the context
continually expands. As the context lengthens, the attention process requires
more memory and computational resources, which in turn reduces the throughput
of the language model. To this end, we propose a compressed context memory
system that continually compresses the growing context into a compact memory
space. The compression process simply involves integrating a lightweight
conditional LoRA into the language model's forward pass during inference. Based
on the compressed context memory, the language model can perform inference with
reduced memory and attention operations. Through evaluations on conversation,
personalization, and multi-task learning, we demonstrate that our approach
achieves the performance level of a full context model with $5\times$ smaller
context memory space. Codes are available at
https://github.com/snu-mllab/context-memory.
☆ A Text-to-Text Model for Multilingual Offensive Language Identification ACL 2023
The ubiquity of offensive content on social media is a growing cause for
concern among companies and government organizations. Recently,
transformer-based models such as BERT, XLNET, and XLM-R have achieved
state-of-the-art performance in detecting various forms of offensive content
(e.g. hate speech, cyberbullying, and cyberaggression). However, the majority
of these models are limited in their capabilities due to their encoder-only
architecture, which restricts the number and types of labels in downstream
tasks. Addressing these limitations, this study presents the first pre-trained
model with encoder-decoder architecture for offensive language identification
with text-to-text transformers (T5) trained on two large offensive language
identification datasets; SOLID and CCTK. We investigate the effectiveness of
combining two datasets and selecting an optimal threshold in semi-supervised
instances in SOLID in the T5 retraining step. Our pre-trained T5 model
outperforms other transformer-based models fine-tuned for offensive language
detection, such as fBERT and HateBERT, in multiple English benchmarks.
Following a similar approach, we also train the first multilingual pre-trained
model for offensive language identification using mT5 and evaluate its
performance on a set of six different languages (German, Hindi, Korean,
Marathi, Sinhala, and Spanish). The results demonstrate that this multilingual
model achieves a new state-of-the-art on all the above datasets, showing its
usefulness in multilingual scenarios. Our proposed T5-based models will be made
freely available to the community.
comment: Accepted to Findings of IJCNLP-AACL 2023
☆ Lazy-k: Decoding for Constrained Token Classification EMNLP
We explore the possibility of improving probabilistic models in structured
prediction. Specifically, we combine the models with constrained decoding
approaches in the context of token classification for information extraction.
The decoding methods search for constraint-satisfying label-assignments while
maximizing the total probability. To do this, we evaluate several existing
approaches, as well as propose a novel decoding method called Lazy-$k$. Our
findings demonstrate that constrained decoding approaches can significantly
improve the models' performances, especially when using smaller models. The
Lazy-$k$ approach allows for more flexibility between decoding time and
accuracy. The code for using Lazy-$k$ decoding can be found here:
https://github.com/ArthurDevNL/lazyk.
comment: Accepted EMNLP Main 2023
☆ KhabarChin: Automatic Detection of Important News in the Persian Language
Being aware of important news is crucial for staying informed and making
well-informed decisions efficiently. Natural Language Processing (NLP)
approaches can significantly automate this process. This paper introduces the
detection of important news, in a previously unexplored area, and presents a
new benchmarking dataset (Khabarchin) for detecting important news in the
Persian language. We define important news articles as those deemed significant
for a considerable portion of society, capable of influencing their mindset or
decision-making. The news articles are obtained from seven different prominent
Persian news agencies, resulting in the annotation of 7,869 samples and the
creation of the dataset. Two challenges of high disagreement and imbalance
between classes were faced, and solutions were provided for them. We also
propose several learning-based models, ranging from conventional machine
learning to state-of-the-art transformer models, to tackle this task.
Furthermore, we introduce the second task of important sentence detection in
news articles, as they often come with a significant contextual length that
makes it challenging for readers to identify important information. We identify
these sentences in a weakly supervised manner.
comment: 8 pages, 2 figures
☆ Teaching Specific Scientific Knowledge into Large Language Models through Additional Training
Through additional training, we explore embedding specialized scientific
knowledge into the Llama 2 Large Language Model (LLM). Key findings reveal that
effective knowledge integration requires reading texts from multiple
perspectives, especially in instructional formats. We utilize text augmentation
to tackle the scarcity of specialized texts, including style conversions and
translations. Hyperparameter optimization proves crucial, with different size
models (7b, 13b, and 70b) reasonably undergoing additional training. Validating
our methods, we construct a dataset of 65,000 scientific papers. Although we
have succeeded in partially embedding knowledge, the study highlights the
complexities and limitations of incorporating specialized information into
LLMs, suggesting areas for further improvement.
☆ Topic and genre in dialogue
In this paper we argue that topic plays a fundamental role in conversations,
and that the concept is needed in addition to that of genre to define
interactions. In particular, the concepts of genre and topic need to be
separated and orthogonally defined. This would enable modular, reliable and
controllable flexible-domain dialogue systems.
☆ Measuring Misogyny in Natural Language Generation: Preliminary Results from a Case Study on two Reddit Communities EMNLP 2023
Generic `toxicity' classifiers continue to be used for evaluating the
potential for harm in natural language generation, despite mounting evidence of
their shortcomings. We consider the challenge of measuring misogyny in natural
language generation, and argue that generic `toxicity' classifiers are
inadequate for this task. We use data from two well-characterised `Incel'
communities on Reddit that differ primarily in their degrees of misogyny to
construct a pair of training corpora which we use to fine-tune two language
models. We show that an open source `toxicity' classifier is unable to
distinguish meaningfully between generations from these models. We contrast
this with a misogyny-specific lexicon recently proposed by feminist
subject-matter experts, demonstrating that, despite the limitations of simple
lexicon-based approaches, this shows promise as a benchmark to evaluate
language models for misogyny, and that it is sensitive enough to reveal the
known differences in these Reddit communities. Our preliminary findings
highlight the limitations of a generic approach to evaluating harms, and
further emphasise the need for careful benchmark design and selection in
natural language evaluation.
comment: This extended abstract was presented at the Generation, Evaluation
and Metrics workshop at Empirical Methods in Natural Language Processing in
2023 (GEM@EMNLP 2023) in Singapore
☆ Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition and Phoneme to Grapheme Translation
This research optimizes two-pass cross-lingual transfer learning in
low-resource languages by enhancing phoneme recognition and phoneme-to-grapheme
translation models. Our approach optimizes these two stages to improve speech
recognition across languages. We optimize phoneme vocabulary coverage by
merging phonemes based on shared articulatory characteristics, thus improving
recognition accuracy. Additionally, we introduce a global phoneme noise
generator for realistic ASR noise during phoneme-to-grapheme training to reduce
error propagation. Experiments on the CommonVoice 12.0 dataset show significant
reductions in Word Error Rate (WER) for low-resource languages, highlighting
the effectiveness of our approach. This research contributes to the
advancements of two-pass ASR systems in low-resource languages, offering the
potential for improved cross-lingual transfer learning.
comment: 8 pages, ASRU 2023 Accepted
☆ Dyport: Dynamic Importance-based Hypothesis Generation Benchmarking Technique
This paper presents a novel benchmarking framework Dyport for evaluating
biomedical hypothesis generation systems. Utilizing curated datasets, our
approach tests these systems under realistic conditions, enhancing the
relevance of our evaluations. We integrate knowledge from the curated databases
into a dynamic graph, accompanied by a method to quantify discovery importance.
This not only assesses hypothesis accuracy but also their potential impact in
biomedical research which significantly extends traditional link prediction
benchmarks. Applicability of our benchmarking process is demonstrated on
several link prediction systems applied on biomedical semantic knowledge
graphs. Being flexible, our benchmarking system is designed for broad
application in hypothesis generation quality verification, aiming to expand the
scope of scientific discovery within the biomedical research community.
Availability and implementation: Dyport framework is fully open-source. All
code and datasets are available at: https://github.com/IlyaTyagin/Dyport
☆ Can language agents be alternatives to PPO? A Preliminary Empirical Study On OpenAI Gym
The formidable capacity for zero- or few-shot decision-making in language
agents encourages us to pose a compelling question: Can language agents be
alternatives to PPO agents in traditional sequential decision-making tasks? To
investigate this, we first take environments collected in OpenAI Gym as our
testbeds and ground them to textual environments that construct the TextGym
simulator. This allows for straightforward and efficient comparisons between
PPO agents and language agents, given the widespread adoption of OpenAI Gym. To
ensure a fair and effective benchmarking, we introduce $5$ levels of scenario
for accurate domain-knowledge controlling and a unified RL-inspired framework
for language agents. Additionally, we propose an innovative
explore-exploit-guided language (EXE) agent to solve tasks within TextGym.
Through numerical experiments and ablation studies, we extract valuable
insights into the decision-making capabilities of language agents and make a
preliminary evaluation of their potential to be alternatives to PPO in
classical sequential decision-making problems. This paper sheds light on the
performance of language agents and paves the way for future research in this
exciting domain. Our code is publicly available
at~\url{https://github.com/mail-ecnu/Text-Gym-Agents}.
☆ Rethinking E-Commerce Search
E-commerce search and recommendation usually operate on structured data such
as product catalogs and taxonomies. However, creating better search and
recommendation systems often requires a large variety of unstructured data
including customer reviews and articles on the web. Traditionally, the solution
has always been converting unstructured data into structured data through
information extraction, and conducting search over the structured data.
However, this is a costly approach that often has low quality. In this paper,
we envision a solution that does entirely the opposite. Instead of converting
unstructured data (web pages, customer reviews, etc) to structured data, we
instead convert structured data (product inventory, catalogs, taxonomies, etc)
into textual data, which can be easily integrated into the text corpus that
trains LLMs. Then, search and recommendation can be performed through a Q/A
mechanism through an LLM instead of using traditional information retrieval
methods over structured data.
☆ Detecting Rumor Veracity with Only Textual Information by Double-Channel Structure
Kyle (1985) proposes two types of rumors: informed rumors which are based on
some private information and uninformed rumors which are not based on any
information (i.e. bluffing). Also, prior studies find that when people have
credible source of information, they are likely to use a more confident textual
tone in their spreading of rumors. Motivated by these theoretical findings, we
propose a double-channel structure to determine the ex-ante veracity of rumors
on social media. Our ultimate goal is to classify each rumor into true, false,
or unverifiable category. We first assign each text into either certain
(informed rumor) or uncertain (uninformed rumor) category. Then, we apply lie
detection algorithm to informed rumors and thread-reply agreement detection
algorithm to uninformed rumors. Using the dataset of SemEval 2019 Task 7, which
requires ex-ante threefold classification (true, false, or unverifiable) of
social media rumors, our model yields a macro-F1 score of 0.4027, outperforming
all the baseline models and the second-place winner (Gorrell et al., 2019).
Furthermore, we empirically validate that the double-channel structure
outperforms single-channel structures which use either lie detection or
agreement detection algorithm to all posts.
☆ Corporate Bankruptcy Prediction with Domain-Adapted BERT
This study performs BERT-based analysis, which is a representative
contextualized language model, on corporate disclosure data to predict
impending bankruptcies. Prior literature on bankruptcy prediction mainly
focuses on developing more sophisticated prediction methodologies with
financial variables. However, in our study, we focus on improving the quality
of input dataset. Specifically, we employ BERT model to perform sentiment
analysis on MD&A disclosures. We show that BERT outperforms dictionary-based
predictions and Word2Vec-based predictions in terms of adjusted R-square in
logistic regression, k-nearest neighbor (kNN-5), and linear kernel support
vector machine (SVM). Further, instead of pre-training the BERT model from
scratch, we apply self-learning with confidence-based filtering to corporate
disclosure data (10-K). We achieve the accuracy rate of 91.56% and demonstrate
that the domain adaptation procedure brings a significant improvement in
prediction accuracy.
♻ ☆ KPI Extraction from Maintenance Work Orders -- A Comparison of Expert Labeling, Text Classification and AI-Assisted Tagging for Computing Failure Rates of Wind Turbines
Marc-Alexander Lutz, Bastian Schäfermeier, Rachael Sexton, Michael Sharp, Alden Dima, Stefan Faulstich, Jagan Mohini Aluri
Maintenance work orders are commonly used to document information about wind
turbine operation and maintenance. This includes details about proactive and
reactive wind turbine downtimes, such as preventative and corrective
maintenance. However, the information contained in maintenance work orders is
often unstructured and difficult to analyze, presenting challenges for
decision-makers wishing to use it for optimizing operation and maintenance. To
address this issue, this work compares three different approaches to calculate
reliability by performance indicators from maintenance work orders. The first
approach involves manual labeling of the maintenance work orders by domain
experts, using the schema defined in an industrial guideline to assign the
label accordingly. The second approach involves the development of a model that
automatically labels the maintenance work orders using text classification
methods. Through this method, we are able to achieve macro average and weighted
average F1-Scores of 0.75 and 0.85 respectively. The third technique uses an
AI-assisted tagging tool to tag and structure the raw maintenance information,
together with a novel rule-based approach for extracting relevant maintenance
work orders for failure rate calculation. In our experiments the AI-assisted
tool leads to a 88% drop in tagging time in comparison to the other two
approaches, while expert labeling and text classification are more accurate in
KPI extraction. Overall, our findings make extracting maintenance information
from maintenance work orders more efficient, enable the assessment of
reliability key performance indicators and therefore support the optimization
of wind turbine operation and maintenance.
♻ ☆ Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation
Ryutaro Tanno, David G. T. Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, Mike Schaekermann, Rhys May, Roy Lee, SiWai Man, Zahra Ahmed, Sara Mahdavi, Danielle Belgrave, Vivek Natarajan, Shravya Shetty, Pushmeet Kohli, Po-Sen Huang, Alan Karthikesalingam, Ira Ktena
Radiology reports are an instrumental part of modern medicine, informing key
clinical decisions such as diagnosis and treatment. The worldwide shortage of
radiologists, however, restricts access to expert care and imposes heavy
workloads, contributing to avoidable errors and delays in report delivery.
While recent progress in automated report generation with vision-language
models offer clear potential in ameliorating the situation, the path to
real-world adoption has been stymied by the challenge of evaluating the
clinical quality of AI-generated reports. In this study, we build a
state-of-the-art report generation system for chest radiographs,
\textit{Flamingo-CXR}, by fine-tuning a well-known vision-language foundation
model on radiology data. To evaluate the quality of the AI-generated reports, a
group of 16 certified radiologists provide detailed evaluations of AI-generated
and human written reports for chest X-rays from an intensive care setting in
the United States and an inpatient setting in India. At least one radiologist
(out of two per case) preferred the AI report to the ground truth report in
over 60$\%$ of cases for both datasets. Amongst the subset of AI-generated
reports that contain errors, the most frequently cited reasons were related to
the location and finding, whereas for human written reports, most mistakes were
related to severity and finding. This disparity suggested potential
complementarity between our AI system and human experts, prompting us to
develop an assistive scenario in which \textit{Flamingo-CXR} generates a
first-draft report, which is subsequently revised by a clinician. This is the
first demonstration of clinician-AI collaboration for report writing, and the
resultant reports are assessed to be equivalent or preferred by at least one
radiologist to reports written by experts alone in 80$\%$ of in-patient cases
and 60$\%$ of intensive care cases.
♻ ☆ LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models EMNLP 2023
Large language models (LLMs) have been applied in various applications due to
their astonishing capabilities. With advancements in technologies such as
chain-of-thought (CoT) prompting and in-context learning (ICL), the prompts fed
to LLMs are becoming increasingly lengthy, even exceeding tens of thousands of
tokens. To accelerate model inference and reduce cost, this paper presents
LLMLingua, a coarse-to-fine prompt compression method that involves a budget
controller to maintain semantic integrity under high compression ratios, a
token-level iterative compression algorithm to better model the interdependence
between compressed contents, and an instruction tuning based method for
distribution alignment between language models. We conduct experiments and
analysis over four datasets from different scenarios, i.e., GSM8K, BBH,
ShareGPT, and Arxiv-March23; showing that the proposed approach yields
state-of-the-art performance and allows for up to 20x compression with little
performance loss. Our code is available at https://aka.ms/LLMLingua.
comment: Accepted at EMNLP 2023
♻ ☆ Conditions for Length Generalization in Learning Reasoning Skills
Reasoning is a fundamental capability of AI agents. Recently, large language
models (LLMs) have shown remarkable abilities to perform reasoning tasks.
However, numerous evaluations of the reasoning capabilities of LLMs have also
showed some limitations. An outstanding limitation is length generalization,
meaning that when trained on reasoning problems of smaller lengths or sizes,
the resulting models struggle with problems of larger sizes or lengths. This
potentially indicates some theoretical limitations of generalization in
learning reasoning skills. These evaluations and their observations motivated
us to perform a theoretical study of the length generalization problem. This
work focuses on reasoning tasks that can be formulated as Markov dynamic
processes (MDPs) and/or directed acyclic graphs (DAGs). It identifies and
proves conditions that decide whether the length generalization problem can be
solved or not for a reasoning task in a particular representation. Experiments
are also conducted to verify the theoretical results.
♻ ☆ TIGERScore: Towards Building Explainable Metric for All Text Generation Tasks
We present TIGERScore, a \textbf{T}rained metric that follows
\textbf{I}nstruction \textbf{G}uidance to perform \textbf{E}xplainable, and
\textbf{R}eference-free evaluation over a wide spectrum of text generation
tasks. Different from other automatic evaluation methods that only provide
arcane scores, TIGERScore is guided by natural language instruction to provide
error analysis to pinpoint the mistakes in the generated text. Our metric is
based on LLaMA-2, trained on our meticulously curated instruction-tuning
dataset MetricInstruct which covers 6 text generation tasks and 23 text
generation datasets. The dataset consists of 42K quadruple in the form of
(instruction, input, system output $\rightarrow$ error analysis). We collected
the `system outputs' through from a large variety of models to cover different
types of errors. To quantitatively assess our metric, we evaluate its
correlation with human ratings on 5 held-in datasets, 2 held-out datasets and
show that TIGERScore can achieve the open-source SoTA correlation with human
ratings across these datasets and almost approaches GPT-4 evaluator. As a
reference-free metric, its correlation can even surpass the best existing
reference-based metrics. To further qualitatively assess the rationale
generated by our metric, we conduct human evaluation on the generated
explanations and found that the explanations are 70.8\% accurate. Through these
experimental results, we believe TIGERScore demonstrates the possibility of
building universal explainable metrics to evaluate any text generation task.
♻ ☆ Entailment Semantics Can Be Extracted from an Ideal Language Model
Language models are often trained on text alone, without additional
grounding. There is debate as to how much of natural language semantics can be
inferred from such a procedure. We prove that entailment judgments between
sentences can be extracted from an ideal language model that has perfectly
learned its target distribution, assuming the training sentences are generated
by Gricean agents, i.e., agents who follow fundamental principles of
communication from the linguistic theory of pragmatics. We also show entailment
judgments can be decoded from the predictions of a language model trained on
such Gricean data. Our results reveal a pathway for understanding the semantic
information encoded in unlabeled linguistic data and a potential framework for
extracting semantics from language models.
comment: Accepted at CONLL 2022. Updated Dec 4, 2023 with erratum
♻ ☆ A Comprehensive Review of Visual-Textual Sentiment Analysis from Social Media Networks
Social media networks have become a significant aspect of people's lives,
serving as a platform for their ideas, opinions and emotions. Consequently,
automated sentiment analysis (SA) is critical for recognising people's feelings
in ways that other information sources cannot. The analysis of these feelings
revealed various applications, including brand evaluations, YouTube film
reviews and healthcare applications. As social media continues to develop,
people post a massive amount of information in different forms, including text,
photos, audio and video. Thus, traditional SA algorithms have become limited,
as they do not consider the expressiveness of other modalities. By including
such characteristics from various material sources, these multimodal data
streams provide new opportunities for optimising the expected results beyond
text-based SA. Our study focuses on the forefront field of multimodal SA, which
examines visual and textual data posted on social media networks. Many people
are more likely to utilise this information to express themselves on these
platforms. To serve as a resource for academics in this rapidly growing field,
we introduce a comprehensive overview of textual and visual SA, including data
pre-processing, feature extraction techniques, sentiment benchmark datasets,
and the efficacy of multiple classification methodologies suited to each field.
We also provide a brief introduction of the most frequently utilised data
fusion strategies and a summary of existing research on visual-textual SA.
Finally, we highlight the most significant challenges and investigate several
important sentiment applications.
♻ ☆ Is This the Subspace You Are Looking for? An Interpretability Illusion for Subspace Activation Patching NeurIPS 2023
Mechanistic interpretability aims to understand model behaviors in terms of
specific, interpretable features, often hypothesized to manifest as
low-dimensional subspaces of activations. Specifically, recent studies have
explored subspace interventions (such as activation patching) as a way to
simultaneously manipulate model behavior and attribute the features behind it
to given subspaces.
In this work, we demonstrate that these two aims diverge, potentially leading
to an illusory sense of interpretability. Counterintuitively, even if a
subspace intervention makes the model's output behave as if the value of a
feature was changed, this effect may be achieved by activating a dormant
parallel pathway leveraging another subspace that is causally disconnected from
model outputs. We demonstrate this phenomenon in a distilled mathematical
example, in two real-world domains (the indirect object identification task and
factual recall), and present evidence for its prevalence in practice. In the
context of factual recall, we further show a link to rank-1 fact editing,
providing a mechanistic explanation for previous work observing an
inconsistency between fact editing performance and fact localization.
However, this does not imply that activation patching of subspaces is
intrinsically unfit for interpretability. To contextualize our findings, we
also show what a success case looks like in a task (indirect object
identification) where prior manual circuit analysis informs an understanding of
the location of a feature. We explore the additional evidence needed to argue
that a patched subspace is faithful.
comment: NeurIPS 2023 Workshop on Attributing Model Behavior at Scale
♻ ☆ Error Detection for Text-to-SQL Semantic Parsing EMNLP 2023
Despite remarkable progress in text-to-SQL semantic parsing in recent years,
the performance of existing parsers is still far from perfect. Specifically,
modern text-to-SQL parsers based on deep learning are often over-confident,
thus casting doubt on their trustworthiness when deployed for real use. In this
paper, we propose a parser-independent error detection model for text-to-SQL
semantic parsing. Using a language model of code as its bedrock, we enhance our
error detection model with graph neural networks that learn structural features
of both natural language questions and SQL queries. We train our model on
realistic parsing errors collected from a cross-domain setting, which leads to
stronger generalization ability. Experiments with three strong text-to-SQL
parsers featuring different decoding mechanisms show that our approach
outperforms parser-dependent uncertainty metrics. Our model could also
effectively improve the performance and usability of text-to-SQL semantic
parsers regardless of their architectures. (Our implementation is available at
https://github.com/OSU-NLP-Group/Text2SQL-Error-Detection)
comment: EMNLP 2023 (Findings); Updated with new experiment results
♻ ☆ Completeness, Recall, and Negation in Open-World Knowledge Bases: A Survey
General-purpose knowledge bases (KBs) are a cornerstone of knowledge-centric
AI. Many of them are constructed pragmatically from Web sources, and are thus
far from complete. This poses challenges for the consumption as well as the
curation of their content. While several surveys target the problem of
completing incomplete KBs, the first problem is arguably to know whether and
where the KB is incomplete in the first place, and to which degree.
In this survey we discuss how knowledge about completeness, recall, and
negation in KBs can be expressed, extracted, and inferred. We cover (i) the
logical foundations of knowledge representation and querying under partial
closed-world semantics; (ii) the estimation of this information via statistical
patterns; (iii) the extraction of information about recall from KBs and text;
(iv) the identification of interesting negative statements; and (v) relaxed
notions of relative recall.
This survey is targeted at two types of audiences: (1) practitioners who are
interested in tracking KB quality, focusing extraction efforts, and building
quality-aware downstream applications; and (2) data management, knowledge base
and semantic web researchers who wish to understand the state of the art of
knowledge bases beyond the open-world assumption. Consequently, our survey
presents both fundamental methodologies and their working, and gives
practice-oriented recommendations on how to choose between different approaches
for a problem at hand.
comment: 42 pages, 8 figures, 5 tables
♻ ☆ Strahler Number of Natural Language Sentences in Comparison with Random Trees
The Strahler number was originally proposed to characterize the complexity of
river bifurcation and has found various applications. This article proposes
computation of the Strahler number's upper and lower limits for natural
language sentence tree structures. Through empirical measurements across
grammatically annotated data, the Strahler number of natural language sentences
is shown to be almost 3 or 4, similarly to the case of river bifurcation as
reported by Strahler (1957). From the theory behind the number, we show that it
is one kind of lower limit on the amount of memory required to process
sentences. We consider the Strahler number to provide reasoning that explains
reports showing that the number of required memory areas to process sentences
is 3 to 4 for parsing (Schuler et al., 2010), and reports indicating a
psychological "magical number" of 3 to 5 (Cowan, 2001). An analytical and
empirical analysis shows that the Strahler number is not constant but grows
logarithmically; therefore, the Strahler number of sentences derives from the
range of sentence lengths. Furthermore, the Strahler number is not different
for random trees, which could suggest that its origin is not specific to
natural language.
comment: 34 pages, 12 figures, 11 tables
♻ ☆ Visual Data-Type Understanding does not emerge from Scaling Vision-Language Models
Recent advances in the development of vision-language models (VLMs) are
yielding remarkable success in recognizing visual semantic content, including
impressive instances of compositional image understanding. Here, we introduce
the novel task of Visual Data-Type Identification, a basic perceptual skill
with implications for data curation (e.g., noisy data-removal from large
datasets, domain-specific retrieval) and autonomous vision (e.g.,
distinguishing changing weather conditions from camera lens staining). We
develop two datasets consisting of animal images altered across a diverse set
of 27 visual data-types, spanning four broad categories. An extensive zero-shot
evaluation of 39 VLMs, ranging from 100M to 80B parameters, shows a nuanced
performance landscape. While VLMs are reasonably good at identifying certain
stylistic \textit{data-types}, such as cartoons and sketches, they struggle
with simpler data-types arising from basic manipulations like image rotations
or additive noise. Our findings reveal that (i) model scaling alone yields
marginal gains for contrastively-trained models like CLIP, and (ii) there is a
pronounced drop in performance for the largest auto-regressively trained VLMs
like OpenFlamingo. This finding points to a blind spot in current frontier
VLMs: they excel in recognizing semantic content but fail to acquire an
understanding of visual data-types through scaling. By analyzing the
pre-training distributions of these models and incorporating data-type
information into the captions during fine-tuning, we achieve a significant
enhancement in performance. By exploring this previously uncharted task, we aim
to set the stage for further advancing VLMs to equip them with visual data-type
understanding. Code and datasets are released at
https://github.com/bethgelab/DataTypeIdentification.
♻ ☆ Assessing Language Disorders using Artificial Intelligence: a Paradigm Shift
Speech, language, and communication deficits are present in most
neurodegenerative syndromes. They enable the early detection, diagnosis,
treatment planning, and monitoring of neurocognitive disease progression as
part of traditional neurological assessment. Nevertheless, standard speech and
language evaluation is time-consuming and resource-intensive for clinicians. We
argue that using machine learning methodologies, natural language processing,
and modern artificial intelligence (AI) for Language Assessment is an
improvement over conventional manual assessment. Using these methodologies,
Computational Language Assessment (CLA) accomplishes three goals: (i) provides
a neuro-cognitive evaluation of speech, language, and communication in elderly
and high-risk individuals for dementia; (ii) facilitates the diagnosis,
prognosis, and therapy efficacy in at-risk and language-impaired populations;
and (iii) allows easier extensibility to assess patients from a wide range of
languages. By employing AI models, CLA may inform neurocognitive theory on the
relationship between language symptoms and their neural bases. Finally, it
signals a paradigm shift by significantly advancing our ability to optimize the
prevention and treatment of elderly individuals with communication disorders,
allowing them to age gracefully with social engagement.
comment: 36 pages, 2 figures, to be submited
♻ ☆ Modeling Empathic Similarity in Personal Narratives EMNLP 2023
The most meaningful connections between people are often fostered through
expression of shared vulnerability and emotional experiences in personal
narratives. We introduce a new task of identifying similarity in personal
stories based on empathic resonance, i.e., the extent to which two people
empathize with each others' experiences, as opposed to raw semantic or lexical
similarity, as has predominantly been studied in NLP. Using insights from
social psychology, we craft a framework that operationalizes empathic
similarity in terms of three key features of stories: main events, emotional
trajectories, and overall morals or takeaways. We create EmpathicStories, a
dataset of 1,500 personal stories annotated with our empathic similarity
features, and 2,000 pairs of stories annotated with empathic similarity scores.
Using our dataset, we fine-tune a model to compute empathic similarity of story
pairs, and show that this outperforms semantic similarity models on automated
correlation and retrieval metrics. Through a user study with 150 participants,
we also assess the effect our model has on retrieving stories that users
empathize with, compared to naive semantic similarity-based retrieval, and find
that participants empathized significantly more with stories retrieved by our
model. Our work has strong implications for the use of empathy-aware models to
foster human connection and empathy between people.
comment: Published at EMNLP 2023
♻ ☆ SwiftSage: A Generative Agent with Fast and Slow Thinking for Complex Interactive Tasks NeurIPS 2023
Bill Yuchen Lin, Yicheng Fu, Karina Yang, Faeze Brahman, Shiyu Huang, Chandra Bhagavatula, Prithviraj Ammanabrolu, Yejin Choi, Xiang Ren
We introduce SwiftSage, a novel agent framework inspired by the dual-process
theory of human cognition, designed to excel in action planning for complex
interactive reasoning tasks. SwiftSage integrates the strengths of behavior
cloning and prompting large language models (LLMs) to enhance task completion
performance. The framework comprises two primary modules: the Swift module,
representing fast and intuitive thinking, and the Sage module, emulating
deliberate thought processes. The Swift module is a small encoder-decoder LM
fine-tuned on the oracle agent's action trajectories, while the Sage module
employs LLMs such as GPT-4 for subgoal planning and grounding. We develop a
heuristic method to harmoniously integrate the two modules, resulting in a more
efficient and robust problem-solving process. In 30 tasks from the ScienceWorld
benchmark, SwiftSage significantly outperforms other methods such as SayCan,
ReAct, and Reflexion, demonstrating its effectiveness in solving complex
interactive tasks.
comment: Accepted to NeurIPS 2023 (spotlight). Project website:
https://swiftsage.github.io
♻ ☆ Clickbait Detection via Large Language Models
Clickbait, which aims to induce users with some surprising and even thrilling
headlines for increasing click-through rates, permeates almost all online
content publishers, such as news portals and social media. Recently, Large
Language Models (LLMs) have emerged as a powerful instrument and achieved
tremendous success in a series of NLP downstream tasks. However, it is not yet
known whether LLMs can be served as a high-quality clickbait detection system.
In this paper, we analyze the performance of LLMs in the few-shot and zero-shot
scenarios on several English and Chinese benchmark datasets. Experimental
results show that LLMs cannot achieve the best results compared to the
state-of-the-art deep and fine-tuning PLMs methods. Different from human
intuition, the experiments demonstrated that LLMs cannot make satisfied
clickbait detection just by the headlines.
♻ ☆ Inference-Time Policy Adapters (IPA): Tailoring Extreme-Scale LMs without Fine-tuning EMNLP 2023
Ximing Lu, Faeze Brahman, Peter West, Jaehun Jang, Khyathi Chandu, Abhilasha Ravichander, Lianhui Qin, Prithviraj Ammanabrolu, Liwei Jiang, Sahana Ramnath, Nouha Dziri, Jillian Fisher, Bill Yuchen Lin, Skyler Hallinan, Xiang Ren, Sean Welleck, Yejin Choi
While extreme-scale language models have demonstrated exceptional performance
on a variety of language tasks, the degree of control over these language
models through pure prompting can often be limited. Directly fine-tuning such
language models can be effective for tailoring them, but it can be either
extremely costly (e.g., GPT-3) or not even feasible for the broader community
(e.g., GPT-4).
We propose Inference-time Policy Adapters (IPA), which efficiently tailors a
language model such as GPT-3 without fine-tuning it. IPA guides a large base
model during decoding time through a lightweight policy adapter trained to
optimize an arbitrary user objective with reinforcement learning.
On five challenging text generation tasks, such as toxicity reduction and
lexically constrained generation, IPA consistently brings significant
improvements over off-the-shelf language models. It outperforms competitive
baseline methods, sometimes even including expensive fine-tuning. In
particular, tailoring GPT-2 with IPA can outperform GPT-3, while tailoring
GPT-3 with IPA brings a major performance boost over GPT-3 (and sometimes even
over GPT-4). Our promising results highlight the potential of IPA as a
lightweight alternative to tailoring extreme-scale language models.
comment: EMNLP 2023
♻ ☆ All the World's a (Hyper)Graph: A Data Drama
We introduce Hyperbard, a dataset of diverse relational data representations
derived from Shakespeare's plays. Our representations range from simple graphs
capturing character co-occurrence in single scenes to hypergraphs encoding
complex communication settings and character contributions as hyperedges with
edge-specific node weights. By making multiple intuitive representations
readily available for experimentation, we facilitate rigorous representation
robustness checks in graph learning, graph mining, and network analysis,
highlighting the advantages and drawbacks of specific representations.
Leveraging the data released in Hyperbard, we demonstrate that many solutions
to popular graph mining problems are highly dependent on the representation
choice, thus calling current graph curation practices into question. As an
homage to our data source, and asserting that science can also be art, we
present all our points in the form of a play.
comment: This is the full version of our paper; an abridged version appears in
Digital Scholarship in the Humanities. Landing page for code and data:
https://hyperbard.net/
♻ ☆ BoschAI @ PLABA 2023: Leveraging Edit Operations in End-to-End Neural Sentence Simplification
Automatic simplification can help laypeople to comprehend complex scientific
text. Language models are frequently applied to this task by translating from
complex to simple language. In this paper, we describe our system based on
Llama 2, which ranked first in the PLABA shared task addressing the
simplification of biomedical text. We find that the large portion of shared
tokens between input and output leads to weak training signals and
conservatively editing models. To mitigate these issues, we propose
sentence-level and token-level loss weights. They give higher weight to
modified tokens, indicated by edit distance and edit operations, respectively.
We conduct an empirical evaluation on the PLABA dataset and find that both
approaches lead to simplifications closer to those created by human annotators
(+1.8% / +3.5% SARI), simpler language (-1 / -1.1 FKGL) and more edits (1.6x /
1.8x edit distance) compared to the same model fine-tuned with standard cross
entropy. We furthermore show that the hyperparameter $\lambda$ in token-level
loss weights can be used to control the edit distance and the simplicity level
(FKGL).
♻ ☆ BiasTestGPT: Using ChatGPT for Social Bias Testing of Language Models
Pretrained Language Models (PLMs) harbor inherent social biases that can
result in harmful real-world implications. Such social biases are measured
through the probability values that PLMs output for different social groups and
attributes appearing in a set of test sentences. However, bias testing is
currently cumbersome since the test sentences are generated either from a
limited set of manual templates or need expensive crowd-sourcing. We instead
propose using ChatGPT for the controllable generation of test sentences, given
any arbitrary user-specified combination of social groups and attributes
appearing in the test sentences. When compared to template-based methods, our
approach using ChatGPT for test sentence generation is superior in detecting
social bias, especially in challenging settings such as intersectional biases.
We present an open-source comprehensive bias testing framework (BiasTestGPT),
hosted on HuggingFace, that can be plugged into any open-source PLM for bias
testing. User testing with domain experts from various fields has shown their
interest in being able to test modern AI for social biases. Our tool has
significantly improved their awareness of such biases in PLMs, proving to be
learnable and user-friendly. We thus enable seamless open-ended social bias
testing of PLMs by domain experts through an automatic large-scale generation
of diverse test sentences for any combination of social categories and
attributes.
♻ ☆ TPPoet: Transformer-Based Persian Poem Generation using Minimal Data and Advanced Decoding Techniques
Recent advances in language models (LMs), have demonstrated significant
efficacy in tasks related to the arts and humanities. While LMs have exhibited
exceptional performance across a wide range of natural language processing
tasks, there are notable challenges associated with their utilization on small
datasets and their ability to replicate more creative human capacities. In this
study, we aim to address these challenges by training a Persian classical
poetry generation model using a transformer architecture on a specialized
dataset with no pretraining. Additionally, we propose a novel decoding method
to enhance coherence and meaningfulness in the generated poetry, effectively
managing the tradeoff between diversity and quality. Furthermore, the results
of our training approach and the proposed decoding method are evaluated through
comprehensive set of automatic and human evaluations and showed its superior
capability to generate coherent and meaningful poetry in compare to other
decoding methods and an existing Persian large language model (LLM).
♻ ★ AlpacaFarm: A Simulation Framework for Methods that Learn from Human Feedback
Yann Dubois, Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, Tatsunori B. Hashimoto
Large language models (LLMs) such as ChatGPT have seen widespread adoption
due to their ability to follow user instructions well. Developing these LLMs
involves a complex yet poorly understood workflow requiring training with human
feedback. Replicating and understanding this instruction-following process
faces three major challenges: the high cost of data collection, the lack of
trustworthy evaluation, and the absence of reference method implementations. We
address these challenges with AlpacaFarm, a simulator that enables research and
development for learning from feedback at a low cost. First, we design LLM
prompts to simulate human feedback that are 45x cheaper than crowdworkers and
display high agreement with humans. Second, we propose an automatic evaluation
and validate it against human instructions obtained on real-world interactions.
Third, we contribute reference implementations for several methods (PPO, DPO,
best-of-n, expert iteration, and more) that learn from pairwise feedback.
Finally, as an end-to-end validation of AlpacaFarm, we train and evaluate
eleven models on 10k pairs of real human feedback and show that rankings of
models trained in AlpacaFarm match rankings of models trained on human data. As
a demonstration of the research possible in AlpacaFarm, we find that methods
that use a reward model can substantially improve over supervised fine-tuning
and that our reference PPO implementation leads to a +10% improvement in
win-rate against Davinci003. We release all components of AlpacaFarm at
https://github.com/tatsu-lab/alpaca_farm.
♻ ☆ In-Context Learning for Text Classification with Many Labels
In-context learning (ICL) using large language models for tasks with many
labels is challenging due to the limited context window, which makes it
difficult to fit a sufficient number of examples in the prompt. In this paper,
we use a pre-trained dense retrieval model to bypass this limitation, giving
the model only a partial view of the full label space for each inference call.
Testing with recent open-source LLMs (OPT, LLaMA), we set new state of the art
performance in few-shot settings for three common intent classification
datasets, with no finetuning. We also surpass fine-tuned performance on
fine-grained sentiment classification in certain cases. We analyze the
performance across number of in-context examples and different model scales,
showing that larger models are necessary to effectively and consistently make
use of larger context lengths for ICL. By running several ablations, we analyze
the model's use of: a) the similarity of the in-context examples to the current
input, b) the semantic content of the class names, and c) the correct
correspondence between examples and labels. We demonstrate that all three are
needed to varying degrees depending on the domain, contrary to certain recent
works.
comment: 12 pages, 4 figures
♻ ★ Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
Chain-of-Thought (CoT) guides large language models (LLMs) to reason
step-by-step, and can motivate their logical reasoning ability. While effective
for logical tasks, CoT is not conducive to creative problem-solving which often
requires out-of-box thoughts and is crucial for innovation advancements. In
this paper, we explore the Leap-of-Thought (LoT) abilities within LLMs -- a
non-sequential, creative paradigm involving strong associations and knowledge
leaps. To this end, we study LLMs on the popular Oogiri game which needs
participants to have good creativity and strong associative thinking for
responding unexpectedly and humorously to the given image, text, or both, and
thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the
Oogiri game, we first build a multimodal and multilingual Oogiri-GO dataset
which contains over 130,000 samples from the Oogiri game, and observe the
insufficient LoT ability or failures of most existing LLMs on the Oogiri game.
Accordingly, we introduce a creative Leap-of-Thought (CLoT) paradigm to improve
LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into
LoT-oriented instruction tuning data to train pretrained LLM for achieving
certain LoT humor generation and discrimination abilities. Then CLoT designs an
explorative self-refinement that encourages the LLM to generate more creative
LoT data via exploring parallels between seemingly unrelated concepts and
selects high-quality data to train itself for self-refinement. CLoT not only
excels in humor generation in the Oogiri game but also boosts creative
abilities in various tasks like cloud guessing game and divergent association
task. These findings advance our understanding and offer a pathway to improve
LLMs' creative capacities for innovative applications across domains. The
dataset, code, and models will be released online.
https://zhongshsh.github.io/CLoT/.
comment: Technical report
♻ ☆ TheoremQA: A Theorem-driven Question Answering dataset EMNLP 2023
The recent LLMs like GPT-4 and PaLM-2 have made tremendous progress in
solving fundamental math problems like GSM8K by achieving over 90% accuracy.
However, their capabilities to solve more challenging math problems which
require domain-specific knowledge (i.e. theorem) have yet to be investigated.
In this paper, we introduce TheoremQA, the first theorem-driven
question-answering dataset designed to evaluate AI models' capabilities to
apply theorems to solve challenging science problems. TheoremQA is curated by
domain experts containing 800 high-quality questions covering 350 theorems
(e.g. Taylor's theorem, Lagrange's theorem, Huffman coding, Quantum Theorem,
Elasticity Theorem, etc) from Math, Physics, EE&CS, and Finance. We evaluate a
wide spectrum of 16 large language and code models with different prompting
strategies like Chain-of-Thoughts and Program-of-Thoughts. We found that
GPT-4's capabilities to solve these problems are unparalleled, achieving an
accuracy of 51% with Program-of-Thoughts Prompting. All the existing
open-sourced models are below 15%, barely surpassing the random-guess baseline.
Given the diversity and broad coverage of TheoremQA, we believe it can be used
as a better benchmark to evaluate LLMs' capabilities to solve challenging
science problems. The data and code are released in
https://github.com/wenhuchen/TheoremQA.
comment: Accepted to Main Conference of EMNLP 2023
♻ ☆ D-Bot: Database Diagnosis System using Large Language Models
Xuanhe Zhou, Guoliang Li, Zhaoyan Sun, Zhiyuan Liu, Weize Chen, Jianming Wu, Jiesi Liu, Ruohang Feng, Guoyang Zeng
Database administrators (DBAs) play an important role in managing,
maintaining and optimizing database systems. However, it is hard and tedious
for DBAs to manage a large number of databases and give timely response
(waiting for hours is intolerable in many online cases). In addition, existing
empirical methods only support limited diagnosis scenarios, which are also
labor-intensive to update the diagnosis rules for database version updates.
Recently large language models (LLMs) have shown great potential in various
fields. Thus, we propose D-Bot, an LLM-based database diagnosis system that can
automatically acquire knowledge from diagnosis documents, and generate
reasonable and well-founded diagnosis report (i.e., identifying the root causes
and solutions) within acceptable time (e.g., under 10 minutes compared to hours
by a DBA). The techniques in D-Bot include (i) offline knowledge extraction
from documents, (ii) automatic prompt generation (e.g., knowledge matching,
tool retrieval), (iii) root cause analysis using tree search algorithm, and
(iv) collaborative mechanism for complex anomalies with multiple root causes.
We verify D-Bot on real benchmarks (including 539 anomalies of six typical
applications), and the results show that D-Bot can effectively analyze the root
causes of unseen anomalies and significantly outperforms traditional methods
and vanilla models like GPT-4.
♻ ☆ TraSE: Towards Tackling Authorial Style from a Cognitive Science Perspective
Stylistic analysis of text is a key task in research areas ranging from
authorship attribution to forensic analysis and personality profiling. The
existing approaches for stylistic analysis are plagued by issues like topic
influence, lack of discriminability for large number of authors and the
requirement for large amounts of diverse data. In this paper, the source of
these issues are identified along with the necessity for a cognitive
perspective on authorial style in addressing them. A novel feature
representation, called Trajectory-based Style Estimation (TraSE), is introduced
to support this purpose. Authorship attribution experiments with over 27,000
authors and 1.4 million samples in a cross-domain scenario resulted in 90%
attribution accuracy suggesting that the feature representation is immune to
such negative influences and an excellent candidate for stylistic analysis.
Finally, a qualitative analysis is performed on TraSE using physical human
characteristics, like age, to validate its claim on capturing cognitive traits.
comment: Experimental results in the paper are incorrectly reported due to an
unforeseen glitch in the software prototype. The paper and its findings are
withdrawn
♻ ☆ HalOmi: A Manually Annotated Benchmark for Multilingual Hallucination and Omission Detection in Machine Translation
David Dale, Elena Voita, Janice Lam, Prangthip Hansanti, Christophe Ropers, Elahe Kalbassi, Cynthia Gao, Loïc Barrault, Marta R. Costa-jussà
Hallucinations in machine translation are translations that contain
information completely unrelated to the input. Omissions are translations that
do not include some of the input information. While both cases tend to be
catastrophic errors undermining user trust, annotated data with these types of
pathologies is extremely scarce and is limited to a few high-resource
languages. In this work, we release an annotated dataset for the hallucination
and omission phenomena covering 18 translation directions with varying resource
levels and scripts. Our annotation covers different levels of partial and full
hallucinations as well as omissions both at the sentence and at the word level.
Additionally, we revisit previous methods for hallucination and omission
detection, show that conclusions made based on a single language pair largely
do not hold for a large-scale evaluation, and establish new solid baselines.
♻ ☆ Polyglot or Not? Measuring Multilingual Encyclopedic Knowledge in Foundation Models EMNLP 2023
In this work, we assess the ability of foundation models to recall
encyclopedic knowledge across a wide range of linguistic contexts. To support
this, we: 1) produce a 20-language dataset that contains 303k factual
associations paired with counterfactuals, 2) evaluate 5 models in a
multilingual test, and 3) benchmark a diverse set of 24 models in an
English-only test. Meta's LLaMA achieves the highest scores in both
multilingual and English-only evaluations. Yet, an analysis of LLaMA's errors
reveals significant limitations in its ability to recall facts in languages
other than English, plus difficulties related to the location and gender of
fact subjects. Overall, our findings suggest that today's foundation models are
far from polyglots.
comment: EMNLP 2023 (Main)
♻ ☆ Model-tuning Via Prompts Makes NLP Models Adversarially Robust EMNLP 2023
In recent years, NLP practitioners have converged on the following practice:
(i) import an off-the-shelf pretrained (masked) language model; (ii) append a
multilayer perceptron atop the CLS token's hidden representation (with randomly
initialized weights); and (iii) fine-tune the entire model on a downstream task
(MLP-FT). This procedure has produced massive gains on standard NLP benchmarks,
but these models remain brittle, even to mild adversarial perturbations. In
this work, we demonstrate surprising gains in adversarial robustness enjoyed by
Model-tuning Via Prompts (MVP), an alternative method of adapting to downstream
tasks. Rather than appending an MLP head to make output prediction, MVP appends
a prompt template to the input, and makes prediction via text
infilling/completion. Across 5 NLP datasets, 4 adversarial attacks, and 3
different models, MVP improves performance against adversarial substitutions by
an average of 8% over standard methods and even outperforms adversarial
training-based state-of-art defenses by 3.5%. By combining MVP with adversarial
training, we achieve further improvements in adversarial robustness while
maintaining performance on unperturbed examples. Finally, we conduct ablations
to investigate the mechanism underlying these gains. Notably, we find that the
main causes of vulnerability of MLP-FT can be attributed to the misalignment
between pre-training and fine-tuning tasks, and the randomly initialized MLP
parameters.
comment: Accepted to the EMNLP 2023 Conference
♻ ☆ DADA: Dialect Adaptation via Dynamic Aggregation of Linguistic Rules EMNLP 2023
Existing large language models (LLMs) that mainly focus on Standard American
English (SAE) often lead to significantly worse performance when being applied
to other English dialects. While existing mitigations tackle discrepancies for
individual target dialects, they assume access to high-accuracy dialect
identification systems. The boundaries between dialects are inherently
flexible, making it difficult to categorize language into discrete predefined
categories. In this paper, we propose DADA (Dialect Adaptation via Dynamic
Aggregation), a modular approach to imbue SAE-trained models with
multi-dialectal robustness by composing adapters which handle specific
linguistic features. The compositional architecture of DADA allows for both
targeted adaptation to specific dialect variants and simultaneous adaptation to
various dialects. We show that DADA is effective for both single task and
instruction finetuned language models, offering an extensible and interpretable
framework for adapting existing LLMs to different English dialects.
comment: EMNLP 2023