master-thesis/writing/latex/sections/03_background.tex

\section{Background \& Related Work}\label{sec:background}

\subsection{Dialog State Tracking (DST)}

\paragraph{} Task-oriented dialog systems, both modular and end-to-end systems, are capable of handling a wide range of tasks (such as ticket booking, restaurant booking, etc.) across various domains. A task-oriented dialogue system has stricter requirements for responses because it needs to accurately understand and process the user's message. Therefore, modular methods were suggested as a way to generate responses in a more controlled manner. The architecture of a typical modular-based task-oriented dialog system is depicted in Figure \ref{figure:1}. A typical modular-based system uses a modular pipeline, which has four modules that execute sequentially - Natural Language Understanding (NLU), Dialog State Tracking (DST), Policy Learning (PL), and Natural Language Generation (NLG).  The NLU module extracts the semantic values from user messages, together with intent detection and domain classification. The DST module takes the extracted values and fills the slot-value pairs based on the entire dialog history. The Policy Learning (PL) module takes the slot-value pairs and decides the next action to be performed by the dialog system. The NLG module converts the dialog actions received from the PL module into the natural language text, which is usually the system response to the user.
\vspace{0.5cm}
\begin{figure}[h!]
    \centering
    \includegraphics[width=\linewidth]{images/modular_tod}
    \caption{Modular-based task-oriented dialog system \citep{ni2021dlds}}
    \label{figure:1}
\end{figure}

The DST module is essential for enabling the system to comprehend the user's requests by tracking them in the form of slots and values (belief states) at every turn. In recent years, some dialog systems take the user utterances and provide them directly to the DST module. This approach fills the slot-value pairs directly from the raw user message and eliminates the need for NLU module.  For example, consider the user message - \textquote{\textit{Plan a train trip to Berlin this Friday for two people}} - the DST module is supposed to fill (\textit{slot, value}) pairs as follows: \{(\textit{destination, Berlin}), (\textit{day, Friday}), (\textit{people, 2})\}.

\paragraph{} A typical task-oriented dialog system can assist users across multiple domains (restaurant, hotel, train, booking). Each dialog domain contains an ontology, which represents the knowledge of the domain and information required for specific tasks. The ontology of a domain consists of a pre-defined set of slots and all the possible values for each slot. Neural-based models were proposed to solve the DST task by multi-class classification, where the model predicts the correct class from multiple values. This approach depends on the ontology of the domains and needs to track a lot of slot-value pairs. Ontology is hard to obtain in real-world scenarios, especially for new domains. The neural-based DST model also needs a lot of training data which is rarely available for new domains.

\paragraph{} A dialog state or belief state in DST contains the required information for the system to process the user's request. At each turn of the dialogue, the dialog state can have \textit{informable slots} and \textit{requestable slots}. Informable slots are specified by the user about the preferences and requirements for the system action.  For example, in the restaurant domain, the user can ask for a specific type of food or desired price range of the restaurants for table booking. The belief state slots with such information are called \textit{informable slots}. Users can ask the dialog system for an address or phone number of a restaurant, such slots are known as \textit{requestable slots}. This thesis work focuses on the DST module to extract informable slots and their values without depending on the ontology.

\subsection{Pre-trained Language Models (PLMs)}

\paragraph{} Large pre-trained language models are trained on huge amounts of textual data and they achieved state-of-the-art performance in a variety of NLP tasks, such as machine translation, text classification, text generation, and summarization. These PLMs trained on large-scale datasets can encode significant linguistic knowledge into their huge amount of parameters. Pre-trained language models based on transformer architectures \citep{vaswani2017attention}, such as BERT \citep{devlin2019bert} and GPT \citep{radford2018gpt}, have also achieved state-of-the-art performance on many NLP tasks. GPT-2 \citep{radford2019gpt2} is a transformer-based left-to-right auto-regressive language model trained on large amounts of open web text data. The main training objective of GPT-2 is to predict the next word, given all the previous words. A left-to-right auto-regressive language model predicts the next word given all the previous words or assigns the probability of word sequences. Consider the sequence of words $x = x_1, x_2, \ldots, x_n$, the probability distribution can be written using the chain rule from left to right:

$$
P(x) = P\left(x_1\right) \times P\left(x_2 \mid x_1\right) \times \ldots \times P\left(x_n \mid x_1 \cdots x_{n-1}\right)
$$


\paragraph{} The PLMs trained on large amounts of text can be fine-tuned using task-specific data to solve the downstream tasks efficiently. Previous work by \citet{wu2020tod-bert} pre-trained the BERT model with nine different task-oriented dialog datasets and later used it to fine-tune the downstream tasks. This approach improved the performance of downstream tasks over fine-tuning directly on BERT. \textsc{Soloist} \citep{peng2021soloist} used a similar approach to pre-train the GPT-2 model on two task-oriented dialog corpora and fine-tuned the pre-trained Soloist on the DST task.
The pre-trained \textsc{Soloist} is the baseline model of this thesis, which uses the publicly available 12-layer GPT-2 (117M) model. The prompt-based methods in this thesis also utilize the pre-trained \textsc{Soloist} to fine-tune the prompt-based DST task.

\subsection{SOLOIST Model} \label{subsec:soloist}

\paragraph{} \textsc{Soloist} \citep{peng2021soloist} uses the \textsl{pre-train, fine-tune} paradigm for building a task-oriented dialog system using an auto-regressive language model GPT-2 \citep{radford2019gpt2}. This dialog system is built in two phases: In the pre-training phase, \textsc{Soloist} is initialized with GPT-2 and further trained on two large task-oriented datasets, Schema and Taskmaster. The primary goal at this stage is to learn task completion skills such as \textit{belief prediction} and \textit{response generation}. In the belief predictions task of the pre-training stage, the \textsc{Soloist} model takes dialog history as input and generates belief states as a sequence of words. The generated belief state sequences take the form - \textquote{\textit{belief: $slot_1 = value_1; slot_2 = value_2, \ldots$}}. The pre-training objective for predicting belief states is:

$$
\mathcal{L}=\log P(b \mid s)=\sum_{t=1}^{T_b} \log P\left(b_t \mid b_{<t}, s\right)
$$

where $T_b$ is the generated belief states sequence length, $b_{<t}$ indicates all tokens before $t$, $s$ is the dialog history up to the current turn. Overall, $log (b \mid s)$ represents the probability of generating the belief states sequence given the dialog history.

\paragraph{} In the fine-tuning stage, the pre-trained \textsc{Soloist} model can be used to solve new tasks by just using a small amount of task-specific dialogs. The belief predictions task of the \textsc{Soloist} model can be fine-tuned on new task-oriented dialog datasets for solving the DST tasks. At inference time, the fine-tuned \textsc{Soloist} uses top-K \citep{fan2018topk} and nucleus \citep{holtzman2020topp} sampling for generating belief states as a sequence of words. In top-K sampling, the K most likely next words are filtered out and the probability is redistributed among only those K next words. In nucleus sampling (also known as \textit{top-p} sampling), only the words that exceed the probability threshold $p$ are chosen. This approach of generating belief states does not depend on the ontology of slots and values. The fine-tuning of belief predictions task on pre-trained \textsc{Soloist} is the baseline model for this thesis. The same pre-trained \textsc{Soloist} is used to fine-tune the prompt-based methods.


\subsection{Prompt Learning}

\paragraph{} Prompt-based learning is a new way of using pre-trained language models more efficiently for solving language tasks. It involves changing the task using textual prompts, and the language model generates the desired output directly from the prompts. The main idea behind this approach is to efficiently use the generation capabilities of PLMs. Table \ref{table:1} introduces some terminology, notations, and an emotion classification example. The original input $x$ is modified using the \textit{prompting function} which generates the \textit{prompt} $x^{\prime}$. The \textit{prompt function} or \textit{prompt template} typically contains text and two slots: the input slot $[X]$ for filling the input x and the answer slot $[Z]$ for generating the answer $z$. The prompt $x^{\prime}$ is given to the PLM to directly generate the answer $z$. For tasks such as emotion classification, another step of answer mapping is required to get to the final output $y$ from answer $z$. For example, multiple emotion-related words (such as \textit{happy, joyful, delighted, pleased}) can belong to the same output class (e.g. \textquote{\textit{joy}}). In this case, if the PLM generates an answer \textquote{\textit{happy}}, it is mapped to the output class \textquote{\textit{joy}}. For some tasks involving text generation, answer mapping is usually not required, the generated answer $z$ becomes the output $y$.
\vspace{0.5cm}

\begin{table}[!ht]
\centering
\begingroup
\setlength{\tabcolsep}{12pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.4} % Default value: 1
\begin{tabular}{l c l}
\hline
\textbf{Name} & \textbf{Notation} & \textbf{Example}\\
\hline
\textit{Input} & $x$ & I missed the bus today. \\
\textit{Output} & $y$ & sad \\
\hline
\textit{Prompt Function} & $f_{prompt}(x)$ & $[X]$ I felt so $[Z]$ \\
\hline
\textit{Prompt} & $x^{\prime}$ &  I missed the bus today. I felt so $[Z]$ \\
\textit{Answered Prompt} & $f_{fill}(x^{\prime}, z)$ &  I missed the bus today. I felt so sad \\
\hline
\textit{Answer} & $z$ & \textit{happy}, \textit{sad}, \textit{scared} \\
\hline
\end{tabular}
\endgroup
\caption{Terminology and notations of prompting methods}
\label{table:1}
\end{table}

\paragraph{} Consider the emotion classification example from table \ref{table:1}, in order to recognize the emotion in the text, where input $x$ = \textquote{I missed the bus today.}, given the prompt function \textquote{[X] I felt so [Z]}. $[X]$ is filled with input $x$, then the prompt $x^{\prime}$ would become \textquote{I missed the bus today. I felt so [Z]} and the PLM is supposed to fill the slot $[Z]$ with the emotion word \textquote{sad}.

\paragraph{Prompt types} There are two main varieties of prompts: \textit{prefix prompts} and \textit{cloze prompts}. In prefix prompts, the entire prompt text comes before the slot $[Z]$. For example, consider the prompt - \textquote{I like this movie. The movie is $[Z]$}, the slot $[Z]$ is at the end. In cloze prompts, the slot to be filled $[Z]$ appears in the middle or beginning of the prompt text. For example, consider the prompt - \textquote{Berlin is the capital of Germany. $[Z]$ is the capital of Japan}, the slot $[Z]$ is in the middle of the prompt text. For tasks that are solved using a left-to-right auto-regressive language model, using prefix prompts is more helpful. This is because prefix prompts are well-suited to the left-to-right nature of the language model. There are multiple ways of creating prompts: \textit{manual prompts, discrete prompts, and continuous prompts}. For \textit{manual prompts}, the templates are hand-crafted by humans based on the intuition of the task. These manual prompts generally contain a few natural language phrases and are usually task-specific. This approach can be time-taking and often requires a lot of experimenting. For \textit{discrete prompts}, the templates are searched by using automated methods such as prompt mining, gradient-based search, and generation from LM. These templates are also in the form of natural language phrases. This approach might require a large amount of training data to find prompts. For \textit{continuous prompts}, the templates are directly expressed in the embedding space of the language model. These prompts have their own parameters and can be tuned based on the training data of the task.


\paragraph{Training strategy} Prompting methods can be used without any training to the PLM for the downstream tasks. This can be done by taking a suitable pre-trained LM and applying the prompts directly to the inputs of the task. This approach is traditionally known as \textit{zero-shot learning}. However, this zero-shot approach has a risk of bias, as the PLM is not fine-tuned for the task and is also less effective on tasks that are different from what the PLM is trained on. \textit{Few-shot learning} is another approach where only a small amount of task-specific training samples are used to train the language model. Prompting methods are particularly useful in this approach when there is not enough task-specific training data to fully train the model. There are different training methods to fine-tune the prompts and the LM: fix the LM parameters and fine-tune the prompts, fix the prompts and update the LM parameters, and fine-tune both LM parameters and prompts. From these training strategies, \textit{fixed-prompt LM tuning} is a way to improve the PLM by fine-tuning it with the prompts. This is similar to the standard fine-tuning paradigm, by applying fixed prompts to the training inputs and fine-tuning them on PLM. This approach helps the PLM understand the downstream task and can potentially lead to improvements under few-shot settings.

\subsection{Prompt-based DST}

\paragraph{} In the previous work TOD-BERT \citep{wu2020tod-bert}, the BERT language model is pre-trained on nine different task-oriented dialogue datasets. The pre-trained TOD-BERT is fine-tuned on multiple task-oriented dialogue tasks (intent recognition, dialog state tracking) to evaluate them under few-shot settings. The DST task in this work is a multi-class classification problem, by predicting the slots and values from the pre-defined ontology. \textsc{Soloist} \citep{peng2021soloist} also used a similar approach by pre-training the GPT-2 language model with two different dialogue datasets. The downstream DST task of the \textsc{Soloist} model does not depend on the ontology of domains and directly generates belief states as word sequences (described in section \ref{subsec:soloist}). Both TOD-BERT and \textsc{Soloist} models do not use prompt-based methods and perform poorly under extremely low-resource settings. The pre-trained \textsc{Soloist} model is adopted as a baseline in this thesis to explore the prompt-based methods for DST.

\paragraph{} Previous work by \citet{lee2021sdp} used prompting methods on a PLM to solve the DST task. This work introduced schema-driven prompting (\textit{slot-based prompt}) that takes domain names, slots, and natural language descriptions of slots to fill in the prompts and generates the corresponding values. This method relies on the complete ontology of the domains and their slot descriptions. This method also requires a lot of training data to fine-tune the PLM. \citet{yang2022prompt} proposed a new prompt-learning framework for DST that uses values in prompts (\textit{value-based prompt}) and generates the slots directly from the PLM. This value-based prompt approach does not depend on the ontology of the domains. This work designed a \textit{value-based prompt} and \textit{inverse prompt} to help the PLM solve the DST task during the fine-tuning stage. Figure \ref{figure:2} shows an overview of value-based prompt and inverse prompt for DST.

\vspace{0.4cm}
\begin{figure}[h!]
    \centering
    \includegraphics[width=\linewidth]{images/prompt_dst}
    \caption{Overview of value-based prompt and inverse prompt mechanism.}
    \label{figure:2}
\end{figure}

\paragraph{} First, the belief state slots are generated using value-based prompts. The generated slots from the value-based prompt are given to the inverse prompt to generate back the values. The inverse prompt function can be considered as an auxiliary task that helps the PLM to understand the DST task. The loss from the value-based prompt and inverse prompt are combined during the training phase. At inference time, the value-based prompt is directly used to generate the slots without depending on the ontology. \citet{yang2022prompt} showed the prompt-learning framework can efficiently learn the DST task even under extremely low-resource settings. These prompt-based methods are further explored in this thesis. The experimental methods for prompt-based DST are detailed in section \ref{subsec:prompt_dst}.

\subsection{MultiWOZ Dataset}

\paragraph{} MultiWOZ \citep{budzianowski2018multiwoz} is a widely used dialogue dataset for benchmarking task-oriented dialog systems. MultiWOZ is a collection of human-human written conversations that are centered around a wide range of topics, such as hotel booking, restaurant reservation, attraction recommendations, and booking a train. It contains over 10K dialogues across 8 domains, with multiple turns for each dialogue. Each dialogue contains multiple user and system utterances and belief states with slot-value pairs for each turn. \citet{eric2019multiwoz} released MultiWOZ 2.1 after fixing the noisy dialog state annotations and utterances that negatively impact the performance of DST models. In this thesis, MultiWOZ 2.1 dataset is used to evaluate the baseline and prompt-based methods on the DST task.