master-thesis/writing/latex/sections/03_background.tex

\section{Background \& Related Work}\label{sec:background}

\subsection{Dialog State Tracking (DST)}

\paragraph{} Task-oriented dialog systems, both modular and end-to-end systems, are capable of handling a wide range of tasks (such as ticket booking, restaurant booking, etc.) across various domains. A task-oriented dialogue system has stricter requirements for responses because it needs to accurately understand and process the user's message. Therefore, modular methods were suggested as a way to generate responses in a more controlled manner. A typical modular-based system uses a modular pipeline, which has four modules that execute sequentially - Natural Language Understanding (NLU), Dialog State Tracking (DST), Policy Learning (POL), and Natural Language Generation (NLG). The DST module is essential for enabling the system to comprehend the user's requests by tracking them in the form of slots and values (belief states) at every turn. For instance, in a dialogue system that helps users book flights, the system might track slots such as destination, departure, travel date, and number of travelers. By keeping track of these slots and their values, the system can understand the user requirements and provides this information to the next module. For example, consider the user message - \textquote{\textit{Plan a train trip to Berlin this Friday for two people}} - the DST module is supposed to extract (\textit{slot, value}) pairs as follows: \{(\textit{destination, Berlin}), (\textit{day, Friday}), (\textit{people, 2})\}. In this thesis, the focus is on the DST module for extracting slots and values.

\subsection{Pre-trained Language Models (PLMs)}

\paragraph{} Large pre-trained language models are trained on huge amounts of textual data and have achieved state-of-the-art performance in a variety of NLP tasks, such as machine translation, text classification, text generation, and summarization. These models are trained on large datasets and are able to learn the probability distribution of the words. Pre-trained language models based on transformer architectures \citep{vaswani2017attention}, such as BERT \citep{devlin2019bert} and GPT \citep{radford2018gpt}, have also achieved state-of-the-art performance on many NLP tasks. GPT-2 \citep{radford2019gpt2} is a transformer-based auto-regressive language model trained on large amounts of open web text data. GPT-2 is trained with a simple objective: predict the next word, given all previous words within some text. The architecture and training objective of the PLMs plays an important role in determining their applicability to particular prompting tasks \citep{liu2021ppp}. For example, left-to-right auto-regressive LMs predict the next word by assigning a probability to the sequence of words. For tasks that require the PLM to generate text from \textit{prefix} prompts (the entire prompt string followed by generated text), the left-to-right LMs tend to mesh well with the left-to-right nature of the language model.

\paragraph{} The baseline model of this thesis, \textsc{Soloist} \citep{peng2021soloist}, uses a 12-layer GPT-2 for building the task-oriented dialog system. \textsc{Soloist} uses the publicly available 117M-parameter GPT-2 as initialization for task-grounded pre-training. The prompt-based methods in this thesis utlize the pre-trained \textsc{Soloist} and fine-tune it to the downstream DST task.

\subsection{SOLOIST Model}

\paragraph{} \textsc{Soloist} \citep{peng2021soloist} is a task-oriented dialog system that uses transfer learning and machine teaching to build task bots at scale. \textsc{Soloist} uses the \textit{pre-train, fine-tune} paradigm for building end-to-end dialog systems using a transformer-based auto-regressive language model GPT-2 \citep{radford2019gpt2}, which subsumes different dialog modules (i.e., NLU, DST, POL, NLG) into a single model. In a \textit{pre-train, fine-tune} paradigm, a fixed \textit{pre-trained} LM is adapted to different downstream tasks by introducing additional parameters and \textit{fine-tuning} them using task-specific objective functions. In the pre-training stage, \textsc{Soloist} is initialized with the 12-layer GPT-2 (117M parameters) and further trained on large heterogeneous dialog corpora. The primary goal at this stage is to learn task completion skills such as belief state prediction (DST) and response generation. In the fine-tuning stage, the pre-trained \textsc{Soloist} model can be used to solve new tasks by just using a handful of task-specific dialogs.

\paragraph{} In this thesis, the pre-trained \textsc{Soloist} is the baseline model for generating the belief states. For the baseline DST task, the pre-trained \textsc{Soloist} is fine-tuned on the belief predictions task for open-ended text generation. For prompt-based methods, the baseline \textsc{Soloist} is fine-tuned for generating belief state slots using prompts. The results and outputs from the baseline model are compared to the prompt-based model for detailed analyses.


\subsection{Prompt Learning}

\paragraph{} Prompt-based learning (also dubbed as \textit{\textquote{pre-train, prompt, and predict}}) is a new paradigm that aims to utilize PLMs more efficiently to solve downstream NLP tasks \citep{liu2021ppp}. In this paradigm, instead of adapting pre-trained LMs to downstream tasks by designing the task-specific training objectives, downstream tasks are reformulated to look more like those solved during the original LM training with the help of a textual \textit{prompt}. To perform prediction tasks, the original input $x$ is modified using a \textit{template} into a textual \textit{prompt} $x^{\prime}$ that has some unfilled slots, and then the PLM is used to probabilistically fill the unfilled information to obtain a final string $z$, from which the final output $y$ can be derived. For text generation tasks, the generated answer $z$ itself is the output $y$.
\vspace{4pt}

\begin{table}[!ht]
\centering
\begingroup
\setlength{\tabcolsep}{12pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.4} % Default value: 1
\begin{tabular}{l c l}
\hline
\textbf{Name} & \textbf{Notation} & \textbf{Example}\\
\hline
\textit{Input} & $x$ & I missed the bus today. \\
\textit{Output} & $y$ & sad \\
\hline
\textit{Prompt Function} & $f_{prompt}(x)$ & $[X]$ I felt so $[Z]$ \\
\hline
\textit{Prompt} & $x^{\prime}$ &  I missed the bus today. I felt so $[Z]$ \\
\textit{Answered Prompt} & $f_{fill}(x^{\prime}, z)$ &  I missed the bus today. I felt so sad \\
\hline
\textit{Answer} & $z$ & \textit{happy}, \textit{sad}, \textit{scared} \\
\hline
\end{tabular}
\endgroup
\caption{Terminology and notations of prompting methods}
\label{table:1}
\end{table}

\paragraph{} For example, to recognize the emotion in the text, where \textit{input} $x = $\textquote{I missed the bus today.}, \textit{the prompt function} (also called \textit{template}) may take the form such as \textquote{$[X]$ I felt so $[Z]$}. $[X]$ takes the input text and $[Z]$ is supposed to be generated by the LM. Then, the \textit{prompt} $x^{\prime}$ would become \textquote{I missed the bus today. I felt so $[Z]$} and ask the PLM to fill the slot $[Z]$ with an emotion-bearing word. For some text generation tasks, the answer mapping from $z$ to $y$ may not be required, as the generated text itself becomes output. There are two main varieties of prompts: \textit{cloze prompts}, where the slot $[Z]$ is to be filled in the middle of the text, and \textit{prefix prompts}, where the input text comes entirely before $[Z]$. In general, for tasks that are being solved using a standard auto-regressive LM, prefix prompts tend to be more helpful, as they mesh well with the left-to-right nature of the model.

\paragraph{} Prompt-based methods can be used without any explicit training of the LM for the downstream task, simply by taking a suitable pre-trained LM and applying the prompts defined for the task. This approach is traditionally called \textit{zero-shot learning}. \textit{Few-shot learning} is another approach where only a small number of data samples are used to train the language model. Prompting methods are particularly useful under few-shot settings, as there is generally not enough training data to fully specify the desired behavior. \textit{Fixed-prompt LM tuning} is a training strategy that fine-tunes the parameters of the LM, as in the standard \textit{pre-train fine-tune} paradigm, by using discrete prompts (\textit{hard prompts}) to help PLM understand the downstream task. This approach can potentially lead to improvements, particularly in few-shot scenarios.

\subsection{Prompt-based DST}

\paragraph{} Previous work by \citet{lee2021sdp} uses belief state slots in the prompts, along with the natural language descriptions of the schema for generating the corresponding values. This \textit{slot-based} prompt DST approach uses encoder-decoder LM with a bi-directional encoder. This method relies on the known ontology of the slots and requires a lot of training data for fine-tuning PLM. In real-world applications, defining all possible slots is difficult due to the rising new domains and users' continuous needs. \citet{yang2022prompt} proposed a new prompt-learning framework for DST that uses values in prompts (\textit{value-based}) and generates slots directly from the PLM. This \textit{value-based} prompt approach does not rely on the ontology of the slots and their natural language descriptions. In task-oriented dialog systems, the prompt-based DST methods are still under-explored. In this thesis, the value-based prompt approach is applied for few-shot DST.

\subsection{MultiWOZ Dataset}
\paragraph{} MultiWOZ \citep{budzianowski2018multiwoz} is a multi-domain task-oriented dialogue dataset that contains over 10K dialogues across 8 domains. It is a fully-labeled collection of human-human written conversations and has been a widely used dataset for benchmarking DST methods. \citet{eric2019multiwoz} released MultiWOZ 2.1 after fixing the noisy dialog state annotations and utterances that negatively impact the performance of DST models. In this thesis, MultiWOZ 2.1 is used to benchmark both baseline and prompt-based methods.