\section{Methods}\label{sec:methods} This section describes the research methods and experimental setup of the work conducted in this thesis. This thesis work can be divided into the following tasks: \textsc{Soloist} baseline implementation for few-shot DST, prompt-based methods for few-shot DST, evaluation and analysis of belief state predictions, and multi-prompt methods for DST. \subsection{Dataset} The baseline and prompt-based methods are benchmarked on MultiWOZ 2.1 \citep{eric2019multiwoz} dataset. The MultiWOZ dataset contains 8438/1000/1000 single-domain and multi-domain dialogues for training/validation/testing respectively. Each dialogue can have multiple turns and each turn can include multiple \textit{(slot, value)} pairs. Dialogues from only five domains (\textit{Restaurant, Hotel, Attraction, Taxi, Train}) and one sub-domain (\textit{Booking}) are used in the experiments, as the other two domains (\textit{Hospital, Police}) only appear in the training set. To observe the performance under few-shot settings, dialogues are randomly sampled for each domain and six different data splits are created. Each data split contains dialogues with all five domains and the dialogues are evenly distributed for each domain. Only single-domain dialogues including booking sub-domain are picked for creating the data splits. Validation and test sets are not sampled after domain filtering. Table \ref{table:2} provides data statistics and the summary of data splits used in few-shot experiments. \vspace{0.25cm} \begin{table}[h!] \centering \begingroup \setlength{\tabcolsep}{10pt} % Default value: 6pt \renewcommand{\arraystretch}{1.2} % Default value: 1 \begin{tabular}{|l|c|c|c|} \hline \textbf{Data Splits} & \textbf{\# Dialogues} & \textbf{\# Total Turns} & \textbf{\# (slot, value)} \\ \hline \textsl{5-dpd} & 25 & 100 & 294 \\ \hline \textsl{10-dpd} & 50 & 234 & 758 \\ \hline \textsl{50-dpd} & 250 & 1114 & 3535 \\ \hline \textsl{100-dpd} & 500 & 2292 & 7408 \\ \hline \textsl{125-dpd} & 625 & 2831 & 9053 \\ \hline \textsl{250-dpd} & 1125 & 5187 & 17214 \\ \hline \textsl{valid} & 190 & 900 & 3106 \\ \hline \textsl{test} & 193 & 894 & 3411 \\ \hline \end{tabular} \endgroup \caption{Data statistics and data split summary for few-shot experiments. The term \textsl{dpd} means \textsl{\textquote{dialogues per domain}}. Each split contains dialogues for all five domains. In data split \textsl{250-dpd}, the domain \textquote{\textit{Attraction}} contains only 125 dialogues.} \label{table:2} \end{table} In the MultiWOZ 2.1 dataset, 16 dialog slots are used to understand the user requirements. For the prompt-based experiments, these slots are converted to look like natural language words for fine-tuning the slot generation process. Table \ref{table:3} lists the slots from all five domains and \textit{booking} sub-domain. \begin{table}[!ht] \centering \begin{tabular}{l} \hline \multicolumn{1}{c}{\textbf{Slots}} \\ \hline \textsl{area, arrive, day, departure, destination, food, internet, leave,} \\ \textsl{name, parking, people, price, stars, stay, time, type} \\ \hline \end{tabular} \caption{Slots from MultiWOZ 2.1 dataset used in prompt-based experiments} \label{table:3} \end{table} \subsection{SOLOIST Baseline} \textsc{Soloist} \citep{peng2021soloist} is the baseline model for the prompt-based methods. \textsc{Soloist} is initialized with the 12-layer GPT-2 \citep{radford2019gpt2} and further trained on two task-oriented dialog corpora (Schema and Taskmaster). The task-grounded pre-training helps the \textsc{Soloist} model to solve two dialog-related tasks: \textit{belief state prediction} and \textit{response generation}. In the belief state predictions task, the model takes dialog history as input and generates the belief states as a sequence of words. In this thesis, for the baseline implementation, the pre-trained \textsc{Soloist} is fine-tuned on MultiWOZ 2.1 data splits to perform the belief predictions task. During inference time, the fine-tuned \textsc{Soloist} baseline doesn't need the pre-defined set of slots and their possible values, and it uses top-K \citep{fan2018topk} and top-p or nucleus \citep{holtzman2020topp} sampling for generating the belief states. In the prompt-based DST task, the same pre-trained \textsc{Soloist} model is fine-tuned for prompt-based slot generation. \subsection{Prompt-based few-shot DST} \label{subsec:prompt_dst} This task aims to apply prompt-based methods proposed by \citep{yang2022prompt} and reproduce the results. This task utilizes the \textit{value-based prompt} and \textit{inverse prompt} for fine-tuning the pre-trained \textsc{Soloist}, which can generate the belief state slots directly at inference time. The prompt-based methods are evaluated on the same data splits (Table \ref{table:2}) of the MultiWOZ 2.1 dataset. \paragraph{Value-based prompt} An intuitive idea for generating (\textit{slot, value}) pairs is to use slots in prompts and generate the corresponding values \citep{lee2021sdp}. For example, given the utterance - \textquote{\textsl{Plan a trip to Berlin}} and slot (\textsl{destination}), the prompt to the PLM could become \textquote{\textsl{Plan a trip to Berlin. destination = [z]}} and the PLM is expected to generate \textsl{[z]} as \textquote{\textsl{Berlin}}. However, this approach relies on the ontology of the slots, and the fixed set of slots can change in real-world applications. \citet{yang2022prompt} proposed \textit{value-based prompt} that uses values in the prompts and generates corresponding slots. This method doesn't require any pre-defined set of slots and can generate slots directly from the PLM. Consider this prompt template: \textquote{\textsl{belief states: value = [v], slot = [s]}}, the prompt function $f$ can be of form $f(v) = $ \textsl{[dialog history] belief states: value = [v], slot = [s]}, given the value candidate $v = $ \textquote{\textsl{Berlin}}, the PLM can generate \textsl{slot [s] = \textquote{destination}}. The overall training objective of value-based prompt generation is minimizing the negative log-likelihood of slots in the training dataset $D$: \begin{equation} \label{eq:1} \mathcal{L}=-\sum_{t}^{|D|} \log P\left(s_{t} \mid c_{t}, f\left(v_{t}\right)\right) \end{equation} where $P\left(s_{t} \mid c_{t}, f\left(v_{t}\right)\right)$ is the probability of generating slot $s_t$ given dialog history $c_t$ and prompt-function $f$ is filled with value $v_t$ for each turn $t$. The loss $\mathcal{L}$ from this step is combined with the loss from inverse prompt (next step) in order to compute the final loss. During training, the annotated values from the dataset are utilized to fill in the value-based prompts. \paragraph{Inverse Prompt} The \textit{inverse prompt} mechanism \citep{yang2022prompt} aims to generate the values by filling the prompts with generated slots. After generating slot $s$ using the value-based prompt, this generated slot is presented to the inverse prompt function $I$. The inverse prompt aims to generate the value $v^{\prime}$ which is supposed to be close to the original value $v$. The prompt template for inverse prompt function can be of form $I = $ \textquote{\textsl{belief states: slot = [s], value = [v]}}. \textsl{[s]} is filled with the generated slot from value-based prompt, \textsl{[v]} is expected to be generated by the PLM. This inverse prompt can be considered as an auxiliary task for the value-based prompt, which can improve performance by helping the PLM understand the task, especially under low-resource scenarios. The loss function $\tilde{\mathcal{L}}$ for the inverse prompt mechanism: \begin{equation} \label{eq:2} \tilde{\mathcal{L}}=-\sum_{t}^{|D|} \log P\left(v^{\prime}_{t} \mid c_{t}, I\left(s_{t}\right)\right) \end{equation} where $P\left(v^{\prime}_{t} \mid c_{t}, I\left(s_{t}\right)\right)$ is the probability of generating the value $v^{\prime}_{t}$ by filling in the inverse prompt $I\left(s_{t}\right)$ with the generated slot $s_{t}$. \noindent The final loss $\mathcal{L}^{*}$ is computed by combining loss from value-based prompt $\mathcal{L}$ and the inverse prompt loss $\tilde{\mathcal{L}}$: \begin{equation} \label{eq:3} \mathcal{L}^{*} = \mathcal{L} + w *\tilde{\mathcal{L}} \end{equation} where $w$ is a decimal value (0,1) which is used to adjust the influence of inverse prompt. \paragraph{Training} For training the prompt-based methods, the pre-trained Soloist (GPT-2 117M) is fine-tuned on value-based prompt and inverse prompt. All the MultiWOZ 2.1 data splits (Table 2) are used in the fine-tuning process in order to evaluate the performance under few-shot settings. The training strategy fixed-prompt LM tuning is adapted for tuning the prompt-based methods, where the fixed discrete prompts are used to fine-tune the parameters of the LM. Table \ref{table:4} shows the prompt templates used in the fine-tuning process. The prompts are appended to the dialog history before providing them as input to the PLM and probabilistically generate the missing slots. The inverse prompt is only used during the training phase. Experiments are performed to evaluate the influence of inverse prompt by setting multiple values to the inverse prompt weight $w$ in equation \ref{eq:3} and also by completely omitting it during the training phase. \vspace{0.3cm} \begin{table}[h!] \centering \begingroup \setlength{\tabcolsep}{10pt} % Default value: 6pt \renewcommand{\arraystretch}{1.25} % Default value: 1 \begin{tabular}{ll} \hline \multicolumn{1}{c}{\textbf{Type}} & \multicolumn{1}{c}{\textbf{Prompt templates}} \\ \hline value-based prompt & belief states: value = [v], slot = [s] \\ inverse prompt & belief states: slot = [s], value = [v] \\ \hline \end{tabular} \endgroup \caption{Prompt templates used during the training phase.} \label{table:4} \end{table} \paragraph{Testing (Slot Generation)} During the testing phase, only value-based prompts are used to generate the slots. The filled prompt together with the dialog history is given as input to the PLM, and the next word with the highest probability is the generated slot. While testing, the value candidates are not known. A set of rules are applied to extract the candidate values directly from the user utterances. This sort of value extraction from utterances is previously explored by \citet{min2020dsi}. \paragraph{Value Extraction} Value candidates are extracted directly from the dialog history and are provided to the value-based prompts for generating slots at inference time. Stanford CoreNLP Stanza \citep{qi2020stanza} tool is used to first extract POS tags and named entities, a set of rules are then applied to extract the candidate values. \begin{itemize} \item Adjectives (JJ) and Adverbs (RB) are considered as possible values \begin{itemize} \item[$\circ$] E.g., \textsl{expensive}, \textsl{moderate}, \textsl{important} \end{itemize} \item Consider the previous negator `not' \begin{itemize} \item[$\circ$] E.g., \textsl{not expensive}, \textsl{not important (= dont care)} \end{itemize} \item Consider all named entities (name of place, time, date/day, numbers) \begin{itemize} \item[$\circ$] E.g., \textsl{cambridge}, \textsl{friday}, \textsl{08:30} \end{itemize} \item Custom set of Regex NER rules are applied for recognizing named entities \begin{itemize} \item[$\circ$] E.g., \textsl{restaurant names}, \textsl{attraction names} \end{itemize} \item Stop words and repeated candidate values are filtered out \end{itemize} \paragraph{Prompt Decomposition} For utterances where multiple \textsl{(slot, value)} pairs are expected to be predicted, directly using a single prompt for generating multiple slots is challenging. Prompt decomposition is a multi-prompt method that breaks down the prompt into sub-prompts and generates the slots separately for each sub-prompt. For each extracted value from the utterances, a value-based prompt is constructed and the corresponding slot is generated. This sort of prompt decomposition has been explored by \citet{cui2021template} for the named entity recognition (NER) task. This approach is applied in both the training and testing phases. \vspace{0.25cm} \begin{table}[h!] \centering \begingroup \setlength{\tabcolsep}{8pt} % Default value: 6pt \renewcommand{\arraystretch}{1.2} % Default value: 1 \begin{tabular}{ c l } \hline Utterance: & Book a flight to Berlin on friday at 08:30.\\ \hline Prompt 1: & belief states: value = \textsl{Stuttgart}, slot = [s]\\ Prompt 2: & belief states: value = \textsl{friday}, slot = [s]\\ Prompt 3: & belief states: value = \textsl{08:30}, slot = [s]\\ \hline \end{tabular} \endgroup \caption{Sub-prompts for an utterance with multiple values.} \label{table:5} \end{table} \subsection{Multi-prompt methods for DST} The \textsl{value-based} prompt described in the previous section utilizes a \textsl{single} prompt for making predictions. However, a significant body of research has demonstrated that the use of multiple prompts can further improve the efficacy of prompting methods \citep{liu2021ppp}. There are different ways to extend the single prompt learning to use multiple prompts. This task explores two more multi-prompt learning methods: \textit{Prompt Ensembling} and \textit{Prompt Augmentation}. Experiments are performed on all the data splits of MultiWOZ 2.1 dataset. This task aims to answer the following questions - \textsf{Q1:} Can different \textsl{multi-prompt} techniques together help the PLM better understand the DST task? \textsf{Q2:} Can the use of multiple discrete prompts improve the performance of prompt-based model? \paragraph{Prompt Ensembling} This method uses multiple \textit{value-based} prompts during the training and inference time. This idea can leverage the complementary advantages of different prompts and stabilize the performance on the downstream task. \citet{yang2022prompt} applied prompt ensembling to the value-based prompt by training a separate model for each prompt. Another way is to train a single model with multiple prompts as it is much faster and more memory efficient than having to train a separate model for each prompt \citep{schick2021pet}. Prompt ensembling is applied only to value-based prompts, and the inverse prompt uses a single prompt. In this task, four hand-crafted prompt templates are chosen for value-based prompts and trained on a single model. The probability of generated slot $s_t$ over multiple prompt functions is calculated by weighted averaging the probability of each prompt: \begin{equation} \label{eq:4} P\left(s_{t} \mid c_{t}\right)=\sum_{k}^{|K|} \alpha_{k} * P\left(s_{t} \mid c_{t}, f_{k}\left(v_{t}\right)\right) \end{equation} where $|K|$ represents the number of prompt functions, $f_{k}$ is the $k$-th prompt function, $\alpha_{k}$ is the weight of prompt $k$. During inference, a simple majority voting is used to pick the generated slot from multiple prompts. When there's no simple majority in the generated slots, the slot with the highest probability is picked. Table \ref{table:6} lists all the prompt templates used in prompt ensembling. \vspace{0.25cm} \begin{table}[h!] \centering \begingroup \setlength{\tabcolsep}{8pt} % Default value: 6pt \renewcommand{\arraystretch}{1.2} % Default value: 1 \begin{tabular}{c l} \hline \multicolumn{2}{c}{\textbf{Prompt ensemble templates}}\\ \hline $f_{1}$ & belief states: [v] = [s]\\ $f_{2}$ & [v] is the value of [s]\\ $f_{3}$ & [v] is of slot type [s]\\ $f_{4}$ & belief states: value = [v], slot = [s]\\ \hline \end{tabular} \endgroup \caption{Prompt templates used for prompt ensemble.} \label{table:6} \end{table} \paragraph{Prompt Augmentation} \textit{Prompt Augmentation}, also known as \textit{demonstration learning} \citep{gao2021lmbff}, provides a few additional answered prompts that can demonstrate to the PLM how the actual task can be performed. These demonstrations take advantage of the language models’ ability to learn repetitive patterns. \citet{brown2020gpt3} also explored this idea by specifying few-shot demonstrations for multiple language tasks and achieved strong performance. This work considered the demonstration samples as additional context, concatenated them to the input, and let GPT-3 model generate outputs without any gradient updates or fine-tuning. Table \ref{table:7} below provides an example of demonstration learning. \begin{table}[h!] \centering \begin{tabular}{ r l } \hline \multicolumn{2}{c}{\textbf{Demonstration learning}} \\ \hline Book a cheap flight to Frankfurt. & \textit{Frankfurt} is of slot \textit{destination}\\ Plan a train trip to Berlin. & \textit{Berlin} is of slot \textit{destination}\\ Book a taxi to the University. & \textit{University} is of slot \textit{destination}\\ Book a train to Stuttgart. & \textit{Stuttgart} is of slot [s]\\ \hline \end{tabular} \caption{An example of demonstration learning for DST task} \label{table:7} \end{table} \paragraph{} Experiments are performed on two \textit{sets} of demonstration samples to understand the importance of sample selection. The \textit{sample selection} of the demonstrations is manually hand-picked and hand-crafted from the training data. Each demonstration sample contains multiple answered prompts (\textit{dialog history + belief states}) covering all the domains and slots. These demonstration examples are specifically chosen after gaining insights from the initial error analysis of the prompt-based methods. The first sample set contains 8 examples, and the second sample contains 5 examples. The number of demonstration examples that can be picked in a single sample set is bounded by the GPT-2 input length of 1024. The demonstrations from each sample set are concatenated to the input and given to the fine-tuned prompt model for generating the slots. \subsection{Evaluation Metrics} The standard evaluation metric joint goal accuracy (JGA) is adopted to evaluate the belief state predictions of baseline and prompt-based methods. This metric compares all the predicted belief states to the ground-truth states at each turn. The prediction is correct only if all the predicted belief states match the ground-truth states. Both slots and values must exactly match for the belief state prediction to be correct. The rule-based methods used in value extraction can lead to many false positives in the value candidates. In order to exclude the influence of wrongly extracted values, \citet{yang2022prompt} proposed JGA*, the joint goal accuracy is computed only for the belief states where the values are extracted correctly. These evaluation metrics answer the following questions: \textsf{Q1:} How do the prompt-based methods perform overall compared to the Soloist baseline? \textsf{Q2:} Can the prompt-based methods perform better under low-resource settings? \textsf{Q3:} For prompt-based methods, does JGA* metric hold a better score than JGA? \textsf{Q4:} Can multi-prompt techniques together perform better than a single-prompt? \paragraph{Analysis}The belief state predictions from the \textsc{Soloist} baseline and prompt-based methods are analyzed to identify the potential improvements and drawbacks. A detailed qualitative analysis is performed on the wrong belief state predictions. Additionally, error analysis is also performed on the rule-based value extraction methods to identify the impact on the slot generation process. \subsection{Implementation Details} For the \textsc{Soloist} baseline, the existing implementation by \citet{peng2021soloist} is adapted to the few-shot experiments conducted in this thesis. There's no publicly available implementation of the prompt-based methods for DST. Huggingface Transformers \citep{wolf2020transformers} library is used to implement the prompt-based DST methods from scratch. Adam \citep{kingma2015adam} optimization algorithm is used during the fine-tuning process of both baseline and prompt-based methods. The rule-based value extraction methods are implemented using Stanford CoreNLP client stanza \citep{qi2020stanza}. Experiments are performed with multiple inverse prompt weights $w$ (\textit{0.1, 0.3, 0.5, 0.7}) in Eq. \ref{eq:3}. The prompt weight $\alpha_{k}$ in Eq. \ref{eq:4} is set to the same value (1/4) for all the prompts used in prompt ensembling.