master-thesis/proposal/latex/sections/03_methods.tex

\section{Methods}

The main goal of thesis is to explore the prompt learning framework for few-shot DST designed by \citet{yang2022prompt} and propose some improvements. This thesis work can be subdivided into three tasks: (1) Apply prompt learning framework for few-shot DST, (2) Evaluation and analyses of belief-state predictions, (3) Multi-prompt learning methods.

\subsection{Prompt learning framework for few-shot DST} \label{task1}

\paragraph{} This task aims to reproduce the results from \citet{yang2022prompt} and apply some minor improvements by utilizing multi-prompt methods. There's no publicly available implementation of this prompt learning framework. This task implements the prompt learning framework using \textsc{Soloist} baseline.

\paragraph{Dataset} The baseline and prompt-based methods are evaluated on MultiWOZ 2.1 \citep{eric2019multiwoz} dataset. MultiWOZ 2.0, originally released by \citet{budzianowski2018multiwoz}, is a fully-labeled dataset with a collection of human-human written conversations spanning over multiple domains and topics. \citet{eric2019multiwoz} added some fixes and improvements to dialogue utterances and released MultiWOZ 2.1, which contains 8438/1000/1000 dialogues for training/validation/testing respectively. \citet{yang2022prompt} excluded two domains that only appear in the training set. Under few-shot settings, only a portion of the training data will be utilized to observe the performance in a low-resource scenario.

\paragraph{SOLOIST Baseline} \textsc{Soloist} \citep{peng2021soloist} is the baseline for the prompt-based approach. \textsc{Soloist} is initialized with the 12-layer GPT-2 \citep{radford2019gpt2} and further trained on two task-oriented dialog corpora (Schema and Taskmaster). The task-grounded pre-training helps \textsc{Soloist} to solve two dialog-related tasks: \textit{belief state prediction} and \textit{response generation}. For the baseline implementation, the pre-trained \textsc{Soloist} will be fine-tuned on MultiWOZ 2.1 dataset and perform the belief predictions task for DST. The main focus of this thesis is on prompt-based methods, however, the \textsc{Soloist} baseline implementation is required for comparing the belief state predictions and performance of prompt learning.

\paragraph{Value-based Prompt} A general idea for generating (\textit{slot, value}) pairs is to use slots in the prompts and generate corresponding values \citep{lee2021sdp}. For example, given the utterance - ``\textit{Plan a trip to Berlin}'' and slot (\textit{destination}), the prompt to the PLM could become ``\textit{Plan a trip to Berlin. destination = [z]}'' and the PLM is expected to generate \textit{[z]} as ``\textit{Berlin}''. However, this approach relies on the ontology of the slots, and the fixed set of slots can change in real-world applications. \citet{yang2022prompt} proposed a \textit{value-based prompt} that uses values in the prompt and generates corresponding slots. This method doesn't require any pre-defined set of slots and can also generate unseen slots. Consider this prompt template: ``\textit{belief states: value = [v], slot = [s]}'', the prompt function $f$ can be of form $f(v) = $ \textit{[utterances] belief states: value = [v], slot = [s]}, given the value candidate $v = $ ``\textit{London}'', the PLM should be able to generate \textit{slot [s] = ``destination''}. The overall training objective of value-based prompt generation is maximizing the log-likelihood of slots in the training dataset $D$:
$$\mathcal{L}=\sum_{t}^{|D|} \log P\left(s_{t} \mid c_{t}, f\left(v_{t}\right)\right)$$
where $P\left(s_{t} \mid c_{t}, f\left(v_{t}\right)\right)$ is the probability of slot $s_t$ given dialog history $c_t$ and prompt-function $f$ filled with value $v_t$ for each turn $t$.
The loss $\mathcal{L}$ from this step will be combined with the loss from the next step in order to compute the final loss. While training, the values from the annotated training dataset are utilized to construct prompts.

\paragraph{Inverse Prompt} The \textit{inverse prompt} mechanism is used to generate values by prompting slots. After generating slot $s$ using value-based prompts (previous step), the generated slot $s$ is presented to the inverse prompt function $I$. The inverse prompt aims to generate the value $v^{\prime}$ which is supposed to be close to original value $v$.  The template for inverse prompt function is $I = $ ``\textit{belief states: slot = [s], value = [v]}''. This inverse prompt can be considered as an auxiliary task for this prompt-based approach, which can improve the performance by helping the PLM understand the task and tune the slot generation process. The loss  function $\tilde{\mathcal{L}}$ for the inverse prompt mechanism:
$$\tilde{\mathcal{L}}=\sum_{t}^{|D|} \log P\left(v^{\prime}_{t} \mid c_{t}, I\left(s_{t}\right)\right)$$
The final loss $\mathcal{L}^{*}$ can be computed by combining loss from value-based prompt $\mathcal{L}$ and inverse prompt loss $\tilde{\mathcal{L}}$:
$$ \mathcal{L}^{*} = \mathcal{L} + w *\tilde{\mathcal{L}} $$
where $w$ is a decimal value (0,1) and can be used to adjust the influence of inverse prompt.

\paragraph{Training} For training the above prompt-based approach, \textsc{Soloist} (117M params) pre-trained model will be utilized and fine-tuned on the prompt-based slot generation process. To fine-tune \textsc{Soloist},  MultiWOZ 2.1 dataset is used, the dialog history and values are directly given to the prompts. Inverse prompt is only used during the training phase. For evaluating the prompt-based model ability to generate slots under low-resource data settings, few-shot experiments will be performed while training. Experiments will be conducted by choosing random samples of the training data (1\%, 5\%, 10\%, and 25\%) for each domain. Few-shot experiments will be performed on both \textsc{Soloist} baseline and the prompt-based model.

\paragraph{Testing} In the testing phase, only value-based prompts are utilized for slot generation. While testing, candidate values are not known. Following the existing work \citep{min2020dsi}, values can be extracted directly from the utterance. First POS tags, named entities, and co-references are extracted. A set of rules can be used to extract candidate values given the POS and entity patterns like considering adjectives and adverbs, filtering stop words and repeated candidates.

\subsection{Evaluation \& Analyses} \label{task2}

\paragraph{Evaluation Metrics} The standard metric joint goal accuracy (JGA) will be adopted to evaluate the belief state predictions. This metric compares all the predicted belief states to the ground-truth states for each turn. The prediction is correct only if all the predicted states match the ground-truth states. Both slots and values must match for the prediction to be correct. To omit the influence of value extraction, \citet{yang2022prompt} proposed JGA*, the accuracy is computed only for the belief states where the values are correctly identified. These evaluation metrics can answer the following questions: \textbf{Q1}: How do the prompt-based methods perform overall compared to SoTA \textsc{Soloist}? \textbf{Q2}: Can the prompt-based model perform better under the few-shot settings? \textbf{Q3}: Does JGA* has a better score than JGA?

\paragraph{Analyses of belief state predictions} The main goal of this task is to analyze belief state predictions. The predictions from \textsc{Soloist} baseline and prompt-based methods will be compared and analyzed to identify the improvements and drawbacks. A detailed error analyses will be performed on the wrong belief state predictions.

\subsection{Multi-prompt learning methods} \label{task3}

The \textit{value-based} prompt described in the previous sections utilize a \textit{single} prompt for making predictions. However, a significant body of research has demonstrated that the use of multiple prompts can further improve the efficacy of prompting methods \citep{liu2021ppp}. There are different ways to extend the single prompt learning to use multiple prompts. This task will explore three multi-prompt learning methods: \textit{Prompt ensembling}, \textit{Prompt augmentation}, and \textit{Prompt decomposition}. This task aims to answer the following questions - \textbf{Q1}: Can combining different \textit{multi-prompt} techniques help the PLM better understand the DST task? \textbf{Q2}: How do various hand-crafted prompt functions influence the prompt-based model?

\paragraph{Prompt Ensembling} This method uses multiple \textit{unanswered} prompts during the inference time to make predictions \citep{liu2021ppp}. This idea can leverage the complementary advantages of different prompts and stabilize the performance on downstream tasks. \citet{yang2022prompt} applied prompt ensembling for the value-based prompt to effectively utilize four different prompts. A simple way for ensembling is to train a separate model for each prompt and generate the output by applying the weighted averaging on slot generation probability. The probability of slot $s_t$ can be calculated via:
$$
P\left(s_{t} \mid c_{t}\right)=\sum_{k}^{|K|} \alpha_{k} * P\left(s_{t} \mid c_{t}, f_{k}\left(v_{t}\right)\right)
$$
where $|K|$ represents the number of prompt functions, $f_{k}$ is the $k$-th prompt function, $\alpha_{k}$ is the weight of prompt $k$. This task will utilize prompt ensembling differently from \citet{yang2022prompt}, by combining other multi-prompt methods. In this task, experiments will be performed on various prompt templates to find the most effective and suitable prompts in combination with other multi-prompt methods.
\begin{table}[h!]
\centering
\begin{tabular}{ c l }
 $f_{1}$ & belief states: value = [v], slot = [s]\\
 $f_{2}$ & belief states: [v] = [s]\\
 $f_{2}$ & [v] is of slot type [s]\\
 $f_{4}$ & [v] is the value of [s]\\
 \vdots &
\end{tabular}
\caption{Examples of different prompt functions for ensembling}
\label{table:1}
\end{table}

\paragraph{Prompt Augmentation} \textit{Prompt Augmentation}, sometimes called \textit{demonstration learning} \citep{gao2021lmbff}, provides a few additional \textit{answered prompts} that can demonstrate to the PLM, how the actual prompt slot can be answered. Sample selection of answered prompts will be manually hand-picked from the training data. Experiments will be conducted on different sets of samples. Table \ref{table:2} provides an example for prompt augmentation.
\begin{table}[h!]
\centering
\begin{tabular}{ r l }
 I want to book a cheap hotel. & \textit{cheap} is of slot \textit{price range}\\
 Plan a train trip to Berlin. & \textit{Berlin} is of slot \textit{destination}\\
 Find me an Italian Restaurant. & \textit{Italin} is of slot \textit{food}\\
 Recommend a movie at Cinemaxx. & \textit{Cinemaxx} is of slot [s]
\end{tabular}
\caption{Examples of prompt augmentation with answered prompts}
\label{table:2}
\end{table}

\paragraph{Prompt Decomposition} For utterances where multiple slot values should be predicted, directly using a single prompt for generating multiple slots is challenging. One intuitive method is to breakdown the prompt into sub-prompts, and generate the slots for each sub-prompt separately. For each candidate value in the utterance, construct a \textit{value-based} prompt and generate the slot. This approach will be utilized in both training and testing phases. This sort of \textit{prompt decomposition} has been explored by \citet{cui2021template} for named entity recognition(NER) task.
\begin{table}[h!]
\centering
\begin{tabular}{ c l }
 Utterance: & Book a flight to Stuttgart tomorrow evening.\\
 Prompt 1: & belief states: \textit{Stuttgart} = [s]\\
 Prompt 2: & belief states: \textit{tomorrow} = [s]\\
 Prompt 3: & belief states: \textit{evening} = [s]\\
\end{tabular}
\caption{Prompt decomposition example}
\label{table:3}
\end{table}