master-thesis/writing/latex/sections/05_results.tex

\section{Results}\label{sec:results}
This section presents the experimental results evaluated on all the methods described in the previous sections. Few-shot experiments are performed on all the data splits (see Table \ref{table:2}) for every method. The baseline Soloist model is evaluated only on the JGA metric. For the prompt-based methods, in addition to the JGA metric, JGA* is also computed.

\subsection{SOLOIST Baseline}

Table \ref{table:8} shows the results of the baseline model under few-shot experiments.  Experimental results show the baseline model performed poorly and struggled to generate belief states under low-resource settings (\textsl{5-dpd}, \textsl{10-dpd}, \textsl{50-dpd}). Under low-resource data splits, the limited size of data samples made it challenging for the baseline to generate unseen belief states. The results also show that more data may be necessary as the model achieves better results on the data splits with a higher number of data samples (\textsl{125-dpd}, \textsl{250-dpd}).

\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{16pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{lc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA} \\
\hline
\textsl{5-dpd} (25) & 9.06 \\
\hline
\textsl{10-dpd} (50) & 14.20 \\
\hline
\textsl{50-dpd} (250) & 28.64 \\
\hline
\textsl{100-dpd} (500) & 33.11 \\
\hline
\textsl{125-dpd} (625) & 35.79 \\
\hline
\textsl{250-dpd} (1125) & \textbf{40.38} \\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results of the \textsc{Soloist} baseline model. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:8}
\end{table}

\subsection{Prompt-based methods}

Table \ref{table:9} shows the results of the prompt-based model under few-shot experiments. Only a single value-based prompt is used in these experiments.  Experimental results show the prompt-based model significantly outperforms the baseline model in all data splits. For low-resource data splits like \textsl{5-dpd}, \textsl{10-dpd}, and \textsl{50-dpd}, the prompt-based model shows a substantial improvement over the baseline, achieving an increase in the JGA metric by \textit{21}, \textit{28}, and \textit{18} points respectively.
\vspace{0.25cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{lcc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA}  & \textbf{JGA*}\\
\hline
\textsl{5-dpd} (25) & 30.66 & 71.04 \\
\hline
\textsl{10-dpd} (50) & 42.65 & 86.43 \\
\hline
\textsl{50-dpd} (250) & 47.06 & 91.63 \\
\hline
\textsl{100-dpd} (500) & \textbf{47.74} & \textbf{92.31} \\
\hline
\textsl{125-dpd} (625) & 46.49 & 91.86 \\
\hline
\textsl{250-dpd} (1125) & 47.06 & 92.08 \\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results from the prompt-based model. Only a single \textit{value-based prompt} is used. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:9}
\end{table}

\paragraph{} The prompt-based results also show that by increasing the number of data samples in experiments, the model only achieved minor performance improvements. For example, the prompt-based methods perform nearly identical on the data splits 50-dpd and 250-dpd. This suggests the prompt-based approach understands the DST task better under low-resource scenarios. The higher values of the JGA* metric across all data splits indicate the potential drawbacks of the rule-based value extraction methods.

\subsection{Multi-prompt methods}

\subsubsection{Prompt Ensembling results}

Table \ref{table:10} shows the results of prompt ensembling under few-shot settings. The results from the prompt ensemble show a slight improvement over a single value-based prompt. Contrary to expectations, the prompt ensemble model did not show significant performance improvement on the JGA metric. The results also show that the performance of the prompt ensemble model is similar when trained on large data splits, i.e. \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}.
\vspace{0.25cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{lcc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA}  & \textbf{JGA*}\\
\hline
\textsl{5-dpd} (25) & 30.09 & 69.23 \\
\hline
\textsl{10-dpd} (50) & 42.84 & 86.99 \\
\hline
\textsl{50-dpd} (250) & 47.62 & 91.74 \\
\hline
\textsl{100-dpd} (500) & \textbf{48.08} & \textbf{92.87} \\
\hline
\textsl{125-dpd} (625) & 46.96 & 92.08 \\
\hline
\textsl{250-dpd} (1125) & \textbf{48.08} & \textbf{92.87} \\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results from prompt ensembling (multi-prompt method). Four \textit{value-based prompts} are used at training and inference time. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:10}
\end{table}


\subsubsection{Prompt Augmentation results}
Table \ref{table:11} shows the results of prompt augmentation under few-shot settings.
\vspace{0.25cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{lcc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA}  & \textbf{JGA*}\\
\hline
\textsl{5-dpd} (25) & 27.8 & 68.1 \\
\hline
\textsl{10-dpd} (50) & 38.91 & 74.43  \\
\hline
\textsl{50-dpd} (250) & 39.52 & 82.81 \\
\hline
\textsl{100-dpd} (500) & \textbf{42.42} & \textbf{83.71} \\
\hline
\textsl{125-dpd} (625) & 40.16 & 82.92 \\
\hline
\textsl{250-dpd} (1125) & 41.52 & 85.07 \\
\hline
\end{tabular}
\endgroup
\caption{Experimental results from demonstration learning (multi-prompt method). The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:11}
\end{table}

Prompt augmentation (also called \textit{demonstration learning}) provides additional context to the language models in the form of \textquote{\textsl{answered prompts}} at inference time. The hand-crafted answered prompts are supposed to help the language model understand the DST task better and generate accurate responses. Table \ref{table:11} shows the experimental results from the prompt augmentation method. Results show that the demonstration learning struggled to generate the belief states accurately. The performance is inadequate across all data splits when compared to other prompt-based methods. Only a limited number of answered prompts can be provided to the GPT-2 LM due to the max input sequence length of 1024, which led to bias during the slot generation process.

\paragraph{} Overall, the multi-prompt methods (prompt ensembling and prompt augmentation) struggled to improve the performance of the DST task. However, the prompt ensembling approach with multiple value-based prompts showed minor improvements over a single value-based prompt.