master-thesis/writing/latex/sections/05_results.tex

\section{Results}\label{sec:results}
This section presents the experimental results evaluated on all the methods described in the previous sections. Few-shot experiments are performed on all the data splits (see Table \ref{table:2}) for every method. The baseline Soloist model is evaluated only on the JGA metric. For the prompt-based methods, in addition to the JGA metric, JGA* is also computed.

\subsection{SOLOIST Baseline}

Table \ref{table:8} shows the results of the fine-tuned \textsc{Soloist} baseline model under few-shot experiments.  Experimental results show the baseline model performed poorly and struggled to generate belief states under low-resource settings (\textsl{5-dpd}, \textsl{10-dpd}, \textsl{50-dpd}). Under low-resource data splits, the limited size of data samples made it challenging for the baseline to generate unseen belief states. The results also show that more data may be necessary as the model achieves better results on the data splits with a higher number of data samples (\textsl{125-dpd}, \textsl{250-dpd}).
\vspace{0.5cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{16pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.25} % Default value: 1
\begin{tabular}{lc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA} \\
\hline
\textsl{5-dpd} (25) & 9.06 \\

\textsl{10-dpd} (50) & 14.20 \\

\textsl{50-dpd} (250) & 28.64 \\

\textsl{100-dpd} (500) & 33.11 \\

\textsl{125-dpd} (625) & 35.79 \\

\textsl{250-dpd} (1125) & \textbf{40.38} \\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results of the \textsc{Soloist} baseline model. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:8}
\end{table}

\subsection{Prompt-based methods}

This section presents the evaluation results for the prompt-based methods. A single value-based prompt together with the inverse prompt is used. Experiments are performed on all the MultiWOZ data splits and evaluated on JGA and JGA* metrics. In order to analyze the influence of the inverse prompt mechanism, experiments are performed on different inverse prompt loss weights $w$ = 0.1, 0.3, 0.5, and 0.7.

\begin{table}[h!]
\centering
\small
\begingroup
\setlength{\tabcolsep}{6pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.1} % Default value: 1
\begin{tabular}{l ccc ccc ccc ccc}
\hline
 & & \multicolumn{2}{c}{\textbf{w = 0.1}} & & \multicolumn{2}{c}{\textbf{w = 0.3}} & & \multicolumn{2}{c}{\textbf{w = 0.5}} & & \multicolumn{2}{c}{\textbf{w = 0.7}}\\
\textbf{Dataset} & & JGA & JGA* & & JGA & JGA* & & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{5-dpd} & & 30.66 & 71.04 & & \textbf{31.67} & \textbf{73.19} & & 30.77 & 72.85 & & 29.98 & 70.93\\

\textsl{10-dpd} & & \textbf{42.65} & \textbf{86.43} & & 41.18 & 83.48 & & 40.05 & 80.77 & & 40.38 & 85.18\\

\textsl{50-dpd} & & \textbf{47.06} & \textbf{91.63} & & 46.49 & 91.18 & & 47.04 & 91.18 & & 46.27 & 90.05\\

\textsl{100-dpd} & & 47.74 & 92.31 & & \textbf{48.42} & \textbf{92.42} & & 48.19 & 92.65 & & 48.3 & 92.65\\

\textsl{125-dpd} & & 46.49 & 91.86 & & 46.15 & 91.18 & & \textbf{46.83} & \textbf{91.74} & & 46.15 & 90.95\\

\textsl{250-dpd} & & 47.06 & 92.08 & & \textbf{47.62} & \textbf{92.65} & & 47.4 & 92.31 & & 47.17 & 92.09\\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results from the prompt-based model. Only a single \textit{value-based prompt} is used together with the \textit{inverse prompt}. Evaluation results for different inverse prompt weights $w$ (in Eq. \ref{eq:3}) are shown in the table. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:9}
\end{table}

\paragraph{} Table \ref{table:9} shows the evaluation results of the prompt-based model under few-shot experiments. Experimental results show the prompt-based model significantly outperformed the baseline model across all data splits. For low-resource data splits like \textsl{5-dpd}, \textsl{10-dpd}, and \textsl{50-dpd}, the prompt-based model shows substantial improvement over the baseline, achieving a performance gain in the JGA metric by \textsl{22}, \textsl{28}, and \textsl{18} points respectively. The results clearly demonstrate the effectiveness of the prompt-based methods over baseline, especially under extremely low-resource settings.
\vspace{-6pt}
\paragraph{} The results from the prompt-based methods also show that by increasing the number of data samples in the few-shot experiments, the model only achieved minor performance improvements. This suggests the prompt-based methods can understand the DST task better under low-resource settings. For example, the performance of prompt-based methods under data splits \textsl{50-dpd} and \textsl{250-dpd} is nearly identical. Different inverse prompt loss weights $w$ resulted in a similar performance, with $w = 0.1$ and $w = 0.3$ performing slightly better than other weights. The higher values of the JGA* metric across all data splits indicates the rule-based value extraction methods have some limitations.

\subsection{Multi-prompt methods}

\subsubsection{Prompt Ensembling results}

For prompt ensemble experiments, four value-based prompts are used as shown in table \ref{table:6}. Only a single inverse prompt with weight $w = 0.1$ (Eq. \ref{eq:3}) is utilized during training. Table \ref{table:10} shows the results of prompt ensembling under few-shot settings. The results from the prompt ensemble model show only a slight improvement over some data splits when compared to a single value-based prompt. Contrary to expectations, the prompt ensemble model did not show significant performance improvement on the JGA metric. The results also show that the performance of the prompt ensemble model is similar when trained on large data splits, i.e. \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}.
\vspace{0.25cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{12pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{lcc}
\hline
\textbf{\makecell{Data split}} & \textbf{JGA}  & \textbf{JGA*}\\
\hline
\textsl{5-dpd} & 30.09 & 69.23 \\

\textsl{10-dpd} & 42.84 & 86.99 \\

\textsl{50-dpd} & 47.62 & 91.74 \\

\textsl{100-dpd} & 48.08 & 93.10 \\

\textsl{125-dpd} & 46.96 & 92.08 \\

\textsl{250-dpd} & \textbf{48.30} & \textbf{93.44} \\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results from prompt ensembling (multi-prompt method). Four \textit{value-based prompts} are used at training and inference time. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:10}
\end{table}

The fine-tuning of the value-based prompt together with the inverse prompt performs exceptionally well on the DST task. The rule-based value extraction approach has a turn-level accuracy of only \textsl{49\%} due to some limitations. Due to these limitations, there is no room for improvement in the prompt ensemble model over a single value-based prompt.


\subsubsection{Prompt Augmentation results}
\vspace{0.2cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{l ccc ccc}
\hline
 & \multicolumn{2}{c}{\textbf{Sample 1}} & & \multicolumn{2}{c}{\textbf{Sample 2}}\\
\textbf{Data split} & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{5-dpd} & 26.02 & 58.6 & & 27.6 & 59.39 \\

\textsl{10-dpd} & 33.26 & 70.14 & & 34.95 & 77.94 \\

\textsl{50-dpd} & 38.80 & 71.38 & & \textbf{39.77} & \textbf{74.55} \\

\textsl{100-dpd} & 35.97 & 70.89 & & 38.46 & 74.89 \\

\textsl{125-dpd} & 36.09 & 73.08 & & 36.18 & 76.47 \\

\textsl{250-dpd} & 35.63 & 72.9 & & 38.91 & 76.7 \\
\hline
\end{tabular}
\endgroup
\caption{Experimental results from demonstration learning (multi-prompt method). \textsl{Sample 1} contains 8 demonstrations and \textsl{Sample 2} contains 5 demonstrations. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:11}
\end{table}

\paragraph{} Prompt augmentation (or \textit{demonstration learning}) provides additional context to the language models in the form of \textquote{answered prompts} at inference time. Two sets of demonstration samples are hand-picked and concatenated to the input to help the language model generate belief states more accurately. Table \ref{table:11} shows the experimental results from the prompt augmentation method on two demonstration samples (sample 1 \& sample 2). Results show the demonstration learning struggled to generate the belief states accurately. The performance is inadequate across all data splits when compared to other prompt-based methods. The sample selection of demonstrations plays an important role in the model performance. The results from \textsl{sample 2} with 5 demonstrations perform slightly better than \textsl{sample 1}. Only a limited number of demonstrations can be provided to the GPT-2 language model due to its maximum input sequence length of 1024, which led to bias during the slot generation at inference time.

\clearpage

\subsection{Comparison of results}
The section summarizes all the experimental results and compares them with different methods. Table \ref{table:12} presents the top results from all the experimental methods.

\begin{table}[h!]
\centering
\small
\begingroup
\setlength{\tabcolsep}{4pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{l c cc ccc ccc ccc}
\hline
 & \textbf{\textsl{Baseline}} & \multicolumn{2}{c}{\textbf{\textsl{VbP}}} & & \multicolumn{2}{c}{\textbf{\textsl{VbP+Inv}}} & & \multicolumn{2}{c}{\textbf{\textsl{PrEns}}} & & \multicolumn{2}{c}{\textbf{\textsl{PrAug}}}\\
\textbf{Dataset} & JGA & JGA & JGA* & & JGA & JGA* & & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{5-dpd} & 9.06 & 26.81 & 64.25 & & \textbf{31.67} & \textbf{73.19} & & 30.09 & 69.23 & & 27.60 & 59.39 \\

\textsl{10-dpd} & 14.20 & 41.10 & 82.35 & & 42.65 & 86.43 & & \textbf{42.84} & \textbf{86.99} & & 34.95 & 77.94 \\

\textsl{50-dpd} & 28.64 & 45.70 & 90.70 & & 47.06 & 91.63 & & \textbf{47.62} & \textbf{91.74} & & 39.77 & 74.55 \\

\textsl{100-dpd} & 33.11 & 47.74 & 91.86 & & \textbf{48.42} & 92.42 & & 48.08 & \textbf{93.10} & & 38.46 & 74.89 \\

\textsl{125-dpd} & 35.79 & 45.02 & 90.61 & & 46.83 & 91.74 & & \textbf{46.96} & \textbf{92.08} & & 36.18 & 76.47 \\

\textsl{250-dpd} & 40.38 & 46.15 & 91.40 & & 47.62 & 92.65 & & \textbf{48.30} & \textbf{93.44} & & 38.91 & 76.70 \\
\hline
\end{tabular}
\endgroup
\caption{Top evaluation results from all the few-shot experiments. \textsl{Baseline}: \textsc{Soloist} baseline model; \textsl{VbP}: \textsl{value-based prompt} without inverse prompt; \textsl{VbP+Inv}: \textsl{value-based prompt} and \textsl{inverse prompt}; \textsl{PrEns}: \textsl{prompt ensembling} with \textsl{inverse prompt}; \textsl{PrAug}: \textsl{Prompt Augmentation}; \textsl{dpd}: \textsl{dialogues per domain}}
\label{table:12}
\end{table}

The prompt-based methods achieved better results and performed significantly better than the baseline model. For the low-resource data split \textsl{5-dpd}, the single value-based prompt together with the inverse prompt outperformed all the other methods. For all the other data splits \textsl{10-dpd}, \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, and \textsl{250-dpd}, the prompt ensembling model achieved the top results.  However, it's important to note the prompt ensemble model only achieved minor improvements over the single value-based prompt. The inverse prompt mechanism also has a noticeable impact on the prompt-based model, especially under low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}). The prompt augmentation approach struggled to take advantage of the demonstration samples.


\clearpage