You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

176 lines
12 KiB

\section{Results}\label{sec:results}
This section presents the experimental results evaluated on all the methods described in the previous sections. Few-shot experiments are performed on all the data splits (see Table \ref{table:2}) for every method. The baseline Soloist model is evaluated only on the JGA metric. For the prompt-based methods, in addition to the JGA metric, JGA* is also computed.
\subsection{SOLOIST Baseline}
Table \ref{table:8} shows the results of the fine-tuned \textsc{Soloist} baseline model under few-shot experiments. Experimental results show the baseline model performed poorly and struggled to generate belief states under low-resource settings (\textsl{5-dpd}, \textsl{10-dpd}, \textsl{50-dpd}). Under low-resource data splits, the limited size of data samples made it challenging for the baseline to generate unseen belief states. The results also show that more data may be necessary as the model achieves better results on the data splits with a higher number of data samples (\textsl{125-dpd}, \textsl{250-dpd}).
\vspace{0.5cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{16pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.25} % Default value: 1
\begin{tabular}{lc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA} \\
\hline
\textsl{5-dpd} (25) & 9.06 \\
\textsl{10-dpd} (50) & 14.20 \\
\textsl{50-dpd} (250) & 28.64 \\
\textsl{100-dpd} (500) & 33.11 \\
\textsl{125-dpd} (625) & 35.79 \\
\textsl{250-dpd} (1125) & \textbf{40.38} \\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results of the \textsc{Soloist} baseline model. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:8}
\end{table}
\subsection{Prompt-based methods}
This section presents the evaluation results for the prompt-based methods. A single value-based prompt together with the inverse prompt is used. Experiments are performed on all the MultiWOZ data splits and evaluated on JGA and JGA* metrics. In order to analyze the influence of the inverse prompt mechanism, experiments are performed on different inverse prompt loss weights $w$ = 0.1, 0.3, 0.5, and 0.7.
\begin{table}[h!]
\centering
\small
\begingroup
\setlength{\tabcolsep}{6pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.1} % Default value: 1
\begin{tabular}{l ccc ccc ccc ccc}
\hline
& & \multicolumn{2}{c}{\textbf{w = 0.1}} & & \multicolumn{2}{c}{\textbf{w = 0.3}} & & \multicolumn{2}{c}{\textbf{w = 0.5}} & & \multicolumn{2}{c}{\textbf{w = 0.7}}\\
\textbf{Dataset} & & JGA & JGA* & & JGA & JGA* & & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{5-dpd} & & 30.66 & 71.04 & & \textbf{31.67} & \textbf{73.19} & & 30.77 & 72.85 & & 29.98 & 70.93\\
\textsl{10-dpd} & & \textbf{42.65} & \textbf{86.43} & & 41.18 & 83.48 & & 40.05 & 80.77 & & 40.38 & 85.18\\
\textsl{50-dpd} & & \textbf{47.06} & \textbf{91.63} & & 46.49 & 91.18 & & 47.04 & 91.18 & & 46.27 & 90.05\\
\textsl{100-dpd} & & 47.74 & 92.31 & & \textbf{48.42} & \textbf{92.42} & & 48.19 & 92.65 & & 48.3 & 92.65\\
\textsl{125-dpd} & & 46.49 & 91.86 & & 46.15 & 91.18 & & \textbf{46.83} & \textbf{91.74} & & 46.15 & 90.95\\
\textsl{250-dpd} & & 47.06 & 92.08 & & \textbf{47.62} & \textbf{92.65} & & 47.4 & 92.31 & & 47.17 & 92.09\\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results from the prompt-based model. Only a single \textit{value-based prompt} is used together with the \textit{inverse prompt}. Evaluation results for different inverse prompt weights $w$ (in Eq. \ref{eq:3}) are shown in the table. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:9}
\end{table}
\vspace{-4pt}
\paragraph{} Table \ref{table:9} shows the evaluation results of the prompt-based model under few-shot experiments. Experimental results show the prompt-based model significantly outperformed the baseline model across all data splits. Results also show that by increasing the number of data samples in the few-shot experiments (i.e., \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}), the prompt-based model only achieved minor performance improvements. For example, the performance on data splits \textsl{50-dpd} and \textsl{250-dpd} is similar with only around a \textsl{1\%} difference in the JGA score. This suggests the prompt-based methods can also learn the DST task efficiently even under very low-resource settings. Results also show the JGA* metric achieved a better score than JGA in every experiment across all data splits. This indicates there are limitations in the current rule-based value-extraction methods.
\vspace{-6pt}
\paragraph{} The experimental results from different inverse prompt weights $w$ yield similar performance, with only a small difference of 1-2\% in the JGA scores. Since the inverse prompt is only an auxiliary task that supports the main value-based prompt, increasing the loss weight $w$ of the inverse prompt did not have any positive impact on the model performance. Instead, the prompt-based model achieved better results when the inverse prompt loss weight $w$ is less than \textsl{0.5}.
\subsection{Multi-prompt methods}
\subsubsection{Prompt Ensembling results}
For prompt ensemble experiments, four value-based prompts are used as shown in table \ref{table:6}. Only a single inverse prompt with weight $w = 0.1$ (Eq. \ref{eq:3}) is utilized during training. Table \ref{table:10} shows the results of prompt ensembling under few-shot settings. The results from the prompt ensemble model show only a slight improvement over some data splits when compared to a single value-based prompt. Contrary to expectations, the prompt ensemble model did not show significant performance improvement on the JGA metric. The results also show that the performance of the prompt ensemble model is similar when trained on large data splits, i.e. \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}.
\vspace{0.25cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{12pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{lcc}
\hline
\textbf{\makecell{Data split}} & \textbf{JGA} & \textbf{JGA*}\\
\hline
\textsl{5-dpd} & 30.09 & 69.23 \\
\textsl{10-dpd} & 42.84 & 86.99 \\
\textsl{50-dpd} & 47.62 & 91.74 \\
\textsl{100-dpd} & 48.08 & 93.10 \\
\textsl{125-dpd} & 46.96 & 92.08 \\
\textsl{250-dpd} & \textbf{48.30} & \textbf{93.44} \\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results from prompt ensembling (multi-prompt method). Four \textit{value-based prompts} are used at training and inference time. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:10}
\end{table}
The fine-tuning of the value-based prompt together with the inverse prompt performs exceptionally well on the DST task. The rule-based value extraction approach has a turn-level accuracy of only \textsl{49\%} due to some limitations. Due to these limitations, there is no room for improvement in the prompt ensemble model over a single value-based prompt.
\subsubsection{Prompt Augmentation results}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{l ccc ccc}
\hline
& \multicolumn{2}{c}{\textbf{Sample 1}} & & \multicolumn{2}{c}{\textbf{Sample 2}}\\
\textbf{Data split} & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{5-dpd} & 26.02 & 58.6 & & 27.6 & 59.39 \\
\textsl{10-dpd} & 33.26 & 70.14 & & 34.95 & 77.94 \\
\textsl{50-dpd} & 38.80 & 71.38 & & \textbf{39.77} & \textbf{74.55} \\
\textsl{100-dpd} & 35.97 & 70.89 & & 38.46 & 74.89 \\
\textsl{125-dpd} & 36.09 & 73.08 & & 36.18 & 76.47 \\
\textsl{250-dpd} & 35.63 & 72.9 & & 38.91 & 76.7 \\
\hline
\end{tabular}
\endgroup
\caption{Experimental results from demonstration learning (multi-prompt method). \textsl{Sample 1} contains 8 demonstrations and \textsl{Sample 2} contains 5 demonstrations.}
\label{table:11}
\end{table}
\vspace{-4pt}
Prompt augmentation provides additional context to the language models in the form of \textquote{answered prompts} at inference time. Two sets of demonstration samples are hand-picked and concatenated to the input to help the language model generate belief states more accurately. Table \ref{table:11} shows the experimental results from the prompt augmentation method on two demonstration samples (sample 1 \& sample 2). The demonstration sample sets used in the experiments are listed in appendix \ref{appendix:a1} and \ref{appendix:a2}. Results show the demonstration learning struggled to generate the belief states accurately. The performance is inadequate across all data splits when compared to other prompt-based methods. The sample selection of demonstrations plays an important role in the model performance. The results from \textsl{sample 2} with 5 demonstrations perform slightly better than \textsl{sample 1}. The top performance is achieved when \textsl{sample 2} is used on the \textsl{50-dpd} fine-tuned model. The fine-tuned GPT-2 language model used in the experiments has a maximum input length of \textsl{1024} tokens. The input size restriction only allows a limited number of demonstration examples to be appended to the input. With fewer demonstrations provided to the language model, it struggled to understand the DST task and inversely impacted the performance due to bias from demonstration examples.
\clearpage
\subsection{Comparison of results}
The section summarizes all the experimental results and compares them with different methods. Table \ref{table:12} presents the top results from all the experimental methods.
\begin{table}[h!]
\centering
\small
\begingroup
\setlength{\tabcolsep}{4pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{l c cc ccc ccc ccc}
\hline
& \textbf{\textsl{Baseline}} & \multicolumn{2}{c}{\textbf{\textsl{VbP}}} & & \multicolumn{2}{c}{\textbf{\textsl{VbP+Inv}}} & & \multicolumn{2}{c}{\textbf{\textsl{PrEns}}} & & \multicolumn{2}{c}{\textbf{\textsl{PrAug}}}\\
\textbf{Dataset} & JGA & JGA & JGA* & & JGA & JGA* & & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{5-dpd} & 9.06 & 26.81 & 64.25 & & \textbf{31.67} & \textbf{73.19} & & 30.09 & 69.23 & & 27.60 & 59.39 \\
\textsl{10-dpd} & 14.20 & 41.10 & 82.35 & & 42.65 & 86.43 & & \textbf{42.84} & \textbf{86.99} & & 34.95 & 77.94 \\
\textsl{50-dpd} & 28.64 & 45.70 & 90.70 & & 47.06 & 91.63 & & \textbf{47.62} & \textbf{91.74} & & 39.77 & 74.55 \\
\textsl{100-dpd} & 33.11 & 47.74 & 91.86 & & \textbf{48.42} & 92.42 & & 48.08 & \textbf{93.10} & & 38.46 & 74.89 \\
\textsl{125-dpd} & 35.79 & 45.02 & 90.61 & & 46.83 & 91.74 & & \textbf{46.96} & \textbf{92.08} & & 36.18 & 76.47 \\
\textsl{250-dpd} & 40.38 & 46.15 & 91.40 & & 47.62 & 92.65 & & \textbf{48.30} & \textbf{93.44} & & 38.91 & 76.70 \\
\hline
\end{tabular}
\endgroup
\caption{Top evaluation results from all the few-shot experiments. \textsl{Baseline}: \textsc{Soloist} baseline model; \textsl{VbP}: \textsl{value-based prompt} without inverse prompt; \textsl{VbP+Inv}: \textsl{value-based prompt} and \textsl{inverse prompt}; \textsl{PrEns}: \textsl{prompt ensembling} with \textsl{inverse prompt}; \textsl{PrAug}: \textsl{Prompt Augmentation}; \textsl{dpd}: \textsl{dialogues per domain}}
\label{table:12}
\end{table}
The prompt-based methods achieved better results and performed significantly better than the baseline model. For the low-resource data split \textsl{5-dpd}, the single value-based prompt together with the inverse prompt outperformed all the other methods. For all the other data splits \textsl{10-dpd}, \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, and \textsl{250-dpd}, the prompt ensembling model achieved the top results. The results clearly demonstrate the effectiveness of the prompt-based methods over baseline, especially under extremely low-resource settings. However, it's important to note that the prompt ensemble model only achieved minor improvements over the single value-based prompt. The inverse prompt mechanism also has a noticeable impact on the prompt-based model, especially under low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}). The prompt augmentation approach struggled to take advantage of the demonstration samples.
\clearpage