diff --git a/writing/latex/images/ims_logo.png b/writing/latex/images/ims_logo.png new file mode 100644 index 0000000..222380f Binary files /dev/null and b/writing/latex/images/ims_logo.png differ diff --git a/writing/latex/sections/04_methods.tex b/writing/latex/sections/04_methods.tex index 4904696..a1d0020 100644 --- a/writing/latex/sections/04_methods.tex +++ b/writing/latex/sections/04_methods.tex @@ -185,7 +185,7 @@ where $|K|$ represents the number of prompt functions, $f_{k}$ is the $k$-th pro \label{table:7} \end{table} -\paragraph{} Experiments are performed on two \textit{sets} of demonstration samples to understand the importance of sample selection. The \textit{sample selection} of the demonstrations is manually hand-picked and hand-crafted from the training data. Each demonstration sample contains multiple answered prompts (\textit{dialog history + belief states}) covering all the domains and slots. These demonstration examples are specifically chosen after gaining insights from the initial error analysis of the prompt-based methods. The first sample set contains 8 examples, and the second sample contains 5 examples. The number of demonstration examples that can be picked in a single sample set is bounded by the GPT-2 input length of 1024. The demonstrations from each sample set are concatenated to the input and given to the fine-tuned prompt model for generating the slots. +\paragraph{} Experiments are performed on two \textit{sets} (listed in appendix \ref{appendix:a1} and \ref{appendix:a2}) of demonstration samples to understand the importance of sample selection. The \textit{sample selection} of the demonstrations is manually hand-picked and hand-crafted from the training data. Each demonstration sample contains multiple answered prompts (\textit{dialog history + belief states}) covering all the domains and slots. These demonstration examples are specifically chosen after gaining insights from the initial error analysis of the prompt-based methods. The first sample set contains 8 examples, and the second sample contains 5 examples. The number of demonstration examples that can be picked in a single sample set is bounded by the GPT-2 input length of 1024. The demonstrations from each sample set are concatenated to the input and given to the fine-tuned prompt model for generating the slots. \subsection{Evaluation Metrics} The standard evaluation metric joint goal accuracy (JGA) is adopted to evaluate the belief state predictions of baseline and prompt-based methods. This metric compares all the predicted belief states to the ground-truth states at each turn. The prediction is correct only if all the predicted belief states match the ground-truth states. Both slots and values must exactly match for the belief state prediction to be correct. The rule-based methods used in value extraction can lead to many false positives in the value candidates. In order to exclude the influence of wrongly extracted values, \citet{yang2022prompt} proposed JGA*, the joint goal accuracy is computed only for the belief states where the values are extracted correctly. These evaluation metrics answer the following questions: \textsf{Q1:} How do the prompt-based methods perform overall compared to the Soloist baseline? \textsf{Q2:} Can the prompt-based methods perform better under low-resource settings? \textsf{Q3:} For prompt-based methods, does JGA* metric hold a better score than JGA? \textsf{Q4:} Can multi-prompt techniques together perform better than a single-prompt? diff --git a/writing/latex/sections/05_results.tex b/writing/latex/sections/05_results.tex index 2e47c3e..46c4d8e 100644 --- a/writing/latex/sections/05_results.tex +++ b/writing/latex/sections/05_results.tex @@ -64,10 +64,10 @@ This section presents the evaluation results for the prompt-based methods. A sin \caption{Few-shot experimental results from the prompt-based model. Only a single \textit{value-based prompt} is used together with the \textit{inverse prompt}. Evaluation results for different inverse prompt weights $w$ (in Eq. \ref{eq:3}) are shown in the table. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.} \label{table:9} \end{table} - -\paragraph{} Table \ref{table:9} shows the evaluation results of the prompt-based model under few-shot experiments. Experimental results show the prompt-based model significantly outperformed the baseline model across all data splits. For low-resource data splits like \textsl{5-dpd}, \textsl{10-dpd}, and \textsl{50-dpd}, the prompt-based model shows substantial improvement over the baseline, achieving a performance gain in the JGA metric by \textsl{22}, \textsl{28}, and \textsl{18} points respectively. The results clearly demonstrate the effectiveness of the prompt-based methods over baseline, especially under extremely low-resource settings. +\vspace{-4pt} +\paragraph{} Table \ref{table:9} shows the evaluation results of the prompt-based model under few-shot experiments. Experimental results show the prompt-based model significantly outperformed the baseline model across all data splits. Results also show that by increasing the number of data samples in the few-shot experiments (i.e., \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}), the prompt-based model only achieved minor performance improvements. For example, the performance on data splits \textsl{50-dpd} and \textsl{250-dpd} is similar with only around a \textsl{1\%} difference in the JGA score. This suggests the prompt-based methods can also learn the DST task efficiently even under very low-resource settings. Results also show the JGA* metric achieved a better score than JGA in every experiment across all data splits. This indicates there are limitations in the current rule-based value-extraction methods. \vspace{-6pt} -\paragraph{} The results from the prompt-based methods also show that by increasing the number of data samples in the few-shot experiments, the model only achieved minor performance improvements. This suggests the prompt-based methods can understand the DST task better under low-resource settings. For example, the performance of prompt-based methods under data splits \textsl{50-dpd} and \textsl{250-dpd} is nearly identical. Different inverse prompt loss weights $w$ resulted in a similar performance, with $w = 0.1$ and $w = 0.3$ performing slightly better than other weights. The higher values of the JGA* metric across all data splits indicates the rule-based value extraction methods have some limitations. +\paragraph{} The experimental results from different inverse prompt weights $w$ yield similar performance, with only a small difference of 1-2\% in the JGA scores. Since the inverse prompt is only an auxiliary task that supports the main value-based prompt, increasing the loss weight $w$ of the inverse prompt did not have any positive impact on the model performance. Instead, the prompt-based model achieved better results when the inverse prompt loss weight $w$ is less than \textsl{0.5}. \subsection{Multi-prompt methods} @@ -106,7 +106,6 @@ The fine-tuning of the value-based prompt together with the inverse prompt perfo \subsubsection{Prompt Augmentation results} -\vspace{0.2cm} \begin{table}[h!] \centering \begingroup @@ -131,12 +130,12 @@ The fine-tuning of the value-based prompt together with the inverse prompt perfo \hline \end{tabular} \endgroup -\caption{Experimental results from demonstration learning (multi-prompt method). \textsl{Sample 1} contains 8 demonstrations and \textsl{Sample 2} contains 5 demonstrations. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.} +\caption{Experimental results from demonstration learning (multi-prompt method). \textsl{Sample 1} contains 8 demonstrations and \textsl{Sample 2} contains 5 demonstrations.} \label{table:11} \end{table} -\paragraph{} Prompt augmentation (or \textit{demonstration learning}) provides additional context to the language models in the form of \textquote{answered prompts} at inference time. Two sets of demonstration samples are hand-picked and concatenated to the input to help the language model generate belief states more accurately. Table \ref{table:11} shows the experimental results from the prompt augmentation method on two demonstration samples (sample 1 \& sample 2). Results show the demonstration learning struggled to generate the belief states accurately. The performance is inadequate across all data splits when compared to other prompt-based methods. The sample selection of demonstrations plays an important role in the model performance. The results from \textsl{sample 2} with 5 demonstrations perform slightly better than \textsl{sample 1}. Only a limited number of demonstrations can be provided to the GPT-2 language model due to its maximum input sequence length of 1024, which led to bias during the slot generation at inference time. - +\vspace{-4pt} +Prompt augmentation provides additional context to the language models in the form of \textquote{answered prompts} at inference time. Two sets of demonstration samples are hand-picked and concatenated to the input to help the language model generate belief states more accurately. Table \ref{table:11} shows the experimental results from the prompt augmentation method on two demonstration samples (sample 1 \& sample 2). The demonstration sample sets used in the experiments are listed in appendix \ref{appendix:a1} and \ref{appendix:a2}. Results show the demonstration learning struggled to generate the belief states accurately. The performance is inadequate across all data splits when compared to other prompt-based methods. The sample selection of demonstrations plays an important role in the model performance. The results from \textsl{sample 2} with 5 demonstrations perform slightly better than \textsl{sample 1}. The top performance is achieved when \textsl{sample 2} is used on the \textsl{50-dpd} fine-tuned model. The fine-tuned GPT-2 language model used in the experiments has a maximum input length of \textsl{1024} tokens. The input size restriction only allows a limited number of demonstration examples to be appended to the input. With fewer demonstrations provided to the language model, it struggled to understand the DST task and inversely impacted the performance due to bias from demonstration examples. \clearpage \subsection{Comparison of results} @@ -171,7 +170,7 @@ The section summarizes all the experimental results and compares them with diffe \label{table:12} \end{table} -The prompt-based methods achieved better results and performed significantly better than the baseline model. For the low-resource data split \textsl{5-dpd}, the single value-based prompt together with the inverse prompt outperformed all the other methods. For all the other data splits \textsl{10-dpd}, \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, and \textsl{250-dpd}, the prompt ensembling model achieved the top results. However, it's important to note the prompt ensemble model only achieved minor improvements over the single value-based prompt. The inverse prompt mechanism also has a noticeable impact on the prompt-based model, especially under low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}). The prompt augmentation approach struggled to take advantage of the demonstration samples. +The prompt-based methods achieved better results and performed significantly better than the baseline model. For the low-resource data split \textsl{5-dpd}, the single value-based prompt together with the inverse prompt outperformed all the other methods. For all the other data splits \textsl{10-dpd}, \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, and \textsl{250-dpd}, the prompt ensembling model achieved the top results. The results clearly demonstrate the effectiveness of the prompt-based methods over baseline, especially under extremely low-resource settings. However, it's important to note that the prompt ensemble model only achieved minor improvements over the single value-based prompt. The inverse prompt mechanism also has a noticeable impact on the prompt-based model, especially under low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}). The prompt augmentation approach struggled to take advantage of the demonstration samples. \clearpage \ No newline at end of file diff --git a/writing/latex/sections/08_appendix.tex b/writing/latex/sections/08_appendix.tex new file mode 100644 index 0000000..edb1033 --- /dev/null +++ b/writing/latex/sections/08_appendix.tex @@ -0,0 +1,75 @@ +\section{Appendix} + +\subsection{Demonstrations used in sample set 1} \label{appendix:a1} + +\begin{table}[h!] +\centering +\begin{tabular}{l} +\hline +\textsf{user:} i need to be picked up from city centre after 16:30.\\ +\textsl{belief states:} city centre = departure, 16:30 = leave \\ [0.5cm] + +\textsf{user:} i am looking for a table at rice house restaurant \\for a party of 8 at 11:15 on thursday.\\ +\textsl{belief states:} rice house = name, thursday = day, 8 = people, 11.15 = time \\ [0.5cm] + +\textsf{user:} i need a train from cambridge that can arrive by 16:15.\\ +\textsf{system:} where is your destination? user : i want to go to broxbourne.\\ +\textsl{belief states:} cambridge = departure, 16:15 = arrive, broxbourne = destination \\ [0.5cm] + +\textsf{user:} hi, i need to leave from frankfurt airport. can you find a train after 20:15?\\ +\textsl{belief states:} frankfurt airport = departure, 20:15 = leave \\ [0.5cm] + +\textsf{user:} i would like a restaurant in the south part of town that serves italian food.\\ +\textsl{belief states:} italian = food, south = area \\ [0.5cm] + +\textsf{user:} i'm looking for a 4-star place to stay on the west side that offers free wifi.\\ +\textsl{belief states:} west = area, 4 = stars, yes = internet \\ [0.5cm] + +\textsf{user:} can you please help me get information on cityroomz?\\ +i need to book it for 3 people and 2 nights starting tuesday.\\ +\textsl{belief states:} cityroomz = name, tuesday = day, 3 = people, 2 = stay \\ [0.5cm] + +\textsf{user:} i am looking for a museum in the town centre.\\ +\textsl{belief states:} museum = type, centre = area \\ + +\hline +\end{tabular} +\caption{Prompt Augmentation: Demonstration examples used in sample set 1} +\label{table:A.1} +\end{table} + +\clearpage + +\subsection{Demonstrations used in sample set 2} \label{appendix:a2} + +\begin{table}[h!] +\centering +\begin{tabular}{l} +\hline + +\textsf{user:} hello, i'm looking for a 4 star place on the west side to stay at.\\ +\textsf{system:} do you have a price range ?\\ +\textsf{user:} yes, i am looking in the expensive price range. also, i need free parking.\\ +\textsl{belief states:} west = area, 4 = stars, expensive = price, yes = parking \\ [0.5cm] + +\textsf{user:} i am looking for a train, it should go to cambridge \\and should depart from norwich.\\ +\textsf{system:} what time and day are you looking to travel?\\ +\textsf{user:} yes, i would like travel on monday and i would need to arrive by 08:30. \\ +\textsl{belief states:} cambridge = destination, norwich = departure, \\monday = day, 08:30 = arrive \\ [0.5cm] + +\textsf{user:} i would like an expensive place to dine, centre of town.\\ +\textsf{system:} what type of food would you like to eat?\\ +\textsf{user:} the type of food doesn't matter, but i need a reservation \\for 8 people on wednesday at 12:45\\ +\textsl{belief states:} expensive = price, centre = area, dont care = food, \\8 = people, wednesday = day, 12:45 = time \\ [0.5cm] + +\textsf{user:} i need a taxi to pick me up at curry prince at 08:15.\\ +\textsl{belief states:} curry prince = departure, 08:15 = leave \\ [0.5cm] + +\textsf{user:} i am looking for a place in the centre of town that is a nightclub.\\ +\textsl{belief states:} night club = type, centre = area \\ + +\hline +\end{tabular} +\caption{Prompt Augmentation: Demonstration examples used in sample set 2} +\label{table:A.2} +\end{table} \ No newline at end of file diff --git a/writing/latex/thesis.pdf b/writing/latex/thesis.pdf index fadd3d9..ca5f6f6 100644 Binary files a/writing/latex/thesis.pdf and b/writing/latex/thesis.pdf differ diff --git a/writing/latex/thesis.tex b/writing/latex/thesis.tex index 409869c..dc5128a 100644 --- a/writing/latex/thesis.tex +++ b/writing/latex/thesis.tex @@ -131,4 +131,10 @@ %% Add all your BibTex citations in references.bib file \bibliography{references} +\clearpage + +\appendix + +\input{sections/08_appendix} + \end{document} \ No newline at end of file