Updated LaTeX code

main
Pavan Mandava 3 years ago
parent 7fd36b9134
commit c4ddc0d386

Binary file not shown.

Before

Width:  |  Height:  |  Size: 70 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 147 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 121 KiB

@ -44,8 +44,8 @@ where $T_b$ is the generated belief states sequence length, $b_{<t}$ indicates a
\subsection{Prompt Learning}
\paragraph{} Prompt-based learning is a new way of using pre-trained language models more efficiently for solving language tasks. It involves changing the task using textual prompts, and the language model generates the desired output directly from the prompts. The main idea behind this approach is to efficiently use the generation capabilities of PLMs. Table \ref{table:1} introduces some terminology, notations, and an emotion classification example. The original input $x$ is modified using the \textit{prompting function} which generates the \textit{prompt} $x^{\prime}$. The \textit{prompt function} or \textit{prompt template} typically contains text and two slots: the input slot $[X]$ for filling the input x and the answer slot $[Z]$ for generating the answer $z$. The prompt $x^{\prime}$ is given to the PLM to directly generate the answer $z$. For tasks such as emotion classification, another step of answer mapping is required to get to the final output $y$ from answer $z$. For example, multiple emotion-related words (such as \textit{happy, joyful, delighted, pleased}) can belong to the same output class (e.g. \textquote{\textit{joy}}). In this case, if the PLM generates an answer \textquote{\textit{happy}}, it is mapped to the output class \textquote{\textit{joy}}. For some tasks involving text generation, answer mapping is usually not required, the generated answer $z$ becomes the output $y$.
\vspace{0.5cm}
\paragraph{} Prompt-based learning is a new way of using pre-trained language models more efficiently for solving language tasks. It involves changing the task using textual prompts, and the language model generates the desired output directly from the prompts. \citet{brown2020gpt3} explored this approach on GPT-3 by providing a textual template and a few demonstrations for each task. The main idea behind this approach is to efficiently use the generation capabilities of PLMs. Table \ref{table:1} introduces some terminology, notations, and an emotion classification example. The original input $x$ is modified using the \textit{prompting function} which generates the \textit{prompt} $x^{\prime}$. The \textit{prompt function} or \textit{prompt template} typically contains text and two slots: the input slot $[X]$ for filling the input x and the answer slot $[Z]$ for generating the answer $z$. The prompt $x^{\prime}$ is given to the PLM to directly generate the answer $z$. For tasks such as emotion classification, another step of answer mapping is required to get to the final output $y$ from answer $z$. For example, multiple emotion-related words (such as \textit{happy, joyful, delighted, pleased}) can belong to the same output class (e.g. \textquote{\textit{joy}}). In this case, if the PLM generates an answer \textquote{\textit{happy}}, it is mapped to the output class \textquote{\textit{joy}}. For some tasks involving text generation, answer mapping is usually not required, the generated answer $z$ becomes the output $y$.
\vspace{0.25cm}
\begin{table}[!ht]
\centering

@ -70,13 +70,14 @@ where $P\left(v^{\prime}_{t} \mid c_{t}, I\left(s_{t}\right)\right)$ is the prob
\end{equation}
where $w$ is a decimal value (0,1) which is used to adjust the influence of inverse prompt.
\paragraph{Training} For training the prompt-based methods, the pre-trained Soloist (GPT-2 117M) is fine-tuned on value-based prompt and inverse prompt. All the MultiWOZ 2.1 data splits (Table 2) are used in the fine-tuning process in order to evaluate the performance under few-shot settings. The training strategy fixed-prompt LM tuning is adapted for tuning the prompt-based methods, where the fixed discrete prompts are used to fine-tune the parameters of the LM. Table \ref{table:4} shows the prompt templates used in the fine-tuning process. The prompts are appended to the dialog history before providing them as input to the PLM and probabilistically generate the missing slots. The inverse prompt is only used during the training phase. Experiments are also performed to evaluate the influence of inverse prompt by omitting it during the training process. \\
\paragraph{Training} For training the prompt-based methods, the pre-trained Soloist (GPT-2 117M) is fine-tuned on value-based prompt and inverse prompt. All the MultiWOZ 2.1 data splits (Table 2) are used in the fine-tuning process in order to evaluate the performance under few-shot settings. The training strategy fixed-prompt LM tuning is adapted for tuning the prompt-based methods, where the fixed discrete prompts are used to fine-tune the parameters of the LM. Table \ref{table:4} shows the prompt templates used in the fine-tuning process. The prompts are appended to the dialog history before providing them as input to the PLM and probabilistically generate the missing slots. The inverse prompt is only used during the training phase. Experiments are performed to evaluate the influence of inverse prompt by setting multiple values to the inverse prompt weight $w$ in equation \ref{eq:3} and also by completely omitting it during the training phase.
\vspace{0.3cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{10pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.35} % Default value: 1
\renewcommand{\arraystretch}{1.25} % Default value: 1
\begin{tabular}{ll}
\hline
\multicolumn{1}{c}{\textbf{Type}} & \multicolumn{1}{c}{\textbf{Prompt templates}} \\
@ -166,7 +167,7 @@ where $|K|$ represents the number of prompt functions, $f_{k}$ is the $k$-th pro
\label{table:6}
\end{table}
\paragraph{Prompt Augmentation} \textit{Prompt Augmentation}, sometimes also called \textit{demonstration learning} \citep{gao2021lmbff}, provides a few additional \textit{answered prompts} that can demonstrate to the PLM, how the actual value-based prompt can be answered. These demonstrations take advantage of the language models' ability to learn repetitive patterns. The sample selection of answered prompts is hand-crafted from the training data. At inference time, the answered prompts are appended to the input before asking the PLM to generate the slot. Table \ref{table:7} below provides an example of demonstration learning.
\paragraph{Prompt Augmentation} \textit{Prompt Augmentation}, also known as \textit{demonstration learning} \citep{gao2021lmbff}, provides a few additional answered prompts that can demonstrate to the PLM how the actual task can be performed. These demonstrations take advantage of the language models ability to learn repetitive patterns. \citet{brown2020gpt3} also explored this idea by specifying few-shot demonstrations for multiple language tasks and achieved strong performance. This work considered the demonstration samples as additional context, concatenated them to the input, and let GPT-3 model generate outputs without any gradient updates or fine-tuning. Table \ref{table:7} below provides an example of demonstration learning.
\begin{table}[h!]
\centering
@ -180,15 +181,17 @@ where $|K|$ represents the number of prompt functions, $f_{k}$ is the $k$-th pro
Book a train to Stuttgart. & \textit{Stuttgart} is of slot [s]\\
\hline
\end{tabular}
\caption{Example prompt augmentation with answered prompts}
\caption{An example of demonstration learning for DST task}
\label{table:7}
\end{table}
\paragraph{} Experiments are performed on two \textit{sets} of demonstration samples to understand the importance of sample selection. The \textit{sample selection} of the demonstrations is manually hand-picked and hand-crafted from the training data. Each demonstration sample contains multiple answered prompts (\textit{dialog history + belief states}) covering all the domains and slots. These demonstration examples are specifically chosen after gaining insights from the initial error analysis of the prompt-based methods. The first sample set contains 8 examples, and the second sample contains 5 examples. The number of demonstration examples that can be picked in a single sample set is bounded by the GPT-2 input length of 1024. The demonstrations from each sample set are concatenated to the input and given to the fine-tuned prompt model for generating the slots.
\subsection{Evaluation Metrics}
The standard evaluation metric joint goal accuracy (JGA) is adopted to evaluate the belief state predictions of baseline and prompt-based methods. This metric compares all the predicted belief states to the ground-truth states at each turn. The prediction is correct only if all the predicted belief states match the ground-truth states. Both slots and values must exactly match for the belief state prediction to be correct. The rule-based methods used in value extraction can lead to many false positives in the value candidates. In order to exclude the influence of wrongly extracted values, \citet{yang2022prompt} proposed JGA*, the joint goal accuracy is computed only for the belief states where the values are extracted correctly. These evaluation metrics answer the following questions: \textsf{Q1:} How do the prompt-based methods perform overall compared to the Soloist baseline? \textsf{Q2:} Can the prompt-based methods perform better under low-resource settings? \textsf{Q3:} For prompt-based methods, does JGA* metric hold a better score than JGA? \textsf{Q4:} Can multi-prompt techniques together perform better than a single-prompt?
\paragraph{Analysis}The belief state predictions from the \textsc{Soloist} baseline and prompt-based methods are analyzed to identify the potential improvements and drawbacks. A detailed qualitative analysis is performed on the wrong belief state predictions. Additionally, error analysis is also performed on the rule-based value extraction methods to identify the impact on the slot generation process.
\subsection{Implementation Details}
For the \textsc{Soloist} baseline, the existing implementation by \citet{peng2021soloist} is adapted to the few-shot experiments conducted in this thesis. There's no publicly available implementation of the prompt-based methods for DST. Huggingface Transformers \citep{wolf2020transformers} library is used to implement the prompt-based DST methods from scratch. Adam \citep{kingma2015adam} optimization algorithm is used during the fine-tuning process of both baseline and prompt-based methods. The rule-based value extraction methods are implemented using Stanford CoreNLP client stanza \citep{qi2020stanza}. The inverse prompt weight $w$ in Eq. \ref{eq:3} is set to 0.1. The prompt weight $\alpha_{k}$ in Eq. \ref{eq:4} is set to the same value (1/4) for all the prompts used in prompt ensembling.
For the \textsc{Soloist} baseline, the existing implementation by \citet{peng2021soloist} is adapted to the few-shot experiments conducted in this thesis. There's no publicly available implementation of the prompt-based methods for DST. Huggingface Transformers \citep{wolf2020transformers} library is used to implement the prompt-based DST methods from scratch. Adam \citep{kingma2015adam} optimization algorithm is used during the fine-tuning process of both baseline and prompt-based methods. The rule-based value extraction methods are implemented using Stanford CoreNLP client stanza \citep{qi2020stanza}. Experiments are performed with multiple inverse prompt weights $w$ (\textit{0.1, 0.3, 0.5, 0.7}) in Eq. \ref{eq:3}. The prompt weight $\alpha_{k}$ in Eq. \ref{eq:4} is set to the same value (1/4) for all the prompts used in prompt ensembling.

@ -3,27 +3,27 @@ This section presents the experimental results evaluated on all the methods desc
\subsection{SOLOIST Baseline}
Table \ref{table:8} shows the results of the baseline model under few-shot experiments. Experimental results show the baseline model performed poorly and struggled to generate belief states under low-resource settings (\textsl{5-dpd}, \textsl{10-dpd}, \textsl{50-dpd}). Under low-resource data splits, the limited size of data samples made it challenging for the baseline to generate unseen belief states. The results also show that more data may be necessary as the model achieves better results on the data splits with a higher number of data samples (\textsl{125-dpd}, \textsl{250-dpd}).
Table \ref{table:8} shows the results of the fine-tuned \textsc{Soloist} baseline model under few-shot experiments. Experimental results show the baseline model performed poorly and struggled to generate belief states under low-resource settings (\textsl{5-dpd}, \textsl{10-dpd}, \textsl{50-dpd}). Under low-resource data splits, the limited size of data samples made it challenging for the baseline to generate unseen belief states. The results also show that more data may be necessary as the model achieves better results on the data splits with a higher number of data samples (\textsl{125-dpd}, \textsl{250-dpd}).
\vspace{0.5cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{16pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\renewcommand{\arraystretch}{1.25} % Default value: 1
\begin{tabular}{lc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA} \\
\hline
\textsl{5-dpd} (25) & 9.06 \\
\hline
\textsl{10-dpd} (50) & 14.20 \\
\hline
\textsl{50-dpd} (250) & 28.64 \\
\hline
\textsl{100-dpd} (500) & 33.11 \\
\hline
\textsl{125-dpd} (625) & 35.79 \\
\hline
\textsl{250-dpd} (1125) & \textbf{40.38} \\
\hline
\end{tabular}
@ -34,63 +34,67 @@ Table \ref{table:8} shows the results of the baseline model under few-shot exper
\subsection{Prompt-based methods}
Table \ref{table:9} shows the results of the prompt-based model under few-shot experiments. Only a single value-based prompt is used in these experiments. Experimental results show the prompt-based model significantly outperforms the baseline model in all data splits. For low-resource data splits like \textsl{5-dpd}, \textsl{10-dpd}, and \textsl{50-dpd}, the prompt-based model shows a substantial improvement over the baseline, achieving an increase in the JGA metric by \textit{21}, \textit{28}, and \textit{18} points respectively.
\vspace{0.25cm}
This section presents the evaluation results for the prompt-based methods. A single value-based prompt together with the inverse prompt is used. Experiments are performed on all the MultiWOZ data splits and evaluated on JGA and JGA* metrics. In order to analyze the influence of the inverse prompt mechanism, experiments are performed on different inverse prompt loss weights $w$ = 0.1, 0.3, 0.5, and 0.7.
\begin{table}[h!]
\centering
\small
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{lcc}
\setlength{\tabcolsep}{6pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.1} % Default value: 1
\begin{tabular}{l ccc ccc ccc ccc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA} & \textbf{JGA*}\\
& & \multicolumn{2}{c}{\textbf{w = 0.1}} & & \multicolumn{2}{c}{\textbf{w = 0.3}} & & \multicolumn{2}{c}{\textbf{w = 0.5}} & & \multicolumn{2}{c}{\textbf{w = 0.7}}\\
\textbf{Dataset} & & JGA & JGA* & & JGA & JGA* & & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{5-dpd} (25) & 30.66 & 71.04 \\
\hline
\textsl{10-dpd} (50) & 42.65 & 86.43 \\
\hline
\textsl{50-dpd} (250) & 47.06 & 91.63 \\
\hline
\textsl{100-dpd} (500) & \textbf{47.74} & \textbf{92.31} \\
\hline
\textsl{125-dpd} (625) & 46.49 & 91.86 \\
\hline
\textsl{250-dpd} (1125) & 47.06 & 92.08 \\
\textsl{5-dpd} & & 30.66 & 71.04 & & \textbf{31.67} & \textbf{73.19} & & 30.77 & 72.85 & & 29.98 & 70.93\\
\textsl{10-dpd} & & \textbf{42.65} & \textbf{86.43} & & 41.18 & 83.48 & & 40.05 & 80.77 & & 40.38 & 85.18\\
\textsl{50-dpd} & & \textbf{47.06} & \textbf{91.63} & & 46.49 & 91.18 & & 47.04 & 91.18 & & 46.27 & 90.05\\
\textsl{100-dpd} & & 47.74 & 92.31 & & \textbf{48.42} & \textbf{92.42} & & 48.19 & 92.65 & & 48.3 & 92.65\\
\textsl{125-dpd} & & 46.49 & 91.86 & & 46.15 & 91.18 & & \textbf{46.83} & \textbf{91.74} & & 46.15 & 90.95\\
\textsl{250-dpd} & & 47.06 & 92.08 & & \textbf{47.62} & \textbf{92.65} & & 47.4 & 92.31 & & 47.17 & 92.09\\
\hline
\end{tabular}
\endgroup
\caption{Few-shot experimental results from the prompt-based model. Only a single \textit{value-based prompt} is used. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\caption{Few-shot experimental results from the prompt-based model. Only a single \textit{value-based prompt} is used together with the \textit{inverse prompt}. Evaluation results for different inverse prompt weights $w$ (in Eq. \ref{eq:3}) are shown in the table. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:9}
\end{table}
\paragraph{} The prompt-based results also show that by increasing the number of data samples in experiments, the model only achieved minor performance improvements. For example, the prompt-based methods perform nearly identical on the data splits 50-dpd and 250-dpd. This suggests the prompt-based approach understands the DST task better under low-resource scenarios. The higher values of the JGA* metric across all data splits indicate the potential drawbacks of the rule-based value extraction methods.
\paragraph{} Table \ref{table:9} shows the evaluation results of the prompt-based model under few-shot experiments. Experimental results show the prompt-based model significantly outperformed the baseline model across all data splits. For low-resource data splits like \textsl{5-dpd}, \textsl{10-dpd}, and \textsl{50-dpd}, the prompt-based model shows substantial improvement over the baseline, achieving a performance gain in the JGA metric by \textsl{22}, \textsl{28}, and \textsl{18} points respectively. The results clearly demonstrate the effectiveness of the prompt-based methods over baseline, especially under extremely low-resource settings.
\vspace{-6pt}
\paragraph{} The results from the prompt-based methods also show that by increasing the number of data samples in the few-shot experiments, the model only achieved minor performance improvements. This suggests the prompt-based methods can understand the DST task better under low-resource settings. For example, the performance of prompt-based methods under data splits \textsl{50-dpd} and \textsl{250-dpd} is nearly identical. Different inverse prompt loss weights $w$ resulted in a similar performance, with $w = 0.1$ and $w = 0.3$ performing slightly better than other weights. The higher values of the JGA* metric across all data splits indicates the rule-based value extraction methods have some limitations.
\subsection{Multi-prompt methods}
\subsubsection{Prompt Ensembling results}
Table \ref{table:10} shows the results of prompt ensembling under few-shot settings. The results from the prompt ensemble show a slight improvement over a single value-based prompt. Contrary to expectations, the prompt ensemble model did not show significant performance improvement on the JGA metric. The results also show that the performance of the prompt ensemble model is similar when trained on large data splits, i.e. \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}.
For prompt ensemble experiments, four value-based prompts are used as shown in table \ref{table:6}. Only a single inverse prompt with weight $w = 0.1$ (Eq. \ref{eq:3}) is utilized during training. Table \ref{table:10} shows the results of prompt ensembling under few-shot settings. The results from the prompt ensemble model show only a slight improvement over some data splits when compared to a single value-based prompt. Contrary to expectations, the prompt ensemble model did not show significant performance improvement on the JGA metric. The results also show that the performance of the prompt ensemble model is similar when trained on large data splits, i.e. \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}.
\vspace{0.25cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\setlength{\tabcolsep}{12pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{lcc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA} & \textbf{JGA*}\\
\hline
\textsl{5-dpd} (25) & 30.09 & 69.23 \\
\hline
\textsl{10-dpd} (50) & 42.84 & 86.99 \\
\hline
\textsl{50-dpd} (250) & 47.62 & 91.74 \\
\hline
\textsl{100-dpd} (500) & \textbf{48.08} & \textbf{92.87} \\
\hline
\textsl{125-dpd} (625) & 46.96 & 92.08 \\
\textbf{\makecell{Data split}} & \textbf{JGA} & \textbf{JGA*}\\
\hline
\textsl{250-dpd} (1125) & \textbf{48.08} & \textbf{92.87} \\
\textsl{5-dpd} & 30.09 & 69.23 \\
\textsl{10-dpd} & 42.84 & 86.99 \\
\textsl{50-dpd} & 47.62 & 91.74 \\
\textsl{100-dpd} & 48.08 & 93.10 \\
\textsl{125-dpd} & 46.96 & 92.08 \\
\textsl{250-dpd} & \textbf{48.30} & \textbf{93.44} \\
\hline
\end{tabular}
\endgroup
@ -98,37 +102,76 @@ Table \ref{table:10} shows the results of prompt ensembling under few-shot setti
\label{table:10}
\end{table}
The fine-tuning of the value-based prompt together with the inverse prompt performs exceptionally well on the DST task. The rule-based value extraction approach has a turn-level accuracy of only \textsl{49\%} due to some limitations. Due to these limitations, there is no room for improvement in the prompt ensemble model over a single value-based prompt.
\subsubsection{Prompt Augmentation results}
Table \ref{table:11} shows the results of prompt augmentation under few-shot settings.
\vspace{0.25cm}
\vspace{0.2cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.15} % Default value: 1
\begin{tabular}{lcc}
\hline
\textbf{\makecell{Data split (\# dialogs)}} & \textbf{JGA} & \textbf{JGA*}\\
\begin{tabular}{l ccc ccc}
\hline
\textsl{5-dpd} (25) & 27.8 & 68.1 \\
& \multicolumn{2}{c}{\textbf{Sample 1}} & & \multicolumn{2}{c}{\textbf{Sample 2}}\\
\textbf{Data split} & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{10-dpd} (50) & 38.91 & 74.43 \\
\hline
\textsl{50-dpd} (250) & 39.52 & 82.81 \\
\textsl{5-dpd} & 26.02 & 58.6 & & 27.6 & 59.39 \\
\textsl{10-dpd} & 33.26 & 70.14 & & 34.95 & 77.94 \\
\textsl{50-dpd} & 38.80 & 71.38 & & \textbf{39.77} & \textbf{74.55} \\
\textsl{100-dpd} & 35.97 & 70.89 & & 38.46 & 74.89 \\
\textsl{125-dpd} & 36.09 & 73.08 & & 36.18 & 76.47 \\
\textsl{250-dpd} & 35.63 & 72.9 & & 38.91 & 76.7 \\
\hline
\textsl{100-dpd} (500) & \textbf{42.42} & \textbf{83.71} \\
\end{tabular}
\endgroup
\caption{Experimental results from demonstration learning (multi-prompt method). \textsl{Sample 1} contains 8 demonstrations and \textsl{Sample 2} contains 5 demonstrations. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:11}
\end{table}
\paragraph{} Prompt augmentation (or \textit{demonstration learning}) provides additional context to the language models in the form of \textquote{answered prompts} at inference time. Two sets of demonstration samples are hand-picked and concatenated to the input to help the language model generate belief states more accurately. Table \ref{table:11} shows the experimental results from the prompt augmentation method on two demonstration samples (sample 1 \& sample 2). Results show the demonstration learning struggled to generate the belief states accurately. The performance is inadequate across all data splits when compared to other prompt-based methods. The sample selection of demonstrations plays an important role in the model performance. The results from \textsl{sample 2} with 5 demonstrations perform slightly better than \textsl{sample 1}. Only a limited number of demonstrations can be provided to the GPT-2 language model due to its maximum input sequence length of 1024, which led to bias during the slot generation at inference time.
\clearpage
\subsection{Comparison of results}
The section summarizes all the experimental results and compares them with different methods. Table \ref{table:12} presents the top results from all the experimental methods.
\begin{table}[h!]
\centering
\small
\begingroup
\setlength{\tabcolsep}{4pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{l c cc ccc ccc ccc}
\hline
\textsl{125-dpd} (625) & 40.16 & 82.92 \\
& \textbf{\textsl{Baseline}} & \multicolumn{2}{c}{\textbf{\textsl{VbP}}} & & \multicolumn{2}{c}{\textbf{\textsl{VbP+Inv}}} & & \multicolumn{2}{c}{\textbf{\textsl{PrEns}}} & & \multicolumn{2}{c}{\textbf{\textsl{PrAug}}}\\
\textbf{Dataset} & JGA & JGA & JGA* & & JGA & JGA* & & JGA & JGA* & & JGA & JGA*\\
\hline
\textsl{250-dpd} (1125) & 41.52 & 85.07 \\
\textsl{5-dpd} & 9.06 & 26.81 & 64.25 & & \textbf{31.67} & \textbf{73.19} & & 30.09 & 69.23 & & 27.60 & 59.39 \\
\textsl{10-dpd} & 14.20 & 41.10 & 82.35 & & 42.65 & 86.43 & & \textbf{42.84} & \textbf{86.99} & & 34.95 & 77.94 \\
\textsl{50-dpd} & 28.64 & 45.70 & 90.70 & & 47.06 & 91.63 & & \textbf{47.62} & \textbf{91.74} & & 39.77 & 74.55 \\
\textsl{100-dpd} & 33.11 & 47.74 & 91.86 & & \textbf{48.42} & 92.42 & & 48.08 & \textbf{93.10} & & 38.46 & 74.89 \\
\textsl{125-dpd} & 35.79 & 45.02 & 90.61 & & 46.83 & 91.74 & & \textbf{46.96} & \textbf{92.08} & & 36.18 & 76.47 \\
\textsl{250-dpd} & 40.38 & 46.15 & 91.40 & & 47.62 & 92.65 & & \textbf{48.30} & \textbf{93.44} & & 38.91 & 76.70 \\
\hline
\end{tabular}
\endgroup
\caption{Experimental results from demonstration learning (multi-prompt method). The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}.}
\label{table:11}
\caption{Top evaluation results from all the few-shot experiments. \textsl{Baseline}: \textsc{Soloist} baseline model; \textsl{VbP}: \textsl{value-based prompt} without inverse prompt; \textsl{VbP+Inv}: \textsl{value-based prompt} and \textsl{inverse prompt}; \textsl{PrEns}: \textsl{prompt ensembling} with \textsl{inverse prompt}; \textsl{PrAug}: \textsl{Prompt Augmentation}; \textsl{dpd}: \textsl{dialogues per domain}}
\label{table:12}
\end{table}
Prompt augmentation (also called \textit{demonstration learning}) provides additional context to the language models in the form of \textquote{\textsl{answered prompts}} at inference time. The hand-crafted answered prompts are supposed to help the language model understand the DST task better and generate accurate responses. Table \ref{table:11} shows the experimental results from the prompt augmentation method. Results show that the demonstration learning struggled to generate the belief states accurately. The performance is inadequate across all data splits when compared to other prompt-based methods. Only a limited number of answered prompts can be provided to the GPT-2 LM due to the max input sequence length of 1024, which led to bias during the slot generation process.
The prompt-based methods achieved better results and performed significantly better than the baseline model. For the low-resource data split \textsl{5-dpd}, the single value-based prompt together with the inverse prompt outperformed all the other methods. For all the other data splits \textsl{10-dpd}, \textsl{50-dpd}, \textsl{100-dpd}, \textsl{125-dpd}, and \textsl{250-dpd}, the prompt ensembling model achieved the top results. However, it's important to note the prompt ensemble model only achieved minor improvements over the single value-based prompt. The inverse prompt mechanism also has a noticeable impact on the prompt-based model, especially under low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}). The prompt augmentation approach struggled to take advantage of the demonstration samples.
\paragraph{} Overall, the multi-prompt methods (prompt ensembling and prompt augmentation) struggled to improve the performance of the DST task. However, the prompt ensembling approach with multiple value-based prompts showed minor improvements over a single value-based prompt.
\clearpage

@ -28,11 +28,11 @@
\end{tabular}
\endgroup
\caption{Examples of a wrongly generated belief states by the baseline model.}
\label{table:12}
\label{table:13}
\end{table}
\vspace{0.5cm}
\noindent The belief predictions task of the \textsc{Soloist} baseline utilizes \textsl{top-k} and \textsl{top-p} sampling in order to generate the \textsl{(slot, value)} pairs. As the baseline model uses open-ended generation, it is susceptible to generating random slot-value pairs that are not relevant. The baseline performance was also affected by the repeated slot generations and in some cases incorrect values. Table \ref{table:12} shows examples of some of the errors made by the baseline model. In the first example, the baseline system missed two true states and generated a totally incorrect belief state. For the second example, the slot \textit{area} is repeated with a different value and the value for the slot \textit{pricerange} is incorrectly generated.
\noindent The belief predictions task of the \textsc{Soloist} baseline utilizes \textsl{top-k} and \textsl{top-p} sampling in order to generate the \textsl{(slot, value)} pairs. As the baseline model uses open-ended generation, it is susceptible to generating random slot-value pairs that are not relevant. The baseline performance was also affected by the repeated slot generations and in some cases incorrect values. Table \ref{table:13} shows examples of some of the errors made by the baseline model. In the first example, the baseline system missed two true states and generated a totally incorrect belief state. For the second example, the slot \textit{area} is repeated with a different value and the value for the slot \textit{pricerange} is incorrectly generated.
\subsection{Analysis of prompt-based methods}
@ -59,43 +59,10 @@
\end{tabular}
\endgroup
\caption{Incorrect belief states generated by value-based prompt.}
\label{table:13}
\end{table}
\noindent The value-based prompt trained on low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}) struggled to distinguish between the slots like \textit{departure} vs \textit{destination} and \textit{leave} vs \textit{arrive}. In many instances, it wrongly generated the slot \textit{destination} instead of \textit{departure} and slot \textit{arrive} instead of \textit{leave}. Table \ref{table:13} shows some example outputs where the slots are generated incorrectly. In both examples, the slot arrive is incorrectly generated. These incorrect slot generations are due to the limited training available for these examples. Overall, the prompt-based methods perform significantly better than the baseline even under low-resource settings, due to the constrained generation of slots using value-based prompts.
\subsubsection{Impact of Inverse Prompt}
The inverse prompt mechanism can be considered as an auxiliary task that complements the value-based prompt and helps generate the slots more accurately. Experiments are performed to analyze the impact of the inverse prompt by omitting it while training the value-based prompt.
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{12pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{|l|cc|cc|}
\hline
\textbf{Data split} & \multicolumn{2}{c|}{\textbf{w/o inverse prompt}} & \multicolumn{2}{c|}{\textbf{with inverse prompt}} \\
\textbf{(\# dialogs)} & \textbf{JGA} & \textbf{JGA*} & \textbf{JGA} & \textbf{JGA*} \\
\hline
\textsl{5-dpd} (25) & 26.81 & 64.25 & 30.66 & 71.04 \\
\hline
\textsl{10-dpd} (50) & 41.1 & 82.35 & 42.65 & 86.43 \\
\hline
\textsl{50-dpd} (250) & 45.7 & 90.7 & 47.06 & 91.63 \\
\hline
\textsl{100-dpd} (500) & \textbf{47.74} & \textbf{91.86} & \textbf{47.74} & \textbf{92.31} \\
\hline
\textsl{125-dpd} (625) & 45.02 & 90.61 & 46.49 & 91.86 \\
\hline
\textsl{250-dpd} (1125) & 46.15 & 91.4 & 47.06 & 92.08 \\
\hline
\end{tabular}
\endgroup
\caption{Experimental results showing value-based prompt performance with and without inverse prompt mechanism while training. The term \textquote{\textsl{dpd}} stands for \textquote{\textsl{dialogues per domain}}, \textquote{\textsl{w/o}} means \textquote{\textsl{without}}.}
\label{table:14}
\end{table}
\noindent The results from table \ref{table:14} show the inverse prompt mechanism helped improve the performance of the prompt-based model, especially under low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}). For the data split \textsl{5-dpd}, where the training data is very limited, the inverse prompt brings noticeable improvements by achieving a \textit{5\%} increase in performance on JGA and a \textit{7\%} increase on JGA* metric. For the data splits with a higher number of data samples (i.e., \textsl{100-dpd}, \textsl{125-dpd}, \textsl{250-dpd}), only minor improvements can be observed in the performance when the inverse prompt is included in the training. The experimental results conclude that the inverse prompt mechanism has a noticeable impact on the prompt-based model under extremely low-resource settings.
\noindent The value-based prompt trained on low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}) struggled to distinguish between the slots like \textit{departure} vs \textit{destination} and \textit{leave} vs \textit{arrive}. In many instances, it wrongly generated the slot \textit{destination} instead of \textit{departure} and slot \textit{arrive} instead of \textit{leave}. Table \ref{table:14} shows some example outputs where the slots are generated incorrectly. In both examples, the slot arrive is incorrectly generated. These incorrect slot generations are due to the limited training available for these examples. Overall, the prompt-based methods perform significantly better than the baseline even under low-resource settings, due to the constrained generation of slots using value-based prompts.
\subsubsection{Repeated values in Belief States}
In the prompt-based methods, the value-based prompt takes the candidate values and generates the corresponding slots. The belief states can have repeated values in the (slot, value) pairs. In other words, the user requirements may lead to having repeated values in the belief state (slot, value) pairs.
@ -156,7 +123,7 @@ In the prompt-based methods, the value-based prompt takes the candidate values a
\label{table:16}
\end{table}
At inference time, the value-based prompt requires the belief state values in order to generate slots. The value extraction methods apply a set of rules on POS tags and named entities to extract value candidates directly from utterances. The rule-based extraction has an accuracy of \textit{79\%} over all the values and a turn-level accuracy of \textit{49\%} on the test split. Table \ref{table:16} highlights instances where the values cannot be extracted using rule-based methods. In the first example, the value \textquote{\textit{dont care}} does not appear in the utterances and cannot be extracted from POS tags. When the user requirement is \textit{free} wifi or \textit{free} parking, the existing annotation system for belief states considers it as the value \textquote{\textit{yes}}. The rule-based methods adopted for value extraction can only extract the value \textquote{\textit{free}} from the utterances. The values \textquote{\textit{dont care}} and \textquote{\textit{yes}} also occur twice in the examples shown in table \ref{table:16}, as described in the previous section (sec \ref{subsec:value_errors}) the value-based prompt cannot handle repeated values for slot generation.
At inference time, the value-based prompt requires the belief state values in order to generate slots. The value extraction methods apply a set of rules on POS tags and named entities to extract value candidates directly from utterances. The rule-based extraction has an accuracy of \textsl{79\%} over all the values and a turn-level accuracy of \textsl{49\%} on the test split. Table \ref{table:16} highlights instances where the values cannot be extracted using rule-based methods. In the first example, the value \textquote{\textit{dont care}} does not appear in the utterances and cannot be extracted from POS tags. When the user requirement is \textit{free} wifi or \textit{free} parking, the existing annotation system for belief states considers it as the value \textquote{\textit{yes}}. The rule-based methods adopted for value extraction can only extract the value \textquote{\textit{free}} from the utterances. The values \textquote{\textit{dont care}} and \textquote{\textit{yes}} also occur twice in the examples shown in table \ref{table:16}, as described in the previous section (sec \ref{subsec:value_errors}) the value-based prompt cannot handle repeated values for slot generation.
\vspace{0.5cm}
\begin{table}[h!]

@ -1,3 +1,5 @@
\section{Conclusion}\label{sec:conclusion}
This work explored the use of prompt-based methods for dialog state tracking (DST) in task-oriented dialogue systems. The prompt-based methods, which include value-based prompt and inverse prompt, learned the DST task efficiently under low-resource few-shot settings without relying on the pre-defined set of slots and values. Experiments show that the prompt-based methods significantly outperformed the baseline \textsc{Soloist} model under low-resource settings. Analysis of generated belief states shows the prompt-based approach has some limitations. Additionally, multi-prompt methods such as prompt ensembling and prompt augmentation are applied to the DST task. Results show that the prompt ensemble model achieved minor improvements, and the performance of prompt augmentation is limited due to the bias in answered prompts. Error analysis of value extraction highlights the limitations of the rule-based methods. Further research is necessary to overcome the limitations of prompt-based methods and value extraction methods.
\paragraph{} In conclusion, prompt-based methods can be used to solve the DST task directly by prompting the language models. However, further research is necessary to improve this prompt learning framework. Future work can explore automated prompt search methods for choosing the right prompts instead of manually creating the templates. Future work can improve the value extraction methods by considering it as a few-shot text summarization and semantic tagging task. Another interesting area is to explore if bigger language models can perform better in solving the DST tasks.

Binary file not shown.

@ -14,11 +14,6 @@
%% You can change the Bibliographic styles here
\bibliographystyle{plainnat}
%% TODO :: remove DRAFT watermark
\usepackage[firstpageonly]{draftwatermark}
\SetWatermarkText{DRAFT}
\SetWatermarkScale{4}
\renewcommand{\baselinestretch}{1.3}
\parskip = \medskipamount
\frenchspacing
@ -41,7 +36,6 @@
% Full Name & other details
\newcommand{\name}{Mandava, Sai Pavan}
\newcommand{\matrikNummer}{3461015}
\newcommand{\myEmail}{st169661@stud.uni-stuttgart.de}
% start date and end date of the Thesis
@ -58,11 +52,14 @@
\begin{document}
\begin{titlepage}
\newgeometry{left=3.5cm,right=2cm}
\begin{center}
\begin{figure}[h!]
\centering
\includegraphics[width=.2\linewidth]{images/ims_logo.jpeg}
\includegraphics[width=.2\linewidth]{images/ims_logo}
\end{figure}
\large{Institut f{\"u}r Maschinelle Sprachverarbeitung\\Universit{\"a}t Stuttgart\\Pfaffenwaldring 5b\\70569 Stuttgart}\\
@ -74,25 +71,25 @@
\large{\startDate} \\
\large{\finishDate} \\
\vspace{0.5cm}
\vspace{1cm}
\Large{\textbf{\name}} \\ [2pt]
\large{M.Sc. Computational Linguistics} \\ [1pt]
\large{Mat.Nr.: \matrikNummer} \\ [1pt]
\normalsize{\myEmail}
\vspace{0.5cm}
\vspace{1cm}
\vfill
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{14pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.25} % Default value: 1
\setlength{\tabcolsep}{16pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.3} % Default value: 1
\begin{tabular}{lr}
\large{\textbf{Supervisor/Examiner}} & \supervisor\\
\large{\textbf{Examiner}} & \examiner\\
\large{\textbf{Supervisor/Examiner}} & \large{\supervisor}\\
\large{\textbf{Examiner}} & \large{\examiner}\\
\end{tabular}
\endgroup
\end{table}
@ -101,6 +98,11 @@
\end{center}
\end{titlepage}
% back to original geometry
\newgeometry{left=3cm,right=3cm}
\pagestyle{empty} % remove page numbers for first few pages
%% Thesis Main Content Starts here
%% Create separate files for each section for better organization
%% You can add/modify/delete the below sections according to your needs
@ -111,6 +113,8 @@
%% abstract
\input{sections/01_abstract}
\pagestyle{plain} % change page style back to plain
%% add table of contents here
\tableofcontents
\newpage

Loading…
Cancel
Save