Update to LaTeX code

3 years ago · 21fd9cac84
parent adc0e0b3ff
commit 21fd9cac84
7 changed files with 112 additions and 47 deletions
--- a/writing/latex/references.bib
+++ b/writing/latex/references.bib
@ -324,3 +324,38 @@ url={https://openreview.net/forum?id=rygGQyrFvH}
  year      = {2015},
  url       = {http://arxiv.org/abs/1412.6980}
 }
@article{ni2021dlds,
  author    = {Jinjie Ni and
               Tom Young and
               Vlad Pandelea and
               Fuzhao Xue and
               Vinay Adiga and
               Erik Cambria},
  title     = {Recent Advances in Deep Learning Based Dialogue Systems: {A} Systematic
               Survey},
  journal   = {CoRR},
  volume    = {abs/2105.04387},
  year      = {2021},
  url       = {https://arxiv.org/abs/2105.04387},
  eprinttype = {arXiv},
  eprint    = {2105.04387},
  timestamp = {Mon, 31 May 2021 08:19:46 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-2105-04387.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
 }
@inproceedings{wu2020tod-bert,
    title = "{TOD}-{BERT}: Pre-trained Natural Language Understanding for Task-Oriented Dialogue",
    author = "Wu, Chien-Sheng  and
      Hoi, Steven C.H.  and
      Socher, Richard  and
      Xiong, Caiming",
    booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.emnlp-main.66",
    doi = "10.18653/v1/2020.emnlp-main.66",
    pages = "917--929",
    abstract = "The underlying difference of linguistic patterns between general text and task-oriented dialogue makes existing pre-trained language models less useful in practice. In this work, we unify nine human-human and multi-turn task-oriented dialogue datasets for language modeling. To better model dialogue behavior during pre-training, we incorporate user and system tokens into the masked language modeling. We propose a contrastive objective function to simulate the response selection task. Our pre-trained task-oriented dialogue BERT (TOD-BERT) outperforms strong baselines like BERT on four downstream task-oriented dialogue applications, including intention recognition, dialogue state tracking, dialogue act prediction, and response selection. We also show that TOD-BERT has a stronger few-shot ability that can mitigate the data scarcity problem for task-oriented dialogue.",
 }
--- a/writing/latex/sections/02_intro.tex
+++ b/writing/latex/sections/02_intro.tex
@ -4,24 +4,12 @@
 \paragraph{} Prompt-based learning \textit{(\textquote{pre-train, prompt, and predict})} is a new paradigm in NLP that aims to predict the probability of text directly from the pre-trained LM. This framework is powerful as it allows the language model to be \textit{pre-trained} on massive amounts of raw text, and by defining a new prompting function the model can perform \textit{few-shot} or even \textit{zero-shot} learning \citep{liu2021ppp}. The large pre-trained language models (PLMs) are supposed to be useful in few-shot scenarios where the task-related training data is limited, as they can be probed for task-related knowledge efficiently by using a prompt. One example of such large pre-trained language models is GPT-3 \citep{brown2020gpt3} - \textit{\textquote{Language Models are Few-Shot Learners}}. \citet{madotto2021fsb} created an end-to-end chatbot (Few-Shot Bot) using \textit{prompt-based few-shot learning} learning and achieved comparable results to those of state-of-the-art. Prompting methods are particularly helpful in few-shot learning where domain-related data is limited. \textit{Fixed-prompt LM tuning} is a fine-tuning strategy for downstream tasks, where the LM parameters are tuned with fixed prompts to help LM understand the task. This can be achieved by applying a discrete textual prompt template to the data used for fine-tuning the PLM.
-\paragraph{} Prompt-based learning for few-shot DST with limited labeled domains is still under-explored. Recently, \citet{yang2022prompt} proposed a new prompt learning framework for few-shot DST. This work designed a \textit{value-based prompt} and an \textit{inverse prompt} mechanism to efficiently train a DST model for domains with limited training data. This approach doesn't depend on the ontology of slots and the results show that it can generate slots by prompting the tuned PLM and outperforms the existing state-of-the-art methods under few-shot settings. In this thesis, the prompt-based few-shot methods for DST are explored by implementing the following three tasks:
+\paragraph{} Prompt-based learning for few-shot DST with limited labeled domains is still under-explored. Recently, \citet{yang2022prompt} proposed a new prompt learning framework for few-shot DST. This work designed a \textit{value-based prompt} and an \textit{inverse prompt} mechanism to efficiently train a DST model for domains with limited training data. This approach doesn't depend on the ontology of slots and the results show that it can generate slots by prompting the tuned PLM and outperforms the existing state-of-the-art methods under few-shot settings. 
-\begin{enumerate}
+
-	\item Prompt-based few-shot DST - reproduce the results from \citet{yang2022prompt}
+\paragraph{} The main research objective of this thesis is to investigate the effectiveness of prompt-based methods for DST and to understand the limitations of this approach. Prompt-based methods are adopted for the DST task to answer the following research questions: \textsf{Q:} Can the dialogue belief states be extracted directly from PLM using prompt-based methods? \textsf{Q:} Can the prompt-based methods learn the DST task under low-resource settings without depending on the ontology of domains? \textsf{Q:} How does the prompt-based approach perform overall compared to a baseline model? \textsf{Q:} What are the drawbacks and limitations of prompt-based methods? \textsf{Q:} Can different multi-prompt techniques help the PLM understand the DST task better? \textsf{Q:} What impact do various multi-prompt methods have on the performance of the DST task?
-	\begin{itemize}
+
-		\item[--] Implement prompt-based methods for DST task under few-shot settings
+\paragraph{} To accomplish the research objectives, the prompt learning framework designed by \citet{yang2022prompt}, which includes a \textit{value-based prompt} and \textit{inverse prompt}, is utilized to generate the belief states by prompting the PLM. Few-shot experiments are performed on different proportions of data to evaluate the prompt-based methods under low-resource settings. A baseline model, which also does not depend on the ontology of dialogue domains, is trained on the DST task to compare with the prompt-based methods. A detailed error analysis is conducted to identify the limitations of prompt-based methods. Further, multi-prompt methods are adopted to help the PLM better understand the DST task.
-		\item[--] Implement a baseline model for comparing the prompt-based methods
+
-	\end{itemize}
+\paragraph{} This section introduced the overview of the thesis topic, motivation, and research objectives. The next section presents the background and related work (section \ref{sec:background}) with details on the following topics: dialog state tracking (DST), pre-trained language models (PLMs), the baseline model, prompting methods, and the dataset used. The description of the research methods used in the thesis experiments, including the few-shot experiments of baseline and prompt-based methods, multi-prompt methods, and evaluation metrics, are detailed in section \ref{sec:methods}. Section \ref{sec:results} provides all the few-shot experimental results of the research methods adopted. Analysis and discussion of results are presented in section \ref{sec:analysis}. Finally, the conclusion (section \ref{sec:conclusion}) highlights the summary of the main findings.
 	\item Evaluation and analyses of belief state predictions
 	\begin{itemize}
 		\item[--] Evaluate the DST task using Joint Goal Accuracy (JGA) metric
 		\item[--] Improvements observed from the prompt-based methods
 		\item[--] Drawbacks of the prompt-based methods
 	\end{itemize}
 	\item Extend prompt-based methods to utilize various \textit{multi-prompt} techniques
 	\begin{itemize}
 		\item[--] Can different multi-prompt techniques help the PLM better understand the DST  task?
 		\item[--] Evaluation of multi-prompt methods and what's the influence of various multi-prompt techniques?
 	\end{itemize} 
 \end{enumerate}
 \clearpage
--- a/writing/latex/sections/03_background.tex
+++ b/writing/latex/sections/03_background.tex
@ -2,25 +2,50 @@
 \subsection{Dialog State Tracking (DST)}
-\paragraph{} Task-oriented dialog systems, both modular and end-to-end systems, are capable of handling a wide range of tasks (such as ticket booking, restaurant booking, etc.) across various domains. A task-oriented dialogue system has stricter requirements for responses because it needs to accurately understand and process the user's message. Therefore, modular methods were suggested as a way to generate responses in a more controlled manner. A typical modular-based system uses a modular pipeline, which has four modules that execute sequentially - Natural Language Understanding (NLU), Dialog State Tracking (DST), Policy Learning (POL), and Natural Language Generation (NLG). The DST module is essential for enabling the system to comprehend the user's requests by tracking them in the form of slots and values (belief states) at every turn. For instance, in a dialogue system that helps users book flights, the system might track slots such as destination, departure, travel date, and number of travelers. By keeping track of these slots and their values, the system can understand the user requirements and provides this information to the next module. For example, consider the user message - \textquote{\textit{Plan a train trip to Berlin this Friday for two people}} - the DST module is supposed to extract (\textit{slot, value}) pairs as follows: \{(\textit{destination, Berlin}), (\textit{day, Friday}), (\textit{people, 2})\}. In this thesis, the focus is on the DST module for extracting slots and values.
+\paragraph{} Task-oriented dialog systems, both modular and end-to-end systems, are capable of handling a wide range of tasks (such as ticket booking, restaurant booking, etc.) across various domains. A task-oriented dialogue system has stricter requirements for responses because it needs to accurately understand and process the user's message. Therefore, modular methods were suggested as a way to generate responses in a more controlled manner. The architecture of a typical modular-based task-oriented dialog system is depicted in Figure \ref{figure:1}. A typical modular-based system uses a modular pipeline, which has four modules that execute sequentially - Natural Language Understanding (NLU), Dialog State Tracking (DST), Policy Learning (PL), and Natural Language Generation (NLG).  The NLU module extracts the semantic values from user messages, together with intent detection and domain classification. The DST module takes the extracted values and fills the slot-value pairs based on the entire dialog history. The Policy Learning (PL) module takes the slot-value pairs and decides the next action to be performed by the dialog system. The NLG module converts the dialog actions received from the PL module into the natural language text, which is usually the system response to the user.
 \vspace{0.5cm}
 \begin{figure}[h!]
    \centering
    \includegraphics[width=\linewidth]{images/modular_tod}
    \caption{Modular-based task-oriented dialog system \citep{ni2021dlds}}
    \label{figure:1}
 \end{figure}
 The DST module is essential for enabling the system to comprehend the user's requests by tracking them in the form of slots and values (belief states) at every turn. In recent years, some dialog systems take the user utterances and provide them directly to the DST module. This approach fills the slot-value pairs directly from the raw user message and eliminates the need for NLU module.  For example, consider the user message - \textquote{\textit{Plan a train trip to Berlin this Friday for two people}} - the DST module is supposed to fill (\textit{slot, value}) pairs as follows: \{(\textit{destination, Berlin}), (\textit{day, Friday}), (\textit{people, 2})\}. 
 \paragraph{} A typical task-oriented dialog system can assist users across multiple domains (restaurant, hotel, train, booking). Each dialog domain contains an ontology, which represents the knowledge of the domain and information required for specific tasks. The ontology of a domain consists of a pre-defined set of slots and all the possible values for each slot. Neural-based models were proposed to solve the DST task by multi-class classification, where the model predicts the correct class from multiple values. This approach depends on the ontology of the domains and needs to track a lot of slot-value pairs. Ontology is hard to obtain in real-world scenarios, especially for new domains. The neural-based DST model also needs a lot of training data which is rarely available for new domains.
 \paragraph{} A dialog state or belief state in DST contains the required information for the system to process the user's request. At each turn of the dialogue, the dialog state can have \textit{informable slots} and \textit{requestable slots}. Informable slots are specified by the user about the preferences and requirements for the system action.  For example, in the restaurant domain, the user can ask for a specific type of food or desired price range of the restaurants for table booking. The belief state slots with such information are called \textit{informable slots}. Users can ask the dialog system for an address or phone number of a restaurant, such slots are known as \textit{requestable slots}. This thesis work focuses on the DST module to extract informable slots and their values without depending on the ontology.
 \subsection{Pre-trained Language Models (PLMs)}
-\paragraph{} Large pre-trained language models are trained on huge amounts of textual data and have achieved state-of-the-art performance in a variety of NLP tasks, such as machine translation, text classification, text generation, and summarization. These models are trained on large datasets and are able to learn the probability distribution of the words. Pre-trained language models based on transformer architectures \citep{vaswani2017attention}, such as BERT \citep{devlin2019bert} and GPT \citep{radford2018gpt}, have also achieved state-of-the-art performance on many NLP tasks. GPT-2 \citep{radford2019gpt2} is a transformer-based auto-regressive language model trained on large amounts of open web text data. GPT-2 is trained with a simple objective: predict the next word, given all previous words within some text. The architecture and training objective of the PLMs plays an important role in determining their applicability to particular prompting tasks \citep{liu2021ppp}. For example, left-to-right auto-regressive LMs predict the next word by assigning a probability to the sequence of words. For tasks that require the PLM to generate text from \textit{prefix} prompts (the entire prompt string followed by generated text), the left-to-right LMs tend to mesh well with the left-to-right nature of the language model.
+\paragraph{} Large pre-trained language models are trained on huge amounts of textual data and they achieved state-of-the-art performance in a variety of NLP tasks, such as machine translation, text classification, text generation, and summarization. These PLMs trained on large-scale datasets can encode significant linguistic knowledge into their huge amount of parameters. Pre-trained language models based on transformer architectures \citep{vaswani2017attention}, such as BERT \citep{devlin2019bert} and GPT \citep{radford2018gpt}, have also achieved state-of-the-art performance on many NLP tasks. GPT-2 \citep{radford2019gpt2} is a transformer-based left-to-right auto-regressive language model trained on large amounts of open web text data. The main training objective of GPT-2 is to predict the next word, given all the previous words. A left-to-right auto-regressive language model predicts the next word given all the previous words or assigns the probability of word sequences. Consider the sequence of words $x = x_1, x_2, \ldots, x_n$, the probability distribution can be written using the chain rule from left to right:
 $$
 P(x) = P\left(x_1\right) \times P\left(x_2 \mid x_1\right) \times \ldots \times P\left(x_n \mid x_1 \cdots x_{n-1}\right)
 $$
 \paragraph{} The PLMs trained on large amounts of text can be fine-tuned using task-specific data to solve the downstream tasks efficiently. Previous work by \citet{wu2020tod-bert} pre-trained the BERT model with nine different task-oriented dialog datasets and later used it to fine-tune the downstream tasks. This approach improved the performance of downstream tasks over fine-tuning directly on BERT. \textsc{Soloist} \citep{peng2021soloist} used a similar approach to pre-train the GPT-2 model on two task-oriented dialog corpora and fine-tuned the pre-trained Soloist on the DST task.
 The pre-trained \textsc{Soloist} is the baseline model of this thesis, which uses the publicly available 12-layer GPT-2 (117M) model. The prompt-based methods in this thesis also utilize the pre-trained \textsc{Soloist} to fine-tune the prompt-based DST task.
-\paragraph{} The baseline model of this thesis, \textsc{Soloist} \citep{peng2021soloist}, uses a 12-layer GPT-2 for building the task-oriented dialog system. \textsc{Soloist} uses the publicly available 117M-parameter GPT-2 as initialization for task-grounded pre-training. The prompt-based methods in this thesis utlize the pre-trained \textsc{Soloist} and fine-tune it to the downstream DST task.
+\subsection{SOLOIST Model} \label{subsec:soloist}
-\subsection{SOLOIST Model}
+\paragraph{} \textsc{Soloist} \citep{peng2021soloist} uses the \textsl{pre-train, fine-tune} paradigm for building a task-oriented dialog system using an auto-regressive language model GPT-2 \citep{radford2019gpt2}. This dialog system is built in two phases: In the pre-training phase, \textsc{Soloist} is initialized with GPT-2 and further trained on two large task-oriented datasets, Schema and Taskmaster. The primary goal at this stage is to learn task completion skills such as \textit{belief prediction} and \textit{response generation}. In the belief predictions task of the pre-training stage, the \textsc{Soloist} model takes dialog history as input and generates belief states as a sequence of words. The generated belief state sequences take the form - \textquote{\textit{belief: $slot_1 = value_1; slot_2 = value_2, \ldots$}}. The pre-training objective for predicting belief states is:
-\paragraph{} \textsc{Soloist} \citep{peng2021soloist} is a task-oriented dialog system that uses transfer learning and machine teaching to build task bots at scale. \textsc{Soloist} uses the \textit{pre-train, fine-tune} paradigm for building end-to-end dialog systems using a transformer-based auto-regressive language model GPT-2 \citep{radford2019gpt2}, which subsumes different dialog modules (i.e., NLU, DST, POL, NLG) into a single model. In a \textit{pre-train, fine-tune} paradigm, a fixed \textit{pre-trained} LM is adapted to different downstream tasks by introducing additional parameters and \textit{fine-tuning} them using task-specific objective functions. In the pre-training stage, \textsc{Soloist} is initialized with the 12-layer GPT-2 (117M parameters) and further trained on large heterogeneous dialog corpora. The primary goal at this stage is to learn task completion skills such as belief state prediction (DST) and response generation. In the fine-tuning stage, the pre-trained \textsc{Soloist} model can be used to solve new tasks by just using a handful of task-specific dialogs. 
+$$
 \mathcal{L}=\log P(b \mid s)=\sum_{t=1}^{T_b} \log P\left(b_t \mid b_{<t}, s\right)
 $$
-\paragraph{} In this thesis, the pre-trained \textsc{Soloist} is the baseline model for generating the belief states. For the baseline DST task, the pre-trained \textsc{Soloist} is fine-tuned on the belief predictions task for open-ended text generation. For prompt-based methods, the baseline \textsc{Soloist} is fine-tuned for generating belief state slots using prompts. The results and outputs from the baseline model are compared to the prompt-based model for detailed analyses.
+where $T_b$ is the generated belief states sequence length, $b_{<t}$ indicates all tokens before $t$, $s$ is the dialog history up to the current turn. Overall, $log (b \mid s)$ represents the probability of generating the belief states sequence given the dialog history.
 \paragraph{} In the fine-tuning stage, the pre-trained \textsc{Soloist} model can be used to solve new tasks by just using a small amount of task-specific dialogs. The belief predictions task of the \textsc{Soloist} model can be fine-tuned on new task-oriented dialog datasets for solving the DST tasks. At inference time, the fine-tuned \textsc{Soloist} uses top-K \citep{fan2018topk} and nucleus \citep{holtzman2020topp} sampling for generating belief states as a sequence of words. In top-K sampling, the K most likely next words are filtered out and the probability is redistributed among only those K next words. In nucleus sampling (also known as \textit{top-p} sampling), only the words that exceed the probability threshold $p$ are chosen. This approach of generating belief states does not depend on the ontology of slots and values. The fine-tuning of belief predictions task on pre-trained \textsc{Soloist} is the baseline model for this thesis. The same pre-trained \textsc{Soloist} is used to fine-tune the prompt-based methods. 
 \subsection{Prompt Learning}
-\paragraph{} Prompt-based learning (also dubbed as \textit{\textquote{pre-train, prompt, and predict}}) is a new paradigm that aims to utilize PLMs more efficiently to solve downstream NLP tasks \citep{liu2021ppp}. In this paradigm, instead of adapting pre-trained LMs to downstream tasks by designing the task-specific training objectives, downstream tasks are reformulated to look more like those solved during the original LM training with the help of a textual \textit{prompt}. To perform prediction tasks, the original input $x$ is modified using a \textit{template} into a textual \textit{prompt} $x^{\prime}$ that has some unfilled slots, and then the PLM is used to probabilistically fill the unfilled information to obtain a final string $z$, from which the final output $y$ can be derived. For text generation tasks, the generated answer $z$ itself is the output $y$.
+\paragraph{} Prompt-based learning is a new way of using pre-trained language models more efficiently for solving language tasks. It involves changing the task using textual prompts, and the language model generates the desired output directly from the prompts. The main idea behind this approach is to efficiently use the generation capabilities of PLMs. Table \ref{table:1} introduces some terminology, notations, and an emotion classification example. The original input $x$ is modified using the \textit{prompting function} which generates the \textit{prompt} $x^{\prime}$. The \textit{prompt function} or \textit{prompt template} typically contains text and two slots: the input slot $[X]$ for filling the input x and the answer slot $[Z]$ for generating the answer $z$. The prompt $x^{\prime}$ is given to the PLM to directly generate the answer $z$. For tasks such as emotion classification, another step of answer mapping is required to get to the final output $y$ from answer $z$. For example, multiple emotion-related words (such as \textit{happy, joyful, delighted, pleased}) can belong to the same output class (e.g. \textquote{\textit{joy}}). In this case, if the PLM generates an answer \textquote{\textit{happy}}, it is mapped to the output class \textquote{\textit{joy}}. For some tasks involving text generation, answer mapping is usually not required, the generated answer $z$ becomes the output $y$. 
-\vspace{4pt}
+\vspace{0.5cm}
 \begin{table}[!ht]
 \centering
@ -47,21 +72,30 @@
 \label{table:1}
 \end{table}
-\paragraph{} For example, to recognize the emotion in the text, where \textit{input} $x = $\textquote{I missed the bus today.}, \textit{the prompt function} (also called \textit{template}) may take the form such as \textquote{$[X]$ I felt so $[Z]$}. $[X]$ takes the input text and $[Z]$ is supposed to be generated by the LM. Then, the \textit{prompt} $x^{\prime}$ would become \textquote{I missed the bus today. I felt so $[Z]$} and ask the PLM to fill the slot $[Z]$ with an emotion-bearing word. For some text generation tasks, the answer mapping from $z$ to $y$ may not be required, as the generated text itself becomes output. There are two main varieties of prompts: \textit{cloze prompts}, where the slot $[Z]$ is to be filled in the middle of the text, and \textit{prefix prompts}, where the input text comes entirely before $[Z]$. In general, for tasks that are being solved using a standard auto-regressive LM, prefix prompts tend to be more helpful, as they mesh well with the left-to-right nature of the model. 
+\paragraph{} Consider the emotion classification example from table \ref{table:1}, in order to recognize the emotion in the text, where input $x$ = \textquote{I missed the bus today.}, given the prompt function \textquote{[X] I felt so [Z]}. $[X]$ is filled with input $x$, then the prompt $x^{\prime}$ would become \textquote{I missed the bus today. I felt so [Z]} and the PLM is supposed to fill the slot $[Z]$ with the emotion word \textquote{sad}.
 \paragraph{} Prompt-based methods can be used without any explicit training of the LM for the downstream task, simply by taking a suitable pre-trained LM and applying the prompts defined for the task. This approach is traditionally called \textit{zero-shot learning}. \textit{Few-shot learning} is another approach where only a small number of data samples are used to train the language model. Prompting methods are particularly useful under few-shot settings, as there is generally not enough training data to fully specify the desired behavior. \textit{Fixed-prompt LM tuning} is a training strategy that fine-tunes the parameters of the LM, as in the standard \textit{pre-train fine-tune} paradigm, by using discrete prompts (\textit{hard prompts}) to help PLM understand the downstream task. This approach can potentially lead to improvements, particularly in few-shot scenarios. 
-\subsection{Prompt-based DST}
+\paragraph{Prompt types} There are two main varieties of prompts: \textit{prefix prompts} and \textit{cloze prompts}. In prefix prompts, the entire prompt text comes before the slot $[Z]$. For example, consider the prompt - \textquote{I like this movie. The movie is $[Z]$}, the slot $[Z]$ is at the end. In cloze prompts, the slot to be filled $[Z]$ appears in the middle or beginning of the prompt text. For example, consider the prompt - \textquote{Berlin is the capital of Germany. $[Z]$ is the capital of Japan}, the slot $[Z]$ is in the middle of the prompt text. For tasks that are solved using a left-to-right auto-regressive language model, using prefix prompts is more helpful. This is because prefix prompts are well-suited to the left-to-right nature of the language model. There are multiple ways of creating prompts: \textit{manual prompts, discrete prompts, and continuous prompts}. For \textit{manual prompts}, the templates are hand-crafted by humans based on the intuition of the task. These manual prompts generally contain a few natural language phrases and are usually task-specific. This approach can be time-taking and often requires a lot of experimenting. For \textit{discrete prompts}, the templates are searched by using automated methods such as prompt mining, gradient-based search, and generation from LM. These templates are also in the form of natural language phrases. This approach might require a large amount of training data to find prompts. For \textit{continuous prompts}, the templates are directly expressed in the embedding space of the language model. These prompts have their own parameters and can be tuned based on the training data of the task. 
 \paragraph{} Previous work by \citet{lee2021sdp} uses belief state slots in the prompts, along with the natural language descriptions of the schema for generating the corresponding values. This \textit{slot-based} prompt DST approach uses encoder-decoder LM with a bi-directional encoder. This method relies on the known ontology of the slots and requires a lot of training data for fine-tuning PLM. In real-world applications, defining all possible slots is difficult due to the rising new domains and users' continuous needs. \citet{yang2022prompt} proposed a new prompt-learning framework for DST that uses values in prompts (\textit{value-based}) and generates slots directly from the PLM. This \textit{value-based} prompt approach does not rely on the ontology of the slots and their natural language descriptions. In task-oriented dialog systems, the prompt-based DST methods are still under-explored. In this thesis, the value-based prompt approach is applied for few-shot DST.
-\subsection{MultiWOZ Dataset}
+\paragraph{Training strategy} Prompting methods can be used without any training to the PLM for the downstream tasks. This can be done by taking a suitable pre-trained LM and applying the prompts directly to the inputs of the task. This approach is traditionally known as \textit{zero-shot learning}. However, this zero-shot approach has a risk of bias, as the PLM is not fine-tuned for the task and is also less effective on tasks that are different from what the PLM is trained on. \textit{Few-shot learning} is another approach where only a small amount of task-specific training samples are used to train the language model. Prompting methods are particularly useful in this approach when there is not enough task-specific training data to fully train the model. There are different training methods to fine-tune the prompts and the LM: fix the LM parameters and fine-tune the prompts, fix the prompts and update the LM parameters, and fine-tune both LM parameters and prompts. From these training strategies, \textit{fixed-prompt LM tuning} is a way to improve the PLM by fine-tuning it with the prompts. This is similar to the standard fine-tuning paradigm, by applying fixed prompts to the training inputs and fine-tuning them on PLM. This approach helps the PLM understand the downstream task and can potentially lead to improvements under few-shot settings.
 \paragraph{} MultiWOZ \citep{budzianowski2018multiwoz} is a multi-domain task-oriented dialogue dataset that contains over 10K dialogues across 8 domains. It is a fully-labeled collection of human-human written conversations and has been a widely used dataset for benchmarking DST methods. \citet{eric2019multiwoz} released MultiWOZ 2.1 after fixing the noisy dialog state annotations and utterances that negatively impact the performance of DST models. In this thesis, MultiWOZ 2.1 is used to benchmark both baseline and prompt-based methods.
 \subsection{Prompt-based DST}
 \paragraph{} In the previous work TOD-BERT \citep{wu2020tod-bert}, the BERT language model is pre-trained on nine different task-oriented dialogue datasets. The pre-trained TOD-BERT is fine-tuned on multiple task-oriented dialogue tasks (intent recognition, dialog state tracking) to evaluate them under few-shot settings. The DST task in this work is a multi-class classification problem, by predicting the slots and values from the pre-defined ontology. \textsc{Soloist} \citep{peng2021soloist} also used a similar approach by pre-training the GPT-2 language model with two different dialogue datasets. The downstream DST task of the \textsc{Soloist} model does not depend on the ontology of domains and directly generates belief states as word sequences (described in section \ref{subsec:soloist}). Both TOD-BERT and \textsc{Soloist} models do not use prompt-based methods and perform poorly under extremely low-resource settings. The pre-trained \textsc{Soloist} model is adopted as a baseline in this thesis to explore the prompt-based methods for DST.
 \paragraph{} Previous work by \citet{lee2021sdp} used prompting methods on a PLM to solve the DST task. This work introduced schema-driven prompting (\textit{slot-based prompt}) that takes domain names, slots, and natural language descriptions of slots to fill in the prompts and generates the corresponding values. This method relies on the complete ontology of the domains and their slot descriptions. This method also requires a lot of training data to fine-tune the PLM. \citet{yang2022prompt} proposed a new prompt-learning framework for DST that uses values in prompts (\textit{value-based prompt}) and generates the slots directly from the PLM. This value-based prompt approach does not depend on the ontology of the domains. This work designed a \textit{value-based prompt} and \textit{inverse prompt} to help the PLM solve the DST task during the fine-tuning stage. Figure \ref{figure:2} shows an overview of value-based prompt and inverse prompt for DST.
 \vspace{0.4cm}
 \begin{figure}[h!]
    \centering
    \includegraphics[width=\linewidth]{images/prompt_dst}
    \caption{Overview of value-based prompt and inverse prompt mechanism.}
    \label{figure:2}
 \end{figure}
 \paragraph{} First, the belief state slots are generated using value-based prompts. The generated slots from the value-based prompt are given to the inverse prompt to generate back the values. The inverse prompt function can be considered as an auxiliary task that helps the PLM to understand the DST task. The loss from the value-based prompt and inverse prompt are combined during the training phase. At inference time, the value-based prompt is directly used to generate the slots without depending on the ontology. \citet{yang2022prompt} showed the prompt-learning framework can efficiently learn the DST task even under extremely low-resource settings. These prompt-based methods are further explored in this thesis. The experimental methods for prompt-based DST are detailed in section \ref{subsec:prompt_dst}.
 \subsection{MultiWOZ Dataset}
 \paragraph{} MultiWOZ \citep{budzianowski2018multiwoz} is a widely used dialogue dataset for benchmarking task-oriented dialog systems. MultiWOZ is a collection of human-human written conversations that are centered around a wide range of topics, such as hotel booking, restaurant reservation, attraction recommendations, and booking a train. It contains over 10K dialogues across 8 domains, with multiple turns for each dialogue. Each dialogue contains multiple user and system utterances and belief states with slot-value pairs for each turn. \citet{eric2019multiwoz} released MultiWOZ 2.1 after fixing the noisy dialog state annotations and utterances that negatively impact the performance of DST models. In this thesis, MultiWOZ 2.1 dataset is used to evaluate the baseline and prompt-based methods on the DST task.
--- a/writing/latex/sections/04_methods.tex
+++ b/writing/latex/sections/04_methods.tex
@ -3,8 +3,8 @@
 This section describes the research methods and experimental setup of the work conducted in this thesis. This thesis work can be divided into the following tasks: \textsc{Soloist} baseline implementation for few-shot DST, prompt-based methods for few-shot DST, evaluation and analysis of belief state predictions, and multi-prompt methods for DST.
 \subsection{Dataset}
-The baseline and prompt-based methods are benchmarked on MultiWOZ 2.1 \citep{eric2019multiwoz} dataset. The MultiWOZ dataset contains 8438/1000/1000 single-domain and multi-domain dialogues for training/validation/testing respectively. Each dialogue can have multiple turns and each turn can include multiple \textit{(slot, value)} pairs. Dialogues from only five domains (\textit{Restaurant, Hotel, Attraction, Taxi, Train}) and one sub-domain (\textit{Booking}) are used in the experiments, as the other two domains (\textit{Hospital, Police}) only appear in the training set. To observe the performance under few-shot settings, dialogues are randomly sampled for each domain and six different data splits are created. Each data split contains all five domains and dialogues are evenly distributed. Only single-domain dialogues including booking sub-domain are picked for creating the data splits. Validation and test sets are not sampled after domain filtering. Table \ref{table:2} provides data statistics and the summary of data splits used in few-shot experiments.
+The baseline and prompt-based methods are benchmarked on MultiWOZ 2.1 \citep{eric2019multiwoz} dataset. The MultiWOZ dataset contains 8438/1000/1000 single-domain and multi-domain dialogues for training/validation/testing respectively. Each dialogue can have multiple turns and each turn can include multiple \textit{(slot, value)} pairs. Dialogues from only five domains (\textit{Restaurant, Hotel, Attraction, Taxi, Train}) and one sub-domain (\textit{Booking}) are used in the experiments, as the other two domains (\textit{Hospital, Police}) only appear in the training set. To observe the performance under few-shot settings, dialogues are randomly sampled for each domain and six different data splits are created. Each data split contains dialogues with all five domains and the dialogues are evenly distributed for each domain. Only single-domain dialogues including booking sub-domain are picked for creating the data splits. Validation and test sets are not sampled after domain filtering. Table \ref{table:2} provides data statistics and the summary of data splits used in few-shot experiments.
-
+\vspace{0.25cm}
 \begin{table}[h!]
 \centering
 \begingroup
@ -28,7 +28,7 @@ The baseline and prompt-based methods are benchmarked on MultiWOZ 2.1 \citep{eri
 \label{table:2}
 \end{table}
-\paragraph{} In the MultiWOZ 2.1 dataset, 16 dialog slots are used to understand the user requirements. For the prompt-based experiments, these slots are converted to look like natural language words for fine-tuning the slot generation process. Table \ref{table:3} lists the slots from all five domains and \textit{booking} sub-domain.
+In the MultiWOZ 2.1 dataset, 16 dialog slots are used to understand the user requirements. For the prompt-based experiments, these slots are converted to look like natural language words for fine-tuning the slot generation process. Table \ref{table:3} lists the slots from all five domains and \textit{booking} sub-domain.
 \begin{table}[!ht]
 \centering
@ -48,7 +48,7 @@ The baseline and prompt-based methods are benchmarked on MultiWOZ 2.1 \citep{eri
 \subsection{SOLOIST Baseline}
 \textsc{Soloist} \citep{peng2021soloist} is the baseline model for the prompt-based methods. \textsc{Soloist} is initialized with the 12-layer GPT-2 \citep{radford2019gpt2} and further trained on two task-oriented dialog corpora (Schema and Taskmaster). The task-grounded pre-training helps the \textsc{Soloist} model to solve two dialog-related tasks: \textit{belief state prediction} and \textit{response generation}. In the belief state predictions task, the model takes dialog history as input and generates the belief states as a sequence of words. In this thesis, for the baseline implementation, the pre-trained \textsc{Soloist} is fine-tuned on MultiWOZ 2.1 data splits to perform the belief predictions task. During inference time, the fine-tuned \textsc{Soloist} baseline doesn't need the pre-defined set of slots and their possible values, and it uses top-K \citep{fan2018topk} and top-p or nucleus \citep{holtzman2020topp} sampling for generating the belief states. In the prompt-based DST task, the same pre-trained \textsc{Soloist} model is fine-tuned for prompt-based slot generation.
-\subsection{Prompt-based few-shot DST}
+\subsection{Prompt-based few-shot DST} \label{subsec:prompt_dst}
 This task aims to apply prompt-based methods proposed by \citep{yang2022prompt} and reproduce the results. This task utilizes the \textit{value-based prompt} and \textit{inverse prompt} for fine-tuning the pre-trained \textsc{Soloist}, which can generate the belief state slots directly at inference time. The prompt-based methods are evaluated on the same data splits (Table \ref{table:2}) of the MultiWOZ 2.1 dataset. 
 \paragraph{Value-based prompt} An intuitive idea for generating (\textit{slot, value}) pairs is to use slots in prompts and generate the corresponding values \citep{lee2021sdp}. For example, given the utterance - \textquote{\textsl{Plan a trip to Berlin}} and slot (\textsl{destination}), the prompt to the PLM could become \textquote{\textsl{Plan a trip to Berlin. destination = [z]}} and the PLM is expected to generate \textsl{[z]} as \textquote{\textsl{Berlin}}. However, this approach relies on the ontology of the slots, and the fixed set of slots can change in real-world applications.  \citet{yang2022prompt} proposed \textit{value-based prompt} that uses values in the prompts and generates corresponding slots. This method doesn't require any pre-defined set of slots and can generate slots directly from the PLM. Consider this prompt template: \textquote{\textsl{belief states: value = [v], slot = [s]}}, the prompt function $f$ can be of form $f(v) = $ \textsl{[dialog history] belief states: value = [v], slot = [s]}, given the value candidate $v = $ \textquote{\textsl{Berlin}}, the PLM can generate \textsl{slot [s] = \textquote{destination}}. The overall training objective of value-based prompt generation is minimizing the negative log-likelihood of slots in the training dataset $D$:
--- a/writing/latex/sections/07_conclusion.tex
+++ b/writing/latex/sections/07_conclusion.tex
@ -1,3 +1,3 @@
 \section{Conclusion}\label{sec:conclusion}
-This work explored the use of prompt-based methods for dialog state tracking (DST) in task-oriented dialogue systems. The prompt-based methods, which include value-based prompt and inverse prompt, learned the DST task efficiently under low-resource few-shot settings without relying on the pre-defined set of slots and values. Experiments show that the prompt-based methods significantly outperformed the baseline Soloist model under low-resource settings. Analysis of generated belief states shows the prompt-based approach has some limitations. Additionally, multi-prompt methods such as prompt ensembling and prompt augmentation are applied to the DST task. Results show that the prompt ensemble model achieved minor improvements, and the performance of prompt augmentation is limited due to the bias in answered prompts. Error analysis of value extraction highlights the limitations of the rule-based methods. Further research is necessary to overcome the limitations of prompt-based methods and value extraction methods.
+This work explored the use of prompt-based methods for dialog state tracking (DST) in task-oriented dialogue systems. The prompt-based methods, which include value-based prompt and inverse prompt, learned the DST task efficiently under low-resource few-shot settings without relying on the pre-defined set of slots and values. Experiments show that the prompt-based methods significantly outperformed the baseline \textsc{Soloist} model under low-resource settings. Analysis of generated belief states shows the prompt-based approach has some limitations. Additionally, multi-prompt methods such as prompt ensembling and prompt augmentation are applied to the DST task. Results show that the prompt ensemble model achieved minor improvements, and the performance of prompt augmentation is limited due to the bias in answered prompts. Error analysis of value extraction highlights the limitations of the rule-based methods. Further research is necessary to overcome the limitations of prompt-based methods and value extraction methods.
--- a/writing/latex/thesis.pdf
+++ b/writing/latex/thesis.pdf
--- a/writing/latex/thesis.tex
+++ b/writing/latex/thesis.tex
@ -50,12 +50,9 @@
 % Thesis Supervisor
 \newcommand{\supervisor}{Prof. Dr. Thang Vu}
 \newcommand{\supervisorEmail}{thang.vu@ims.uni-stuttgart.de}
 %% 2nd supervisor/examiner details here
 %%
 %%
 %% 2nd examiner details here
 \newcommand{\examiner}{Dr. Antje Schweitzer}
 % Start document
 \begin{document}
@ -86,9 +83,20 @@
 \vspace{0.5cm}
-\large{\textbf{Supervisor}}\\
+\vfill
-\supervisor\\ [1pt]
+
-\normalsize{\supervisorEmail}\\
+\begin{table}[h!]
 \centering
 \begingroup
 \setlength{\tabcolsep}{14pt} % Default value: 6pt
 \renewcommand{\arraystretch}{1.25} % Default value: 1
 \begin{tabular}{lr}
 \large{\textbf{Supervisor/Examiner}} & \supervisor\\
 \large{\textbf{Examiner}} & \examiner\\
 \end{tabular}
 \endgroup
 \end{table}
 \end{center}
 \end{titlepage}
`@ -1,3 +1,3 @@`
	`\section{Conclusion}\label{sec:conclusion}`	`\section{Conclusion}\label{sec:conclusion}`

	This work explored the use of prompt-based methods for dialog state tracking (DST) in task-oriented dialogue systems. The prompt-based methods, which include value-based prompt and inverse prompt, learned the DST task efficiently under low-resource few-shot settings without relying on the pre-defined set of slots and values. Experiments show that the prompt-based methods significantly outperformed the baseline Soloist model under low-resource settings. Analysis of generated belief states shows the prompt-based approach has some limitations. Additionally, multi-prompt methods such as prompt ensembling and prompt augmentation are applied to the DST task. Results show that the prompt ensemble model achieved minor improvements, and the performance of prompt augmentation is limited due to the bias in answered prompts. Error analysis of value extraction highlights the limitations of the rule-based methods. Further research is necessary to overcome the limitations of prompt-based methods and value extraction methods.	This work explored the use of prompt-based methods for dialog state tracking (DST) in task-oriented dialogue systems. The prompt-based methods, which include value-based prompt and inverse prompt, learned the DST task efficiently under low-resource few-shot settings without relying on the pre-defined set of slots and values. Experiments show that the prompt-based methods significantly outperformed the baseline \textsc{Soloist} model under low-resource settings. Analysis of generated belief states shows the prompt-based approach has some limitations. Additionally, multi-prompt methods such as prompt ensembling and prompt augmentation are applied to the DST task. Results show that the prompt ensemble model achieved minor improvements, and the performance of prompt augmentation is limited due to the bias in answered prompts. Error analysis of value extraction highlights the limitations of the rule-based methods. Further research is necessary to overcome the limitations of prompt-based methods and value extraction methods.