\section{Background \& Related Work}\label{sec:background} \subsection{Dialog State Tracking (DST)} \paragraph{} Task-oriented dialog systems, both modular and end-to-end systems, are capable of handling a wide range of tasks (such as ticket booking, restaurant booking, etc.) across various domains. A task-oriented dialogue system has stricter requirements for responses because it needs to accurately understand and process the user's message. Therefore, modular methods were suggested as a way to generate responses in a more controlled manner. The architecture of a typical modular-based task-oriented dialog system is depicted in Figure \ref{figure:1}. A typical modular-based system uses a modular pipeline, which has four modules that execute sequentially - Natural Language Understanding (NLU), Dialog State Tracking (DST), Policy Learning (PL), and Natural Language Generation (NLG). The NLU module extracts the semantic values from user messages, together with intent detection and domain classification. The DST module takes the extracted values and fills the slot-value pairs based on the entire dialog history. The Policy Learning (PL) module takes the slot-value pairs and decides the next action to be performed by the dialog system. The NLG module converts the dialog actions received from the PL module into the natural language text, which is usually the system response to the user. \vspace{0.5cm} \begin{figure}[h!] \centering \includegraphics[width=\linewidth]{images/modular_tod} \caption{Modular-based task-oriented dialog system \citep{ni2021dlds}} \label{figure:1} \end{figure} The DST module is essential for enabling the system to comprehend the user's requests by tracking them in the form of slots and values (belief states) at every turn. In recent years, some dialog systems take the user utterances and provide them directly to the DST module. This approach fills the slot-value pairs directly from the raw user message and eliminates the need for NLU module. For example, consider the user message - \textquote{\textit{Plan a train trip to Berlin this Friday for two people}} - the DST module is supposed to fill (\textit{slot, value}) pairs as follows: \{(\textit{destination, Berlin}), (\textit{day, Friday}), (\textit{people, 2})\}. \paragraph{} A typical task-oriented dialog system can assist users across multiple domains (restaurant, hotel, train, booking). Each dialog domain contains an ontology, which represents the knowledge of the domain and information required for specific tasks. The ontology of a domain consists of a pre-defined set of slots and all the possible values for each slot. Neural-based models were proposed to solve the DST task by multi-class classification, where the model predicts the correct class from multiple values. This approach depends on the ontology of the domains and needs to track a lot of slot-value pairs. Ontology is hard to obtain in real-world scenarios, especially for new domains. The neural-based DST model also needs a lot of training data which is rarely available for new domains. \paragraph{} A dialog state or belief state in DST contains the required information for the system to process the user's request. At each turn of the dialogue, the dialog state can have \textit{informable slots} and \textit{requestable slots}. Informable slots are specified by the user about the preferences and requirements for the system action. For example, in the restaurant domain, the user can ask for a specific type of food or desired price range of the restaurants for table booking. The belief state slots with such information are called \textit{informable slots}. Users can ask the dialog system for an address or phone number of a restaurant, such slots are known as \textit{requestable slots}. This thesis work focuses on the DST module to extract informable slots and their values without depending on the ontology. \subsection{Pre-trained Language Models (PLMs)} \paragraph{} Large pre-trained language models are trained on huge amounts of textual data and they achieved state-of-the-art performance in a variety of NLP tasks, such as machine translation, text classification, text generation, and summarization. These PLMs trained on large-scale datasets can encode significant linguistic knowledge into their huge amount of parameters. Pre-trained language models based on transformer architectures \citep{vaswani2017attention}, such as BERT \citep{devlin2019bert} and GPT \citep{radford2018gpt}, have also achieved state-of-the-art performance on many NLP tasks. GPT-2 \citep{radford2019gpt2} is a transformer-based left-to-right auto-regressive language model trained on large amounts of open web text data. The main training objective of GPT-2 is to predict the next word, given all the previous words. A left-to-right auto-regressive language model predicts the next word given all the previous words or assigns the probability of word sequences. Consider the sequence of words $x = x_1, x_2, \ldots, x_n$, the probability distribution can be written using the chain rule from left to right: $$ P(x) = P\left(x_1\right) \times P\left(x_2 \mid x_1\right) \times \ldots \times P\left(x_n \mid x_1 \cdots x_{n-1}\right) $$ \paragraph{} The PLMs trained on large amounts of text can be fine-tuned using task-specific data to solve the downstream tasks efficiently. Previous work by \citet{wu2020tod-bert} pre-trained the BERT model with nine different task-oriented dialog datasets and later used it to fine-tune the downstream tasks. This approach improved the performance of downstream tasks over fine-tuning directly on BERT. \textsc{Soloist} \citep{peng2021soloist} used a similar approach to pre-train the GPT-2 model on two task-oriented dialog corpora and fine-tuned the pre-trained Soloist on the DST task. The pre-trained \textsc{Soloist} is the baseline model of this thesis, which uses the publicly available 12-layer GPT-2 (117M) model. The prompt-based methods in this thesis also utilize the pre-trained \textsc{Soloist} to fine-tune the prompt-based DST task. \subsection{SOLOIST Model} \label{subsec:soloist} \paragraph{} \textsc{Soloist} \citep{peng2021soloist} uses the \textsl{pre-train, fine-tune} paradigm for building a task-oriented dialog system using an auto-regressive language model GPT-2 \citep{radford2019gpt2}. This dialog system is built in two phases: In the pre-training phase, \textsc{Soloist} is initialized with GPT-2 and further trained on two large task-oriented datasets, Schema and Taskmaster. The primary goal at this stage is to learn task completion skills such as \textit{belief prediction} and \textit{response generation}. In the belief predictions task of the pre-training stage, the \textsc{Soloist} model takes dialog history as input and generates belief states as a sequence of words. The generated belief state sequences take the form - \textquote{\textit{belief: $slot_1 = value_1; slot_2 = value_2, \ldots$}}. The pre-training objective for predicting belief states is: $$ \mathcal{L}=\log P(b \mid s)=\sum_{t=1}^{T_b} \log P\left(b_t \mid b_{