master-thesis/writing/latex/sections/06_analysis.tex

\section{Analysis}\label{sec:analysis}

\subsection{Error analysis of baseline model}

\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{6pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.3} % Default value: 1
\begin{tabular}{lp{10.25cm}}
\hline
\multicolumn{2}{c}{\textbf{Wrong belief state predictions}}\\
\hline
\textbf{Dialog History} & \parbox{10.25cm}{
\vspace{.25\baselineskip}
\textsf{user:} we need to find a guesthouse of moderate price.\newline \textsf{system:} do you have any special area you would like to stay? or possibly a star request for the guesthouse?\newline \textsf{user:} i would like it to have a 3 star rating.} \\
\textbf{True belief states} & \textsl{(type, guesthouse) (pricerange, moderate) (stars, 3)} \\
\textbf{Generated states} & \textsl{(parking, yes) (stars, 3)} \\
\hline
\textbf{Dialog History} & \parbox{10.25cm}{
\vspace{.25\baselineskip}
\textsf{user:} i need an expensive place to eat in the west.\newline
\textsf{system:} is there a specific type of food you would like?\newline
 \textsf{user:} yes, i would like eat indian food.} \\
\textbf{True belief states} &\textsl{(area, west) (food, indian) (pricerange, expensive)} \\
\textbf{Generated states} &\textsl{(area, west) (food, indian) (pricerange, cheap) (area, east)} \\
\hline
\end{tabular}
\endgroup
\caption{Examples of a wrongly generated belief states by the baseline model.}
\label{table:13}
\end{table}
\vspace{0.5cm}

\noindent The belief predictions task of the \textsc{Soloist} baseline utilizes \textsl{top-k} and \textsl{top-p} sampling in order to generate the \textsl{(slot, value)} pairs. As the baseline model uses open-ended generation, it is susceptible to generating random slot-value pairs that are not relevant. The baseline performance was also affected by the repeated slot generations and in some cases incorrect values. Table \ref{table:13} shows examples of some of the errors made by the baseline model. In the first example, the baseline system missed two true states and generated a totally incorrect belief state. For the second example, the slot \textit{area} is repeated with a different value and the value for the slot \textit{pricerange} is incorrectly generated.

\subsection{Analysis of prompt-based methods}

\subsubsection{Value-based Prompt}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{6pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.25} % Default value: 1
\begin{tabular}{lp{10.2cm}}
\hline
\textbf{Dialog History} & \parbox{10.2cm}{
\vspace{.25\baselineskip}
\textsf{user:} I need to be picked up from pizza hut city centre after 04:30} \\
\textbf{True belief states} & \textsl{(departure, pizza hut city centre) (leave, 04:30)} \\
\textbf{Generated states} & \textsl{(destination, pizza hut city centre) (arrive, 04:30)} \\
\hline
\textbf{Dialog History} & \parbox{10.2cm}{
\vspace{.25\baselineskip}
\textsf{user:} I need a taxi to arrive by 16:45 to take me to the parkside police station.}\\
\textbf{True belief states} &\textsl{(destination, parkside police station) (leave, 16:45)}\\
\textbf{Generated states} &\textsl{(destination, parkside police station) (arrive, 16:45)}\\
\hline
\end{tabular}
\endgroup
\caption{Incorrect belief states generated by value-based prompt.}
\label{table:14}
\end{table}

\noindent The value-based prompt trained on low-resource data splits (i.e., \textsl{5-dpd}, \textsl{10-dpd}) struggled to distinguish between the slots like \textit{departure} vs \textit{destination} and \textit{leave} vs \textit{arrive}. In many instances, it wrongly generated the slot \textit{destination} instead of \textit{departure} and slot \textit{arrive} instead of \textit{leave}. Table \ref{table:14} shows some example outputs where the slots are generated incorrectly. In both examples, the slot arrive is incorrectly generated. These incorrect slot generations are due to the limited training available for these examples. Overall, the prompt-based methods perform significantly better than the baseline even under low-resource settings, due to the constrained generation of slots using value-based prompts.

\subsubsection{Repeated values in Belief States}
In the prompt-based methods, the value-based prompt takes the candidate values and generates the corresponding slots. The belief states can have repeated values in the (slot, value) pairs. In other words, the user requirements may lead to having repeated values in the belief state (slot, value) pairs.

\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{5pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.2} % Default value: 1
\begin{tabular}{lp{11.25cm}}
\hline
\textbf{History} & \parbox{11.25cm}{
\vspace{.25\baselineskip}
\textsf{user:} hi, can you help me find a 3 star place to stay?\newline
\textsf{system:} Is there a particular area or price range you prefer?\newline
\textsf{user:} how about a place in centre of town that is of type hotel.\newline
\textsf{system:} how long would you like to stay, and how many people?\newline
\textsf{user:} I'll arrive on saturday and stay for 3 nights with 3 people.} \\
\textbf{True states} & \textsl{(area, centre) (stars, 3) (type, hotel) (day, saturday) (stay, 3) (people, 3)} \\
\hline
\end{tabular}
\endgroup
\caption{An example instance with repeated values in the (slot, value) pairs}
\label{table:15}
\end{table}

\noindent The data instance listed in table \ref{table:15} contains multiple (slot, value) pairs. For the belief slots \textsl{stars}, \textsl{stay}, and \textsl{people}, the value is the same. The value-based prompt can only generate one slot with the repeated value 3. This is a main drawback of the value-based prompt under the existing belief state annotation system.


\subsubsection{Error Analysis of Value Extraction} \label{subsec:value_errors}

\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{6pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.1} % Default value: 1
\begin{tabular}{lp{11.25cm}}
\hline
\textbf{History} & \parbox{11.25cm}{
\vspace{.25\baselineskip}
\textsf{user:} I want a place to stay that has free wifi and free parking.\newline
\textsf{system:} do you have a preference for area or price range?\newline
\textsf{user:} I don't have a preference. I want a hotel not guesthouse.}\\
\textbf{True states} & \textsl{(area, \underline{dont care}) (internet, \underline{yes}) (parking, \underline{yes}) (price, \underline{dont care}) (type, hotel)} \\
\textbf{\makecell[l]{Extracted\\values}} & \textsl{free}, \textsl{hotel} \\
\hline
\textbf{History} & \parbox{11.25cm}{
\vspace{.25\baselineskip}
\textsf{user:} I need a guesthouse with free wifi please.\newline
\textsf{system:} which area would you prefer?\newline
\textsf{user:} I also need free parking, and I prefer a 4 star place.}\\
\textbf{True states} & \textsl{(internet, \underline{yes}) (parking, \underline{yes}) (stars, 4) (type, guesthouse)} \\
\textbf{\makecell[l]{Extracted\\values}} & \textsl{free}, \textsl{guesthouse}, \textsl{4} \\
\hline
\end{tabular}
\endgroup
\caption{Example data instances where values cannot be extracted (underlined).}
\label{table:16}
\end{table}

At inference time, the value-based prompt requires the belief state values in order to generate slots. The value extraction methods apply a set of rules on POS tags and named entities to extract value candidates directly from utterances. The rule-based extraction has an accuracy of \textsl{79\%} over all the values and a turn-level accuracy of \textsl{49\%} on the test split. Table \ref{table:16} highlights instances where the values cannot be extracted using rule-based methods. In the first example, the value \textquote{\textit{dont care}} does not appear in the utterances and cannot be extracted from POS tags. When the user requirement is \textit{free} wifi or \textit{free} parking, the existing annotation system for belief states considers it as the value \textquote{\textit{yes}}. The rule-based methods adopted for value extraction can only extract the value \textquote{\textit{free}} from the utterances. The values \textquote{\textit{dont care}} and \textquote{\textit{yes}} also occur twice in the examples shown in table \ref{table:16}, as described in the previous section (sec \ref{subsec:value_errors}) the value-based prompt cannot handle repeated values for slot generation.

\vspace{0.5cm}
\begin{table}[h!]
\centering
\begingroup
\setlength{\tabcolsep}{8pt} % Default value: 6pt
\renewcommand{\arraystretch}{1.25} % Default value: 1
\begin{tabular}{lp{10cm}}
\hline
\textbf{History} & \parbox{10cm}{
\vspace{.3\baselineskip}
\textsf{user:} I kind of need some help finding a nice hotel in the north part of town.
}\\
\textbf{True states} & \textsl{(area, north) (price, expensive) (type, hotel)} \\
\textbf{Extracted values} & \textsl{\underline{kind}}, \textsl{\underline{nice}}, \textsl{hotel}, \textsl{north} \\
\hline
%\textbf{History} & \parbox{11.25cm}{
%\vspace{.25\baselineskip}
%\textsf{user:} Hi, are there any expensive restaurants in the city centre?.\newline
%\textsf{system:} Is there a particular type of food you are looking for?\newline
%\textsf{user:} No, can you choose one for me and provide me the address.}\\
%\textbf{True states} & \textsl{(area, centre) (price, expensive) (food, dont care)}\\
%\textbf{Extracted values} & \textsl{expensive}, \textsl{centre}, \textsl{1} \\
%\hline
\end{tabular}
\endgroup
\caption{Example instance where values are extracted incorrectly (underlined).}
\label{table:17}
\end{table}

\paragraph{} After extracting POS tags using the CoreNLP client, all the \textsl{adjectives} and \textsl{adverbs} from the utterances are considered as candidate values. This approach can lead to false positives in value candidates. Table \ref{table:17} shows a data instance where some values are extracted incorrectly. The existing annotation system associates the user utterance \textquote{a nice hotel} with the value \textquote{\textit{expensive}} for slot price, this cannot be achieved under the current rule-based methods. The value \textquote{\textit{kind}} is also extracted incorrectly due to considering all the \textsl{adverbs} as possible values. The rule-based value extraction methods used in this thesis have limitations, which led to the performance degradation of prompt-based DST.