You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

5.1 KiB

Table of Contents

Analysis of results and outputs

Baseline (SOLOIST)

The baseline SOLOIST is fine-tuned on different data splits to evaluate the performance of belief state predictions task under low-resource settings. As the results show that the baseline SOLOIST model did perform well when fine-tuned on relatively large data samples, however, it performed poorly under low-resource training data (esp. 25 & 50 dialogs).

The belief state prediction task of SOLOIST utilizes top-k and top-p sampling to generate the belief state slots and values. Since the baseline SOLOIST uses open-ended generation, it's susceptible to generating random slot-value pairs that are not relevant to the dialog history. Below is an example of how the baseline model generated a slot-value pair that's not relevant to user goals and it completely missed two correct slot-value pairs.

Dialog History True belief states Generated belief states
user: we need to find a guesthouse of moderate price.
system: do you have any special area you would like to stay?
or possibly a star request for the guesthouse?
user: i would like it to have a 3 star rating.
type = guesthouse
pricerange = moderate
stars = 3
parking = yes
stars = 3

Prompt-based Methods

Value-based prompt & Inverse prompt

Value-based prompt utilizes the dialog history and value to generate corresponding slots. This approach doesn't rely on the ontology of the slots. While training, both value-based prompts and inverse prompts are used to compute the training loss. The inverse prompt mechanism helped complementing the value-based prompt in generating the correct slots. It's worth mentioning that there's a 5-10% drop (depending on the data split trained on) in the JGA score when inverse prompt mechanism is not applied during training.

The experimental results show a significant difference in the performance between baseline SOLOIST and Prompt-based methods. Prompt-based methods significantly outperformed the baseline model under low-resource settings (5-dpd, 10-dpd and 50-dpd).

destination vs departure & leave vs arrive

Under low-resource settings, the prompt-based model struggled while generate slots like departure|destination and leave|arrive. For many instances, it wrongly generated destination instead of departure and vice-versa. Below is one example where slots are wrongly generated.

Dialog History True belief states Generated belief states
user: I need to be picked up from pizza hut city centre after 04:30 leave = 04:30
departure = pizza hut city centre
arrive = 04:30
destination = pizza hut city centre

Repeated values

Since value-based prompt generates slots from corresponding values, it can't generate slots for repeated values. Only one slot can be generated for the repeated values. Consider the following example:

Dialog History True belief states
user: hi, can you help me find a 3 star place to stay?
system: Is there a particular area or price range you would like?
user: how about a place in the centre of town that is of type hotel
system: how long would you like to stay, and how many are in your party?
user: I'll be arriving saturday and staying for 3 nights. there are 3 of us.
area = centre
stars = 3
type = hotel
day = saturday
people = 3
stay = 3

The repeated value 3 in the above example can only generate one slot using value-based prompt, as the word with the highest probability is picked as the generated slot. This suggests that the existing annotations for beleif states doesn't work well with value-based prompt.

Multi-prompt methods

After applying multi-prompt methods like prompt ensemble and prompt augmentation, the results are similar with just a minor improvement in the JGA scores. Different samples of prompts and answered prompts are applied to value-based prompt, while some yield good results, the others add bias while generating slots and degrade the performance.

JGA and JGA* Scores

Higher JGA* scores suggest the current methods of extracting value candidates need improvements.

Value Extraction

Stanford CoreNLP client stanza is used to extract the values from user utterances. A set of rules are used to extract values from POS tags and named entities. Considering all Adjectives (JJ) and Adverbs (RB) can lead to a lot of false positives in value candidates (even after filtering out common stopwords). Another drawback of this approach is extracting the values for slots like parking and internet. When the user asks for free internet and free parking, the current belief state annotations use "yes" as the value, while the value extraction rules can only extract "free" from user utterance. This is also a drawback of the existing annotations in MultiWoZ dataset.