You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
51 lines
6.5 KiB
51 lines
6.5 KiB
---
|
|
gitea: none
|
|
include_toc: true
|
|
---
|
|
## Analysis of results and outputs
|
|
|
|
### Baseline (SOLOIST)
|
|
The baseline SOLOIST is fine-tuned on different data splits to evaluate the performance of belief state predictions task under low-resource settings. As the results show that the baseline SOLOIST model did perform well when *fine-tuned* on relatively large data samples, however, it performed poorly under low-resource training data (esp. 25 & 50 dialogs).
|
|
|
|
The belief state prediction task of SOLOIST utilizes *top-k* and *top-p* sampling to generate the belief state slots and values. Since the baseline SOLOIST uses open-ended generation, it's susceptible to generating random slot-value pairs that are not relevant to the dialog history. Below is an example of how the baseline model generated a slot-value pair that's not relevant to user goals and it completely missed two correct slot-value pairs.
|
|
|
|
| Dialog History | True belief states | Generated belief states |
|
|
|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------|
|
|
| **user:** we need to find a guesthouse of moderate price.<br />**system:** do you have any special area you would like to stay?<br/>or possibly a star request for the guesthouse?<br />**user:** i would like it to have a 3 star rating. | type = guesthouse<br/>pricerange = moderate<br/>stars = 3 | parking = yes<br/>stars = 3 |
|
|
|
|
|
|
|
|
### Prompt-based Methods
|
|
|
|
#### Value-based prompt & Inverse prompt
|
|
Value-based prompt utilizes the dialog history and value to generate corresponding slots. This approach doesn't rely on the ontology of the slots. While training, both value-based prompts and inverse prompts are used to compute the training loss. The inverse prompt mechanism helped to complement the value-based prompt in generating the correct slots, especially under low-resource data splits.
|
|
|
|
The experimental results show a significant difference in the performance between baseline SOLOIST and Prompt-based methods. Prompt-based methods significantly outperformed the baseline model under low-resource settings (*5-dpd*, *10-dpd* and *50-dpd*).
|
|
|
|
#### destination vs departure & leave vs arrive
|
|
Under low-resource settings, the prompt-based model struggled while generate slots like *departure*|*destination* and *leave*|*arrive*. For many instances, it wrongly generated *destination* instead of *departure* and vice-versa. Below is one example where slots are wrongly generated.
|
|
|
|
| Dialog History | True belief states | Generated belief states |
|
|
|-------------------------------------------------------------------------|-----------------------------------------------------|--------------------------------------------------------|
|
|
| **user:** I need to be picked up from pizza hut city centre after 04:30 | leave = 04:30<br/>departure = pizza hut city centre | arrive = 04:30<br/>destination = pizza hut city centre |
|
|
|
|
#### Repeated values
|
|
Since value-based prompt generates slots from corresponding values, it can't generate slots for repeated values. Only one slot can be generated for the repeated values. Consider the following example:
|
|
|
|
| Dialog History | True belief states |
|
|
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
|
|
| **user:** hi, can you help me find a 3 star place to stay?<br />**system:** Is there a particular area or price range you would like?<br />**user:** how about a place in the centre of town that is of type hotel<br />**system:** how long would you like to stay, and how many are in your party?<br />**user:** I'll be arriving saturday and staying for 3 nights. there are 3 of us. | area = centre<br/>stars = 3<br/>type = hotel<br />day = saturday<br/>people = 3<br/>stay = 3 |
|
|
|
|
The repeated value `3` in the above example can only generate one slot using value-based prompt, as the word with the highest probability is picked as the generated slot. This suggests that the existing annotations for beleif states doesn't work well with value-based prompt.
|
|
|
|
#### Multi-prompt methods
|
|
After applying multi-prompt methods like *prompt ensemble* and *prompt augmentation*, the results are similar with just a minor improvement in the JGA scores. Different samples of prompts and answered prompts are applied to value-based prompt, while some yield good results, the others add bias while generating slots and degrade the performance.
|
|
|
|
#### JGA and JGA* Scores
|
|
Higher JGA* scores suggest the current methods of extracting value candidates need improvements.
|
|
|
|
### Value Extraction
|
|
|
|
Stanford CoreNLP client [stanza](https://stanfordnlp.github.io/stanza/index.html) is used to extract the values from user utterances. A set of rules are used to extract values from POS tags and named entities. Considering all Adjectives (JJ) and Adverbs (RB) can lead to a lot of false positives in value candidates (even after filtering out common stopwords). Another drawback of this approach is extracting the values for slots like *parking* and *internet*. When the user asks for *free* internet and *free* parking, the current belief state annotations use "yes" as the value, while the value extraction rules can only extract "free" from user utterance. This is also a drawback of the existing annotations in MultiWoZ dataset.
|
|
|