diff --git a/ANALYSIS.md b/ANALYSIS.md index 55b47a8..3ba1a80 100644 --- a/ANALYSIS.md +++ b/ANALYSIS.md @@ -9,8 +9,8 @@ The baseline SOLOIST is fine-tuned on different data splits to evaluate the perf The belief state prediction task of SOLOIST utilizes *top-k* and *top-p* sampling to generate the belief state slots and values. Since the baseline SOLOIST uses open-ended generation, it's susceptible to generating random slot-value pairs that are not relevant to the dialog history. Below is an example of how the baseline model generated a slot-value pair that's not relevant to user goals and it completely missed two correct slot-value pairs. -| Dialog History | True belief states | Generated belief states | -| ----- | ----- | ----- | +| Dialog History | True belief states | Generated belief states | +|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------| | **user:** we need to find a guesthouse of moderate price.
**system:** do you have any special area you would like to stay?
or possibly a star request for the guesthouse?
**user:** i would like it to have a 3 star rating. | type = guesthouse
pricerange = moderate
stars = 3 | parking = yes
stars = 3 | @@ -18,23 +18,23 @@ The belief state prediction task of SOLOIST utilizes *top-k* and *top-p* samplin ### Prompt-based Methods #### Value-based prompt & Inverse prompt -Value-based prompt utilizes the dialog history and value to generate corresponding slots. This approach doesn't rely on the ontology of the slots. While training, both value-based prompts and inverse prompts are used to compute the training loss. The inverse prompt mechanism helped complementing the value-based prompt in generating the correct slots. It's worth mentioning that there's a 5-10% drop (depending on the data split trained on) in the JGA score when inverse prompt mechanism is not applied during training. +Value-based prompt utilizes the dialog history and value to generate corresponding slots. This approach doesn't rely on the ontology of the slots. While training, both value-based prompts and inverse prompts are used to compute the training loss. The inverse prompt mechanism helped to complement the value-based prompt in generating the correct slots, especially under low-resource data splits. The experimental results show a significant difference in the performance between baseline SOLOIST and Prompt-based methods. Prompt-based methods significantly outperformed the baseline model under low-resource settings (*5-dpd*, *10-dpd* and *50-dpd*). #### destination vs departure & leave vs arrive Under low-resource settings, the prompt-based model struggled while generate slots like *departure*|*destination* and *leave*|*arrive*. For many instances, it wrongly generated *destination* instead of *departure* and vice-versa. Below is one example where slots are wrongly generated. -| Dialog History | True belief states | Generated belief states | -|-------------------------------------------------------------------------| ----- | ----- | +| Dialog History | True belief states | Generated belief states | +|-------------------------------------------------------------------------|-----------------------------------------------------|--------------------------------------------------------| | **user:** I need to be picked up from pizza hut city centre after 04:30 | leave = 04:30
departure = pizza hut city centre | arrive = 04:30
destination = pizza hut city centre | #### Repeated values Since value-based prompt generates slots from corresponding values, it can't generate slots for repeated values. Only one slot can be generated for the repeated values. Consider the following example: -| Dialog History | True belief states | -| ----- | ----- | -| **user:** hi, can you help me find a 3 star place to stay?
**system:** Is there a particular area or price range you would like?
**user:** how about a place in the centre of town that is of type hotel
**system:** how long would you like to stay, and how many are in your party?
**user:** I'll be arriving saturday and staying for 3 nights. there are 3 of us.| area = centre
stars = 3
type = hotel
day = saturday
people = 3
stay = 3| +| Dialog History | True belief states | +|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------| +| **user:** hi, can you help me find a 3 star place to stay?
**system:** Is there a particular area or price range you would like?
**user:** how about a place in the centre of town that is of type hotel
**system:** how long would you like to stay, and how many are in your party?
**user:** I'll be arriving saturday and staying for 3 nights. there are 3 of us. | area = centre
stars = 3
type = hotel
day = saturday
people = 3
stay = 3 | The repeated value `3` in the above example can only generate one slot using value-based prompt, as the word with the highest probability is picked as the generated slot. This suggests that the existing annotations for beleif states doesn't work well with value-based prompt. diff --git a/README.md b/README.md index 1a7dd3f..3305453 100644 --- a/README.md +++ b/README.md @@ -260,14 +260,8 @@ python evaluate.py -o path/to/outputs/file ``` ### Results from prompt-based belief state generations -|data-split| JGA | JGA* | -|--|:--:|:--:| -| 5-dpd | 30.66 | 71.04 | -| 10-dpd | 42.65 | 86.43 | -| 50-dpd | 47.06 | 91.63 | -| 100-dpd | **47.74** | **92.31** | -| 125-dpd | 46.49 | 91.86 | -| 250-dpd | 47.06 | 92.08 | + +
w = 0.1 w = 0.3 w = 0.5 w = 0.7
Dataset JGA JGA* JGA JGA* JGA JGA* JGA JGA*
5-dpd 30.66 71.04 31.67 73.19 30.77 72.85 29.98 70.93
10-dpd 42.65 86.43 41.18 83.48 40.05 80.77 40.38 85.18
50-dpd 47.06 91.63 46.49 91.18 47.04 91.18 46.27 90.05
100-dpd 47.74 92.31 48.42 92.42 48.19 92.65 48.3 92.65
125-dpd 46.49 91.86 46.15 91.18 46.83 91.74 46.15 90.95
250-dpd 47.06 92.08 47.62 92.65 47.4 92.31 47.17 92.09
> **Note:** All the generated output files for the above reported results are available in this repository. Check [outputs/prompt-learning](outputs/prompt-learning) directory to see the output JSON files for each data-split. @@ -306,6 +300,18 @@ Script for generating belief states (slots) using prompt-ensemble remains the sa sh test_prompting.sh -m ``` +#### Results from Prompt Ensembling + +| Dataset | JGA | JGA* | +|---------|-------|-------| +| 5-dpd | 30.09 | 69.23 | +| 10-dpd | 42.84 | 86.99 | +| 50-dpd | 47.62 | 91.74 | +| 100-dpd | 48.08 | 93.10 | +| 125-dpd | 46.96 | 92.08 | +| 250-dpd | 48.30 | 93.44 | + + ### Prompt Augmentation Prompt Augmentation, also called *demonstration learning*, provides a few additional *answered prompts* that can demonstrate to the PLM, how the actual prompt slot can be answered. Sample selection of answered prompts are hand-crafted and hand-picked manually. Experiments are performed on different sets of *answered prompts*. @@ -315,20 +321,13 @@ Edit the [test_prompting.sh](prompt-learning/test_prompting.sh) file and add `-- sh test_prompting.sh -m ``` -### Results from multi-prompt methods -|data-split| JGA | JGA* | -|--|:--:|:--:| -| 5-dpd | 30.09 | 69.23 | -| 10-dpd | 42.84 | 86.99 | -| 50-dpd | 47.62 | 91.74 | -| 100-dpd | **48.08** | **92.87** | -| 125-dpd | 46.96 | 92.08 | -| 250-dpd | **48.08** | **92.87** | +#### Results from Prompt Augmentation + +
Sample 1 Sample 2
Data JGA JGA* JGA JGA*
5-dpd 26.02 58.6 27.6 59.39
10-dpd 33.26 70.14 34.95 77.94
50-dpd 38.8 71.38 39.77 74.55
100-dpd 35.97 70.89 38.46 74.89
125-dpd 36.09 73.08 36.18 76.47
250-dpd 35.63 72.9 38.91 76.7
> **Note:** All the generated output files for the above reported results are available in this repository. Check [outputs/multi-prompt](outputs/multi-prompt) directory to see the output JSON files for each data-split. - ## Analysis Analyses of the results and belief state generations (outputs) can be found [here](ANALYSIS.md). \ No newline at end of file