diff --git a/ANALYSIS.md b/ANALYSIS.md
index 55b47a8..3ba1a80 100644
--- a/ANALYSIS.md
+++ b/ANALYSIS.md
@@ -9,8 +9,8 @@ The baseline SOLOIST is fine-tuned on different data splits to evaluate the perf
The belief state prediction task of SOLOIST utilizes *top-k* and *top-p* sampling to generate the belief state slots and values. Since the baseline SOLOIST uses open-ended generation, it's susceptible to generating random slot-value pairs that are not relevant to the dialog history. Below is an example of how the baseline model generated a slot-value pair that's not relevant to user goals and it completely missed two correct slot-value pairs.
-| Dialog History | True belief states | Generated belief states |
-| ----- | ----- | ----- |
+| Dialog History | True belief states | Generated belief states |
+|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------|-----------------------------|
| **user:** we need to find a guesthouse of moderate price.
**system:** do you have any special area you would like to stay?
or possibly a star request for the guesthouse?
**user:** i would like it to have a 3 star rating. | type = guesthouse
pricerange = moderate
stars = 3 | parking = yes
stars = 3 |
@@ -18,23 +18,23 @@ The belief state prediction task of SOLOIST utilizes *top-k* and *top-p* samplin
### Prompt-based Methods
#### Value-based prompt & Inverse prompt
-Value-based prompt utilizes the dialog history and value to generate corresponding slots. This approach doesn't rely on the ontology of the slots. While training, both value-based prompts and inverse prompts are used to compute the training loss. The inverse prompt mechanism helped complementing the value-based prompt in generating the correct slots. It's worth mentioning that there's a 5-10% drop (depending on the data split trained on) in the JGA score when inverse prompt mechanism is not applied during training.
+Value-based prompt utilizes the dialog history and value to generate corresponding slots. This approach doesn't rely on the ontology of the slots. While training, both value-based prompts and inverse prompts are used to compute the training loss. The inverse prompt mechanism helped to complement the value-based prompt in generating the correct slots, especially under low-resource data splits.
The experimental results show a significant difference in the performance between baseline SOLOIST and Prompt-based methods. Prompt-based methods significantly outperformed the baseline model under low-resource settings (*5-dpd*, *10-dpd* and *50-dpd*).
#### destination vs departure & leave vs arrive
Under low-resource settings, the prompt-based model struggled while generate slots like *departure*|*destination* and *leave*|*arrive*. For many instances, it wrongly generated *destination* instead of *departure* and vice-versa. Below is one example where slots are wrongly generated.
-| Dialog History | True belief states | Generated belief states |
-|-------------------------------------------------------------------------| ----- | ----- |
+| Dialog History | True belief states | Generated belief states |
+|-------------------------------------------------------------------------|-----------------------------------------------------|--------------------------------------------------------|
| **user:** I need to be picked up from pizza hut city centre after 04:30 | leave = 04:30
departure = pizza hut city centre | arrive = 04:30
destination = pizza hut city centre |
#### Repeated values
Since value-based prompt generates slots from corresponding values, it can't generate slots for repeated values. Only one slot can be generated for the repeated values. Consider the following example:
-| Dialog History | True belief states |
-| ----- | ----- |
-| **user:** hi, can you help me find a 3 star place to stay?
**system:** Is there a particular area or price range you would like?
**user:** how about a place in the centre of town that is of type hotel
**system:** how long would you like to stay, and how many are in your party?
**user:** I'll be arriving saturday and staying for 3 nights. there are 3 of us.| area = centre
stars = 3
type = hotel
day = saturday
people = 3
stay = 3|
+| Dialog History | True belief states |
+|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
+| **user:** hi, can you help me find a 3 star place to stay?
**system:** Is there a particular area or price range you would like?
**user:** how about a place in the centre of town that is of type hotel
**system:** how long would you like to stay, and how many are in your party?
**user:** I'll be arriving saturday and staying for 3 nights. there are 3 of us. | area = centre
stars = 3
type = hotel
day = saturday
people = 3
stay = 3 |
The repeated value `3` in the above example can only generate one slot using value-based prompt, as the word with the highest probability is picked as the generated slot. This suggests that the existing annotations for beleif states doesn't work well with value-based prompt.
diff --git a/README.md b/README.md
index 1a7dd3f..3305453 100644
--- a/README.md
+++ b/README.md
@@ -260,14 +260,8 @@ python evaluate.py -o path/to/outputs/file
```
### Results from prompt-based belief state generations
-|data-split| JGA | JGA* |
-|--|:--:|:--:|
-| 5-dpd | 30.66 | 71.04 |
-| 10-dpd | 42.65 | 86.43 |
-| 50-dpd | 47.06 | 91.63 |
-| 100-dpd | **47.74** | **92.31** |
-| 125-dpd | 46.49 | 91.86 |
-| 250-dpd | 47.06 | 92.08 |
+
+
| | w = 0.1 | w = 0.3 | w = 0.5 | w = 0.7 |
| Dataset | JGA | JGA* | JGA | JGA* | JGA | JGA* | JGA | JGA* |
| 5-dpd | 30.66 | 71.04 | 31.67 | 73.19 | 30.77 | 72.85 | 29.98 | 70.93 |
| 10-dpd | 42.65 | 86.43 | 41.18 | 83.48 | 40.05 | 80.77 | 40.38 | 85.18 |
| 50-dpd | 47.06 | 91.63 | 46.49 | 91.18 | 47.04 | 91.18 | 46.27 | 90.05 |
| 100-dpd | 47.74 | 92.31 | 48.42 | 92.42 | 48.19 | 92.65 | 48.3 | 92.65 |
| 125-dpd | 46.49 | 91.86 | 46.15 | 91.18 | 46.83 | 91.74 | 46.15 | 90.95 |
| 250-dpd | 47.06 | 92.08 | 47.62 | 92.65 | 47.4 | 92.31 | 47.17 | 92.09 |
> **Note:** All the generated output files for the above reported results are available in this repository. Check [outputs/prompt-learning](outputs/prompt-learning) directory to see the output JSON files for each data-split.
@@ -306,6 +300,18 @@ Script for generating belief states (slots) using prompt-ensemble remains the sa
sh test_prompting.sh -m
```
+#### Results from Prompt Ensembling
+
+| Dataset | JGA | JGA* |
+|---------|-------|-------|
+| 5-dpd | 30.09 | 69.23 |
+| 10-dpd | 42.84 | 86.99 |
+| 50-dpd | 47.62 | 91.74 |
+| 100-dpd | 48.08 | 93.10 |
+| 125-dpd | 46.96 | 92.08 |
+| 250-dpd | 48.30 | 93.44 |
+
+
### Prompt Augmentation
Prompt Augmentation, also called *demonstration learning*, provides a few additional *answered prompts* that can demonstrate to the PLM, how the actual prompt slot can be answered. Sample selection of answered prompts are hand-crafted and hand-picked manually. Experiments are performed on different sets of *answered prompts*.
@@ -315,20 +321,13 @@ Edit the [test_prompting.sh](prompt-learning/test_prompting.sh) file and add `--
sh test_prompting.sh -m
```
-### Results from multi-prompt methods
-|data-split| JGA | JGA* |
-|--|:--:|:--:|
-| 5-dpd | 30.09 | 69.23 |
-| 10-dpd | 42.84 | 86.99 |
-| 50-dpd | 47.62 | 91.74 |
-| 100-dpd | **48.08** | **92.87** |
-| 125-dpd | 46.96 | 92.08 |
-| 250-dpd | **48.08** | **92.87** |
+#### Results from Prompt Augmentation
+ | Sample 1 | Sample 2 |
+ | Data | JGA | JGA* | JGA | JGA* |
| 5-dpd | 26.02 | 58.6 | 27.6 | 59.39 |
| 10-dpd | 33.26 | 70.14 | 34.95 | 77.94 |
| 50-dpd | 38.8 | 71.38 | 39.77 | 74.55 |
| 100-dpd | 35.97 | 70.89 | 38.46 | 74.89 |
| 125-dpd | 36.09 | 73.08 | 36.18 | 76.47 |
| 250-dpd | 35.63 | 72.9 | 38.91 | 76.7 |
> **Note:** All the generated output files for the above reported results are available in this repository. Check [outputs/multi-prompt](outputs/multi-prompt) directory to see the output JSON files for each data-split.
-
## Analysis
Analyses of the results and belief state generations (outputs) can be found [here](ANALYSIS.md).
\ No newline at end of file