Added more baseline outputs, updated README

main
Pavan Mandava 3 years ago
parent 9be77c2de6
commit 1fe3d89476

@ -5,10 +5,12 @@ Repository for my master thesis at the University of Stuttgart (IMS).
Refer to this thesis [proposal](proposal/proposal_submission_1st.pdf) document for detailed explanation about thesis experiments. Refer to this thesis [proposal](proposal/proposal_submission_1st.pdf) document for detailed explanation about thesis experiments.
## Dataset ## Dataset
MultiWOZ 2.1 [dataset](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip) is used for training and evaluation of the baseline/prompt-based methods. MultiWOZ is a fully-labeled dataset with a collection of human-human written conversations spanning over multiple domains and topics. Only single-domain dialogues are used in this setup for training and testing. Each dialogue contains multiple turns and may also contain a sub-domain *booking*. Five domains - *Hotel, Train, Restaurant, Attraction, Taxi* are used in the experiments and excluded the other two domains as they only appear in the training set. Under few-shot settings, only a portion of the training data is utilized to measure the performance of the DST task in a low-resource scenario. Dialogues are randomly picked for each domain. The below table contains some statistics of the dataset and data splits for the few-shot experiments. MultiWOZ 2.1 [dataset](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip) is used for training and evaluation of the baseline/prompt-based methods. MultiWOZ is a fully-labeled dataset with a collection of human-human written conversations spanning over multiple domains and topics. Only single-domain dialogues are used in this setup for training and testing. Each dialogue contains multiple turns and may also contain a subdomain *booking*. Five domains - *Hotel, Train, Restaurant, Attraction, Taxi* are used in the experiments and excluded the other two domains as they only appear in the training set. Under few-shot settings, only a portion of the training data is utilized to measure the performance of the DST task in a low-resource scenario. Dialogues are randomly picked for each domain. The below table contains some statistics of the dataset and data splits for the few-shot experiments.
| Data Split | # Dialogues | # Total Turns | | Data Split | # Dialogues | # Total Turns |
|--|:--:|:--:| |--|:--:|:--:|
| 5-dpd | 25 | 100 |
| 10-dpd | 50 | 234 |
| 50-dpd | 250 | 1114 | | 50-dpd | 250 | 1114 |
| 100-dpd | 500 | 2292 | | 100-dpd | 500 | 2292 |
| 125-dpd | 625 | 2831 | | 125-dpd | 625 | 2831 |
@ -113,7 +115,7 @@ Train a separate model for each data split. Edit the [train_baseline.sh](baselin
```shell ```shell
sh train_baseline.sh -d <data-split-name> sh train_baseline.sh -d <data-split-name>
``` ```
Pass the data split name to `-d` flag. Possible values are: `50-dpd`, `100-dpd`, `125-dpd`, `250-dpd` Pass the data split name to `-d` flag. Possible values are: `5-dpd`, `10-dpd`, `50-dpd`, `100-dpd`, `125-dpd`, `250-dpd`
Example training command: `sh train_baseline.sh -d 50-dpd` Example training command: `sh train_baseline.sh -d 50-dpd`
@ -130,7 +132,7 @@ Generate belief states by running decode script
```shell ```shell
sh decode_baseline.sh sh decode_baseline.sh
``` ```
The generated predictions are saved under `OUTPUTS_DIR_BASELINE` folder. Some of the generated belief state predictions are uploaded to this repository and can found under [outputs](outputs) folder. The generated predictions are saved under `OUTPUTS_DIR_BASELINE` folder. Some generated belief state predictions are uploaded to this repository and can be found under [outputs](outputs) folder.
### Baseline Evaluation ### Baseline Evaluation
@ -140,12 +142,13 @@ Edit the [evaluate.py](baseline/evaluate.py) to set the predictions output file
```shell ```shell
python evaluate.py python evaluate.py
``` ```
#### Preliminary results of baseline evaluation #### Results from baseline evaluation
|data-split| JGA | |data-split| JGA |
|--|:--:| |--|:--:|
| 5-dpd | 9.06 |
| 10-dpd | 14.20 |
| 50-dpd | 28.64 | | 50-dpd | 28.64 |
| 100-dpd | 33.11 | | 100-dpd | 33.11 |
| 125-dpd | 35.79 | | 125-dpd | 35.79 |
| 250-dpd | 40.38 | | 250-dpd | 40.38 |
> Note: The above preliminary results will change based on further experiments

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff
Loading…
Cancel
Save