@ -5,10 +5,12 @@ Repository for my master thesis at the University of Stuttgart (IMS).
Refer to this thesis [proposal](proposal/proposal_submission_1st.pdf) document for detailed explanation about thesis experiments.
## Dataset
MultiWOZ 2.1 [dataset](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip) is used for training and evaluation of the baseline/prompt-based methods. MultiWOZ is a fully-labeled dataset with a collection of human-human written conversations spanning over multiple domains and topics. Only single-domain dialogues are used in this setup for training and testing. Each dialogue contains multiple turns and may also contain a sub-domain *booking*. Five domains - *Hotel, Train, Restaurant, Attraction, Taxi* are used in the experiments and excluded the other two domains as they only appear in the training set. Under few-shot settings, only a portion of the training data is utilized to measure the performance of the DST task in a low-resource scenario. Dialogues are randomly picked for each domain. The below table contains some statistics of the dataset and data splits for the few-shot experiments.
MultiWOZ 2.1 [dataset](https://github.com/budzianowski/multiwoz/blob/master/data/MultiWOZ_2.1.zip) is used for training and evaluation of the baseline/prompt-based methods. MultiWOZ is a fully-labeled dataset with a collection of human-human written conversations spanning over multiple domains and topics. Only single-domain dialogues are used in this setup for training and testing. Each dialogue contains multiple turns and may also contain a subdomain *booking*. Five domains - *Hotel, Train, Restaurant, Attraction, Taxi* are used in the experiments and excluded the other two domains as they only appear in the training set. Under few-shot settings, only a portion of the training data is utilized to measure the performance of the DST task in a low-resource scenario. Dialogues are randomly picked for each domain. The below table contains some statistics of the dataset and data splits for the few-shot experiments.
| Data Split | # Dialogues | # Total Turns |
|--|:--:|:--:|
| 5-dpd | 25 | 100 |
| 10-dpd | 50 | 234 |
| 50-dpd | 250 | 1114 |
| 100-dpd | 500 | 2292 |
| 125-dpd | 625 | 2831 |
@ -113,7 +115,7 @@ Train a separate model for each data split. Edit the [train_baseline.sh](baselin
```shell
sh train_baseline.sh -d <data-split-name>
```
Pass the data split name to `-d` flag. Possible values are: `50-dpd`, `100-dpd`, `125-dpd`, `250-dpd`
Pass the data split name to `-d` flag. Possible values are: `5-dpd`, `10-dpd`, `50-dpd`, `100-dpd`, `125-dpd`, `250-dpd`
Example training command: `sh train_baseline.sh -d 50-dpd`
@ -130,7 +132,7 @@ Generate belief states by running decode script
```shell
sh decode_baseline.sh
```
The generated predictions are saved under `OUTPUTS_DIR_BASELINE` folder. Some of the generated belief state predictions are uploaded to this repository and can found under [outputs](outputs) folder.
The generated predictions are saved under `OUTPUTS_DIR_BASELINE` folder. Some generated belief state predictions are uploaded to this repository and can be found under [outputs](outputs) folder.
### Baseline Evaluation
@ -140,12 +142,13 @@ Edit the [evaluate.py](baseline/evaluate.py) to set the predictions output file
```shell
python evaluate.py
```
#### Preliminary results of baseline evaluation
#### Results from baseline evaluation
|data-split| JGA |
|--|:--:|
| 5-dpd | 9.06 |
| 10-dpd | 14.20 |
| 50-dpd | 28.64 |
| 100-dpd | 33.11 |
| 125-dpd | 35.79 |
| 250-dpd | 40.38 |
> Note: The above preliminary results will change based on further experiments