Skip to content

Commit 2463179

Browse files
authored
Merge pull request #196 from GRIT621/main
unid2t_code
2 parents 90a76ef + 4fad00c commit 2463179

File tree

115 files changed

+12275
-1
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

115 files changed

+12275
-1
lines changed

unid2t/.gitignore

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
data_preprocess/__pycache__/
2+
.idea/
3+
tools/__pycache__/

unid2t/README.md

Lines changed: 49 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1,49 @@
1-
The code is currently in the approval process, and the full version will be announced on subsequent dates as soon as possible.
1+
[//]: # (#Unified Data-to-Text Pretraining)
2+
3+
4+
## Unified Structured Data as Graph for Data-to-Text Pretraing
5+
6+
## Prepare Environment
7+
You can create an environment for UniD2T and directly install python packages by commands:
8+
```
9+
pip install -r requirements.txt
10+
```
11+
12+
13+
## Data_preprocess
14+
You can download the original data from the original website:
15+
[ToTTo](https://github.com/google-research-datasets/ToTTo),
16+
[CoSQL](https://yale-lily.github.io/cosql),
17+
[WebNLG](https://gitlab.com/shimorina/webnlg-dataset/-/tree/master/release_v3.0),
18+
[DART](https://github.com/Yale-LILY/DART),
19+
[WikiBio](https://rlebret.github.io/wikipedia-biography-dataset/),
20+
[WikiTableT](https://github.com/mingdachen/WikiTableT).
21+
22+
Then put it in the ```/orig_datasets/``` directory and use the code in ```/data_preprocess/``` to process each data. The processed data will be saved in cleanout_datasets, such as in totto dataset:
23+
```
24+
python /data_preprocess/totto/convert_totto_to_unified_graph.py
25+
```
26+
27+
## Pretrain
28+
Merge the data processed in the previous step:
29+
```
30+
python /data_preprocess/convert_totto_to_unified_graph.py
31+
```
32+
Pre-training on multiple GPUs:
33+
```
34+
torchrun \
35+
--nproc_per_node=4 \
36+
./pretrain.py \
37+
--config /pretrain_config/**.yml
38+
```
39+
40+
41+
## Fintune
42+
Fintune on single GPUs:
43+
```
44+
python finetune.py --config config/**.yml
45+
```
46+
47+
48+
49+
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# basic
2+
seed: 42
3+
device: 'cuda'
4+
model_name: 't5'
5+
datatype: 'linear'
6+
enable_uda_relative_pos: False
7+
# tokenizer_path: '/data/nt12_ssd_gluster/myself/pretrained_models/t5-small'
8+
tokenizer_path: 't5-large'
9+
special_token_path: '/root/data/cleanout_datasets/special_tokens.txt'
10+
data_processor: 'linear'
11+
# task_source_prefix: 'Describe the following data: '
12+
modified_default_plm_config: True
13+
plms_dropout_rate: 0.1
14+
15+
# training
16+
train_type: 'finetune'
17+
dist_train: False
18+
experiment_name: 'finetuning_t5_base_on_cosql_2e-4'
19+
init_model_path: 't5-large'
20+
max_epochs: 80
21+
max_steps: -1
22+
early_stopping_patience: 8
23+
start_eval_from: 0
24+
eval_every: 1
25+
max_keep_checkpoints: -1
26+
report_every: 100
27+
saved_dir: '/root/data/guanbao/finetuning/cosql_T5large_linear'
28+
29+
learner: fairseq_adafactor
30+
learning_rate: 2e-04
31+
adam_epsilon: 0.00000001
32+
max_grad_norm: 2.0
33+
lr_scheduler: 'none'
34+
warmup_steps: 0
35+
36+
# training data
37+
train_file_src: '/root/data/cleanout_datasets/cosql_with_unified_graph/cosql_train.json'
38+
train_n_example: -1
39+
train_batch_size: 16
40+
max_source_length: 1024
41+
max_target_length: -1
42+
train_num_workers: 5
43+
44+
45+
# evaluate data
46+
eval_noise_data: False
47+
val_metric: bleu
48+
eval_file_src: '/root/data/cleanout_datasets/cosql_with_unified_graph/cosql_dev.json'
49+
eval_n_example: -1
50+
eval_batch_size: 32
51+
num_beams: 5
52+
eval_max_source_length: 1024
53+
eval_max_target_length: 128
54+
eval_num_workers: 5
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# basic
2+
seed: 42
3+
device: 'cuda'
4+
model_name: 't5'
5+
datatype: 'linear'
6+
enable_uda_relative_pos: False
7+
# tokenizer_path: '/data/nt12_ssd_gluster/myself/pretrained_models/t5-small'
8+
tokenizer_path: 't5-large'
9+
special_token_path: './cleanout_datasets/special_tokens.txt'
10+
data_processor: 'linear'
11+
# task_source_prefix: 'Describe the following data: '
12+
modified_default_plm_config: True
13+
plms_dropout_rate: 0.1
14+
15+
# training
16+
train_type: 'finetune'
17+
dist_train: False
18+
experiment_name: 'finetuning_t5_base_on_dart_2e-4'
19+
init_model_path: 't5-large'
20+
max_epochs: 80
21+
max_steps: -1
22+
early_stopping_patience: 8
23+
start_eval_from: 0
24+
eval_every: 1
25+
max_keep_checkpoints: -1
26+
report_every: 100
27+
saved_dir: '/root/data/guanbao/finetuning/dart_T5large_linear'
28+
29+
learner: fairseq_adafactor
30+
learning_rate: 2e-04
31+
adam_epsilon: 0.00000001
32+
max_grad_norm: 2.0
33+
lr_scheduler: 'none'
34+
warmup_steps: 0
35+
36+
# training data
37+
train_file_src: '/root/data/cleanout_datasets/dart/dart-v1.1.1-full-train_with_unified_graph_simplified_and_lower_relationt.json'
38+
train_n_example: -1
39+
train_batch_size: 16
40+
max_source_length: 1024
41+
max_target_length: -1
42+
train_num_workers: 5
43+
44+
45+
# evaluate data
46+
eval_noise_data: False
47+
val_metric: bleu
48+
eval_file_src: '/root/data/cleanout_datasets/dart/dart-v1.1.1-full-dev_with_unified_graph_simplified_and_lower_relationt.json'
49+
eval_n_example: -1
50+
eval_batch_size: 32
51+
num_beams: 5
52+
eval_max_source_length: 1024
53+
eval_max_target_length: 128
54+
eval_num_workers: 5
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# basic
2+
seed: 42
3+
device: 'cuda'
4+
model_name: 't5'
5+
datatype: 'linear'
6+
enable_uda_relative_pos: False
7+
# tokenizer_path: '/data/nt12_ssd_gluster/myself/pretrained_models/t5-small'
8+
tokenizer_path: 't5-large'
9+
special_token_path: './cleanout_datasets/special_tokens.txt'
10+
data_processor: 'linear'
11+
# task_source_prefix: 'Describe the following data: '
12+
modified_default_plm_config: True
13+
plms_dropout_rate: 0.1
14+
15+
# training
16+
train_type: 'finetune'
17+
dist_train: False
18+
experiment_name: 'finetuning_t5_base_on_totto_2e-4'
19+
init_model_path: 't5-large'
20+
max_epochs: 80
21+
max_steps: -1
22+
early_stopping_patience: 8
23+
start_eval_from: 0
24+
eval_every: 1
25+
max_keep_checkpoints: -1
26+
report_every: 100
27+
saved_dir: '/root/data/guanbao/finetuning/totto_T5_large_linear'
28+
29+
learner: fairseq_adafactor
30+
learning_rate: 2e-04
31+
adam_epsilon: 0.00000001
32+
max_grad_norm: 2.0
33+
lr_scheduler: 'none'
34+
warmup_steps: 0
35+
36+
# training data
37+
train_file_src: '/root/data/cleanout_datasets/totto_with_unified_graph/totto_train_data.jsonl'
38+
train_n_example: -1
39+
train_batch_size: 16
40+
max_source_length: 1024
41+
max_target_length: -1
42+
train_num_workers: 5
43+
44+
45+
# evaluate data
46+
eval_noise_data: False
47+
val_metric: bleu
48+
eval_file_src: '/root/data/cleanout_datasets/totto_with_unified_graph/totto_dev_data.jsonl'
49+
eval_n_example: -1
50+
eval_batch_size: 32
51+
num_beams: 5
52+
eval_max_source_length: 1024
53+
eval_max_target_length: 128
54+
eval_num_workers: 5
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# basic
2+
seed: 42
3+
device: 'cuda'
4+
model_name: 't5'
5+
datatype: 'linear'
6+
enable_uda_relative_pos: False
7+
# tokenizer_path: '/data/nt12_ssd_gluster/myself/pretrained_models/t5-small'
8+
tokenizer_path: 't5-large'
9+
special_token_path: '/root/data/liliang/experiments/UnifiedData2TextPretrain/cleanout_datasets/special_tokens.txt'
10+
data_processor: 'linear'
11+
# task_source_prefix: 'Describe the following data: '
12+
modified_default_plm_config: True
13+
plms_dropout_rate: 0.1
14+
15+
# training
16+
train_type: 'finetune'
17+
dist_train: False
18+
experiment_name: 'finetuning_t5_large_on_webnlg17-4e-5'
19+
init_model_path: 't5-large'
20+
max_epochs: 10
21+
max_steps: -1
22+
early_stopping_patience: 8
23+
start_eval_from: 0
24+
eval_every: 1
25+
max_keep_checkpoints: -1
26+
report_every: 100
27+
saved_dir: '/root/data/guanbao/finetuning/webnlg17_T5_large_linear'
28+
29+
learner: fairseq_adafactor
30+
learning_rate: 4e-5
31+
adam_epsilon: 0.00000001
32+
max_grad_norm: 2.0
33+
lr_scheduler: 'none'
34+
warmup_steps: 0
35+
36+
37+
# training data
38+
train_file_src: '/root/data/cleanout_datasets/cleanout_webnlg17/train.json'
39+
train_n_example: -1
40+
train_batch_size: 16
41+
max_source_length: 1024
42+
max_target_length: -1
43+
train_num_workers: 5
44+
45+
46+
# evaluate data
47+
eval_noise_data: False
48+
val_metric: bleu
49+
eval_file_src: '/root/data/cleanout_datasets/cleanout_webnlg17/test.json'
50+
eval_n_example: -1
51+
eval_batch_size: 32
52+
num_beams: 5
53+
eval_max_source_length: 1024
54+
eval_max_target_length: 128
55+
eval_num_workers: 5
Lines changed: 54 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,54 @@
1+
# basic
2+
seed: 42
3+
device: 'cuda'
4+
model_name: 't5'
5+
datatype: 'linear'
6+
enable_uda_relative_pos: False
7+
# tokenizer_path: '/data/nt12_ssd_gluster/myself/pretrained_models/t5-small'
8+
tokenizer_path: 't5-large'
9+
special_token_path: './cleanout_datasets/special_tokens.txt'
10+
data_processor: 'linear'
11+
# task_source_prefix: 'Describe the following data: '
12+
modified_default_plm_config: True
13+
plms_dropout_rate: 0.1
14+
15+
# training
16+
train_type: 'finetune'
17+
dist_train: False
18+
experiment_name: 'finetuning_t5_base_on_wikibio_2e-4'
19+
init_model_path: 't5-large'
20+
max_epochs: 80
21+
max_steps: -1
22+
early_stopping_patience: 8
23+
start_eval_from: 0
24+
eval_every: 1
25+
max_keep_checkpoints: -1
26+
report_every: 100
27+
saved_dir: '/root/data/guanbao/finetuning/wikibio_T5_large_linear'
28+
29+
learner: fairseq_adafactor
30+
learning_rate: 2e-04
31+
adam_epsilon: 0.00000001
32+
max_grad_norm: 2.0
33+
lr_scheduler: 'none'
34+
warmup_steps: 0
35+
36+
# training data
37+
train_file_src: '/root/data/cleanout_datasets/wikibio/train.json'
38+
train_n_example: -1
39+
train_batch_size: 16
40+
max_source_length: 1024
41+
max_target_length: -1
42+
train_num_workers: 5
43+
44+
45+
# evaluate data
46+
eval_noise_data: False
47+
val_metric: bleu
48+
eval_file_src: '/root/data/cleanout_datasets/wikibio/test.json'
49+
eval_n_example: -1
50+
eval_batch_size: 32
51+
num_beams: 5
52+
eval_max_source_length: 1024
53+
eval_max_target_length: 128
54+
eval_num_workers: 5
Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# basic
2+
seed: 42
3+
device: 'cuda'
4+
model_name: 't5'
5+
datatype: 'linear'
6+
enable_uda_relative_pos: False
7+
# tokenizer_path: '/data/nt12_ssd_gluster/myself/pretrained_models/t5-small'
8+
tokenizer_path: 't5-large'
9+
#special_token_path: '/root/data/cleanout_datasets/special_tokens.txt'
10+
special_token_path: '/root/data/cleanout_datasets/special_tokens.txt'
11+
data_processor: 'linear'
12+
# task_source_prefix: 'Describe the following data: '
13+
modified_default_plm_config: True
14+
plms_dropout_rate: 0.1
15+
16+
# training
17+
train_type: 'finetune'
18+
dist_train: False
19+
experiment_name: 'finetuning_t5_base_on_wikitableT_2e-4'
20+
init_model_path: 't5-large'
21+
max_epochs: 80
22+
max_steps: -1
23+
early_stopping_patience: 8
24+
start_eval_from: 0
25+
eval_every: 1
26+
max_keep_checkpoints: -1
27+
report_every: 100
28+
saved_dir: '/root/data/guanbao/finetuning/wikitableT_T5_large_linear'
29+
30+
learner: fairseq_adafactor
31+
learning_rate: 2e-04
32+
adam_epsilon: 0.00000001
33+
max_grad_norm: 2.0
34+
lr_scheduler: 'none'
35+
warmup_steps: 0
36+
37+
# training data
38+
train_file_src: '/root/data/cleanout_datasets/WikitableT/train_udt.json'
39+
train_n_example: -1
40+
train_batch_size: 16
41+
max_source_length: 512
42+
max_target_length: 208
43+
train_num_workers: 5
44+
45+
46+
# evaluate data
47+
eval_noise_data: False
48+
val_metric: bleu
49+
eval_file_src: '/root/data/cleanout_datasets/WikitableT/dev_udt.json'
50+
eval_n_example: -1
51+
eval_batch_size: 32
52+
num_beams: 5
53+
eval_max_source_length: 512
54+
eval_max_target_length: 128
55+
eval_num_workers: 5

0 commit comments

Comments
 (0)