Skip to content

Add two popular datasets for character level LM#254

Open
entron wants to merge 2 commits intokarpathy:masterfrom
entron:master
Open

Add two popular datasets for character level LM#254
entron wants to merge 2 commits intokarpathy:masterfrom
entron:master

Conversation

@entron
Copy link
Copy Markdown

@entron entron commented Apr 25, 2023

Added data preparation and example trainning config scripts for two popular datasets: text8 and enwik8.

@entron
Copy link
Copy Markdown
Author

entron commented Apr 25, 2023

Also tried out feeding the outputs of each layer to itself mutiple times in the 2nd commit.
For the shakespeare_char dataset, this actually gives better val at 1.4543 with only 1 layer and 1.8M parameters.
For bigger datasets such as text8, this also gives better results when the
number of parameters are the same. Haven't tested on GPT-2 yet.
The 2nd commit may be not so relavant though.

@karpathy
Copy link
Copy Markdown
Owner

That's nice, but prefer we keep n_layer_update separate

@entron
Copy link
Copy Markdown
Author

entron commented Apr 26, 2023

I have removed the 2nd commit.

@Andrei-Aksionov
Copy link
Copy Markdown

Maybe it's about time to have a separate .py file with the shared logic?
Because all prepare.py files for shakespeare and these two new datasets basically do the same thing.
I understand that it's sometimes better to have some code duplication for the sake of simplicity and easiness of understanding, but this is not the case (in my opinion).

I am opened to hear why I am wrong (again 😄 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants