Three minor updates to the Fine-Tuning Scheduler tutorial to better s…

…upport the release of Lightning 1.9.x (supported by Fine-Tuning Scheduler 0.4.x)
Lightning-AI · Borda · Mar 14, 2023 · Jan 25, 2023 · Mar 14, 2023 · Mar 14, 2023
commit aa838bbdbe0d4b0c8e056e937a2ca9b5a016f661
@@ -1,20 +1,19 @@
 title: Fine-Tuning Scheduler
 author: "[Dan Dale](https://github.com/speediedan)"
 created: 2021-11-29
-updated: 2022-09-20
+updated: 2023-01-24
 license: CC BY-SA
 build: 0
 tags:
   - Fine-Tuning
 description: |
   This notebook introduces the [Fine-Tuning Scheduler](https://finetuning-scheduler.readthedocs.io/en/stable/index.html) extension
-  and demonstrates the use of it to fine-tune a small foundational model on the
+  and demonstrates the use of it to fine-tune a small foundation model on the
   [RTE](https://huggingface.co/datasets/viewer/?dataset=super_glue&config=rte) task of
   [SuperGLUE](https://super.gluebenchmark.com/) with iterative early-stopping defined according to a user-specified
   schedule. It uses Hugging Face's ``datasets`` and ``transformers`` libraries to retrieve the relevant benchmark data
-  and foundational model weights. The required dependencies are installed via the finetuning-scheduler ``[examples]`` extra.
+  and foundation model weights. The required dependencies are installed via the finetuning-scheduler ``[examples]`` extra.
 requirements:
-  - finetuning-scheduler[examples]>=0.3.0
-  - datasets<2.8.0  # todo: AttributeError: module 'datasets.arrow_dataset' has no attribute 'Batch'
+  - finetuning-scheduler[examples]>=0.4.0
 accelerator:
   - GPU
@@ -21,8 +21,8 @@
 # <div style="display:inline" id="a1">
 #
 # Fundamentally, [Fine-Tuning Scheduler](https://finetuning-scheduler.readthedocs.io/en/stable/index.html) enables
-# scheduled, multi-phase, fine-tuning of foundational models. Gradual unfreezing (i.e. thawing) can help maximize
-# foundational model knowledge retention while allowing (typically upper layers of) the model to
+# scheduled, multi-phase, fine-tuning of foundation models. Gradual unfreezing (i.e. thawing) can help maximize
+# foundation model knowledge retention while allowing (typically upper layers of) the model to
 # optimally adapt to new tasks during transfer learning [1, 2, 3](#f1)
 #
 # </div>
@@ -111,7 +111,7 @@
 #
 #
 #
-# The end-to-end example in this notebook ([Scheduled Fine-Tuning For SuperGLUE](#superglue)) uses [FinetuningScheduler](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.fts.html#finetuning_scheduler.fts.FinetuningScheduler) in explicit mode to fine-tune a small foundational model on the [RTE](https://huggingface.co/datasets/viewer/?dataset=super_glue&config=rte) task of [SuperGLUE](https://super.gluebenchmark.com/).
+# The end-to-end example in this notebook ([Scheduled Fine-Tuning For SuperGLUE](#superglue)) uses [FinetuningScheduler](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.fts.html#finetuning_scheduler.fts.FinetuningScheduler) in explicit mode to fine-tune a small foundation model on the [RTE](https://huggingface.co/datasets/viewer/?dataset=super_glue&config=rte) task of [SuperGLUE](https://super.gluebenchmark.com/).
 # Please see the [official Fine-Tuning Scheduler documentation](https://finetuning-scheduler.readthedocs.io/en/stable/index.html) if you are interested in a similar [CLI-based example](https://finetuning-scheduler.readthedocs.io/en/stable/index.html#example-scheduled-fine-tuning-for-superglue) using the LightningCLI.
 
 # %% [markdown]
@@ -147,23 +147,24 @@
 #
 # **Note:** Currently, [FinetuningScheduler](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.fts.html#finetuning_scheduler.fts.FinetuningScheduler) supports the following strategy types:
 #
-# - ``DP``
-# - ``DDP``
-# - ``DDP_FORK`` (and its aliases e.g. ``ddp_notebook``)
-# - ``DDP_SPAWN``
-# - ``DDP_SHARDED``
-# - ``DDP_SHARDED_SPAWN``
+# - ``ddp`` (and alias ``ddp_find_unused_parameters_false``)
+# - ``fsdp_native`` (and alias ``fsdp_native_full_shard_offload``)
+# - ``ddp_spawn`` (and aliases ``ddp_fork``, ``ddp_notebook``)
+# - ``dp``
+# - ``ddp_sharded`` (deprecated, to be removed in 2.0)
+# - ``ddp_sharded_spawn`` (deprecated, to be removed in 2.0)
 #
 # Custom or officially unsupported strategies can be used by setting [FinetuningScheduler.allow_untested](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.fts.html?highlight=allow_untested#finetuning_scheduler.fts.FinetuningScheduler.params.allow_untested) to ``True``.
-# Note that most currently unsupported strategies are so because they require varying degrees of modification to be compatible (e.g. ``deepspeed`` requires an ``add_param_group`` method, ``tpu_spawn`` an override of the current broadcast method to include python objects)
+# Note that most currently unsupported strategies are so because they require varying degrees of modification to be compatible. For example, ``deepspeed`` will require a [StrategyAdapter](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.strategy_adapters.html#finetuning_scheduler.strategy_adapters.StrategyAdapter) to be written (similar to the one for ``FSDP``, [FSDPStrategyAdapter](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.strategy_adapters.html#finetuning_scheduler.strategy_adapters.FSDPStrategyAdapter)) before support can be added (PRs welcome!),
+# while ``tpu_spawn`` would require an override of the current broadcast method to include python objects.
 # </div>
 
 # %% [markdown]
 # <div id="superglue"></div>
 #
 # ## Scheduled Fine-Tuning For SuperGLUE
 #
-# The following example demonstrates the use of [FinetuningScheduler](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.fts.html#finetuning_scheduler.fts.FinetuningScheduler) to fine-tune a small foundational model on the [RTE](https://huggingface.co/datasets/viewer/?dataset=super_glue&config=rte) task of [SuperGLUE](https://super.gluebenchmark.com/). Iterative early-stopping will be applied according to a user-specified schedule.
+# The following example demonstrates the use of [FinetuningScheduler](https://finetuning-scheduler.readthedocs.io/en/stable/api/finetuning_scheduler.fts.html#finetuning_scheduler.fts.FinetuningScheduler) to fine-tune a small foundation model on the [RTE](https://huggingface.co/datasets/viewer/?dataset=super_glue&config=rte) task of [SuperGLUE](https://super.gluebenchmark.com/). Iterative early-stopping will be applied according to a user-specified schedule.
 #
 
 # %%
@@ -180,7 +181,7 @@
 import pytorch_lightning as pl
 import torch
 from datasets import logging as datasets_logging
-from lightning_lite.accelerators.cuda import is_cuda_available
+from lightning_fabric.accelerators.cuda import is_cuda_available
 from pytorch_lightning.callbacks import EarlyStopping, ModelCheckpoint
 from pytorch_lightning.loggers.tensorboard import TensorBoardLogger
 from pytorch_lightning.utilities import rank_zero_warn
@@ -251,8 +252,8 @@ def __init__(
             model_name_or_path (str):
                 Can be either:
                     - A string, the ``model id`` of a pretrained model hosted inside a model repo on huggingface.co.
-                        Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced under
-                        a user or organization name, like ``dbmdz/bert-base-german-cased``.
+                        Valid model ids can be located at the root-level, like ``bert-base-uncased``, or namespaced
+                        under a user or organization name, like ``dbmdz/bert-base-german-cased``.
                     - A path to a ``directory`` containing model weights saved using
                         :meth:`~transformers.PreTrainedModel.save_pretrained`, e.g., ``./my_model_directory/``.
             task_name (str, optional): Name of the SuperGLUE task to execute. This module supports 'rte' or 'boolq'.
@@ -261,7 +262,7 @@ def __init__(
             train_batch_size (int, optional): Training batch size. Defaults to 16.
             eval_batch_size (int, optional): Batch size to use for validation and testing splits. Defaults to 16.
             tokenizers_parallelism (bool, optional): Whether to use parallelism in the tokenizer. Defaults to True.
-            \**dataloader_kwargs: Arguments passed when initializing the dataloader
+            \**dataloader_kwargs: Arguments passed when initializing the dataloader.
         """
         super().__init__()
         task_name = task_name if task_name in TASK_NUM_LABELS.keys() else DEFAULT_TASK
@@ -300,7 +301,7 @@ def train_dataloader(self):
     def val_dataloader(self):
         return DataLoader(self.dataset["validation"], batch_size=self.hparams.eval_batch_size, **self.dataloader_kwargs)
 
-    def _convert_to_features(self, example_batch: datasets.arrow_dataset.Batch) -> BatchEncoding:
+    def _convert_to_features(self, example_batch: datasets.arrow_dataset.LazyDict) -> BatchEncoding:
         """Convert raw text examples to a :class:`~transformers.tokenization_utils_base.BatchEncoding` container
         (derived from python dict) of features that includes helpful methods for translating between word/character
         space and token space.
@@ -309,7 +310,7 @@ def _convert_to_features(self, example_batch: datasets.arrow_dataset.Batch) -> B
             example_batch ([type]): The set of examples to convert to token space.
 
         Returns:
-            ``BatchEncoding``: A batch of encoded examples (note default tokenizer batch_size=1000)
+            ``BatchEncoding``: A batch of encoded examples (note default tokenizer batch_size=1000).
         """
         text_pairs = list(zip(example_batch[self.text_fields[0]], example_batch[self.text_fields[1]]))
         # Tokenize the text/text pairs
@@ -323,8 +324,8 @@ def _convert_to_features(self, example_batch: datasets.arrow_dataset.Batch) -> B
 
 # %%
 class RteBoolqModule(pl.LightningModule):
-    """A ``LightningModule`` that can be used to fine-tune a foundational model on either the RTE or BoolQ
-    SuperGLUE tasks using Hugging Face implementations of a given model and the `SuperGLUE Hugging Face dataset."""
+    """A ``LightningModule`` that can be used to fine-tune a foundation model on either the RTE or BoolQ SuperGLUE
+    tasks using Hugging Face implementations of a given model and the `SuperGLUE Hugging Face dataset."""
 
     def __init__(
         self,
@@ -337,9 +338,9 @@ def __init__(
     ):
         """
         Args:
-            model_name_or_path (str): Path to pretrained model or identifier from https://huggingface.co/models
+            model_name_or_path (str): Path to pretrained model or identifier from https://huggingface.co/models.
             optimizer_init (Dict[str, Any]): The desired optimizer configuration.
-            lr_scheduler_init (Dict[str, Any]): The desired learning rate scheduler config
+            lr_scheduler_init (Dict[str, Any]): The desired learning rate scheduler config.
             model_cfg (Optional[Dict[str, Any]], optional): Defines overrides of the default model config. Defaults to
                 ``None``.
             task_name (str, optional): The SuperGLUE task to execute, one of ``'rte'``, ``'boolq'``. Defaults to "rte".
@@ -495,7 +496,7 @@ def configure_optimizers(self):
 # Though other optimizers can arguably yield some marginal advantage contingent on the context,
 # the Adam optimizer (and the [AdamW version](https://pytorch.org/docs/stable/_modules/torch/optim/adamw.html#AdamW) which
 # implements decoupled weight decay) remains robust to hyperparameter choices and is commonly used for fine-tuning
-# foundational language models.  See [(Sivaprasad et al., 2020)](#f2) and [(Mosbach, Andriushchenko & Klakow, 2020)](#f3) for theoretical and systematic empirical justifications of Adam and its use in fine-tuning
+# foundation language models.  See [(Sivaprasad et al., 2020)](#f2) and [(Mosbach, Andriushchenko & Klakow, 2020)](#f3) for theoretical and systematic empirical justifications of Adam and its use in fine-tuning
 # large transformer-based language models. The values used here have some justification
 # in the referenced literature but have been largely empirically determined and while a good
 # starting point could be could be further tuned.