[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

作者用游戏的暂停与继续聊明白了checkpoint的作用,在三种主流框架中演示实际使用场景,手动点赞。

 

转自:https://blog.floydhub.com/checkpointing-tutorial-for-tensorflow-keras-and-pytorch/

This post will demonstrate how to checkpoint your training models on FloydHub so that you can resume your experiments from these saved states.

Wait, but why?

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

If you've ever played a video game, you might already understand why checkpoints are useful. For example, sometimes you'll want to save your game right before a big boss castle - just in case everything goes terribly wrong inside and you need to try again. Checkpoints in machine learning and deep learning experiments are essentially the same thing - a way to save the current state of your experiment so that you can pick up from where you left off.

Trust me, you're going to have a bad time if you lose one or more of your experiments due to a power outage, OS fault, job preemption, or any other type of unexpected error. Other times, even if you don't experience an unforeseen error, you might just want just to resume a particular state of the training for a new experiment - or try different things from a given state.

That's why you need checkpoints!

But, wait - there's one more reason, and it's a big one. If you don't checkpoint your training models at the end of a job, you'll have lost all of your results! Like, they're just gone. Simply put, if you'd like to make use of your trained models, you're going to need some checkpoints.

So what is a checkpoint really?

The Keras docs provide a great explanation of checkpoints (that I'm going to gratuitously leverage here):

  • The architecture of the model, allowing you to re-create the model
  • The weights of the model
  • The training configuration (loss, optimizer, epochs, and other meta-information)
  • The state of the optimizer, allowing to resume training exactly where you left off.

Again, a checkpoint contains the information you need to save your current experiment state so that you can resume training from this point. Just like in that infernal Zelda II: The Adventure of Link game from my childhood.

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

At this point, I'll assume I've convinced you that checkpoints need to be a vital part of your deep learning workflow. So, let's talk strategy.

You can employ different checkpoint strategies according to the type of experiment training regime you're performing:

  • Short Training Regime (minutes to hours)
  • Normal Training Regime (hours to day)
  • Long Training Regime (days to weeks)

Short Training Regime

The typical practice is to save a checkpoint only at the end of the training, or at the end of every epoch.

Normal Training Regime

In this case, it's common to save multiple checkpoints every n_epochs and keep track of the best one with respect to some validation metric that we care about. Usually, there's a fixed maximum number of checkpoints so as to not take up too much disk space (for example, restricting your maximum number of checkpoints to 10, where the new ones will replace the earliest ones).

Long Training Regime

In this type of training regime, you'll likely want to employ a similar strategy to the Normal regime - where you're saving multiple checkpoints every n_epochs and keeping track of the best one with respect to the validation metric that you care about. In this case, since the training can be very long, it's common to save checkpoints less frequently but maintain a greater number of checkpoints.

Which regime is right for me?

The tradeoff among these various strategies is between the frequency and the number of checkpoint files to keep. Let's take a look what's happening when we act over these two parameters:

FREQUENCY CHECKPOINTS CONS PRO
High High You need a lot of space!! You can resume very quickly in almost all the interesting training states
High Low You could have lost precious states Minimize the storage space you need
Low High It will take time to get to intermediate states You can resume the experiments in a lot of interesting states
Low Low You could have lost precious states Minimize the storage space you need

Hopefully, now you have a good intuition about what might be the best checkpoint strategy for your training regime. It should go without saying that you can obviously develop your own custom checkpoint strategy based on your experiment needs! These are just tips and best practices that I take into consideration for my own projects.

Save and Resume on FloydHub

Now, let's dive into some code on FloydHub. I'll show you how to save checkpoints in three popular deep learning frameworks available on FloydHub: TensorFlow, Keras, and PyTorch.

Before you start, log into the FloydHub command-line-tool with the floyd logincommand, then fork and init the project:

$ git clone https://github.com/floydhub/save-and-resume.git
$ cd save-and-resume
$ floyd init save-and-resume

For our checkpointing examples, we'll be using the Hello, World of deep learning: the MNIST classification task using a Convolutional Neural Network model.

Because it's always important to be clear about our checkpointing strategy up-front, I'll state the approach we're going to be taking:

  • Keep only one checkpoint
  • Trigger the strategy at the end of every epoch
  • Save the one with the best (maximum) validation accuracy

Considering this toy example, we can employ the Short Training Regime strategy. Feel free to adapt this for your own more complicated experiments!

The commands

Before we dive into specific working examples, let's outline the basic commands you'll need. When starting a new job, your first command will look something like this:

floyd run 
    [--gpu] 
    --env <env> 
    --data <your_dataset>:<mounting_point_dataset> 
    "python <script_and_parameters>"

Important note: within your python script, you'll want to make sure that the checkpoint is being saved to the /output folder. FloydHub will automatically save the contents of the /output directory as a job's Output, which is how you'll be able to leverage these checkpoints to resume jobs.

Once your job has been completed, you'll then be able to mount that's job's output as an input to your next job - allowing your script to leverage the checkpoint you created in the next run of this project.

floyd run 
    [--gpu] 
    --env <env> 
    --data <your_dataset>:<mounting_point_dataset> 
    --data <output_of_previous_job>:<mounting_point_model> 
    "python <script_and_parameters>"

Okay, enough of that. Let's see how to make this tangible using three of the most popular frameworks on FloydHub.

TensorFlow

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

View full example on a FloydHub Jupyter Notebook

TensorFlow provides different ways to save and resume a checkpoint. In our example, we will use the tf.Estimator API, which uses tf.train.Savertf.train.CheckpointSaverHook and tf.saved_model.builder.SavedModelBuilder behind the scenes.

To be more clear, the tf.Estimator API uses the first function to save the checkpoint, the second one to act according to the adopted checkpointing strategy, and the last one to export the model to be served with export_savedmodel() method.

Let's dig in.

Saving a TensorFlow checkpoint

Before initializing an Estimator, we have to define the checkpoint strategy. To do so, we have to create a configuration for the Estimator using the tf.estimator.RunConfigAPI. Here's an example of how we might do this:

# Save the checkpoint in the /output folder
filepath = "/output/mnist_convnet_model"

# Checkpoint Strategy configuration
run_config = tf.contrib.learn.RunConfig(
    model_dir=filepath,
    keep_checkpoint_max=1)

In this way, we're telling the estimator which directory to save or resume a checkpoint from, and also how many checkpoints to keep.

Next, we have to provide this configuration at the initialization of the Estimator:

# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
      model_fn=cnn_model_fn, config=run_config)

That's it. Seriously. We're now set up to save checkpoints in our TensorFlow code.

Resuming a TensorFlow checkpoint

Guess what? We're also already set up to resume from checkpoints in our next experiment run. If the Estimator finds a checkpoint inside the given model folder, it will load from the last checkpoint.

Okay, let me try

Don't take my word for it - try it out yourself. Here are the steps to run the TensorFlow checkpointing example on FloydHub.

Via FloydHub's Command Mode

First time training command:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data redeipirati/datasets/mnist/1:input 
    'python tf_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Resuming from your checkpoint:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data redeipirati/datasets/mnist/1:input 
    --data <your-username>/projects/save-and-resume/<jobs>/output:/model 
    'python tf_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The first --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The second --data flag specifies that the output of a previus Job should be available at the /model directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Via FloydHub's Jupyter Notebook Mode

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data redeipirati/datasets/mnist/1:input 
    --mode jupyter
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine
  • The --mode flag specifies that this job should provide a Jupyter notebook instance

Resuming from your checkpoint:

Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/modelto the previous command if you want to load a checkpoint from a previous Job in your Jupyter notebook.

Keras

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

View full example on a FloydHub Jupyter Notebook

Keras provides a great API for saving and loading checkpoints. Let's take a look:

Saving a Keras checkpoint

Keras provides a set of functions called callbacks: you can think of callbacks as events that will be triggered at certain training states. The callback we need for checkpointing is the ModelCheckpoint which provides all the features we need according to the checkpointing strategy we adopted in our example.

Note: this function will only save the model's weights - if you want to save the entire model or some of the components, you can take a look at the Keras docs on saving a model.

First up, we have to import the callback functions:

from keras.callbacks import ModelCheckpoint

Next, just before the call to model.fit(...), it's time to prepare the checkpoint strategy.

# Save the checkpoint in the /output folder
filepath = "/output/mnist-cnn-best.hdf5"

# Keep only a single checkpoint, the best over test accuracy.
checkpoint = ModelCheckpoint(filepath,
                            monitor='val_acc',
                            verbose=1,
                            save_best_only=True,
                            mode='max')
  • filepath="/output/mnist-cnn-best.hdf5": Remember, FloydHub will save the contents of /output folder! See more on job output in the FloydHub docs,
  • monitor='val_acc': This is the metric we care about - validation accuracy,
  • verbose=1: It will print more information
  • save_best_only=True: Keep only the best checkpoint (in terms of maximum validation accurancy)
  • mode='max': Save the checkpoint with max validation accuracy

By default, the period (or checkpointing frequency) is set to 1, which means at the end of every epoch.

For more information (such as filepath formatting options, checkpointing period, and more), you can explore the Keras ModelCheckpoint API.

Finally, we are ready to see this checkpointing strategy applied during model training. In order to do this, we need to pass the callback variable to the model.fit(...) call:

# Train
model.fit(x_train, y_train,
                batch_size=batch_size,
                epochs=epochs,
                verbose=1,
                validation_data=(x_test, y_test),
                callbacks=[checkpoint])  # <- Apply our checkpoint strategy

According to our chosen strategy, you will see:

# This line when the training reach a new max
Epoch < n_epoch >: val_acc improved from < previous val_acc > to < new max val_acc >, saving model to /output/mnist-cnn-best.hdf5

# Or this line
Epoch < n_epoch >: val_acc did not improve

That's it - you're now set up to save your Keras checkpoints.

Resuming a Keras checkpoint

Keras models provide the load_weights() method, which loads the weights from a hdf5 file.

To load the model's weights, you just need to add this line after the model definition:

... # Model Definition

model.load_weights(resume_weights)

Okay, let me try

Here's how you can do run this Keras example on FloydHub:

Via FloydHub's Command Mode

First time training command:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    'python keras_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Keras provides an API to handle MNIST data, so we can skip the dataset mounting in this case.

Resuming from your checkpoint:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data <your-username>/projects/save-and-resume/<jobs>/output:/model 
    'python keras_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --data flag specifies that the output of a previus Job should be available at the /model directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Via FloydHub's Jupyter Notebook Mode

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --mode jupyter
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine
  • The --mode flag specifies that this job should provide us a Jupyter notebook.

Resuming from your checkpoint:

Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/model if you want to load a checkpoint from a previous job.

PyTorch

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

View full example on a FloydHub Jupyter Notebook

Unfortunately, at the moment, PyTorch does not have as easy of an API as Keras for checkpointing. We'll need to write our own solution according to our chosen checkpointing strategy.

Saving a PyTorch checkpoint

PyTorch does not provide an all-in-one API to defines a checkpointing strategy, but it does provide a simple way to save and resume a checkpoint. According the official docs about semantic serialization, the best practice is to save only the weights - due to a code refactoring issue.

Therefore, let's take a look at how to save the model weights in PyTorch.

First up, let's define a save_checkpoint function which handles all the instructions about the number of checkpoints to keep and the serialization on file:

def save_checkpoint(state, is_best, filename='/output/checkpoint.pth.tar'):
    """Save checkpoint if a new best is achieved"""
    if is_best:
        print ("=> Saving a new best")
        torch.save(state, filename)  # save checkpoint
    else:
        print ("=> Validation Accuracy did not improve")

Then, inside the training (which is usually a for-loop of the number of epochs), we define the checkpoint frequency (in our case, at the end of every epoch) and the information we'd like to store (the epochs, model weights, and best accuracy achieved):

...

# Training the Model
for epoch in range(num_epochs):
    train(...)  # Train
    acc = eval(...)  # Evaluate after every epoch

    # Some stuff with acc(accuracy)
    ...

    # Get bool not ByteTensor
    is_best = bool(acc.numpy() > best_accuracy.numpy())
    # Get greater Tensor to keep track best acc
    best_accuracy = torch.FloatTensor(max(acc.numpy(), best_accuracy.numpy()))
    # Save checkpoint if is a new best
    save_checkpoint({
        'epoch': start_epoch + epoch + 1,
        'state_dict': model.state_dict(),
        'best_accuracy': best_accuracy
    }, is_best)

That's it! You can now save checkpoints in your PyTorch experiments.

Resuming a PyTorch checkpoint

To resume a PyTorch checkpoint, we have to load the weights and the meta information we need before the training:

# cuda = torch.cuda.is_available()
if cuda:
    checkpoint = torch.load(resume_weights)
else:
    # Load GPU model on CPU
    checkpoint = torch.load(resume_weights,
                            map_location=lambda storage,
                            loc: storage)
start_epoch = checkpoint['epoch']
best_accuracy = checkpoint['best_accuracy']
model.load_state_dict(checkpoint['state_dict'])
print("=> loaded checkpoint '{}' (trained for {} epochs)".format(resume_weights, checkpoint['epoch']))

For more information on loading GPU-trained weights on a CPU instance, you can check out this PyTorch discussion.

Okay, let me try

Here's how you can do run this PyTorch example on FloydHub:

Via FloydHub's Command Mode

First time training command:

floyd run 
    --gpu 
    --env pytorch-0.2 
    --data redeipirati/datasets/pytorch-mnist/1:input 
    'python pytorch_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Resuming from your checkpoint:

floyd run 
    --gpu 
    --env pytorch-0.2 
    --data redeipirati/datasets/pytorch-mnist/1:input 
    --data <your-username>/projects/save-and-resume/<jobs>/output:/model 
    'python pytorch_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3)
  • The first --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The second --data flag specifies that the output of a previus Job should be available at the /model directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Via FloydHub's Jupyter Notebook Mode

floyd run 
    --gpu 
    --env pytorch-0.2 
    --data redeipirati/datasets/pytorch-mnist/1:input 
    --mode jupyter
  • The --env flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine
  • The --mode flag specifies that this job should provide us a Jupyter notebook.

Resuming from your checkpoint:

Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/model if you want to load a checkpoint from a previous Job.

 

本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用 - Python技术站

(0)
上一篇 2023年4月6日
下一篇 2023年4月6日

相关文章

  • 深度学习之CNN(卷积神经网络)、RNN(循环神经网络)、DNN(深度神经网络)概念区分理解

    背景 我们知道,目前,深度学习十分热门,深度学习在搜索技术,数据挖掘,机器学习,机器翻译,自然语言处理,多媒体学习,语音,推荐和个性化技术,以及其他相关领域都取得了很多成果。深度学习使机器模仿视听和思考等人类的活动,解决了很多复杂的模式识别难题,使得人工智能相关技术取得了很大进步。 从广义上来说,NN(或是更美的DNN)可以认为包含了CNN、RNN这些具体的…

    2023年4月6日
    00
  • 目标检测 tensorflow(预训练模型)

    tensorflow detection model zoo:在这个链接当中哦有训练好的checkpoint:https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/detection_model_zoo.md 这里的模型是基于COCO dataset,…

    目标检测 2023年4月8日
    00
  • win10+keras 跑ResNet 完美解决Failed to get convolution algorithm. This is probably because cuDNN failed to initialize报错

    #GPU按需分配,解决 import tensorflow as tf import keras config = tf.ConfigProto() config.gpu_options.allow_growth = True keras.backend.tensorflow_backend.set_session(tf.Session(config=con…

    Keras 2023年4月5日
    00
  • caffe 在 windows 下的配置(scripts\build_win.cmd)

    官网配置文档见:GitHub – BVLC/caffe at windows 1. windows 设置 requirements: visual studio 2013/2015 CMake >= 3.4(注意添加 cmake 的 bin 路径到 Path 环境变量中,保证命令行可以找到 cmake.exe) 2. 配置和编译 caffe 进入 wi…

    Caffe 2023年4月8日
    00
  • 在Keras中可视化LSTM

    作者|Praneet Bomma编译|VK来源|https://towardsdatascience.com/visualising-lstm-activations-in-keras-b50206da96ff 你是否想知道LSTM层学到了什么?有没有想过是否有可能看到每个单元如何对最终输出做出贡献。我很好奇,试图将其可视化。在满足我好奇的神经元的同时,我偶…

    Keras 2023年4月7日
    00
  • NeurIPS 2018 | 旷视科技提出MetaAnchor:自定义锚点框优化目标检测系统

    论文名称:MetaAnchor: Learning to Detect Objects with Customized Anchors 论文链接:https://arxiv.org/abs/1807.00980 目录 导语 背景 设计思想 方法 锚点框函数生成器 架构细节 实验 COCO 目标检测结果 结论 参考文献 导语 随着 ImageNet 退出“江湖…

    2023年4月8日
    00
  • 机器学习与Tensorflow(5)——循环神经网络、长短时记忆网络

      1.循环神经网络的标准模型 前馈神经网络能够用来建立数据之间的映射关系,但是不能用来分析过去信号的时间依赖关系,而且要求输入样本的长度固定 循环神经网络是一种在前馈神经网络中增加了分亏链接的神经网络,能够产生对过去数据的记忆状态,所以可以用于对序列数据的处理,并建立不同时段数据之间的依赖关系 循环神经网络是一类允许节点连接成有向环的人工神经网络。如下图:…

    2023年4月8日
    00
  • 深度学习13—RNN循环神经网络原理

    为什么需要RNN(循环神经网络) 传统的神经网络的不足:传统的神经网络已经非常强大了,但有一个 特点是:他们都只能单独的取处理一个个的输入,前一个输入和后一个输入是完全没有关系的。但是,某些任务需要能够更好的处理序列的信息,即前面的输入和后面的输入是有关系的。比如时间序列数据、文字序列等。比如,当我们在理解一句话意思时,孤立的理解这句话的每个词是不够的,我们…

    2023年4月8日
    00
合作推广
合作推广
分享本页
返回顶部