[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

作者用游戏的暂停与继续聊明白了checkpoint的作用,在三种主流框架中演示实际使用场景,手动点赞。

 

转自:https://blog.floydhub.com/checkpointing-tutorial-for-tensorflow-keras-and-pytorch/

This post will demonstrate how to checkpoint your training models on FloydHub so that you can resume your experiments from these saved states.

Wait, but why?

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

If you've ever played a video game, you might already understand why checkpoints are useful. For example, sometimes you'll want to save your game right before a big boss castle - just in case everything goes terribly wrong inside and you need to try again. Checkpoints in machine learning and deep learning experiments are essentially the same thing - a way to save the current state of your experiment so that you can pick up from where you left off.

Trust me, you're going to have a bad time if you lose one or more of your experiments due to a power outage, OS fault, job preemption, or any other type of unexpected error. Other times, even if you don't experience an unforeseen error, you might just want just to resume a particular state of the training for a new experiment - or try different things from a given state.

That's why you need checkpoints!

But, wait - there's one more reason, and it's a big one. If you don't checkpoint your training models at the end of a job, you'll have lost all of your results! Like, they're just gone. Simply put, if you'd like to make use of your trained models, you're going to need some checkpoints.

So what is a checkpoint really?

The Keras docs provide a great explanation of checkpoints (that I'm going to gratuitously leverage here):

  • The architecture of the model, allowing you to re-create the model
  • The weights of the model
  • The training configuration (loss, optimizer, epochs, and other meta-information)
  • The state of the optimizer, allowing to resume training exactly where you left off.

Again, a checkpoint contains the information you need to save your current experiment state so that you can resume training from this point. Just like in that infernal Zelda II: The Adventure of Link game from my childhood.

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

At this point, I'll assume I've convinced you that checkpoints need to be a vital part of your deep learning workflow. So, let's talk strategy.

You can employ different checkpoint strategies according to the type of experiment training regime you're performing:

  • Short Training Regime (minutes to hours)
  • Normal Training Regime (hours to day)
  • Long Training Regime (days to weeks)

Short Training Regime

The typical practice is to save a checkpoint only at the end of the training, or at the end of every epoch.

Normal Training Regime

In this case, it's common to save multiple checkpoints every n_epochs and keep track of the best one with respect to some validation metric that we care about. Usually, there's a fixed maximum number of checkpoints so as to not take up too much disk space (for example, restricting your maximum number of checkpoints to 10, where the new ones will replace the earliest ones).

Long Training Regime

In this type of training regime, you'll likely want to employ a similar strategy to the Normal regime - where you're saving multiple checkpoints every n_epochs and keeping track of the best one with respect to the validation metric that you care about. In this case, since the training can be very long, it's common to save checkpoints less frequently but maintain a greater number of checkpoints.

Which regime is right for me?

The tradeoff among these various strategies is between the frequency and the number of checkpoint files to keep. Let's take a look what's happening when we act over these two parameters:

FREQUENCY CHECKPOINTS CONS PRO
High High You need a lot of space!! You can resume very quickly in almost all the interesting training states
High Low You could have lost precious states Minimize the storage space you need
Low High It will take time to get to intermediate states You can resume the experiments in a lot of interesting states
Low Low You could have lost precious states Minimize the storage space you need

Hopefully, now you have a good intuition about what might be the best checkpoint strategy for your training regime. It should go without saying that you can obviously develop your own custom checkpoint strategy based on your experiment needs! These are just tips and best practices that I take into consideration for my own projects.

Save and Resume on FloydHub

Now, let's dive into some code on FloydHub. I'll show you how to save checkpoints in three popular deep learning frameworks available on FloydHub: TensorFlow, Keras, and PyTorch.

Before you start, log into the FloydHub command-line-tool with the floyd logincommand, then fork and init the project:

$ git clone https://github.com/floydhub/save-and-resume.git
$ cd save-and-resume
$ floyd init save-and-resume

For our checkpointing examples, we'll be using the Hello, World of deep learning: the MNIST classification task using a Convolutional Neural Network model.

Because it's always important to be clear about our checkpointing strategy up-front, I'll state the approach we're going to be taking:

  • Keep only one checkpoint
  • Trigger the strategy at the end of every epoch
  • Save the one with the best (maximum) validation accuracy

Considering this toy example, we can employ the Short Training Regime strategy. Feel free to adapt this for your own more complicated experiments!

The commands

Before we dive into specific working examples, let's outline the basic commands you'll need. When starting a new job, your first command will look something like this:

floyd run 
    [--gpu] 
    --env <env> 
    --data <your_dataset>:<mounting_point_dataset> 
    "python <script_and_parameters>"

Important note: within your python script, you'll want to make sure that the checkpoint is being saved to the /output folder. FloydHub will automatically save the contents of the /output directory as a job's Output, which is how you'll be able to leverage these checkpoints to resume jobs.

Once your job has been completed, you'll then be able to mount that's job's output as an input to your next job - allowing your script to leverage the checkpoint you created in the next run of this project.

floyd run 
    [--gpu] 
    --env <env> 
    --data <your_dataset>:<mounting_point_dataset> 
    --data <output_of_previous_job>:<mounting_point_model> 
    "python <script_and_parameters>"

Okay, enough of that. Let's see how to make this tangible using three of the most popular frameworks on FloydHub.

TensorFlow

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

View full example on a FloydHub Jupyter Notebook

TensorFlow provides different ways to save and resume a checkpoint. In our example, we will use the tf.Estimator API, which uses tf.train.Savertf.train.CheckpointSaverHook and tf.saved_model.builder.SavedModelBuilder behind the scenes.

To be more clear, the tf.Estimator API uses the first function to save the checkpoint, the second one to act according to the adopted checkpointing strategy, and the last one to export the model to be served with export_savedmodel() method.

Let's dig in.

Saving a TensorFlow checkpoint

Before initializing an Estimator, we have to define the checkpoint strategy. To do so, we have to create a configuration for the Estimator using the tf.estimator.RunConfigAPI. Here's an example of how we might do this:

# Save the checkpoint in the /output folder
filepath = "/output/mnist_convnet_model"

# Checkpoint Strategy configuration
run_config = tf.contrib.learn.RunConfig(
    model_dir=filepath,
    keep_checkpoint_max=1)

In this way, we're telling the estimator which directory to save or resume a checkpoint from, and also how many checkpoints to keep.

Next, we have to provide this configuration at the initialization of the Estimator:

# Create the Estimator
mnist_classifier = tf.estimator.Estimator(
      model_fn=cnn_model_fn, config=run_config)

That's it. Seriously. We're now set up to save checkpoints in our TensorFlow code.

Resuming a TensorFlow checkpoint

Guess what? We're also already set up to resume from checkpoints in our next experiment run. If the Estimator finds a checkpoint inside the given model folder, it will load from the last checkpoint.

Okay, let me try

Don't take my word for it - try it out yourself. Here are the steps to run the TensorFlow checkpointing example on FloydHub.

Via FloydHub's Command Mode

First time training command:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data redeipirati/datasets/mnist/1:input 
    'python tf_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Resuming from your checkpoint:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data redeipirati/datasets/mnist/1:input 
    --data <your-username>/projects/save-and-resume/<jobs>/output:/model 
    'python tf_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The first --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The second --data flag specifies that the output of a previus Job should be available at the /model directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Via FloydHub's Jupyter Notebook Mode

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data redeipirati/datasets/mnist/1:input 
    --mode jupyter
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine
  • The --mode flag specifies that this job should provide a Jupyter notebook instance

Resuming from your checkpoint:

Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/modelto the previous command if you want to load a checkpoint from a previous Job in your Jupyter notebook.

Keras

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

View full example on a FloydHub Jupyter Notebook

Keras provides a great API for saving and loading checkpoints. Let's take a look:

Saving a Keras checkpoint

Keras provides a set of functions called callbacks: you can think of callbacks as events that will be triggered at certain training states. The callback we need for checkpointing is the ModelCheckpoint which provides all the features we need according to the checkpointing strategy we adopted in our example.

Note: this function will only save the model's weights - if you want to save the entire model or some of the components, you can take a look at the Keras docs on saving a model.

First up, we have to import the callback functions:

from keras.callbacks import ModelCheckpoint

Next, just before the call to model.fit(...), it's time to prepare the checkpoint strategy.

# Save the checkpoint in the /output folder
filepath = "/output/mnist-cnn-best.hdf5"

# Keep only a single checkpoint, the best over test accuracy.
checkpoint = ModelCheckpoint(filepath,
                            monitor='val_acc',
                            verbose=1,
                            save_best_only=True,
                            mode='max')
  • filepath="/output/mnist-cnn-best.hdf5": Remember, FloydHub will save the contents of /output folder! See more on job output in the FloydHub docs,
  • monitor='val_acc': This is the metric we care about - validation accuracy,
  • verbose=1: It will print more information
  • save_best_only=True: Keep only the best checkpoint (in terms of maximum validation accurancy)
  • mode='max': Save the checkpoint with max validation accuracy

By default, the period (or checkpointing frequency) is set to 1, which means at the end of every epoch.

For more information (such as filepath formatting options, checkpointing period, and more), you can explore the Keras ModelCheckpoint API.

Finally, we are ready to see this checkpointing strategy applied during model training. In order to do this, we need to pass the callback variable to the model.fit(...) call:

# Train
model.fit(x_train, y_train,
                batch_size=batch_size,
                epochs=epochs,
                verbose=1,
                validation_data=(x_test, y_test),
                callbacks=[checkpoint])  # <- Apply our checkpoint strategy

According to our chosen strategy, you will see:

# This line when the training reach a new max
Epoch < n_epoch >: val_acc improved from < previous val_acc > to < new max val_acc >, saving model to /output/mnist-cnn-best.hdf5

# Or this line
Epoch < n_epoch >: val_acc did not improve

That's it - you're now set up to save your Keras checkpoints.

Resuming a Keras checkpoint

Keras models provide the load_weights() method, which loads the weights from a hdf5 file.

To load the model's weights, you just need to add this line after the model definition:

... # Model Definition

model.load_weights(resume_weights)

Okay, let me try

Here's how you can do run this Keras example on FloydHub:

Via FloydHub's Command Mode

First time training command:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    'python keras_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Keras provides an API to handle MNIST data, so we can skip the dataset mounting in this case.

Resuming from your checkpoint:

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --data <your-username>/projects/save-and-resume/<jobs>/output:/model 
    'python keras_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --data flag specifies that the output of a previus Job should be available at the /model directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Via FloydHub's Jupyter Notebook Mode

floyd run 
    --gpu 
    --env tensorflow-1.3 
    --mode jupyter
  • The --env flag specifies the environment that this project should run on (Tensorflow 1.3.0 + Keras 2.0.6 on Python3.6)
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine
  • The --mode flag specifies that this job should provide us a Jupyter notebook.

Resuming from your checkpoint:

Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/model if you want to load a checkpoint from a previous job.

PyTorch

[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用

View full example on a FloydHub Jupyter Notebook

Unfortunately, at the moment, PyTorch does not have as easy of an API as Keras for checkpointing. We'll need to write our own solution according to our chosen checkpointing strategy.

Saving a PyTorch checkpoint

PyTorch does not provide an all-in-one API to defines a checkpointing strategy, but it does provide a simple way to save and resume a checkpoint. According the official docs about semantic serialization, the best practice is to save only the weights - due to a code refactoring issue.

Therefore, let's take a look at how to save the model weights in PyTorch.

First up, let's define a save_checkpoint function which handles all the instructions about the number of checkpoints to keep and the serialization on file:

def save_checkpoint(state, is_best, filename='/output/checkpoint.pth.tar'):
    """Save checkpoint if a new best is achieved"""
    if is_best:
        print ("=> Saving a new best")
        torch.save(state, filename)  # save checkpoint
    else:
        print ("=> Validation Accuracy did not improve")

Then, inside the training (which is usually a for-loop of the number of epochs), we define the checkpoint frequency (in our case, at the end of every epoch) and the information we'd like to store (the epochs, model weights, and best accuracy achieved):

...

# Training the Model
for epoch in range(num_epochs):
    train(...)  # Train
    acc = eval(...)  # Evaluate after every epoch

    # Some stuff with acc(accuracy)
    ...

    # Get bool not ByteTensor
    is_best = bool(acc.numpy() > best_accuracy.numpy())
    # Get greater Tensor to keep track best acc
    best_accuracy = torch.FloatTensor(max(acc.numpy(), best_accuracy.numpy()))
    # Save checkpoint if is a new best
    save_checkpoint({
        'epoch': start_epoch + epoch + 1,
        'state_dict': model.state_dict(),
        'best_accuracy': best_accuracy
    }, is_best)

That's it! You can now save checkpoints in your PyTorch experiments.

Resuming a PyTorch checkpoint

To resume a PyTorch checkpoint, we have to load the weights and the meta information we need before the training:

# cuda = torch.cuda.is_available()
if cuda:
    checkpoint = torch.load(resume_weights)
else:
    # Load GPU model on CPU
    checkpoint = torch.load(resume_weights,
                            map_location=lambda storage,
                            loc: storage)
start_epoch = checkpoint['epoch']
best_accuracy = checkpoint['best_accuracy']
model.load_state_dict(checkpoint['state_dict'])
print("=> loaded checkpoint '{}' (trained for {} epochs)".format(resume_weights, checkpoint['epoch']))

For more information on loading GPU-trained weights on a CPU instance, you can check out this PyTorch discussion.

Okay, let me try

Here's how you can do run this PyTorch example on FloydHub:

Via FloydHub's Command Mode

First time training command:

floyd run 
    --gpu 
    --env pytorch-0.2 
    --data redeipirati/datasets/pytorch-mnist/1:input 
    'python pytorch_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Resuming from your checkpoint:

floyd run 
    --gpu 
    --env pytorch-0.2 
    --data redeipirati/datasets/pytorch-mnist/1:input 
    --data <your-username>/projects/save-and-resume/<jobs>/output:/model 
    'python pytorch_mnist_cnn.py'
  • The --env flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3)
  • The first --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The second --data flag specifies that the output of a previus Job should be available at the /model directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine

Via FloydHub's Jupyter Notebook Mode

floyd run 
    --gpu 
    --env pytorch-0.2 
    --data redeipirati/datasets/pytorch-mnist/1:input 
    --mode jupyter
  • The --env flag specifies the environment that this project should run on (PyTorch 0.2.0 on Python 3)
  • The --data flag specifies that the pytorch-mnist dataset should be available at the /input directory
  • The --gpu flag is actually optional here - unless you want to start right away with running the code on a GPU machine
  • The --mode flag specifies that this job should provide us a Jupyter notebook.

Resuming from your checkpoint:

Just add --data <your-username>/projects/save-and-resume/<jobs>/output:/model if you want to load a checkpoint from a previous Job.

 

本站文章如无特殊说明,均为本站原创,如若转载,请注明出处:[转] 理解CheckPoint及其在Tensorflow & Keras & Pytorch中的使用 - Python技术站

(0)
上一篇 2023年4月6日
下一篇 2023年4月6日

相关文章

  • Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.1

    http://blog.csdn.net/sunbow0 Spark MLlib Deep Learning工具箱,是依据现有深度学习教程《UFLDL教程》中的算法。在SparkMLlib中的实现。详细Spark MLlib Deep Learning(深度学习)文件夹结构: 第一章Neural Net(NN) 1、源代码 2、源代码解析 3、实例 第二章D…

    卷积神经网络 2023年4月8日
    00
  • f-gan生成对抗神经网络进阶第一级

    在之前我们做始祖Gan的数学推导的时候,其实留下了一堆坑(不知道你注意到了没有哈哈)。当然,牛逼的人就是这样的,指明了一条光明大道,让人知道从哪里走,然后剩下的一些坑坑洼洼刚刚好就能帮助一些博士、硕士毕业或者一些人评副教授、教授啥的。 这篇文章介绍的f-gan就是填的其中一个坑,那就是给V(G,D)V(G,D)V(G,D)这个函数一个更加通用的描述。在始祖文…

    2023年4月5日
    00
  • 思考卷积神经网络(CNN)中各种意义

    思考卷积神经网络(CNN)中各种意义 只是知道CNN是不够,我们需要对其进行解剖,继而分析不同部件存在的意义 CNN的目的 简单来说,CNN的目的是以一定的模型对事物进行特征提取,而后根据特征对该事物进行分类、识别、预测或决策等。在这个过程里,最重要的步骤在于特征提取,即如何提取到能最大程度区分事物的特征。如果提取的特征无法将不同的事物进行划分,那么该特征提…

    2023年4月8日
    00
  • 『PyTorch』屌丝的PyTorch玩法

    1. prefetch_generator 使用 prefetch_generator库 在后台加载下一batch的数据,原本PyTorch默认的DataLoader会创建一些worker线程来预读取新的数据,但是除非这些线程的数据全部都被清空,这些线程才会读下一批数据。使用prefetch_generator,我们可以保证线程不会等待,每个线程都总有至少一…

    PyTorch 2023年4月8日
    00
  • TensorFlow keras vgg16net的使用

    from tensorflow.python.keras.applications.vgg16 import VGG16,preprocess_input,decode_predictions from tensorflow.python.keras.preprocessing.image import load_img,img_to_array def p…

    Keras 2023年4月6日
    00
  • 基于tensorflow的MNIST手写识别

    这个例子,是学习tensorflow的人员通常会用到的,也是基本的学习曲线中的一环。我也是!   这个例子很简单,这里,就是简单的说下,不同的tensorflow版本,相关的接口函数,可能会有不一样哟。在TensorFlow的中文介绍文档中的内容,有些可能与你使用的tensorflow的版本不一致了,我这里用到的tensorflow的版本就有这个问题。 另外…

    2023年4月8日
    00
  • pytorch children和modules

    参考1参考2官方论坛讨论 children: 只包括网络的第一级孩子,不包括孩子的孩子modules: 深度优先遍历,先输出孩子,再输出孩子的孩子,孩子的孩子的孩子。。。 children的用法:加载预训练模型 resnet = models.resnet50(pretrained=True) modules = list(resnet.children()…

    PyTorch 2023年4月8日
    00
  • 【caffe】无法找到gpu/mxGPUArray.h: No such file or directory

    @tags: caffe 问题出现在,windows下编译caffe的过程中。按照github.com/microsoft/caffe的readme配置的。 问题原因是,用的matlab版本较新(2016a),mxGPUArray.h换位置了。 解决办法:<caffe_root>/windows/CommonSettings.props里面,找到…

    Caffe 2023年4月8日
    00
合作推广
合作推广
分享本页
返回顶部