PyTorch DataLoaders

PyTorch Dataloaders are commonly used for:

Creating mini-batches
Speeding-up the training process
Automatic data shuffling

In this tutorial, you will review several common examples of how to use Dataloaders and explore settings including dataset, batch_size, shuffle, num_workers, pin_memory and drop_last.

Level:  Intermediate
Time:  10 minutes
Equipment:  Google Chrome Browser

Introduction

PyTorch DataLoaders will automatically create mini-batches of your dataset for the training process and speed-up the data loading process. In this tutorial, you will learn when to use a DataLoader, how to create one for a vision task and learn more about some of the advanced features. Let’s get started!

DataLoaders - What do they provide?

As you know, neural networks train best with batches of data. That is, instead of using the entire dataset in one training pass, we instead prefer to use mini-batch training where we provide the training loop say 64 or 128 samples at a time. This is where PyTorch DataLoaders come in. A PyTorch dataloader will take your raw dataset and automatically slice it up into mini-batches. In addition, if your dataset has a lot of sequential labels that are the same, you can opt to use the shuffle option to have them automatically shuffled as you feed them into the training loop. Finally, PyTorch DataLoaders will also speed up the training and testing loop by parallelizing the loading of the data from disk to the CPU, GPU or TPU.

Advanced PyTorch Programming Course

Self-Paced, On-Demand

Strengthen your PyTorch development knowledge with our on-line, self-paced course. Get updated on the latest best practices and applications to real-world projects.

Learn More

Syntax and Common Examples

Now let’s take a look at the syntax of some of the most commonly used dataloader options. The overall syntax can be a bit daunting at first but you will usually only use a smaller portion of total options. The most common ones include:

dataset - the dataset itself
batch_size - number of samples we want to pass into the training loop at each iteration
shuffle – optionally we can opt to shuffle the data during each epoch

The shuffle option is helpful is you have a lot of the same labels sequentially in your dataset. For example, if you have a dataset containing pictures of dogs and cats, it would be good if each mini-batch has examples of a mix of both dogs and cats. Some datasets are stored sequentially so when the batches are built, each batch ends up with just dogs or cats but not a mix. Setting shuffle to true eliminates that problem during training.

As an example, in the above slide we create a simple dataloader that takes in our training dataset, creates batches of size 64 and shuffles the batches as well.

Using DataLoaders During Training

After creating the dataloader, I am able to use it our training loop. Here we simply iterate across the entire training dataset, grabbing a batch of inputs (x) and their associated outputs (labels or y in this case). The rest is just vanilla PyTorch training code where I will use my input x, pass it into my model, and then compare the output of my model (pred) to what it should have been (y) from my data loader.

Creating DataLoaders for Training and Testing Datasets

Now typically we always want to have multiple datasets to ensure our model is generalizing. For example, you should at least always have one training dataset and a separate testing dataset. (a validation set is usually also used). Here we grab the MNIST dataset included in torchvision. We will download the training dataset by passing in train = True and then grab the testing dataset by passing in train = False.

create seperate dataloaders for training and testing

We will then create two separate dataloaders, one for training and one for testing. I usually always turn on shuffling for the training dataset to make sure that I'm always getting a good amount of random samples in each of the mini-batches. So here we have two different loaders, one for training and the other one for my testing loop. And so just like the training loop, I can now build my test loop using this test loader by creating a for loop that iterates across all the inputs and the outputs.

DataLoader Quiz: How good is your knowledge?

Take our 5 question free quiz when you have completed this tutorial to figure out where you stand on your PyTorch dataloader knowledge.

Take Quiz (Free)

DataLoader Advanced Options

Some of the more advanced options you can get into are the following:

num_workers - how many parallel subprocesses you want to activate when you are loading all your data during your training or validation. I recommend you start with zero here until you verify your code is fully working and that your model is training correctly. At that point you can increase this value to 1 or even up to 4. If you have a lot of GPUs, you can go beyond that as well. Experiment and track your memory usage.
pin memory - if you are using a CUDA device or GPU, this should speed up loading by removing a copy during the loading operation
drop_last - if your total dataset size cannot be evenly divided by your batch size, you can opt to drop that last batch which will be smaller than your batch_size. This ensures that all batches are of the same size (batch_size).

Get Our Popular A.I. Newsletter

It’s Free, No Spam, Opt-Out Anytime

In the example above, we create a dataloader for the training dataset with a batch size of 64, with shuffling enabled and the number of workers set to 4. I also set pin_memory to True and I also dropped the last batch because my dataset is not divisible by my batch size.

Pulling it All Together – CIFAR10 Example

CIFAR10 is a vision dataset which has ten labels. You can load the example code directly into Google Colab by clicking here. In the example, you will download the training and testing dataset, create the dataloaders, build a simple state of the art vision model and then train it using the train_loader. Finally, I show you how to use the testing_loader to test your vision model after training.

How to Make it Better

This example is more to illustrate how to use dataloaders and there are a lot more things you would want to add if you are building a vision pipeline including:

turning off gradient calculations
creating a validation dataset
increasing the number of epochs
implementing early stopping to just name a few...

You can learn more about how to do that in our Hands-On A.I. Programming with PyTorch Course.

Conclusion

PyTorch DataLoaders are super powerful and a critical part of any PyTorch deep learning project. They help you automate the creation of mini-batches of data for the training process and also speed up up the overall training and testing process through parallelization.

Next Steps

If you liked this tutorial, consider taking our popular Hands-On A.I. Programming with PyTorch Course to further improve your PyTorch development abilities. Also, don’t forget to test your dataloader knowledge with our PyTorch DataLoader quiz that you will find below!

Ready to test your knowledge?