Creating Image Data Set for Classification using Tensorflow
- Akshat Mandloi
- Apr 11, 2021
- 4 min read
Updated: Apr 16, 2021

This blog will take you through various methods of creating a tensorflow dataset for your deep learning use case. Most of the typical use case pertaining to images can be classification, detection or segmentation.
Data Loading is usually the first step while creating a machine learning or deep learning pipeline. If not optimised properly, this may act as a major bottleneck in the pipeline hampering the training time. Also, this can be crucial when pre-processing such as augmentations are heavy.
The best place to get references for building a data pipeline is this exhaustive guide by Tensorflow - tf.data - Build Tensorflow Input Pipelines. Cassava Image Dataset as an example is used to illustrate the different data loading techniques for a classification problem.
Using tf.data Tensorflow API
Step-1: Determine the type of data source
You may have the data source in memory, in the form of image files, or in the format of tensorflow records. Apart from image, tf.data API also provides methods to consume csv files, numpy arrays, python generators, text files and files.
A quick look at various classification datasets reveals that data is provided in two formats usually. Either a csv file with image id and label with training/test images in a single folder or images corresponding to different labels in different folders.
Step-2: Determining the constructor to use
This step will determine which constructor to use when loading the data. If dataset is small and fits in memory then Dataset.from_tensor_slices comes in handy and is the simplest way to create a tensorflow dataset. However, this method loads everything in memory and thus wastes memory.
A better way to carry on if dataset is in the form of image files can be using generators using Dataset.from_generator constructor. If the dataset format is images of different labels in different folders then, this constructor can be combined with keras.image.ImageDataGenerator to create the dataset.
To read tfrecord files, read the files and pass it to tf.data.TFRecordDataset. However, many projects use serialized tf.train.Example records in their TFRecords file which need to be decoded to use further.
The code snippet below depicts my implementation of creating a tf.data dataset with Cassava Dataset using both image files and TFRecords files.
1) Using Image Files -
The get_dataset function of this class reads the csv file containing image_id (image_path) and corresponding label as a tensorflow dataset using Dataset.from_tensor_slices as the size of csv file is small.
Then using the map function, the contents of the csv file are processed to read the image file and label and create a dataset with (image, label) specification.
2) Using TFRecords -
The get_dataset function of this class reads the TFRecords filenames using tf.data.TFRecordDataset() and processes it using the read_tfrecord function. The decode_image function decodes the tensorflow binary file and converts it to tf.float32 format.
In the lines 33-34 the function takes in all the tfrecordfiles and maps it to read_tfrecord function.
The format of a tfrecord is specified in the lines 23-26, which is then passed to io.parse_single_example function in line 27 which decodes the single example in the format specified earlier. Lines 28-30 decode the image and label and return the tuple image, label to form the dataset.
The important point to note is that the tfrecords format must be the same as the one which was used in creating the tfrecords.
Step-3: Improving the Data Loader
Once we have the basic data loader which return a dataset in the desired format as output. The next task is to finetune it according to the given use case. This involves shuffling the dataset, batching , adding augmentations and other add-ons to reduce the training time.
a) Shuffle -
As suggested in this tutorial - Building an image data pipeline , the best practice is to keep the buffer size as the length of the dataset while shuffling. The importance of shuffling the dataset correctly has been beautifully summarized in this answer on Stackoverflow. It discusses how improper shuffling can hurt the training and also the importance of almost equal distribution of all classes in each batch for proper training.
For shuffling a tensorflow dataset simply use dataset.shuffle(buffer_size=$buffer_size)
b) Batch-
This is a hyperparameter which is chosen mainly based upon memory and performance. To batch a dataset use - dataset.batch($batch_size)
In some cases one of the specs of the dataset may be of varying length , eg. the length of labels may be varying for each image, in such cases
dataset.padded_batch($batch_size, padded_shapes=(None,)) can be useful.
c) Repeat -
In order to use the dataset for multiple epochs, use dataset.repeat($epochs).
However, it should be noted that for clear seperation between epochs put dataset.batch() and dataset.shuffle() before dataset.repeat().
Once the images have been decoded and cast to tf.float32 format a number of tf.image APIs can be used to perform augmentations such as random flips, random crops, and almost any other preprocessing that needs to be performed.
The important point to note is that it is smarter to put the preprocessing function after the batch operation.
The best practices mentioned here can be used to optimise the data pipeline.
The snippets below include the above mentioned methods and serve as an example of full fledged image data pipeline that can be effectively used for training. A further improvement in this code would be to do augmentations after batching.
This blog takes you through the basics of tf.data API and how to use it for building data pipelines specifically for image datasets. Further improvements to this basic version can be found at the official documentation of tf.data guide here.
The code included in this blog can be found at https://github.com/akki2503/the_learning_path_project/blob/main/data_loader.py
Commentaires