Articles on: Datasets

How to Create a Dataset for an Image Classification Experiment

This tutorial shows how to create a dataset for an image classification experiment in Cogniflow.

In this example, a two-class experiment is created to train models capable of recognizing whether an image belongs to a dog or a cat. So dogs and cats are the two classes.

If your images are expected to always have some dogs or cats, this two classes are enough. But if different images containing other kind of objects could be present, it's recommended to create a third class other. This class will contain random images without cats or dogs, so the model will be capable of recognize them and not return false results. Example: an image of a tree should be classified as "other". It's also positive that the other category has around five times more images than the most populated one. This way the model will learn to recognize dogs, cats and anything different from those animals.

1. Create a Folder for Each Image Class

On your computer or on any equipment, you have your images, create a folder named the same way you want to identify its class or type. So in this case two folders are created: dogs and cats.

After all folders are created, move your images to each corresponding folder. If the image is a cat, it goes to cats, if it is a dog, it goes to dogs.

You can use jpg, png for the image formats.

2. Create the ZIP File

Select all folders and compress them into a single ZIP file. Now the dataset is ready to be uploaded to Cogniflow.

3. Upload your Dataset

When creating your experiment you will reach the step when a dataset has to be uploaded. Click on "Browse your files" and upload the ZIP file generated before. Cogniflow will automatically split your dataset into train and validation subsets, with 80% of the data for training and 20% for validation. The split is done randomly.

If you prefer, you also can upload a ZIP file containing your already generated validation subset by clicking on "Advanced options" and again "Browse your files". This way Cogniflow is not going to run the dataset split. This validation subset is created the same way as explained in 1.

When the file/files upload is complete, click on "Next step".

4. Check if Everything is OK

After uploading your dataset, the training process is started. You can double-check if the dataset was correctly created, uploaded, and split by clicking on the "Dataset" tab. Here you can see how much data there is for each category and subset and how it is distributed. Also, it is possible to download the data.

5. How Many Examples do I Need?

We recommend at least between 300-500 images per category. The more examples, the more accurate the model will later be.

Example Datasets

[
Coke or Pepsi](https://drive.google.com/file/d/1are-zefIhygpENpBIynG9mTzdAMCQx9W/view?usp=sharing): Train a model capable of recognizing if a picture belongs to whether Coke or Pepsi. The dataset has two classes, Coke and Pepsi, with 1410 and 1295 images, respectively.

Tire or Not: Train a model capable of recognizing if a picture belongs to a car Tire or not. The dataset has two classes, Tire and Other, with 274 and 353 images, respectively.

Chest X-Ray: Train a model capable of recognizing if a patient has Covid, Pneumonia or if she/he is healthy, based on chest X-Ray scans. The dataset has three classes, Covid, Pneumonia and Normal, with 132, 90 and 90 images, respectively.

Updated on: 28/10/2022

Was this article helpful?

Thank you!