Articles on: Datasets

How to Create a Dataset for a Text Classification Experiment

This tutorial shows how to create a dataset for a text classification experiment in Cogniflow.

In this example, a two-class experiment is created to train models capable of recognizing whether a restaurant review is positive or negative. So positive and negative are the two classes.

1. Create the CSV File

Create a CSV file containing both reviews (texts) and labels (classes). Each review must be placed in a column with the header 'Text'. Next to it, a column named as 'Label' must have the text's class. Five example rows are shown below:

2. Create the ZIP File

Select the CSV file and compress it into a single ZIP file. Now the dataset is ready to be uploaded to Cogniflow.

3. Upload your Dataset

When creating your experiment you will reach the step when a dataset has to be uploaded. Click on "Browse your files" and upload the ZIP file generated before. Cogniflow will automatically split your dataset into train and validation subsets, with 80% of the data for training and 20% for validation. The split is done randomly.

If you prefer, you also can upload a ZIP file containing your already generated validation subset by clicking on "Advanced options" and again "Browse your files". This way Cogniflow is not going to run the dataset split. This validation subset is created the same way as explained in 1.

When the file/files upload is complete, click on "Next step".

4. Check if Everything is OK

After uploading your dataset, the training process is started. You can double-check if the dataset was correctly created, uploaded, and split by clicking on the "Dataset" tab. Here you can see how much data there is for each category and subset and how it is distributed. Also, it is possible to download the data.

5. How Many Examples do I Need?

We recommend at least between 300-500 text samples per category. The more examples, the more accurate the model will later be.

Example Datasets

BBC News Classification: Train a model capable of a recognizing a BBC News article topic. The dataset has five classes, Business, Entertainment, Politics, Sport and Tech, with 510, 386, 417, 511 and 401 examples, respectively.

Customer Support Intention: Train a model capable of a recognizing the intentions in customer supports comments. The dataset has twenty-seven classes, with a total of 21672 examples, from intentions like 'cancel_order', 'get_refund' or 'complaint'.

Restaurant Reviews: Train a model capable of a recognizing the sentiment in restaurant reviews. The dataset has two classes, Positive and Negative, with 500 examples each.

Updated on: 28/10/2022

Was this article helpful?

Thank you!