Hello there again!
Before we begin with this article, we would like you to take a moment and appreciate this masterpiece of architecture. Really marvelous, isn’t it?
Can’t comprehend what we are trying to show? Well, try now!
Both of the above illustrations are of Taj Mahal. The former figure is not some fancy abstract art. It is a simple depiction of how a computer looks at the Taj Mahal - in the form of 1s and 0s. Ok, we think you figured out where we are going with this. Let us reiterate it, nonetheless. The way a computer processes and stores the information we provide it is very different than we, humans, do it.
Now, ML, in its most basic form, involves teaching a computer using data. Therefore, we don’t think we need to specially stress upon the importance of converting the collected data into a form suitable for building ML models i.e. something which the computer can interpret. The entire process which facilitates the above is referred to as Data Pre-processing - the name is clear enough to understand the objective. And the different steps which constitute the same, depends on the nature of data at hand. The previous article helped us understand a few important things about data. In this article, you will find that knowledge handy in transforming the data into something which the machine can act upon and learn from. And now, without further ado, let’s get started.
We are so fond of our orchard. Let’s drive back there again. The last time we visited, you were implementing an automatic grading system for your fruits. But you were disappointed with the results. On consulting your friend, who happens to be an expert in ML, you figure out that the problem might be in the image capturing equipment being used in the plant. You decide to give ML another chance and invest some capital in replacing the gear. All this meant that you need to recollect data, now using the values from the new equipment. Hoping that things go right this time, you start with the process.
Where did the data go?
You mimic the procedure used last time - collecting the readings from the new equipment and note the label assigned to it by the workers. The collated information is presented below. Now, the aim is to use this manually labelled data for training the algorithm.
Does anything catch your eye? Yes, there are a few rows for which there is no value in column C i.e. No. of Spots. To understand why this might be happening, let’s imagine the following scenarios:
Just like in the example, in almost all of the ML tasks you will be working on in the future, you will observe missing/incomplete data. We don’t think we need to explain any further about why such observations (rows in which data is missing) cannot be fed directly to an ML algorithm. The important question to ask is how do you handle such cases? Well, there are ways –
That doesn’t seem right
Let’s say the problem in the earlier case was due to Scenario 2 - few apples not getting detected. You employ Hot Deck Imputation and now, have the dataset with all values present (as shown above). So, everything looks fine now, right? We can hear you say no. Apple 12 weights around 7.8Kgs and Apple 15 around 5Kgs! Wow. Imagine the size of it. But they are tagged as medium and small! You never came across such produce in the orchard before. What’s the other possibility? The weighing machine must have malfunctioned. Nonetheless, you know such values would lead to a faulty ML algorithm if used for training. The internal mappings (or think, if-else constructs) would go wrong, right? Such abnormal values observed in the data are referred to as Outliers. Pretty cool name, huh? But it does sound like trouble!
How do you detect them? Well, a simple numerical sort function on the column would highlight such unusual values. Sometimes, it is not that easy to identify them. You might need slightly sophisticated tools and graphing methods to corner them. For example, from the below shown histogram, it can be seen that most of the values lie to the left and hence, the values to the extreme right must be outliers.
Another popular method is based on using Boxplots. You can find the mechanics of the same explained here and here. A few other ways which can help you find outliers are enlisted here.
And how do you proceed forward post identification of such outliers? If it is known that there was a measurement or data entry error, the error can be corrected, if possible. For example, we can re-weigh those apples again. If it cannot be fixed, the easiest thing to do is to remove them, since their presence distorts the picture painted by the regular data. You can also try replacing the erroneous values using those generated by the imputation techniques discussed above. That is treat such abnormal values as missing data and fill them using the imputation techniques. But in certain cases, outliers need to be treated separately. Read about all such cases here.
Making your algorithm understand Apple stuff
Let’s assume that you have fixed all the previous issues with the data and the same is presented below. So, what next? We know that Color is a nominal variable while Size is ordinal in nature.
Most of the ML algorithms involve mathematical transformations on the data to figure out the mapping between the input and output data. And the data needs to be numerical for such transformation to be possible. But the Color variable is text-based. What can be done about this? One way to transform the text into numerical data is by assigning each level a particular number. This method is referred to as Label Encoding. Using the same, we can do the following mapping:
Dark Red → 0
Red → 1
Yellow → 2
While the above mapping does convert the text levels into numerical values, it should be noted that the levels in Color have no logical ordering. Therefore, by assigning numerical values through label encoding, you are adding an inherent order i.e. in this case Yellow > Red > Dark Red. This will result in false interpretation by the ML algorithm. So, how do we transform text into numerical data without any specific ordering. Let’s try to understand the transformation below:
You can see three new variables whose values are based on the Color variable. The mapping is as follows:
With this, the text data is converted into numerical form with no logical ordering. Think about it. The process adopted here is referred to as One Hot (or ‘One-of-K’ or Dummy) Encoding. Simply put, this transformation creates a binary variable for each category. These binary variables so created are commonly referred to as Dummy Variables. And we also know that you are curious to find out the reason why it is called One Hot Encoding. It is One-Hot because each row of the binary columns so created, contain only one column with a value of 1 and the others as 0. Makes sense?
Alright, we know how to deal with nominal categorical variables. But what about the ordinal variable here? Size? Well, you can simply use the Label Encoder and map each size level to a numerical value while keeping the order intact. This is usually practiced but there is an inherent problem. For example –
But if you remember from the previous article, while there is a logical order associated with an ordinal variable, the difference between two consecutive levels doesn’t make sense. For example, post encoding, the difference between “Large” – “Medium” apple is 1. No, right? But a Label Encoder doesn’t address this issue. We can always use One-Hot Encoding, but the ordering information gets lost in translation. You can read about various other types of encoding techniques here.
My algorithm believes both are the same
The data, post encoding the categorical variables, looks like this:
Based on the discussion we had in our previous article, we know that both column C and D are numerical variables. We can consider them to be continuous in nature. Recall out discussion about when discrete variable can be treated as continuous ones. Now, the algorithm doesn’t know the difference between Column C and D. It doesn’t know its units. It just sees the numbers. Hence, if this numerical data is fed directly, it might create problems in certain ML algorithms (you will get to know which algorithms in the upcoming articles).
A transformation similar to the one which we performed in the article discussing Correlation, is needed for comparing two numerical variables with different units and scales. Such a transformation would indeed help the algorithm distinguish between the values represented by 1 spot on apple and 1 kg in weight. This process of conversion of numerical data is referred to as Scaling. One of the simplest and intuitive scaling techniques is the Minmax Scaling which is formulaically shown below:
Applying the above transformation to a numeric variable, results in a transformed variable with values bounded between [0, 1]. With 1 representing the maximum value and 0 being the minimum value of that variable. The results of this transformation on our dataset is presented below:
Other scaling techniques which can be employed and the scenarios where they produce the best results are discussed in detail here and here.
How good is your algorithm?
From our introductory articles, we know that a supervised ML algorithm requires training data, for it to learn the mapping between the data and the classes (or labels). Remember? Assume that you have trained the classifier using the processed training data shown above. What next? Isn’t it a good idea to find out the performance/accuracy of the algorithm? But, for calculating the accuracy of the predictions, we need to compare the output to the true values of the labels. So, does that mean we need to get back again for collecting more labelled data?
An easy way to check the accuracy is by pushing in the same training data, now without the labels. The algorithm will throw out its prediction. You can compare and calculate the accuracy. But, is there a problem by relying just on this method?
Imagine doing this. We have around 16 rows for training the algorithm, right? What if we just use 12 rows and keep the 4 rows aside. After the algorithm is trained on the (reduced) data, we can feed the algorithm with the data from the 4 hidden rows (without the labels obviously) and get the predictions. Since you have the labels for this data, you can compare and calculate the accuracy. How is this different than the previous method?
Remember the time when you were a kid. Did you attend any tuitions outside of school? If so, you know how it works. Let’s say you are in a Maths tuition. After each chapter in your textbook, there are a few problems, let’s say there are 16 of them. You know where to find the answers for them, right? At the end of the textbook. Now imagine the following scenarios:
Your objective of attending a tuition is to perform well in the school exams. How would you know if you are making progress? In Scenario 1, you are solving the same problem again. You should get it right mostly. In Scenario 2, you are taught the techniques and asked to solve similar problems, not the same one. Now, both of the cases produce an accuracy measure. Which one is more reliable to understand the value addition of tutoring? Think!
Scenario 2, right? Why? Because your aim is to do well in the school exams - the questions of which you are unaware of beforehand. Scenario 2 simulates this better than Scenario 1. If you are solving the unseen questions pretty well, you will be more confident that you will do good in the school exams. But in Scenario 1, even if you get all the questions correct when you are resolving them, you will be uncertain if you will perform well when you come across a question which you have never seen before. Makes sense?
The ultimate objective of our algorithm is to grade the apples it is exposed to, in the future. Can you draw a parallel between the tutoring example and the alternatives we listed earlier?
When we split the 16-row data into two parts, what we are effectively doing is dividing the data into a ‘training’ and ‘testing’ set. As the name suggests, the training set is used for training our algorithm. And the testing set, the unseen data, is used for estimating the accuracy of our algorithm. We can also calculate the accuracy on the train data as well (the first alternative) but we know which score to trust more. These accuracies are referred to as the train and the test performance of the algorithm. But how do you decide the split ratio (we used 75:25 ratio here) and which rows to use for training and testing? Would this choice matter? All these and many other questions of yours will be addressed in the upcoming articles.
Now, based on these metrics, you can either tweak the algorithm further or decide to go and use it for grading. No more manual work required!
Alright. So, all the steps that we discussed right from handling missing data to breaking the processed data into train and test, fall under the Data Preprocessing pipeline. Did you enjoy this real-short demo of how things roll in the world of ML. We hope that the following few things are made clear through this article -
Now for the most awaited part. From our next set of articles, we will start exploring the mechanics of ML algorithms, beginning with Supervised Classification ones. Until then, you can use the following set of resources to explore more about the nuances in the Data Preprocessing stage -