Data preprocessing in machine learning

Cover Image

Hello there again!

Before we begin with this article, we would like you to take a moment and appreciate this masterpiece of architecture. Really marvelous, isn’t it?

Can’t comprehend what we are trying to show? Well, try now!

Both of the above illustrations are of Taj Mahal. The former figure is not some fancy abstract art. It is a simple depiction of how a computer looks at the Taj Mahal - in the form of 1s and 0s. Ok, we think you figured out where we are going with this. Let us reiterate it, nonetheless. The way a computer processes and stores the information we provide it is very different than we, humans, do it.

Now, ML, in its most basic form, involves teaching a computer using data. Therefore, we don’t think we need to specially stress upon the importance of converting the collected data into a form suitable for building ML models i.e. something which the computer can interpret. The entire process which facilitates the above is referred to as Data Pre-processing - the name is clear enough to understand the objective.  And the different steps which constitute the same, depends on the nature of data at hand. The previous article helped us understand a few important things about data. In this article, you will find that knowledge handy in transforming the data into something which the machine can act upon and learn from. And now, without further ado, let’s get started.

Upgrading

We are so fond of our orchard. Let’s drive back there again. The last time we visited, you were implementing an automatic grading system for your fruits. But you were disappointed with the results. On consulting your friend, who happens to be an expert in ML, you figure out that the problem might be in the image capturing equipment being used in the plant. You decide to give ML another chance and invest some capital in replacing the gear. All this meant that you need to recollect data, now using the values from the new equipment. Hoping that things go right this time, you start with the process.

Where did the data go?

You mimic the procedure used last time - collecting the readings from the new equipment and note the label assigned to it by the workers. The collated information is presented below. Now, the aim is to use this manually labelled data for training the algorithm.

S.NoDataGrade [Label] [F]
Color [A]Size [B]No. of Spots [C]Weight [D]Bird Pecked [E]
1Dark RedLarge4300.45gNoBest
2RedMedium5250.16gNoRegular
3YellowSmall3152.34gNoLow
4RedLarge273.27gNoBest
5YellowMedium198.14gYesRejected
6YellowMedium2241.96gNoRegular
7Dark RedMedium7224.52gNoRegular
8RedLarge8311.15gNoBest
9RedMedium5242.55gYesRejected
10RedSmall4182.85gNoLow
11Dark RedSmall150.66gNoRegular
12RedMedium7800.45gNoBest
13RedLarge6260.54gNoRegular
14YellowSmall120.45gNoRegular
15RedSmall5000.15gNoLow
16Dark RedSmall5190.45gNoLow

 

S.NoDataGrade [Label] [F]
Color [A]Size [B]No. of Spots [C]Weight [D]Bird Pecked [E]
1Dark RedLarge4300.45gNoBest
2RedMedium5250.16gNoRegular
3YellowSmall3152.34gNoLow
4RedLarge273.27gNoBest
5YellowMedium198.14gYesRejected
6YellowMedium2241.96gNoRegular
7Dark RedMedium7224.52gNoRegular
8RedLarge8311.15gNoBest
9RedMedium5242.55gYesRejected
10RedSmall4182.85gNoLow
11Dark RedSmall150.66gNoRegular
12RedMedium7800.45gNoBest
13RedLarge6260.54gNoRegular
14YellowSmall120.45gNoRegular
15RedSmall5000.15gNoLow
16Dark RedSmall5190.45gNoLow

S.NoDataGrade [Label] [F]
Color [A]Size [B]No. of Spots [C]Weight [D]Bird Pecked [E]
1Dark RedLarge4300.45gNoBest
2RedMedium5250.16gNoRegular
3YellowSmall3152.34gNoLow
4RedLarge273.27gNoBest
5YellowMedium198.14gYesRejected
6YellowMedium2241.96gNoRegular
7Dark RedMedium7224.52gNoRegular
8RedLarge8311.15gNoBest
9RedMedium5242.55gYesRejected
10RedSmall4182.85gNoLow
11Dark RedSmall150.66gNoRegular
12RedMedium7800.45gNoBest
13RedLarge6260.54gNoRegular
14YellowSmall120.45gNoRegular
15RedSmall5000.15gNoLow
16Dark RedSmall5190.45gNoLow

Does anything catch your eye? Yes, there are a few rows for which there is no value in column C i.e. No. of Spots. To understand why this might be happening, let’s imagine the following scenarios:

  • Scenario 1: There is a faulty connection in the imaging equipment used for counting the spots. A loose wire gets disconnected every once in a while, and the data fails to get captured. This happened randomly across the apples in the training data. Now, the missing data arising from such contexts is categorized as Missing Completely at Random (MCAR). In such cases, the missing values cannot be explained by any other observed factor - it happened at random.
  • Scenario 2: Assume that there is no faulty circuit. You observe that the data is missing for a few apples. The proportion is higher for smaller apples. Now, it might be the case that the equipment fails to detect some apples moving on the conveyor belt. This happens more frequently with apples which are relatively smaller. The missing data arising from such scenarios is classified as Missing at Random (MAR). The difference with respect to the previous scenario being that the probability of an observation being missing can be explained by one of the observed factors. It is not completely due to randomness. The observed factor over here being the size of the apple.
  • Scenario 3: Assume that there is neither a faulty circuit nor undetected apples. Something else is at play which is unknown to you. The chip on the imaging equipment has limited computational power. And hence, when the number of spots on the apple is high, the CPU on the chip hangs and no data is collected. But what’s the difference between this scenario and the previous one. Well, in this case, the missing of a certain observation is dependent on the true value of the observation itself i.e. the no. of spots. And not some other observed factor like the size, color etc. But to figure out that this is the underlying problem, the no. of spots on the fruit has to be detected i.e. the true value of the observation. But this is not possible because the required data is missing in the first place. Right? Circular logic? Chicken egg problem? You are right. Now, missing data under such scenarios is referred to as Missing Not at Random (MNAR). The probability of data missing is dependent on the unobserved measurement itself.  

Just like in the example, in almost all of the ML tasks you will be working on in the future, you will observe missing/incomplete data. We don’t think we need to explain any further about why such observations (rows in which data is missing) cannot be fed directly to an ML algorithm. The important question to ask is how do you handle such cases? Well, there are ways –

  • Delete the Row: It is as simple as it sounds. Delete the rows where you find missing data. There are no complex computations required for this. It works well with Scenario 1, where data is missing at random. But think about the unintended effects of applying this technique in Scenario 2. Yes, you are right. You end up deleting a lot of rows corresponding to small apples. Hence, the balance in the data is disturbed. Let us put that in technical terms. The sample we have taken might not be representative of the actual population anymore.
  • Mean/Median Imputation: Although it sounds fancy, imputation just refers to the process of replacing missing data with some value. And as you might have figured out already, the value which we want to use for filling here is either the mean/median. Or some measure of central tendency, if you remember what it means. But mean (or median) of what? Of the column/variable with the missing data. For example, a median imputed dataset would look like the one shown below. The missing values filled using the median (i.e. 5) are highlighted in red. While this method doesn’t involve a lot of computation, it still suffers from a problem. Can you find out? Aren’t there a lot of 5s in column C. What if the rows with missing data are more, say 10? There would be a lot more 5s. Got where we are going with this? Yes, this method artificially suppresses the variance of the column by filling it with the exact same value. Remember, more the variance, more the variety and hence, more information that the variable can represent.

S.NoDataGrade [Label] [F]
Color [A]Size [B]No. of Spots [C]Weight [D]Bird Pecked [E]
1Dark RedLarge4300.45gNoBest
2RedMedium5250.16gNoRegular
3YellowSmall3152.34gNoLow
4RedLarge5273.27gNoBest
5YellowMedium5198.14gYesRejected
6YellowMedium2241.96gNoRegular
7Dark RedMedium7224.52gNoRegular
8RedLarge8311.15gNoBest
9RedMedium5242.55gYesRejected
10RedSmall4182.85gNoLow
11Dark RedSmall5150.66gNoRegular
12RedMedium57800.45gNoBest
13RedLarge6260.54gNoRegular
14YellowSmall5120.45gNoRegular
15RedSmall55000.15gNoLow
16Dark RedSmall5190.45gNoLow

 

  • Hot Deck Imputation: You know that in general, the larger the size of the apples, the more are spots on it. Can this information come in handy? Yes, the technique of imputing missing value using the data from another similar observation is referred to as Hot Deck Imputation. In our case, we can use hot deck imputation by filling out all the missing values based on the size wise median. The median value for the no. of spots for large apples is 6, while the same for the medium and small ones are 5 and 4 respectively. The dataset, so filled, is shown below. It is considered to be better than a simple mean/median imputation as you are incorporating additional information for filling out the missing values - a more educated estimate you might say. But it might involve some heavy computations since the similarity needs to be calculated between the different rows. Also, variance is suppressed but not as much as compared to the previous case. Think about it.

 

S.NoDataGrade [Label] [F]
Color [A]Size [B]No. of Spots [C]Weight [D]Bird Pecked [E]
1Dark RedLarge4300.45gNoBest
2RedMedium5250.16gNoRegular
3YellowSmall3152.34gNoLow
4RedLarge6273.27gNoBest
5YellowMedium5198.14gYesRejected
6YellowMedium2241.96gNoRegular
7Dark RedMedium7224.52gNoRegular
8RedLarge8311.15gNoBest
9RedMedium5242.55gYesRejected
10RedSmall4182.85gNoLow
11Dark RedSmall4150.66gNoRegular
12RedMedium57800.45gNoBest
13RedLarge6260.54gNoRegular
14YellowSmall4120.45gNoRegular
15RedSmall45000.15gNoLow
16Dark RedSmall5190.45gNoLow

That doesn’t seem right

Let’s say the problem in the earlier case was due to Scenario 2 - few apples not getting detected. You employ Hot Deck Imputation and now, have the dataset with all values present (as shown above). So, everything looks fine now, right? We can hear you say no. Apple 12 weights around 7.8Kgs and Apple 15 around 5Kgs! Wow. Imagine the size of it. But they are tagged as medium and small! You never came across such produce in the orchard before. What’s the other possibility? The weighing machine must have malfunctioned. Nonetheless, you know such values would lead to a faulty ML algorithm if used for training. The internal mappings (or think, if-else constructs) would go wrong, right? Such abnormal values observed in the data are referred to as Outliers. Pretty cool name, huh? But it does sound like trouble!

 

How do you detect them? Well, a simple numerical sort function on the column would highlight such unusual values. Sometimes, it is not that easy to identify them. You might need slightly sophisticated tools and graphing methods to corner them. For example, from the below shown histogram, it can be seen that most of the values lie to the left and hence, the values to the extreme right must be outliers.

Another popular method is based on using Boxplots. You can find the mechanics of the same explained here and here. A few other ways which can help you find outliers are enlisted here.

 

And how do you proceed forward post identification of such outliers? If it is known that there was a measurement or data entry error, the error can be corrected, if possible. For example, we can re-weigh those apples again. If it cannot be fixed, the easiest thing to do is to remove them, since their presence distorts the picture painted by the regular data. You can also try replacing the erroneous values using those generated by the imputation techniques discussed above. That is treat such abnormal values as missing data and fill them using the imputation techniques. But in certain cases, outliers need to be treated separately. Read about all such cases here.

 

Making your algorithm understand Apple stuff

Let’s assume that you have fixed all the previous issues with the data and the same is presented below. So, what next? We know that Color is a nominal variable while Size is ordinal in nature.

S.NoDataGrade [Label] [F]
Color [A]Size [B]No. of Spots [C]Weight [D]Bird Pecked [E]
1Dark RedLarge4300.45gNoBest
2RedMedium5250.16gNoRegular
3YellowSmall3152.34gNoLow
4RedLarge6273.27gNoBest
5YellowMedium5198.14gYesRejected
6YellowMedium2241.96gNoRegular
7Dark RedMedium7224.52gNoRegular
8RedLarge8311.15gNoBest
9RedMedium5242.55gYesRejected
10RedSmall4182.85gNoLow
11Dark RedSmall4150.66gNoRegular
12RedMedium5210.45gNoBest
13RedLarge6260.54gNoRegular
14YellowSmall4120.45gNoRegular
15RedSmall4176.84gNoLow
16Dark RedSmall5190.45gNoLow


Most of the ML algorithms involve mathematical transformations on the data to figure out the mapping between the input and output data. And the data needs to be numerical for such transformation to be possible. But the Color variable is text-based. What can be done about this? One way to transform the text into numerical data is by assigning each level a particular number. This method is referred to as Label Encoding. Using the same, we can do the following mapping:


Dark Red →  0

Red →  1

Yellow →  2

 

While the above mapping does convert the text levels into numerical values, it should be noted that the levels in Color have no logical ordering. Therefore, by assigning numerical values through label encoding, you are adding an inherent order i.e. in this case Yellow > Red > Dark Red. This will result in false interpretation by the ML algorithm. So, how do we transform text into numerical data without any specific ordering. Let’s try to understand the transformation below:

 

 

You can see three new variables whose values are based on the Color variable. The mapping is as follows:

  • Color_Dark Red is 1 if the color of apple is Dark Red and 0 otherwise.
  • Color_Red is 1 if the color of apple is Red and 0 otherwise.
  • Color_Yellow is 1 if the color of apple is Yellow and 0 otherwise.

With this, the text data is converted into numerical form with no logical ordering. Think about it. The process adopted here is referred to as One Hot (or ‘One-of-K’ or Dummy) Encoding. Simply put, this transformation creates a binary variable for each category. These binary variables so created are commonly referred to as Dummy Variables. And we also know that you are curious to find out the reason why it is called One Hot Encoding. It is One-Hot because each row of the binary columns so created, contain only one column with a value of 1 and the others as 0. Makes sense?

 

Alright, we know how to deal with nominal categorical variables. But what about the ordinal variable here? Size? Well, you can simply use the Label Encoder and map each size level to a numerical value while keeping the order intact. This is usually practiced but there is an inherent problem. For example –

 

SizeLabel Encoder Value Assigned
Small0
Medium1
Large2

But if you remember from the previous article, while there is a logical order associated with an ordinal variable, the difference between two consecutive levels doesn’t make sense. For example, post encoding, the difference between “Large” – “Medium” apple is 1. No, right? But a Label Encoder doesn’t address this issue. We can always use One-Hot Encoding, but the ordering information gets lost in translation. You can read about various other types of encoding techniques here.

My algorithm believes both are the same

The data, post encoding the categorical variables, looks like this:

 

Based on the discussion we had in our previous article, we know that both column C and D are numerical variables. We can consider them to be continuous in nature. Recall out discussion about when discrete variable can be treated as continuous ones. Now, the algorithm doesn’t know the difference between Column C and D. It doesn’t know its units. It just sees the numbers. Hence, if this numerical data is fed directly, it might create problems in certain ML algorithms (you will get to know which algorithms in the upcoming articles).

 

A transformation similar to the one which we performed in the article discussing Correlation, is needed for comparing two numerical variables with different units and scales. Such a transformation would indeed help the algorithm distinguish between the values represented by 1 spot on apple and 1 kg in weight. This process of conversion of numerical data is referred to as Scaling. One of the simplest and intuitive scaling techniques is the Minmax Scaling which is formulaically shown below:  

Applying the above transformation to a numeric variable, results in a transformed variable with values bounded between [0, 1]. With 1 representing the maximum value and 0 being the minimum value of that variable. The results of this transformation on our dataset is presented below:

Other scaling techniques which can be employed and the scenarios where they produce the best results are discussed in detail here and here.

 How good is your algorithm?

From our introductory articles, we know that a supervised ML algorithm requires training data, for it to learn the mapping between the data and the classes (or labels). Remember? Assume that you have trained the classifier using the processed training data shown above. What next? Isn’t it a good idea to find out the performance/accuracy of the algorithm? But, for calculating the accuracy of the predictions, we need to compare the output to the true values of the labels. So, does that mean we need to get back again for collecting more labelled data?

An easy way to check the accuracy is by pushing in the same training data, now without the labels. The algorithm will throw out its prediction. You can compare and calculate the accuracy. But, is there a problem by relying just on this method?

Imagine doing this. We have around 16 rows for training the algorithm, right? What if we just use 12 rows and keep the 4 rows aside. After the algorithm is trained on the (reduced) data, we can feed the algorithm with the data from the 4 hidden rows (without the labels obviously) and get the predictions. Since you have the labels for this data, you can compare and calculate the accuracy. How is this different than the previous method?

Remember the time when you were a kid. Did you attend any tuitions outside of school? If so, you know how it works. Let’s say you are in a Maths tuition. After each chapter in your textbook, there are a few problems, let’s say there are 16 of them. You know where to find the answers for them, right? At the end of the textbook. Now imagine the following scenarios:

  • Scenario 1 - The tutor teaches you to solve all the 16 problems. He then asks you to resolve them and check if you are getting the same solution.
  • Scenario 2 - The tutor teaches you to solve a few problems, let’s say 12 of them. He then asks you to solve the rest and check if you are correct.

Your objective of attending a tuition is to perform well in the school exams. How would you know if you are making progress? In Scenario 1, you are solving the same problem again. You should get it right mostly. In Scenario 2, you are taught the techniques and asked to solve similar problems, not the same one. Now, both of the cases produce an accuracy measure. Which one is more reliable to understand the value addition of tutoring? Think!

Scenario 2, right? Why? Because your aim is to do well in the school exams - the questions of which you are unaware of beforehand. Scenario 2 simulates this better than Scenario 1. If you are solving the unseen questions pretty well, you will be more confident that you will do good in the school exams. But in Scenario 1, even if you get all the questions correct when you are resolving them, you will be uncertain if you will perform well when you come across a question which you have never seen before. Makes sense?

The ultimate objective of our algorithm is to grade the apples it is exposed to, in the future. Can you draw a parallel between the tutoring example and the alternatives we listed earlier?

When we split the 16-row data into two parts, what we are effectively doing is dividing the data into a ‘training’ and ‘testing’ set. As the name suggests, the training set is used for training our algorithm. And the testing set, the unseen data, is used for estimating the accuracy of our algorithm. We can also calculate the accuracy on the train data as well (the first alternative) but we know which score to trust more. These accuracies are referred to as the train and the test performance of the algorithm. But how do you decide the split ratio (we used 75:25 ratio here) and which rows to use for training and testing? Would this choice matter? All these and many other questions of yours will be addressed in the upcoming articles.

Now, based on these metrics, you can either tweak the algorithm further or decide to go and use it for grading. No more manual work required!

 

EndNote

Alright. So, all the steps that we discussed right from handling missing data to breaking the processed data into train and test, fall under the Data Preprocessing pipeline. Did you enjoy this real-short demo of how things roll in the world of ML. We hope that the following few things are made clear through this article -

  • Types of Missing Data and how to handle them
  • Finding Outliers and tackling them
  • Encoding Categorical (Nominal and Ordinal) Variables
  • Scaling Continuous Variables
  • Train and Test Split

 

Now for the most awaited part. From our next set of articles, we will start exploring the mechanics of ML algorithms, beginning with Supervised Classification ones. Until then, you can use the following set of resources to explore more about the nuances in the Data Preprocessing stage -

Share this post

-