What are the Missing Values??

Jyothi Panuganti
7 min readFeb 7, 2020

How many types of Missing values and how they can be handled?

For Suppose, What happens if data is missing in the dataset…how can you know that data is useful or not useful, you check and verify then only you get to know right!!!

Missing values can reduce the statistical power of a study and can produce biased estimates, leading to invalid conclusions.

Let's try to draw the definition of missing data(Missing values):

The value that is not stored for a variable in the observation of interest.

Missing values/data represents various problems.

**First, the Absence of the data reduces statistical power, which refers to the probability that the test will reject the null hypothesis when it is false.

**Second, the lost data can cause bias in the estimation of parameters.

**Third, it can reduce the representativeness of the samples.

**Fourth, it may complicate the analysis of the study. Each of these distortions may threaten the validity of the trials and can lead to invalid conclusions.

Till now we have seen what are the misconceptions need to face, if data is missing. Now let’s see the types of missing data.

There are 3 types of Missing data

  1. Missing Completely at random(MCAR)
  2. Missing at Random(MAR)
  3. Not Missing at random(NMAR)

Missing Completely At Random(MCAR)

=> The first form is missing completely at random (MCAR). This form exists when the missing values are randomly distributed across all observations. This form can be confirmed by partitioning the data into two parts: one set containing the missing values, and the other containing the non-missing values. After partitioning the data, the most popular test, called the t-test of mean difference, is carried out in order to check whether there exists any difference in the sample between the two data-sets.

=>should keep in mind that if the data are MCAR, then he may choose a pair-wise or a list-wise deletion of missing value cases. If, however, the data are not MCAR, then imputation to replace them is conducted.

Missing At Random(MAR)

Data is considered to be missing at random if the data meet the requirement that missingness does not depend on the value of x after controlling for all the other variables.

The key aspect about MAR is that the values of the missing data can somehow be predicted from some of the other variables being studied.when data is missing at random, it means that we need to either use an advanced imputation method, such as multiple imputations, or analysis methods specifically designed for missing at random data

Missing Not At Random(MNAR):

When the data is missing not at random it means that we cannot use any of the standard methods for dealing with missing data(e.g., imputation, or algorithms specifically designed for missing values).If the missing data are missing not at random, standard calculations give the wrong answer.

Now let's try to plot some solve with missing values(Handling the Missing values)

Seven techniques to handle missing values

  1. Listwise Deletion: Delete all data from any participant with missing values. If the sample is large enough, then you likely can drop data without substantial loss of statistical power.

For Example:- In this, we can see that there are missing values at USER A, USER C. According to our theory, we have to delete those records. Here the structure of the data.

After deleting USER A and USER C we are left with USER:-B, D, E respectively.

Pros:

  • complete removal of data with missing values results in a robust and highly accurate model.
  • Deleting a particular row or column with no specific information is better since it does not have high weightage.

Cons:

  • Loss of Information and data
  • Works poorly if the percentage of missing value is high, compared to the whole dataset.

* Recover the Values: You can sometimes contact the participants and ask them to fill out the missing values. For in-person studies, we’ve found having an additional check for missing values before the participant leaves help.

2. Replacing with Mean/Median/Mode: which are used to infer missing values. Approaches ranging from global average for the variable to averages based on groups are usually considered.

* For example: if you are working with a missing value for Revenue, you might assign the average defined by mean, median or mode to such missing value. You could also consider taking into account some other variables such as Gender of the User and/or the Device OS to calculate such an average to be assigned to the missing values.

Though you can get a quick estimate of the missing values, you are artificially reducing the variation in the dataset as the missing observations could have the same value. This may impact the statistical analysis of the dataset since depending on the percentage of missing observations imputed, metrics such as mean, median, correlation, etc may get affected.

Imputation methods

3. Regression Imputation: Regression imputation fits a statistical model on a variable with missing values. Predictions of the regression model are used to substitute the missing values in this variable.

  • Deterministic regression imputation: replaces the missing values with the exact predictions of the regression model.
  • Stochastic Regression imputation: was developed in order to solve this issue of deterministic regression imputation. Stochastic regression imputation adds a random error term to the predicted value and therefore able to reproduce the correlation of X and Y more appropriately.

Drawbacks of stochastic regression imputation: —

  • Stochastic regression imputation may lead to implausible values. Variables are often restricted to a certain range of values (e.g. income should always be positive). Regression imputation is not able to impute according to such restrictions.
  • Stochastic regression imputation leads to poor results when data is heteroscedastic. The imputation method assumes that the random error has on average the same size for all parts of the distribution, often resulting in too small or too large random error terms for the imputed values.

4. Multiple Imputations: Multiple imputations can incorporate information from all variables in a dataset to derive imputed values for those that are missing. This method has been shown to be an effective tool in a variety of scenarios involving missing data, including incomplete item response.

5. Hot-Deck: This method is the same in principle as case-based reasoning. In order for attributes that contain missing values to be utilized, values must be found from among the most similar instances of non-missing values and used to replace the missing values. Therefore, each missing value is replaced with the value of an attribute with the most similar instances.

6. k-NN: Attributes are found via search among non-missing attributes using the 3-NN method. Missing values are imputed based on the values of the attributes of the k most similar instance

Example: Creating DataFrame with missing values

  1. Load KNNImputer: from sklearn.impute import KNNImputer

import numpy as np
import pandas as pd
dict = {‘First’:[100, 90, np.nan, 95],
‘Second’: [30, 45, 56, np.nan],
‘Third’:[np.nan, 40, 80, 98]}
# creating a dataframe from list
df = pd.DataFrame(dict)

i. Initialize KNNImputer

imputer = KNNImputer(n_neighbrs=2)

ii. Impute/Fill Missing values

df_filled = imputer.fit_transorm(df)

This is the display of filled data fields

7. k-means Clustering: Attributes are found through the formation of k-clusters from non-missing data, after which missing values are imputed. The entire dataset is partitioned into k clusters by maximizing the homogeneity within each cluster and the heterogeneity between clusters.

Reference:

****These are more than enough for knowing about imputation********

*********Have a wonderful reading*********

If I know views about my blog its great pleasure, Do I have to improve myself in which aspect let me know, if you share your views. I am very glad to know from you.

Please find some more blogs from me in the medium…

My next post is on distance measures like gradient, euclidean, etc…

--

--

Jyothi Panuganti

Data Science Enthusiast, Blogger, content writer, and Freelancer.