top of page

Data Preparation 2 - Imputing missing values with the help of Linear Regression

The dataset we have currently is a clean one with no missing values. It looks like this.

For the sake of this tutorial, I have deliberately made the data into a mess by removing some values from "Import Million Riyals Columns) with keeping the backup of the original data.

#=========================================================================================

#Extracting the data

#=========================================================================================

setwd("C:/Users/Ed/Desktop/datasets") sm = read.csv("saudimessydata.csv") View(sm) colnames(sm) <- c("Year", "Exp.mRiyals", "Change.in.Exp", "Imp.mRiyals", "Change.in.Imp")

#renamed the column names

str(sm) sm$Exp.mRiyals = as.numeric(sm$Exp.mRiyals)

#=========================================================================================



The missing values can be imputed with the help of linear regression. Before performing a linear regression, we have to find out the independent variable column from the table. We can do this with the help of correlation. The most correlated column variables are taken as the independent variables.



#========================================================================================= cor(sm) cor(sm, use = "complete.obs") # ignoring the NAs

#=========================================================================================

If we have a large dataset, it becomes difficult to compare the correlation of each and every column. That is why we use a function called symnum().

#========================================================================================= symnum(cor(sm, use = "complete.obs"))

#=========================================================================================

Now we have determined the independent variable column as "Export Million Riyals". The following step is one of the most crucial steps. We are going assign 1s and 0s to present and NULL values respectively with the help of new column I.

#=========================================================================================

# Creating a new column

#=========================================================================================

Ind = function(t) { x = dim(length(t)) x[which(!is.na(t))] = 1 x[which(is.na(t))] = 0 return(x) }

sm$I = Ind(sm$Imp.mRiyals) sm

#=========================================================================================

#linear regression

#=========================================================================================

lg = lm(Imp.mRiyals ~ Exp.mRiyals, data = sm) summary(lg)

coefficients(lg) #Imp.mRiyals = 1.806105e+04 + 4.067606e-01 * Exp.mRiyals

for(i in 1:nrow(sm)) { if(sm$I[i]==0) { sm$Imp.mRiyals[i] = 1.806105e+04 + 4.067606e-01*sm$Exp.mRiyals[i] } }

sm View(sm)

l = sm[4] View(l)

#=========================================================================================

#Correlation obtained in Excel, when compared with the true data #0.991004342

#This is an excellent score and it reinforces the fact that #linear regression is highly helpful in the case of imputation during data preparation

#=========================================================================================

Sample Visualization

ggplot(data = sm, aes(x = Year, y = Imp.mRiyals)) + geom_line(stat = "identity", col = "dark green", size = 2) + theme_bw() + ggtitle("Imports over the years") + xlab("Year -->") + ylab("Imports in Million Riyals") + theme(plot.title = element_text(hjust = 0.5))


Featured Posts
Check back soon
Once posts are published, you’ll see them here.
Recent Posts
Archive
Search By Tags
No tags yet.
Follow Us
  • Facebook Basic Square
  • Twitter Basic Square
  • Google+ Basic Square
bottom of page