Understanding of Linear Regression Model

Mayank Porwal
5 min readJan 10, 2021
Photo by Michael Browning on Unsplash

Linear regression is one the most well known algorithm in statistics and machine learning. In this article we are going to explore linear regression with the help of Boston Housing dataset.

In Linear regression model, we try to learn the relationship between the input variables and output variable(single) in the given dataset. If the dataset is trying to showcase linear tendency so we can solve it through Linear regression.

So let’s started.

First we will import all the required libraries and load the dataset.

Next we will explore the dataset:

boston.keys()## outputdict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
  • data: contains the information of the houses
  • target: price of the data
  • feature_names: features of the dataset
  • DESCR: description about the data

Check the features of the dataset by boston.feature_names

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD','TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

by using print(boston.DESCR) we can get more about dataset features. There are 13 features and 506 observations.

.. _boston_dataset:

Boston house prices dataset
---------------------------
**Data Set Characteristics:**
Number of Instances: 506
Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

Attribute Information (in order):
CRIM:per capita crime rate by town
ZN:proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS:proportion of non-retail business acres per town
CHAS:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX:nitric oxides concentration (parts per 10 million)
RM:average number of rooms per dwelling
AGE:proportion of owner-occupied units built prior to 1940
DIS:weighted distances to five Boston employment centres
RAD:index of accessibility to radial highways
TAX:full-value property-tax rate per $10,000
PTRATIO:pupil-teacher ratio by town
B:1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT:% lower status of the population
MEDV:Median value of owner-occupied homes in $1000's

Missing Attribute Values: None

Creator: Harrison, D. and Rubinfeld, D.L.

This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/


This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.

The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.

The Boston house-price data has been used in many machine learning papers that address regression
problems.

.. topic:: References

- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.

Now let’s create a data-frame by using pd.Dataframe.

boston dataset

We can see that target valueMEDV is missing from the dataset. So we add new column to dataset for target.

Now our dataset is ready for data preprocessing.

Data Preprocessing: In data preprocessing we will check if there are any NAN value, Null value or missing value in the dataset. For this we will count the missing value for each feature like that:

Now we will move forward to statistical and graphical analysis of data. In this step with the help of visualization we will try to learn the relationship between the target variable and features.

Let’s first check our target variable data is normally distributed or not??

Normalization

Now we can clearly see that our data is normally distributed but have some outliers. Now what is an outlier???

Wikipedia: In statistics, An outlier is an observation point that is distant from other observations.

According to definition we can see that there are some data which is separate from the crowd.

Next, we check the correlation between the variables with the help of correlation matrix. This matrix depicts the correlation between all the possible pairs of value in the table. It can be formed by using corr() function. To plot correlation matrix we will use heatmap() function from the seaborn library.

Correlation Matrix

The correlation coefficient lies between -1 to +1. It quantifies the strength and direction of the relationship between two numerical variables. If the value is close to 1, it simply means that there is a strong positive correlation between two variables and if value is close to -1, the variables have a strong negative correlation.

By looking the correlation matrix, we can observe the following:

  • RAM and MEDV value features are strongly correlated to each other so we can’t select both features together for train the model.
  • AGE and DIS features are negatively correlation with value -0.75.
  • RM has strong positive correlation with MEDV (0.7) and STATS has high negative correlation with MEDV(-0.74)

Finally we select two features: RM, STATS. Now we will draw a scatter plot with linear model.

Scatter plot with best fit line

Prepare the data to train model

Now prepare the data with two features.

Split the data into training and testing sets

Here we are going to split the dataset into training and testing sets. We will test the model with 20% of the dataset and rest will use for train the model.

20% for test data

Testing and Train the model

We will use Linear regression to train model.

Model Accuracy with R2 statistics

We will check the accuracy of the model by R2 — score. It provide the measure of fit. It simply means that how much dataset is fit to our model.

The model performance for training set
--------------------------------------
R2 score is 0.6300745149331701


The model performance for testing set
--------------------------------------
R2 score is 0.6628996975186954

Thanks for Reading 🙏

Happy Learning 😃

Feel free to leave a comment or share this post. Follow me for future posts….

--

--