Understanding of Linear Regression Model
Linear regression is one the most well known algorithm in statistics and machine learning. In this article we are going to explore linear regression with the help of Boston Housing dataset.
In Linear regression model, we try to learn the relationship between the input variables and output variable(single) in the given dataset. If the dataset is trying to showcase linear tendency so we can solve it through Linear regression.
So let’s started.
First we will import all the required libraries and load the dataset.
Next we will explore the dataset:
boston.keys()## outputdict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
- data: contains the information of the houses
- target: price of the data
- feature_names: features of the dataset
- DESCR: description about the data
Check the features of the dataset by boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD','TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
by using print(boston.DESCR)
we can get more about dataset features. There are 13 features and 506 observations.
.. _boston_dataset:
Boston house prices dataset
---------------------------
**Data Set Characteristics:**
Number of Instances: 506
Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.
Attribute Information (in order):CRIM:per capita crime rate by town
ZN:proportion of residential land zoned for lots over 25,000 sq.ft.
INDUS:proportion of non-retail business acres per town
CHAS:Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
NOX:nitric oxides concentration (parts per 10 million)
RM:average number of rooms per dwelling
AGE:proportion of owner-occupied units built prior to 1940
DIS:weighted distances to five Boston employment centres
RAD:index of accessibility to radial highways
TAX:full-value property-tax rate per $10,000
PTRATIO:pupil-teacher ratio by town
B:1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
LSTAT:% lower status of the population
MEDV:Median value of owner-occupied homes in $1000's
Missing Attribute Values: None
Creator: Harrison, D. and Rubinfeld, D.L.
This is a copy of UCI ML housing dataset.
https://archive.ics.uci.edu/ml/machine-learning-databases/housing/
This dataset was taken from the StatLib library which is maintained at Carnegie Mellon University.
The Boston house-price data of Harrison, D. and Rubinfeld, D.L. 'Hedonic
prices and the demand for clean air', J. Environ. Economics & Management,
vol.5, 81-102, 1978. Used in Belsley, Kuh & Welsch, 'Regression diagnostics
...', Wiley, 1980. N.B. Various transformations are used in the table on
pages 244-261 of the latter.
The Boston house-price data has been used in many machine learning papers that address regression
problems.
.. topic:: References
- Belsley, Kuh & Welsch, 'Regression diagnostics: Identifying Influential Data and Sources of Collinearity', Wiley, 1980. 244-261.
- Quinlan,R. (1993). Combining Instance-Based and Model-Based Learning. In Proceedings on the Tenth International Conference of Machine Learning, 236-243, University of Massachusetts, Amherst. Morgan Kaufmann.
Now let’s create a data-frame by using pd.Dataframe.
We can see that target valueMEDV
is missing from the dataset. So we add new column to dataset for target.
Now our dataset is ready for data preprocessing.
Data Preprocessing: In data preprocessing we will check if there are any NAN value, Null value or missing value in the dataset. For this we will count the missing value for each feature like that:
Now we will move forward to statistical and graphical analysis of data. In this step with the help of visualization we will try to learn the relationship between the target variable and features.
Let’s first check our target variable data is normally distributed or not??
Now we can clearly see that our data is normally distributed but have some outliers. Now what is an outlier???
Wikipedia: In statistics, An outlier is an observation point that is distant from other observations.
According to definition we can see that there are some data which is separate from the crowd.
Next, we check the correlation between the variables with the help of correlation matrix. This matrix depicts the correlation between all the possible pairs of value in the table. It can be formed by using corr()
function. To plot correlation matrix we will use heatmap()
function from the seaborn library.
The correlation coefficient lies between -1 to +1. It quantifies the strength and direction of the relationship between two numerical variables. If the value is close to 1, it simply means that there is a strong positive correlation between two variables and if value is close to -1, the variables have a strong negative correlation.
By looking the correlation matrix, we can observe the following:
RAM
andMEDV
value features are strongly correlated to each other so we can’t select both features together for train the model.AGE
andDIS
features are negatively correlation with value -0.75.RM
has strong positive correlation withMEDV
(0.7) andSTATS
has high negative correlation withMEDV
(-0.74)
Finally we select two features: RM
, STATS
. Now we will draw a scatter plot with linear model.
Prepare the data to train model
Now prepare the data with two features.
Split the data into training and testing sets
Here we are going to split the dataset into training and testing sets. We will test the model with 20% of the dataset and rest will use for train the model.
Testing and Train the model
We will use Linear regression to train model.
Model Accuracy with R2 statistics
We will check the accuracy of the model by R2 — score. It provide the measure of fit. It simply means that how much dataset is fit to our model.
The model performance for training set
--------------------------------------
R2 score is 0.6300745149331701
The model performance for testing set
--------------------------------------
R2 score is 0.6628996975186954
Thanks for Reading 🙏
Happy Learning 😃
Feel free to leave a comment or share this post. Follow me for future posts….