In this article, you will learn the basics behind a very popular statistical model, the linear regression.

## What is Linear Regression

In statistics, linear regression is a linear approach to modeling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is called simple linear regression; for more than one, the process is called multiple linear regression.

In Linear Regression these two variables are related through an equation, where exponent (power) of both these variables is 1. Mathematically a linear relationship represents a straight line when plotted as a graph. A non-linear relationship where the exponent of any variable is not equal to 1 creates a curve.

The general mathematical equation for a linear regression is −

`y = ax + b`

Description of the parameters used −

• y is the response variable.
• x is the predictor variable.
• a and b are constants which are called the coefficients.

## Example

Moreover this analysis, we will use the cars dataset that comes with R by default. `cars` is a standard built-in dataset, that makes it convenient to demonstrate linear regression in a simple and easy to understand fashion. You can access this dataset simply by typing in `cars` in your R console. You will find that it consists of 50 observations(rows) and 2 variables (columns) – `dist` and `speed`. Let’s print out the first six observations here.

``````# head shows only first 6 obervations
speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10
``````

Before we begin building the regression model, it is a good practice to analyze and understand the variables. The graphical analysis and correlation study below will help with this.

## Density plot – Check if the response variable is close to normality

``````> plot(density(cars\$speed), main="Density Plot: Speed", ylab="Frequency")
> polygon(density(cars\$speed), col="steelblue")

> plot(density(cars\$dist), main="Density Plot: Distance", ylab="Frequency")
> polygon(density(cars\$dist), col="steelblue")
``````

## Correlation

It is a statistical measure that suggests the level of linear dependence between two variables, that occur in pair – just like what we have here in speed and dist. Correlation can take values between -1 to +1. If we observe for every instance where speed increases, the distance also increases along with it, then there is a high positive correlation between them and therefore the correlation between them will be closer to 1. The opposite is true for an inverse relationship, in which case, the correlation between the variables will be close to -1.

A value closer to 0 suggests a weak relationship between the variables. A low correlation (-0.2 < x < 0.2) probably suggests that much of variation of the response variable (Y) is unexplained by the predictor (X), in which case, we should probably look for better explanatory variables.

``````cor(cars\$speed, cars\$dist)  # calculate correlation between speed and distance
#> [1] 0.8068949``````

## Building Linear Model

Now that we have seen the linear relationship pictorially in the scatter plot and by computing the correlation, let’s see the syntax for building the linear model. The function used for building linear models is lm(). The lm() function takes in two main arguments, namely: 1. Formula 2. Data. The data is typically a data.frame and the formula is an object of class `formula`. But the most common convention is to write out the formula directly in place of the argument as written below.

### Description

`lm` is used to fit linear models. It can be used to carry out regression, single stratum analysis of variance and analysis of covariance (although `aov` may provide a more convenient interface for these).

### Usage

```lm(formula, data, subset, weights, na.action,
method = "qr", model = TRUE, x = FALSE, y = FALSE, qr = TRUE,
singular.ok = TRUE, contrasts = NULL, offset, ...)
```

### Arguments

``````# build linear regression model on full data
> linearModel <- lm(dist ~ speed, data=cars)
> linearModel
Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed
-17.579        3.932``````

Now that we have built the linear model, we also have established the relationship between the predictor and response in the form of a mathematical formula for Distance (dist) as a function for speed. For the above output, you can notice the ‘Coefficients’ part having two components: Intercept: -17.579, speed: 3.932 These are also called the beta coefficients. In other words,
dist = Intercept + (β ∗ speed)
=> dist = −17.579 + 3.932∗speed

## Linear Regression Diagnostics

Now the linear model is built and we have a formula that we can use to predict the dist value if a corresponding speed is known. Let’s begin by printing the summary statistics for `linearModel`.

``````> summary(linearModel)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
Min      1Q  Median      3Q     Max
-29.069  -9.525  -2.272   9.215  43.201

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791     6.7584  -2.601   0.0123 *
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

> AIC(linearModel)
[1] 419.1569

> BIC(linearModel)
[1] 424.8929
``````

## How to know if the model is best fit for your data?

The most common metrics to look at while selecting the model are:

## predict() Function

### Syntax

The basic syntax for predict() in linear regression is −

### Usage

`predict(object, newdata)`

## Let’s predict the amount of distance coverd at a speed of 100 mph

``````> a <- data.frame(speed = c(10, 100, 1000))
> a
speed
1    10
2   100
3  1000

> predict(linearModel, a)
1          2          3
21.74499  375.66178 3914.82966
``````

As you can see we got the amount of distance which will be covered by the car at a speed of 10mph, 100mph, 1000mph.

## Visualize the Regression Graphically

``````# ploting the distance speed plot with cars dataset
> plot(cars\$speed, cars\$dist, xlab="Speed in mph", ylab="Distace covered in ft",
main="Distance & Speed Regression")

# Creating a linear model
> linearModel <- lm(dist~speed, data=cars)
> linearModel

Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed
-17.579        3.932

> abline(linearModel, col="red", lty=2, lwd=3)
``````

## Conclusion

Hence, we show what is Linear Regression in R, what is Correlation, and how to build your own linear model along with how to diagnose your model whether it is the best fit or not. We also show how to predict values with the model we built and see the result. and in last we represented our linear regression using plot() function.

This brings the end of this Blog. We really appreciate your time.

Hope you liked it.