tidylinreg Example

The tidylinreg package fits a linear model to a dataset, and can be used to carry out regression. tidylinreg computes and returns a list of summary statistics of the coefficients of the fitted linear model, including standard error, test statistic, confidence interval, and p-value.

To use tidylinreg in a project:

from tidylinreg.tidylinreg import LinearModel

LinearModel

tidylinreg.tidylinreg.LinearModel

Data

To demonstrate Linear Regression using tidylinreg, we can use the California housing dataset. This dataset has 8 numeric, predictive attributes and the target.

Attributes:

MedInc: median income in block group
HouseAge: median house age in block group
AveRooms: average number of rooms per household
AveBedrms: average number of bedrooms per household
Population: block group population
AveOccup: average number of household members
Latitude: block group latitude
Longitude: block group longitude

Target: The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000).

from sklearn.datasets import fetch_california_housing
import pandas as pd

housing = fetch_california_housing(as_frame=True)
X = housing["data"]
y = housing["target"]

Train/Test Split

Make sure to split the data into train and test sets to avoid any violations of the Golden Rule!

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=524)

EDA

EDA is not the focus of this example, so here’s just a quick preview of the data. A more detailed explanation of the data can be found here.

X_train

	MedInc	HouseAge	AveRooms	AveBedrms	Population	AveOccup	Latitude	Longitude
190	3.3750	52.0	4.011905	1.071429	647.0	3.851190	37.79	-122.24
19599	1.5500	17.0	5.425150	1.080838	672.0	2.011976	37.56	-120.93
20100	2.0057	52.0	5.061728	1.078189	609.0	2.506173	37.96	-120.23
9708	1.9385	28.0	3.600000	1.039252	1717.0	3.209346	36.67	-121.65
20518	1.9426	32.0	3.957935	1.072658	2046.0	3.912046	38.58	-121.56
...	...	...	...	...	...	...	...	...
19206	1.9932	29.0	4.329949	1.053299	990.0	2.512690	38.47	-122.72
15131	2.3693	11.0	4.434608	1.042254	1472.0	2.961771	32.86	-116.92
5652	2.1382	21.0	3.708333	1.058036	1699.0	2.528274	33.73	-118.29
12733	8.6572	20.0	8.130435	1.027174	1105.0	3.002717	38.58	-121.35
17402	2.2743	38.0	4.021419	1.062918	2601.0	3.481928	34.95	-120.44

14448 rows × 8 columns

Scaling

This model uses simple linear regression with no regularization, so scaling of the features is optional but not required. Scaling is useful if we are interested in comparing relative feature importance; however, we should keep the data in its original scale in this example so that the linear regression coefficient estimates are more interpretable.

# from sklearn.preprocessing import StandardScaler

# scaler = StandardScaler()
# X_train = pd.DataFrame(scaler.fit_transform(X_train),
#                        columns=X_train.columns,
#                        index=X_train.index)
# X_test = pd.DataFrame(scaler.fit_transform(X_test),
#                       columns=X_test.columns,
#                       index=X_test.index)

Fit

Let’s define our model and fit it to the training data!

Note:, we don’t need to add a column of ones to our data to get an intercept. The fit method takes care of this for you!

model = LinearModel()
model.fit(X_train, y_train)

Predict

How do our predictions look compared to the true values?

import matplotlib.pyplot as plt

preds = model.predict(X_test)

# Create a scatter plot
plt.scatter(y_test, preds, color='blue', label='Predictions')  # Scatter plot

# Add a reference line (y = x)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 
         color='red', linestyle='--', label='Ideal Fit (y = x)')

# Add labels and title
plt.xlabel('True values')
plt.ylabel('Predictions')
plt.title('California Housing Predictions')
plt.legend()

# Display the plot
plt.grid(True)  # Optional: Add a grid for better readability
plt.show()

_images/dcd06b1af78196d0f719d08c371aed174f45e4cfb3cdb1a6424118db3f07f036.png

Our model is capturing the general trend of the data!

Summary Statistics

This is the great part of tidylinreg! Calling summary() will calculate important statistical values for the linear model coefficient estimates, including standard error, test statistic, confidence interval, and p-values. The output of summary() is generated by calling the following internal class methods:

get_std_error()
get_test_statistic()
get_ci()
get_pvalues()

model.summary()

	Parameter	Estimate	Std. Error	T-Statistic	P-Value
0	(Intercept)	-37.004147	0.790393	-46.817373	0.000000e+00
1	MedInc	0.438382	0.005039	86.989990	0.000000e+00
2	HouseAge	0.009267	0.000535	17.314654	0.000000e+00
3	AveRooms	-0.107089	0.006973	-15.358336	0.000000e+00
4	AveBedrms	0.635964	0.033466	19.003273	0.000000e+00
5	Population	-0.000007	0.000006	-1.160798	2.457431e-01
6	AveOccup	-0.003739	0.000536	-6.981016	3.058442e-12
7	Latitude	-0.422767	0.008629	-48.992423	0.000000e+00
8	Longitude	-0.435591	0.009039	-48.191074	0.000000e+00

As stated before, the intercept estimate was included without any need for modifying the dataframe.

Interpretations

Wow, it looks like all of our coefficient estimates are statistically significant, except for Population (with a significance level alpha = 0.05). We can also make interpretations on these statistically significant coefficients; for example, we can say that a unit increase in median income in the block group (MedInc) is associated with an increase of $43,838.20 in the median house value for the housing district (holding all other factors constant).

Lets look at another less intuitive example: We can see the AveOccup parameter has a statistically significant estimate of -0.003739. We can say that increasing the average occupancy by one person is associated to a decrease of $373.90 in the median house value for the housing district, while holding other factors constant.