{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# tidylinreg Example\n", "\n", "The `tidylinreg` package fits a linear model to a dataset, and can be used to carry out regression. `tidylinreg` computes and returns a list of summary statistics of the coefficients of the fitted linear model, including standard error, test statistic, confidence interval, and p-value.\n", "\n", "To use `tidylinreg` in a project:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "tidylinreg.tidylinreg.LinearModel" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from tidylinreg.tidylinreg import LinearModel\n", "\n", "LinearModel" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Data\n", "\n", "To demonstrate Linear Regression using `tidylinreg`, we can use the [California housing dataset](https://scikit-learn.org/dev/modules/generated/sklearn.datasets.fetch_california_housing.html). This dataset has 8 numeric, predictive attributes and the target.\n", "\n", "Attributes:\n", "- MedInc: median income in block group\n", "- HouseAge: median house age in block group\n", "- AveRooms: average number of rooms per household\n", "- AveBedrms: average number of bedrooms per household\n", "- Population: block group population\n", "- AveOccup: average number of household members\n", "- Latitude: block group latitude\n", "- Longitude: block group longitude\n", "\n", "Target: The target variable is the median house value for California districts, expressed in hundreds of thousands of dollars ($100,000)." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_california_housing\n", "import pandas as pd\n", "\n", "housing = fetch_california_housing(as_frame=True)\n", "X = housing[\"data\"]\n", "y = housing[\"target\"]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Train/Test Split\n", "\n", "Make sure to split the data into train and test sets to avoid any violations of the **Golden Rule**!" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=524)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### EDA\n", "\n", "EDA is not the focus of this example, so here's just a quick preview of the data. A more detailed explanation of the data can be found [here](https://inria.github.io/scikit-learn-mooc/python_scripts/datasets_california_housing.html)." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | MedInc | \n", "HouseAge | \n", "AveRooms | \n", "AveBedrms | \n", "Population | \n", "AveOccup | \n", "Latitude | \n", "Longitude | \n", "
|---|---|---|---|---|---|---|---|---|
| 190 | \n", "3.3750 | \n", "52.0 | \n", "4.011905 | \n", "1.071429 | \n", "647.0 | \n", "3.851190 | \n", "37.79 | \n", "-122.24 | \n", "
| 19599 | \n", "1.5500 | \n", "17.0 | \n", "5.425150 | \n", "1.080838 | \n", "672.0 | \n", "2.011976 | \n", "37.56 | \n", "-120.93 | \n", "
| 20100 | \n", "2.0057 | \n", "52.0 | \n", "5.061728 | \n", "1.078189 | \n", "609.0 | \n", "2.506173 | \n", "37.96 | \n", "-120.23 | \n", "
| 9708 | \n", "1.9385 | \n", "28.0 | \n", "3.600000 | \n", "1.039252 | \n", "1717.0 | \n", "3.209346 | \n", "36.67 | \n", "-121.65 | \n", "
| 20518 | \n", "1.9426 | \n", "32.0 | \n", "3.957935 | \n", "1.072658 | \n", "2046.0 | \n", "3.912046 | \n", "38.58 | \n", "-121.56 | \n", "
| ... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "... | \n", "
| 19206 | \n", "1.9932 | \n", "29.0 | \n", "4.329949 | \n", "1.053299 | \n", "990.0 | \n", "2.512690 | \n", "38.47 | \n", "-122.72 | \n", "
| 15131 | \n", "2.3693 | \n", "11.0 | \n", "4.434608 | \n", "1.042254 | \n", "1472.0 | \n", "2.961771 | \n", "32.86 | \n", "-116.92 | \n", "
| 5652 | \n", "2.1382 | \n", "21.0 | \n", "3.708333 | \n", "1.058036 | \n", "1699.0 | \n", "2.528274 | \n", "33.73 | \n", "-118.29 | \n", "
| 12733 | \n", "8.6572 | \n", "20.0 | \n", "8.130435 | \n", "1.027174 | \n", "1105.0 | \n", "3.002717 | \n", "38.58 | \n", "-121.35 | \n", "
| 17402 | \n", "2.2743 | \n", "38.0 | \n", "4.021419 | \n", "1.062918 | \n", "2601.0 | \n", "3.481928 | \n", "34.95 | \n", "-120.44 | \n", "
14448 rows × 8 columns
\n", "| \n", " | Parameter | \n", "Estimate | \n", "Std. Error | \n", "T-Statistic | \n", "P-Value | \n", "
|---|---|---|---|---|---|
| 0 | \n", "(Intercept) | \n", "-37.004147 | \n", "0.790393 | \n", "-46.817373 | \n", "0.000000e+00 | \n", "
| 1 | \n", "MedInc | \n", "0.438382 | \n", "0.005039 | \n", "86.989990 | \n", "0.000000e+00 | \n", "
| 2 | \n", "HouseAge | \n", "0.009267 | \n", "0.000535 | \n", "17.314654 | \n", "0.000000e+00 | \n", "
| 3 | \n", "AveRooms | \n", "-0.107089 | \n", "0.006973 | \n", "-15.358336 | \n", "0.000000e+00 | \n", "
| 4 | \n", "AveBedrms | \n", "0.635964 | \n", "0.033466 | \n", "19.003273 | \n", "0.000000e+00 | \n", "
| 5 | \n", "Population | \n", "-0.000007 | \n", "0.000006 | \n", "-1.160798 | \n", "2.457431e-01 | \n", "
| 6 | \n", "AveOccup | \n", "-0.003739 | \n", "0.000536 | \n", "-6.981016 | \n", "3.058442e-12 | \n", "
| 7 | \n", "Latitude | \n", "-0.422767 | \n", "0.008629 | \n", "-48.992423 | \n", "0.000000e+00 | \n", "
| 8 | \n", "Longitude | \n", "-0.435591 | \n", "0.009039 | \n", "-48.191074 | \n", "0.000000e+00 | \n", "