ds1001_final/ds1001_final/notebooks/Final_Project_Notebook.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3ba52918",
   "metadata": {},
   "source": [
    "# Final Project Notebook\n",
    "\n",
    "Use the follow cells prompts to complete the final project for the course. Everything you need should be present in this notebook or previous notebooks we've used in class. You can work together as needed. \n",
    "\n",
    " - You will need to name your own dataset and use that name throughout\n",
    " - There are sections where you need to make changes the code and insert new code this will be noted in the code provided\n",
    " - You may get frustrated along the way, this is totally normal, just remember even small changes to the code make a huge difference. "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d45a3b8d",
   "metadata": {},
   "source": [
    "## Question Fork the Repository\n",
    "i. Include a screenshot of the forked repo in your GitHub account\n",
    "\n",
    "To fork the repository:\n",
    "1. Go to https://github.com/NovaVolunteer/ds1001_final\n",
    "2. Click the \"Fork\" button in the top right corner\n",
    "3. The repo will be forked to your GitHub account\n",
    "4. Take a screenshot of your forked repository"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2f556f44",
   "metadata": {},
   "source": [
    "### You should now be able to open your cloned repo in google collab, use the code below. \n",
    "\n",
    "### It's very helpful to have the variable inspector open while you go through this process. To do so go to tools>command palette>show variable inspector\n",
    "\n",
    "### It's also helpful to open up the folder tree on the left menu bar. Just click on the folder icon and then the ds1001_final folder. The data is located in the data folder in the processed sub-folder. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fd9c6e15",
   "metadata": {},
   "outputs": [],
   "source": [
    "!git clone \"https://github.com/username/repository.git\"\n",
    "# This script clones a GitHub repository using Git command line tool. \n",
    "# Insert the path to your desired repository in place of the URL."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "49e14c0f",
   "metadata": {},
   "source": [
    "## Systems"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a2fd1d5c",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Activate the finalproj environment\n",
    "!source ds1001_final/ds1001_final/finalproj/bin/activate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "db255838",
   "metadata": {},
   "outputs": [],
   "source": [
    "### You can use this command to list all the packages in your environment\n",
    "!pip list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install \"XX\"\n",
    "\n",
    "#You'll likely need to install the fairlearn packages, if not already installed.\n",
    "#Are there additional packages to install? (Cross check with the list above to \n",
    "# ensure all packages are installed)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "aae8b41a",
   "metadata": {},
   "source": [
    "### Check !pip list again to confirm installations"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cca2a44d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.preprocessing import MinMaxScaler\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay\n",
    "import fairlearn.metrics\n",
    "from fairlearn.metrics import MetricFrame\n",
    "from fairlearn.metrics import count, true_positive_rate, false_positive_rate, selection_rate, demographic_parity_ratio\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7369da7c",
   "metadata": {},
   "source": [
    "## Design: Data prep and exploration "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "196ba293",
   "metadata": {},
   "outputs": [],
   "source": [
    "\"name your dataset\" = pd.read_csv('your_dataset.csv') # the data is the data folder, \n",
    "#you'll need to use the correct path to the dataset. \n",
    "\n",
    "# How many rows are in the dataframe? How many columns?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aebf9d93",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Explore the variables a bit more, create histograms for the numerics values and bar charts for the categorical.\n",
    "# Histograms for numeric variables\n",
    "numeric_columns = \"xx\".select_dtypes(include=['number']).columns\n",
    "for col in numeric_columns: \n",
    "    plt.figure(figsize=(10, 6))\n",
    "    sns.histplot(\"xx\"[col], kde=True)\n",
    "    plt.title(f'Histogram of {col}')\n",
    "    plt.show()  \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c1d7469e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Bar charts for categorical variables\n",
    "categorical_columns = \"xx\".select_dtypes(include=['object', 'category']).columns\n",
    "for col in categorical_columns:\n",
    "    plt.figure(figsize=(10, 6))\n",
    "    sns.countplot(x=\"xx\"[col])\n",
    "    plt.title(f'Bar Chart of {col}')\n",
    "    plt.xticks(rotation=45)\n",
    "    plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9a6964a9",
   "metadata": {},
   "outputs": [],
   "source": [
    "# How many numeric columns are in the data set?\n",
    "num_numeric_columns = \"xx\".select_dtypes(include=['number']).shape[1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e76a5f9e",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normalization\n",
    "scaler = MinMaxScaler()\n",
    "\"xx\"[\"xx\".select_dtypes(include=['number']).columns] = scaler.fit_transform(\"xx\".select_dtypes(include=['number']))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d9875873",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Likely need to convert categorical columns to category dtype\n",
    "for col in \"xx\".select_dtypes(include=['object']).columns:\n",
    "    \"xx\"[col] = \"xx\"[col].astype('category')   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "80626fad",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Creating dummy variables, make sure the variables that need to be converted to dummies are categorical, not numeric.\n",
    "# This might require you to convert some columns to categorical first using astype('category')\n",
    "\"xx\" = pd.get_dummies(\"xx\", drop_first=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f39eb5f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Display missing data using the isnull function, is there any missing data?\n",
    "print(\"xx\".isnull().sum())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "26ae0ed6",
   "metadata": {},
   "outputs": [],
   "source": [
    "# remove missing values if needed\n",
    "\"xx\" = \"xx\".dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "acd5b69f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Scatterplot between two variables\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.scatterplot(x='Variable1', y='Variable2', data=\"xx\")  # Replace 'Variable1' and 'Variable2' with your column names\n",
    "plt.title('Scatterplot of Variable1 vs Variable2')\n",
    "plt.savefig('scatterplot.png')  # Save the scatterplot image\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4889270f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Density chart of a continuous variable\n",
    "plt.figure(figsize=(10, 6))\n",
    "sns.kdeplot(\"xx\"['ContinuousVariable'], fill=True)  # Replace 'ContinuousVariable' with your column name\n",
    "plt.title('Density Chart of ContinuousVariable')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9131b326",
   "metadata": {},
   "outputs": [],
   "source": [
    "#Correlation matrix, make sure to only include numeric variables\n",
    "num_values = \"xx\".select_dtypes(include=['number'])\n",
    "correlation_matrix = num_values.corr()\n",
    "plt.figure(figsize=(12, 8))\n",
    "sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', square=True)\n",
    "plt.title('Correlation Matrix')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7072676b",
   "metadata": {},
   "source": [
    "## Analytics: Build a model and Tune it for best Best Performance"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f5d16e93",
   "metadata": {},
   "outputs": [],
   "source": [
    "# What is the ‘target’ of a model and what is the prevalence of the target in your dataset? Remember prevalence \n",
    "# is the proportion of records that take on the value of interest for the target variable, usually the positive class.\n",
    "target_prevalence = \"xx\"['TargetVariable'].sum()  # Replace 'TargetVariable' with your target column name\n",
    "print(f'Target Prevalence: {target_prevalence}')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d6f008fe",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Divide the dataset into features and target\n",
    "target = \"xx\"['TargetVariable']  # Replace 'TargetVariable' with your actual target column name and \"xx\" with your dataframe name\n",
    "features = \"xx\".drop(columns=[target])\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3e511f88",
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(features, target, test_size=0.2, random_state=42)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4c995e33",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Include your table for the 10 values of k you tried and the corresponding accuracies.\n",
    "\n",
    "accuracy_results = {}\n",
    "\n",
    "for k in range(x, x):  # Replace x with your desired range values, explain what is happening in this loop\n",
    "    knn_model = KNeighborsClassifier(n_neighbors=k)\n",
    "    knn_model.fit(X_train, y_train)\n",
    "    accuracy = knn_model.score(X_test, y_test)\n",
    "    accuracy_results[k] = accuracy"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "366f9e5a",
   "metadata": {},
   "outputs": [],
   "source": [
    "#graph of accuracy vs k values\n",
    "plt.figure(figsize=(10, 6))\n",
    "plt.plot(list(accuracy_results.keys()), list(accuracy_results.values()), marker='o')\n",
    "plt.title('KNN Accuracy vs K Values')\n",
    "plt.xlabel('Number of Neighbors (k)')\n",
    "plt.ylabel('Accuracy')\n",
    "plt.xticks(list(accuracy_results.keys()))\n",
    "plt.grid()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e866ae7d",
   "metadata": {},
   "outputs": [],
   "source": [
    "# using the hyperparameter k that gave the best accuracy, rerun the model and generate \n",
    "# predictions on the test set. Explain why you choose this k value.\n",
    "best_k = 'xx'  # Replace 'xx' with the best k value found\n",
    "knn_model = KNeighborsClassifier(n_neighbors=best_k)\n",
    "knn_model.fit(X_train, y_train)\n",
    "y_pred = knn_model.predict(X_test)  "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fa4f911",
   "metadata": {},
   "source": [
    "## Value: Evaluation and Protected Classes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3b20d118",
   "metadata": {},
   "outputs": [],
   "source": [
    "# create a confusion matrix for your model's predictions. \n",
    "# What does the confusion matrix tell you about your model's performance?\n",
    "cm = confusion_matrix(y_test, y_pred)\n",
    "disp = ConfusionMatrixDisplay(confusion_matrix=cm)\n",
    "disp.plot()\n",
    "plt.title('Confusion Matrix')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e12b6ebe",
   "metadata": {},
   "outputs": [],
   "source": [
    "#We already have a model above using KNN so we can use the results to compute fairness metrics\n",
    "\n",
    "# Compute fairness metrics using Fairlearn\n",
    "\n",
    "my_metrics = {\n",
    "    'true positive rate' : true_positive_rate,\n",
    "    'false positive rate' : false_positive_rate,\n",
    "    'selection rate' : selection_rate,\n",
    "    'count' : count\n",
    "}\n",
    "# Construct a MetricFrame for race\n",
    "mf_race = MetricFrame(\n",
    "    metrics=my_metrics,\n",
    "    y_true=y_test,\n",
    "    y_pred=y_pred,\n",
    "    sensitive_features=X_test[\"xx1\"]  # Replace with your first protected class\n",
    ")\n",
    "\n",
    "# Construct a MetricFrame for gender\n",
    "mf_gender = MetricFrame(\n",
    "    metrics=my_metrics,\n",
    "    y_true=y_test,\n",
    "    y_pred=y_pred,\n",
    "    sensitive_features=X_test[\"xx2\"]  # Replace second protected class \n",
    ")  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "aa47711b",
   "metadata": {},
   "outputs": [],
   "source": [
    "mf_race.by_group #What do the results show? Change the mf_race with each subgroup and report the findings. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cdcfd773",
   "metadata": {},
   "outputs": [],
   "source": [
    "mf_gender.by_group #What do the results show?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c0ad32f0",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Derived fairness metrics. Be sure you understand the scale and meaning of these. Here we are calculating the \n",
    "# two fairness ratios using the gender_m feature. What do the results show, is the model more or less fair with this grouping?\n",
    "\n",
    "dpr_gender = fairlearn.metrics.demographic_parity_ratio(y_test, y_pred, sensitive_features=X_test['gender_m'])\n",
    "print(\"Demographic Parity ratio:\\t\", dpr_gender)\n",
    "\n",
    "eodds_gender = fairlearn.metrics.equalized_odds_ratio(y_test, y_pred, sensitive_features=X_test['gender_m'])\n",
    "print(\"Equalized Odds ratio:\\t\\t\", eodds_gender)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8d008f7a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Derived fairness metrics. Be sure you understand the scale and meaning of these. Here we are calculating the \n",
    "# the same features above only using a filtered search to pull in all the possibilities of features\n",
    "# starting with \"race\". What do the results show, is the model more or less fair with this grouping?\n",
    "\n",
    "dpr_race = fairlearn.metrics.demographic_parity_ratio(y_test, y_pred, sensitive_features=X_test.filter(regex=\"race.*\"))\n",
    "print(\"Demographic Parity ratio:\\t\", dpr_race)\n",
    "\n",
    "eodds_race = fairlearn.metrics.equalized_odds_ratio(y_test, y_pred, sensitive_features=X_test.filter(regex=\"race.*\"))\n",
    "print(\"Equalized Odds ratio:\\t\\t\", eodds_race)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "finalproj",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.1"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}