Analysis of PhD application

Abstract

In this Jupyter notebook, I analyze a dataset containing information about PhD applicants to major universities. I use machine learning methods to predict if the university admissions committee accepts, waitlists or rejects the application. The main goal of this project is to predict rating score assigned to the student by the committee.

Table of Contents

  1. Import Libraries
  2. Introduction
    • Data Set
    • Variable Description
  3. Data Analysis
    • Data Visualization
    • Using groupby
  4. Data Prepocessing
    1. Handling Missing Values
    2. Target variable
    3. Converting categorical variables to numeric variables
  5. Overview of the Methods.

    1. Gradient Descent a. Batch Gradient Descent b. Stochastic Gradient Descent c. Mini-Batch Gradient Descent
    2. Neural Networks a. Bulding Blocks: Neurons b. A Simple Example c. Combining Neurons into Neural Network d. Feedforward e. Training a Neural Network
  6. Applying Machine Learning to predict DECISION using RATING

    1. Decision Tree
    2. Logistic Regression
    3. Random Forest
    4. Stochastic Gradient Descent
    5. KNN
    6. Gaussian Naive Bayes
    7. Perceptron
    8. SVM
    9. Linear SVM
    10. Adaptive Boosting
    11. XGBoost
    12. Which Model is the best ? Table 1
  7. Models with the grid search(to predict Decision)

    1. Decision Tree (DT)
    2. Logistic Regression (LG)
    3. Random Forest (RF)
    4. Stochastic Gradient Descent (SGD)
    5. KNN
    6. Gaussian Naive Bayes (GNB)
    7. Perceptron
    8. SVM
    9. Linear SVM
    10. Adaptive Boosting
    11. XGBoost
    12. Cat Boost
    13. Light GBM
    14. Which Model is the best ? Table 4
    15. Stacking Approach
    16. Using h2o AutoML
  8. Estimating the Rating Variable packages
    1. Individual Models
    2. h2o AutoML
    3. h2o GBM
    4. h2o RF
    5. h2o RF
    6. Deep Learning Estimator
    7. Deep Water Estimator
  9. References

Introduction

The data used for this analysis was collected from a major Universities Graduate Mathematics Application system for students applying for the Mathematics PHD program. The information is used by the department of mathematics to determine which applicants will be admitted into the graduate program. Each year members of the department of mathematics review each graduate application and give the prospective student a rating score between one and five, five being the best, with all values in between possible. This rating score determines whether an applicant is accepted, rejected, or put on a waitlist for the Universities Mathematics graduate program.

The rating score (or just RATING) and whether an applicant is accepted, rejected, or put on a waitlist (DECISION) are the variables of interest for this project. The purpose of this research is to create both a regression and classification models that can accurately predict the RATING and DECISION, based on the data submitted by the student. The models we use includes Random Forest, Gradient Boosting, Generalized Linear Models, Stacked Ensemble, XGBoost and Deep Learning.

Data Set

The data is collected in a spreadsheet for easy visual inspection. Each row of data represents a single applicant identified by a unique identification number. Each application consists of the qualitative and quantitative data described in the table below. Note that the qualitative variables are identified by blue highlighting.The following variables make up the columns of the spreadsheet. Note that some of these fields are optional for the student to submit, so not every field has an entry for every student. This creates an issues of missing data, and later on we will discuss how this issue was dealt with.

Table 1.1.

# Variable Description Type
1 Applicant Client ID Application ID Numeric
2 Emphasis Area First choice of study area Factor
3 Emphasis Area 2 Secondary choice of study area Factor
4 Emphasis Area 3 Tertiary choice of study area Factor
5 UU_APPL_CITIZEN US Citizen (Yes or No) Factor(Binary)
6 CTZNSHP Citizenship of the Applicant
7 AGE Age of the applicant in years Numeric
8 SEX Gender of the applicant Factor
9 LOW_INCOME If the applicant is coming from low income family Factor(Binary)
10 UU_FIRSTGEN If the appicant is the first generation attending grad school Factor(Binary)
11 UU_APPL_NTV_LANG Applicant's native language Factor
12 HAS_LANGUAGE_TEST Foreign Language Exam, if applicable(TOEFL IBT, IELTS, or blank) Factor
13 TEST_READ Score on the reading part of TOEFL Numeric
14 TEST_SPEAK Score on the speaking part of TOEFL Numeric
15 TEST_WRITE Score on the writing part of TOEFL Numeric
16 TEST_LISTEN Score on the listening part of TOEFL Numeric
17 MAJOR Applicant's undergraduate major Factor
18 GPA Applicant's GPA Numeric
19 NUM_PREV_INSTS Number of the previous institutions student studied Numeric
20 HAS_GRE_GEN If applicant has taken GRE General exam Factor(Binary)
21 GRE_VERB Raw score on verbal part of the GRE Numeric
22 GRE_QUANT Raw score on quantitative part of the GRE Numeric
23 GRE_AW Raw score on analytical writing part of the GRE Numeric
24 HAS_GRE_SUBJECT If applicant has taken GRE Subject exam Factor(Binary)
25 GRE_SUB Raw score on Math subject GRE Numeric
26 NUM_RECOMMENDS Number of recommenders of the applicant Numeric
27 R_AVG_ORAL Average rating of recommenders' for applicant's oral excellence Numeric
28 R_AVG_WRITTEN Average rating of recommenders' for applicant's written excellence Numeric
29 R_AVG_ACADEMIC Average rating of recommenders' for applicant's academic excellence Numeric
30 R_AVG_KNOWLEDGE Average rating of recommenders' for applicant's knowledge of field excellence Numeric
31 R_AVG_EMOT Average rating of recommenders' for applicant's emotional excellence Numeric
32 R_AVG_MOT Average rating of recommenders' for applicant's motivational excellence Numeric
33 R_AVG_RES Average rating of recommenders' for applicant's research of skill excellence Numeric
34 R_AVG_RATING Average rating of recommenders' for applicant's overall rating Numeric
35 RATING Rating score of the committee Numeric
36 DECISION Faculty application decision (Accept, Reject, or Waitlist) Factor

The data set includes 759 graduate applications, that were submitted for admission in Fall 2016, Fall 2017, Fall 2018 and Fall 2019. There are various missing data points throughout both the dataset. The figure 1.1. below describes the number of missing values for each variable for whole data set. Missing data is represented by shorter columns. The bottom of the table lists the various variable names. The top of the table represents how many data entries we have. On the left of the table is the percentage of the missing data for a specific category. The numbers on the right of the table records the number of variables that each variable has. For example, on the bottom columns starting from TEST_READ, TEST_SPEAK, TEST_WRITE and TEST_LISTEN have shorter columns.

Figure 1.1.

The applicants age (AGE) was calculated using the applicants birthday and is accurate as of 1 January of the year in which they applied. Also, since all universities do not use the same GPA scale, GPA values over four were reviewed and scaled based on information deduced from the applicants resume.

Data Analysis

To be able to do some analysis. We will need to load the data into the jyputer notebook. We can see the head of the data. The output is hided because of confidentiality reasons.

After loading the data into jupyter notebook, we can see the name of the variables, the number of non missing observations each variable has and the type of the variable.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 759 entries, 0 to 758
Data columns (total 36 columns):
Applicant_Client_ID    759 non-null int64
Emphasis Area          759 non-null object
Emphasis Area 2        759 non-null object
Emphasis Area 3        759 non-null object
UU_APPL_CITIZEN        759 non-null object
CTZNSHP                759 non-null object
AGE                    759 non-null float64
SEX                    759 non-null object
LOW_INCOME             759 non-null object
UU_FIRSTGEN            759 non-null object
UU_APPL_NTV_LANG       759 non-null object
HAS_LANGUAGE_TEST      759 non-null object
TEST_READ              272 non-null float64
TEST_SPEAK             272 non-null float64
TEST_WRITE             272 non-null float64
TEST_LISTEN            272 non-null float64
MAJOR                  759 non-null object
GPA                    759 non-null float64
NUM_PREV_INSTS         759 non-null int64
HAS_GRE_GEN            759 non-null object
GRE_VERB               647 non-null float64
GRE_QUANT              647 non-null float64
GRE_AW                 647 non-null float64
HAS_GRE_SUBJECT        759 non-null object
GRE_SUB                554 non-null float64
NUM_RECOMMENDS         759 non-null int64
R_AVG_ORAL             759 non-null float64
R_AVG_WRITTEN          759 non-null float64
R_AVG_ACADEMIC         759 non-null float64
R_AVG_KNOWLEDGE        759 non-null float64
R_AVG_EMOT             759 non-null float64
R_AVG_MOT              759 non-null float64
R_AVG_RES              759 non-null float64
R_AVG_RATING           759 non-null float64
RATING                 759 non-null float64
DECISION               759 non-null object
dtypes: float64(19), int64(3), object(14)
memory usage: 213.6+ KB

As we mentioned in Table 1.1. there are 36 columns with 759 number of observations. Let us see the number of the missing values for each variable.

Applicant_Client_ID      0
Emphasis Area            0
Emphasis Area 2          0
Emphasis Area 3          0
UU_APPL_CITIZEN          0
CTZNSHP                  0
AGE                      0
SEX                      0
LOW_INCOME               0
UU_FIRSTGEN              0
UU_APPL_NTV_LANG         0
HAS_LANGUAGE_TEST        0
TEST_READ              487
TEST_SPEAK             487
TEST_WRITE             487
TEST_LISTEN            487
MAJOR                    0
GPA                      0
NUM_PREV_INSTS           0
HAS_GRE_GEN              0
GRE_VERB               112
GRE_QUANT              112
GRE_AW                 112
HAS_GRE_SUBJECT          0
GRE_SUB                205
NUM_RECOMMENDS           0
R_AVG_ORAL               0
R_AVG_WRITTEN            0
R_AVG_ACADEMIC           0
R_AVG_KNOWLEDGE          0
R_AVG_EMOT               0
R_AVG_MOT                0
R_AVG_RES                0
R_AVG_RATING             0
RATING                   0
DECISION                 0
dtype: int64

Data Visualization

We would like to see the relations between variables via visualization. Let us start by counting number of students who admitted, rejected and waitlisted.

Reject      403
Waitlist    242
Admit       114
Name: DECISION, dtype: int64

Histogram of the all students look like

WE would like to show the relations between their GPA, recommendor's avarage rating and the committee rating.

The next figure shows the relationship between decision variable according to the gender (sex variable).

In the following figure, we will see the relationship between decision, major and sex.

This histogram does not tell you much except that unspecified sex has equal number of being accepted,rejected, or waitlisted.

Let us see the scatter plot of decision variable according to AGE and GPA variables. This scatter plot shows that there are 9-10 students over the age of 40 and 2 of them are admitted.

We wonder if the average rating of recommenders' for applicant's overall rating has any relation between decision variable. We see that there is no implication that higher overall rating implies higher chance of admitted.

Next plot shows the relationship between low income and decision. The distribution of the each category(Admit, Reject,or Waitlist) looks very similar based on income of the family of the applicant.

Similarly, let us the if there is a strong relation between first generation who attends to graduate school and decision.

Let us the if there is a strong relation between number of previous students and decision.

These histograms show that low income, being first generation in your family coming to grad school or number of previous institutions you studied are relavent to being accepted.

In the next plot, we will see the relation between GPA and decision. From the plot we see that waitlisted people have higher average than GPA of admitted people.

In the following plot, we see that admitted students tend to have higher number of previous institutions.

Using groupby to see the relations

First we drop applicant's client ID. The next table shows the mean of the numerical variables according to their gender.

<class 'pandas.core.frame.DataFrame'>
AGE TEST_READ TEST_SPEAK TEST_WRITE TEST_LISTEN GPA NUM_PREV_INSTS GRE_VERB GRE_QUANT GRE_AW ... NUM_RECOMMENDS R_AVG_ORAL R_AVG_WRITTEN R_AVG_ACADEMIC R_AVG_KNOWLEDGE R_AVG_EMOT R_AVG_MOT R_AVG_RES R_AVG_RATING RATING
SEX
Female 23.865241 26.148936 22.308511 23.744681 25.680851 3.493529 1.877005 155.368750 163.937500 3.868750 ... 3.197861 14.771658 14.660428 16.567914 16.156150 16.290374 17.993048 14.368984 22.209091 3.754545
Male 24.333816 27.418981 21.875000 24.476852 26.037037 3.328879 1.853526 156.038217 166.065817 3.757962 ... 3.160940 14.212477 14.263291 16.847920 16.963834 15.775769 17.888969 14.633635 22.992043 3.649747
Unspecified 24.073684 27.333333 20.166667 24.222222 22.666667 3.595789 1.789474 153.187500 166.437500 3.468750 ... 3.421053 12.584211 12.963158 16.300000 15.752632 15.005263 16.542105 13.442105 20.826316 4.089474

3 rows × 21 columns

Let us group the students according to their decision and their gender.

SEX Female Male Unspecified
DECISION
Admit 33 76 5
Reject 92 305 6
Waitlist 62 172 8

Let us group the students according to their decision, gender, first generation and their gender.

SEX Female Male Unspecified
DECISION UU_FIRSTGEN
Admit N 12.0 11.0 NaN
Unspecified 20.0 62.0 5.0
Y 1.0 3.0 NaN
Reject N 22.0 61.0 NaN
Unspecified 62.0 223.0 5.0
Y 8.0 21.0 1.0
Waitlist N 11.0 36.0 1.0
Unspecified 46.0 123.0 7.0
Y 5.0 13.0 NaN

Data Prepocessing

Handling Missing values

The following table shows how many missing values each variable has.

Emphasis Area          0
Emphasis Area 2        0
Emphasis Area 3        0
UU_APPL_CITIZEN        0
CTZNSHP                0
AGE                    0
SEX                    0
LOW_INCOME             0
UU_FIRSTGEN            0
UU_APPL_NTV_LANG       0
HAS_LANGUAGE_TEST      0
TEST_READ            487
TEST_SPEAK           487
TEST_WRITE           487
TEST_LISTEN          487
MAJOR                  0
GPA                    0
NUM_PREV_INSTS         0
HAS_GRE_GEN            0
GRE_VERB             112
GRE_QUANT            112
GRE_AW               112
HAS_GRE_SUBJECT        0
GRE_SUB              205
NUM_RECOMMENDS         0
R_AVG_ORAL             0
R_AVG_WRITTEN          0
R_AVG_ACADEMIC         0
R_AVG_KNOWLEDGE        0
R_AVG_EMOT             0
R_AVG_MOT              0
R_AVG_RES              0
R_AVG_RATING           0
RATING                 0
DECISION               0
dtype: int64

By looking the table, either we can get rid off 8 variables that have missing values or we can fill them mean, median or common values. We will go with the latter method. In other words we will apply simple imputation method.

Before imputation, first, let us take a look of GPA. We know GPA should not be higher than 4. Let us see if there is GPA higher than 4.

733    4.08
105    4.05
34     4.00
36     4.00
69     4.00
114    4.00
Name: GPA, dtype: float64

This shows we have two student entered their GPA higher than 4. We will set them to 4.00 to be consistent.

We would like to impute all the missing variables. A lot of the applicants are from English speaking countries like US, UK and Canada. That is why, their TOEFL scores are missing. According to this website, https://www.prepscholar.com/toefl/blog/what-is-the-average-toefl-score/ United States's average TOEFL score is for Reading 21, for Speaking 23, for Writing 22 and for Listening 23. Total of these scores is 89. This is also very similar to UK and Canada. Before imputing the missing variables with this average of the countries given on this website, we will see first average scores other students TOEFL score for each section based on their gender.

AGE TEST_READ TEST_SPEAK TEST_WRITE TEST_LISTEN GPA NUM_PREV_INSTS GRE_VERB GRE_QUANT GRE_AW ... NUM_RECOMMENDS R_AVG_ORAL R_AVG_WRITTEN R_AVG_ACADEMIC R_AVG_KNOWLEDGE R_AVG_EMOT R_AVG_MOT R_AVG_RES R_AVG_RATING RATING
SEX
Female 23.865241 26.148936 22.308511 23.744681 25.680851 3.493102 1.877005 155.368750 163.937500 3.868750 ... 3.197861 14.771658 14.660428 16.567914 16.156150 16.290374 17.993048 14.368984 22.209091 3.754545
Male 24.333816 27.418981 21.875000 24.476852 26.037037 3.328788 1.853526 156.038217 166.065817 3.757962 ... 3.160940 14.212477 14.263291 16.847920 16.963834 15.775769 17.888969 14.633635 22.992043 3.649747
Unspecified 24.073684 27.333333 20.166667 24.222222 22.666667 3.595789 1.789474 153.187500 166.437500 3.468750 ... 3.421053 12.584211 12.963158 16.300000 15.752632 15.005263 16.542105 13.442105 20.826316 4.089474

3 rows × 21 columns

Imputing the missing Values

We will replaces the missing values of any variables with the mean of other observations for particular variable accoring to their gender.

To make sure our code works, we will check if there is any missing values.

Emphasis Area        0
Emphasis Area 2      0
Emphasis Area 3      0
UU_APPL_CITIZEN      0
CTZNSHP              0
AGE                  0
SEX                  0
LOW_INCOME           0
UU_FIRSTGEN          0
UU_APPL_NTV_LANG     0
HAS_LANGUAGE_TEST    0
TEST_READ            0
TEST_SPEAK           0
TEST_WRITE           0
TEST_LISTEN          0
MAJOR                0
GPA                  0
NUM_PREV_INSTS       0
HAS_GRE_GEN          0
GRE_VERB             0
GRE_QUANT            0
GRE_AW               0
HAS_GRE_SUBJECT      0
GRE_SUB              0
NUM_RECOMMENDS       0
R_AVG_ORAL           0
R_AVG_WRITTEN        0
R_AVG_ACADEMIC       0
R_AVG_KNOWLEDGE      0
R_AVG_EMOT           0
R_AVG_MOT            0
R_AVG_RES            0
R_AVG_RATING         0
RATING               0
DECISION             0
dtype: int64

After imputing all the variables, it is time to see the histograms of each variables.

Target Variable

 mu = 3.69 and sigma = 0.78

This is slightly left-skewed. But we will keep it this way.

Now the data almost ready. We would like to convert categorical variables to numeric variables.

Converting categorical variables to numeric variables.

#students.head(12)

Many machine learning algorithms can support categorical values without further manipulation but there are many more algorithms that do not. For example, machine learning models, such as regression or SVM, are algebraic. This means that their input must be numerical. To use these models, categories must be transformed into numbers first, before you can apply the learning algorithm on them. Therefore, the analyst is faced with the challenge of figuring out how to turn these text attributes into numerical values for further processing.

We will use one hot encoding techninque to convert all the categorical variable into numeric variable.

Overview of the methods.

Gradient Descent

Gradient Descent is a very generic optimization algorithm capable of finding optimal solutions to a wide range of problems. The general idea of Gradient Descent is to tweak parameters iteratively in order to minimize a cost function.

MSE cost function for a Linear Regression model $$ MSE(X,h_\theta)=\frac{1}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)^2$$ where $\theta$ is the model’s parameter vector, containing the bias term $\theta_0$ and the feature weights $\theta_1$ to $\theta_n$.

  • $\theta^T$ is the transpose of $\theta$ (a row vector instead of a column vector).
  • $x$ is the instance’s feature vector, containing $x_0$ to $x_n$, with $x_0$ always equal to 1.
  • $\theta^T \cdot x$ is the dot product of $\theta^T$ and $x$.
  • $h_\theta$ is the hypothesis function, using the model parameters $\theta$.

Gradient Descent measures the local gradient of the error function with regards to the parameter vector $\theta $, and it goes in the direction of descending gradient. Once the gradient is zero, you have reached a minimum.

Concretely, you start by filling $\theta$ with random values (this is called random initialization), and then you improve it gradually, taking one baby step at a time, each step attempting to decrease the cost function (e.g., the MSE), until the algorithm converges to a minimum.

An important parameter in Gradient Descent is the size of the steps, determined by the learning rate hyperparameter. If the learning rate is too small, then the algorithm will have to go through many iterations to converge, which will take a long time

On the other hand, if the learning rate is too high, you might jump across the valley and end up on the other side, possibly even higher up than you were before. This might make the algorithm diverge, with larger and larger values, failing to find a good solution

Finally, not all cost functions look like nice regular bowls. There may be holes, ridges, plateaus, and all sorts of irregular terrains, making convergence to the minimum very difficult. Next figure shows the two main challenges with Gradient Descent: if the random initialization starts the algorithm on the left, then it will converge to a local minimum, which is not as good as the global minimum. If it starts on the right, then it will take a very long time to cross the plateau, and if you stop too early you will never reach the global minimum.

Fortunately, the MSE cost function for a Linear Regression model happens to be a convex function, which means that if you pick any two points on the curve, the line segment joining them never crosses the curve. This implies that there are no local minima, just one global minimum. It is also a continuous function with its derivative is Lipschitz continuous. These two facts have a great consequence: Gradient Descent is guaranteed to approach arbitrarily close the global minimum (if you wait long enough and if the learning rate is not too high).

Batch Gradient Descent

To implement Gradient Descent, you need to compute the gradient of the cost function with regards to each model parameter $\theta_j$. In other words, you need to calculate partial derivatives. $$\frac{\partial }{\partial \theta_j}MSE(\theta) = \frac{2}{m}\sum_{i=1}^m \left(\theta^T \cdot x^{(i)}-y^{(i)}\right)x^{(i)}_j$$

Instead of computing these gradients individually, you can use $$ \nabla_\theta MSE(\theta)= \frac{2}{m}X^T\cdot(X\cdot \theta-y) $$ to compute them all in one go. The gradient vector, noted $\nabla_\theta MSE(\theta)$, contains all the partial derivatives of the cost function (one for each model parameter).

Notice that this formula involves calculations over the full training set X, at each Gradient Descent step! This is why the algorithm is called Batch Gradient Descent: it uses the whole batch of training data at every step. As a result it is terribly slow on very large training sets (but we will see much faster Gradient Descent algorithms shortly). However, Gradient Descent scales well with the number of features; training a Linear Regression model when there are hundreds of thousands of features is much faster using Gradient Descent than using the Normal Equation.

Once you have the gradient vector, which points uphill, just go in the opposite direction to go downhill. This means subtracting $\nabla_\theta MSE(\theta)$ from $\theta$. This is where the learning rate $\eta$ comes into play: multiply the gradient vector by $\eta$ to determine the size of the downhill step. $$\theta^{next \; step }=\theta-\eta \nabla_\theta MSE(\theta) $$

But what if you had used a different learning rate $\eta$? The next figure shows the first 10 steps of Gradient Descent using three different learning rates (the dashed line represents the starting point). The blue lines show how the gradient descent starts and then slowly gets closer to the final value. The dots are points in our data. $y$ corresponds to output variable and $x_1$ is predictor.

On the left, the learning rate is too low: the algorithm will eventually reach the solution, but it will take a long time. In the middle, the learning rate looks pretty good: in just a few iterations, it has already converged to the solution. On the right, the learning rate is too high: the algorithm diverges, jumping all over the place and actually getting further and further away from the solution at every step. To find a good learning rate, you can use grid search. However, you may want to limit the number of iterations so that grid search can eliminate models that take too long to converge.

Stochastic Gradient Descent

The main problem with Batch Gradient Descent is the fact that it uses the whole training set to compute the gradients at every step, which makes it very slow when the training set is large. At the opposite extreme, Stochastic Gradient Descent just picks a random instance in the training set at every step and computes the gradients based only on that single instance. Obviously this makes the algorithm much faster since it has very little data to manipulate at every iteration. It also makes it possible to train on huge training sets, since only one instance needs to be in memory at each iteration (SGD can be implemented as an out-of-core algorithm.)

On the other hand, due to its stochastic (i.e., random) nature, this algorithm is much less regular than Batch Gradient Descent: instead of gently decreasing until it reaches the minimum, the cost function will bounce up and down, decreasing only on average. Over time it will end up very close to the minimum, but once it gets there it will continue to bounce around, never settling down. So once the algorithm stops, the final parameter values are good, but not optimal.

When the cost function is very irregular, this can actually help the algorithm jump out of local minima, so Stochastic Gradient Descent has a better chance of finding the global minimum than Batch Gradient Descent does.

Therefore randomness is good to escape from local optima, but bad because it means that the algorithm can never settle at the minimum. One solution to this dilemma is to gradually reduce the learning rate. The steps start out large (which helps make quick progress and escape local minima), then get smaller and smaller, allowing the algorithm to settle at the global minimum. This process is called simulated annealing, because it resembles the process of annealing in metallurgy where molten metal is slowly cooled down. The function that determines the learning rate at each iteration is called the learning schedule. If the learning rate is reduced too quickly, you may get stuck in a local minimum, or even end up frozen halfway to the minimum. If the learning rate is reduced too slowly, you may jump around the minimum for a long time and end up with a suboptimal solution if you halt training too early.

By convention we iterate by rounds of m iterations; each round is called an epoch.

Mini-batch Gradient Descent

The last Gradient Descent algorithm we will look at is called Mini-batch Gradient Descent. It is quite simple to understand once you know Batch and Stochastic Gradient Descent: at each step, instead of computing the gradients based on the full training set (as in Batch GD) or based on just one instance (as in Stochastic GD), Minibatch GD computes the gradients on small random sets of instances called minibatches.

The main advantage of Mini-batch GD over Stochastic GD is that you can get a performance boost from hardware optimization of matrix operations, especially when using GPUs.

The algorithm’s progress in parameter space is less erratic than with SGD, especially with fairly large mini-batches. As a result, Mini-batch GD will end up walking around a bit closer to the minimum than SGD. But, on the other hand, it may be harder for it to escape from local minima (in the case of problems that suffer from local minima, unlike Linear Regression as we saw earlier). The next figure shows the paths taken by the three Gradient Descent algorithms in parameter space during training. They all end up near the minimum, but Batch GD’s path actually stops at the minimum, while both Stochastic GD and Mini-batch GD continue to walk around. However, don’t forget that Batch GD takes a lot of time to take each step, and Stochastic GD and Mini-batch GD would also reach the minimum if you used a good learning schedule.

Neural Networks

Building Blocks: Neurons

First, we have to talk about neurons, the basic unit of a neural network. A neuron takes inputs, does some math with them, and produces one output. Here’s what a 2-input neuron looks like:

3 things are happening here. First, in a red square, each input is multiplied by a weight:

\begin{align} x_1 & \to x_1* w_1\\ x_2 & \to x_2* w_2\\ \end{align}

Next, in a blue square, all the weighted inputs are added together with a bias b:

$$(x_1*w_1)+(x_2*w_2)+b$$

Finally, in the orange square, the sum is passed through an activation function

$$y=f(x_1*w_1+x_2*w_2+b)$$

The activation function is used to turn an unbounded input into an output that has a nice, predictable form. A commonly used activation function is the sigmoid function: \begin{equation} {\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.} \end{equation}

The sigmoid function only outputs numbers in the range (0,1). You can think of it as compressing $(-\infty, +\infty)$ to $(0,1)$ - big negative numbers become $\sim 0$, and big positive numbers become $\sim 1$

A sigmoid function is a bounded, differentiable, real function that is defined for all real input values and has a non-negative derivative at each point. A sigmoid "function" and a sigmoid "curve" refer to the same object.

A Simple Example

Assume we have a 2-input neuron that uses the sigmoid activation function and has the following parameters:

\begin{align} w &=(0,1) \\ b & = 4\\ \end{align}

where $w_1=0$ and $w_2=1$. Now, let’s give the neuron an input of $x=(2,3)$. We’ll use the dot product to write things more concisely: \begin{align} (w*x)+b= & ((w_1*x_1)+(w_2*x_2))+b \\ =& 0*2+1*3+4\\ =& 7\\ y=f(w*x+b)=&f(7)=1 / (1 + e^{-7})= 0.999 \end{align}

The neuron outputs 0.999 given the inputs $x=(2,3)$. That’s it! This process of passing inputs forward to get an output is known as feedforward.

Combining Neurons into a Neural Network

A neural network is nothing more than a bunch of neurons connected together. Here’s what a simple neural network might look like:

This network has 2 inputs, a hidden layer with 2 neurons $(h_1$ and $h_2)$, and an output layer with $1$ neuron $(o_1)$. Notice that the inputs for $o_1$ are the outputs from $h_1$and $h_2$- that’s what makes this a network.

A hidden layer is any layer between the input (first) layer and output (last) layer. There can be multiple hidden layers!

An Example: Feedforward

Let’s use the network pictured above and assume all neurons have the same weights $w=(0,1)$, the same bias $b = 0$, and the same sigmoid activation function. Let $h_1, h_2, o_1$ denote the outputs of the neurons they represent.

What happens if we pass in the input $x = (2, 3)$?

\begin{align} h_1=h_2&=f(w* x+b) \\ &=f((0* 2)+(1* 3)+0)\\ &=f(3)\\ &=1 / (1 + e^{-3})\\ &=0.9526 \\ o_1&=f(w* (h_1,h_2)+b)\\ &=f((0* h_1)+(1* h_2)+0)\\ &=f(0.9526)\\ &=1 / (1 + e^{-0.9526})\\ &=0.7216 \end{align}

The output of the neural network for input $x = (2, 3)$ is 0.7216. Pretty simple, right?

A neural network can have any number of layers with any number of neurons in those layers. The basic idea stays the same: feed the input(s) forward through the neurons in the network to get the output(s) at the end. For simplicity, we’ll keep using the network pictured above for the rest of this topic.

Training a Neural Network

Say we have the following measurements:

Name Weight(lb) Height(in) Gender
Alice 132 65 F
Bob 160 72 M
Charlie 152 75 M
Diana 120 60 F

Let’s train our network to predict someone’s gender given their weight and height:

We’ll represent Male with a 0 and Female with a 1, and we will also shift the data to make it easier to use:

Name Weight (minus 141) Height (minus 68 ) Gender
Alice -9 -3 1
Bob 19 4 0
Charlie 11 7 0
Diana -21 -8 1

Here, note that $(132+160+152+120)/4=141$ and $(65+72+75+60)/4=68$

Loss

Before we train our network, we first need a way to quantify how "good" it's doing so that it can try to do "better". That's what the loss is.

We'll use the mean squared error (MSE) loss:

$$ MSE = \frac{1}{n}\sum_{i=1}^{n}(y_{true}-y_{pred})^2$$

Let's break this down:

  • n is the number of samples, which is 4.
  • y represents the variable being predicted, which is Gender.
  • $y_{true}$ is the true value of the variable. For example, $y_{true}$ for Alice would be 1 (Female).
  • $y_{pred}$ is the predicted value of the variable. It’s whatever our network outputs.

$(y_{true}-y_{pred})^2$ is known as the squared error. Our loss function is simply taking the average over all squared errors (hence the name mean squared error). The better our predictions are, the lower our loss will be!

Training a network = trying to minimize its loss.

An Example Loss Calculation

Let’s say our network always outputs 0 - in other words, it's confident all humans are Male 🤔. What would our loss be?

Let diff = $(y_{true}-y_{pred})^2$

Name $y_{true}$ $y_{pred}$ diff
Alice 1 0 1
Bob 0 0 0
Charlie 0 0 0
Diana 1 0 1
$$ MSE = \frac{1}{4}(1+0+0+1)=0.5 $$

We now have a clear goal: minimize the loss of the neural network. We know we can change the network's weights and biases to influence its predictions, but how do we do so in a way that decreases loss?

For simplicity, let's pretend we only have Alice in our dataset:

Name Weight (minus 141) Height (minus 68 ) Gender
Alice -9 -3 1

Then the mean squared error loss is just Alice’s squared error:

\begin{align} MSE&=\frac{1}{1}\sum_{i=1}^1(y_{true}−y_{pred})^2\\ &=(y_{true}−y_{pred})^2\\ & =(1−y_{pred})^2 \end{align}

Another way to think about loss is as a function of weights and biases. Let’s label each weight and bias in our network:

Then, we can write loss as a multivariable function: $$L(w_1,w_2,w_3,w_4,w_5,w_6,b_1,b_2,b_3)$$

Imagine we wanted to tweak $w_1$. How would loss $L$ change if we changed $w_1$? That's a question the partial derivative $\frac{\partial L}{\partial w_1}$can answer. How do we calculate it?

To start, let's rewrite the partial derivative in terms of $\frac{\partial y_{pred}}{\partial w_1}$ instead: $$\dfrac{\partial L}{\partial w_1}= \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial w_1} $$

We can calculate $\frac{\partial L}{\partial y_{pred}}$ because we computed $L = (1 - y_{pred})^2$ above:

$$\dfrac{\partial L}{\partial y_{pred}} = \dfrac{\partial (1 - y_{pred})^2}{\partial y_{pred}}= -2(1-y_{pred})$$

Now, let's figure out what to do with $\frac{\partial y_{pred}}{\partial w_1}$. Just like before, let $h_1, h_2, o_1$ be the outputs of the neurons they represent. Then

$$ y_{pred}=o_1=f(w_5*h_1+w_6*h2+b_3)$$

Since $w_1$ only affects $h_1$ (not $h_2$), we can write

$$\dfrac{\partial y_{pred}}{\partial w_1} =\dfrac{\partial y_{pred}}{\partial h_1} *\dfrac{\partial h_1}{\partial w_1} $$

Also note that by using chain rule, $$ \dfrac{\partial y_{pred}}{\partial h_1} = w_5*f'(w_5h_1+w_6h_2+b_3)$$ Recall $h_1 = f(w_1x_1+w_2x_2+b_1)$. Thus, we can do the same thing for $\frac{\partial h_1}{\partial w_1} $: $$ \dfrac{\partial h_1}{\partial w_1} = x_1*f'(w_1x_1+w_2x_2+b_1)$$ $x_1$ here is weight, and $x_2$ is height. This is the second time we've seen $f'(x)$ (the derivate of the sigmoid function) now! Let’s derive it:

$$ f(x) = \dfrac{1}{1+e^{-x}}$$

By taking derivative, we get $$f'(x)= \dfrac{e^{-x}}{(1 + e^{-x})^2}=f(x) * (1 - f(x))$$

We'll use this nice form for $f'(x)$ later. This form shows we do not need to take a derivative.

We're done! We've managed to break down $\frac{\partial L}{\partial w_1}$ into several parts we can calculate it now: $$\dfrac{\partial L}{\partial w_1} = \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial h_1}*\dfrac{\partial h_1}{\partial w_1} $$

This system of calculating partial derivatives by working backwards is known as backpropagation, or "backprop".

Example: Calculating the Partial Derivative

We're going to continue pretending only Alice is in our dataset:

Name Weight (minus 141) Height (minus 68 ) Gender
Alice -9 -3 1

Let's initialize all the weights to 1 and all the biases to 0. If we do a feedforward pass through the network, we get:

$$ h_1 =f(w_1*x_1+w_2*x_2+b_1)=f(−9+−3+0)=6.16*10^{-6}$$

and similarly $$h_2 =f(w_3*x_1+w_4*x_2+b_2)=f(−9+−3+0)=6.16*10^{-6} $$ and now let us calculate $o_1$ $$o_1 =f(w_5*h_1+w_6*h_2+b_3)=f(6.16*10^{-6}+6.16*10^{-6}+0)=0.50$$

The network outputs $y_{pred} = 0.50$, which doesn't favor Male(0) or Female (1). This totally makes sense because we do not do any training yet.

Let's calculate $\frac{\partial L}{\partial w_1}$:

\begin{aligned} \dfrac{\partial L}{\partial w_1} =& \dfrac{\partial L}{\partial y_{pred}}*\dfrac{\partial y_{pred}}{\partial h_1}*\dfrac{\partial h_1}{\partial w_1}\\ \end{aligned}

Now let us calculate each of the terms on the RHS one by one. \begin{aligned} \dfrac{\partial L}{\partial y_{pred}} &= -2(1 - y_{pred}) \\ &= -2(1 - 0.50) \\ &= -1 \\ \end{aligned} and \begin{aligned} \dfrac{\partial y_{pred}}{\partial h_1} &= w_5 * f'(w_5h_1 + w_6h_2 + b_3) \\ &= 1 * f'(6.16* 10^{-6} + 6.16* 10^{-6}+ 0) \\ &= f(1.23* 10^{-5}) * (1 - f(1.23* 10^{-5})) \\ &= 0.249 \\ \end{aligned} lastly \begin{aligned} \dfrac{\partial h_1}{\partial w_1} &= x_1 * f'(w_1x_1 + w_2x_2 + b_1) \\ &= -9 * f'(-9 + -3 + 0) \\ &= -9 * f(-12) * (1 - f(-12)) \\ &= -5.52* 10^{-5} \\ \end{aligned} Now, we can collect them all and write \begin{aligned} \dfrac{\partial L}{\partial w_1} &= -1 * 0.249 * -5.52* 10^{-5} \\ &= \boxed{1.37* 10^{-5}} \\ \end{aligned}

We did it! This tells us that if we were to increase $w_1$, $L$ would increase a tiny bit as a result.

Training: Stochastic Gradient Descent

We have all the tools we need to train a neural network now! We’ll use an optimization algorithm called stochastic gradient descent (SGD) that tells us how to change our weights and biases to minimize loss. It’s basically just this update equation

$$ w_1\leftarrow w_1-\eta \dfrac{\partial L}{\partial w_1}$$

$\eta$ is a constant called the learning rate that controls how fast we train. All we're doing is subtracting $\eta \frac{\partial L}{\partial w_1}$ from $w_1$:

  • If $\frac{\partial L}{\partial w_1}$ is positive, $w_1$ will decrease, which makes $L$ decrease.
  • If $\frac{\partial L}{\partial w_1}$ is negative, $w_1$ will increase, which makes $L$ increase.

If we do this for every weight and bias in the network, the loss will slowly decrease and our network will improve.

Our training process will look like this:

  1. Choose one sample from our dataset. This is what makes it stochastic gradient descent - we only operate on one sample at a time.
  2. Calculate all the partial derivatives of loss with respect to weights or biases (e.g. $\frac{\partial L}{\partial w_1}$,$\frac{\partial L}{\partial w_2}$, etc).
  3. Use the update equation to update each weight and bias.
  4. Go back to step 1.

Applying Machine Learning to predict DECISION using RATING

Splitting Data Set

We are splitting the data into two parts train and test data. We will test our algorithm after we trained on a train data. We will be using cross validation technique on a train data.

Scaling

This is a crucial step in rescaling input data so that all the features are mean zero with a unit variance.

First, we will estimate DECISION using RATING variable. After predicting the decision, we will be predicting RATING variable. When we predict RATING, we wil not be using decision variable.

Boosting

Boosting (originally called hypothesis boosting) refers to any Ensemble method that can combine several weak learners into a strong learner. The general idea of most boosting methods is to train predictors sequentially, each trying to correct its predecessor. There are many boosting methods available, but by far the most popular are AdaBoost(short for Adaptive Boosting) and Gradient Boosting and XGBoost. Let’s start with Ada‐ Boost.

Adaptive Boosting

One way for a new predictor to correct its predecessor is to pay a bit more attention to the training instances that the predecessor underfitted. This results in new predictors focusing more and more on the hard cases. This is the technique used by Ada‐ Boost. For example, to build an AdaBoost classifier, a first base classifier (such as a Decision Tree) is trained and used to make predictions on the training set. The relative weight of misclassified training instances is then increased. A second classifier is trained using the updated weights and again it makes predictions on the training set, weights are updated, and so on. The next figure explains the structure.

Decision Tree

Decision Trees are also the fundamental components of Random Forests which are among the most powerful Machine Learning algorithms available today. To understand Decision Trees, let’s just visualize one and take a look at how it makes predictions.

Let’s see how the tree represented in the figure above makes predictions. Assume you are looking iris data set. Suppose you find an iris flower and you want to classify it. You start at the root node (depth 0, at the top): this node asks whether the flower’s petal length is smaller than 2.45 cm. If it is, then you move down to the root’s left child node (depth 1, left). In this case, it is a leaf node (i.e., it does not have any children nodes), so it does not ask any questions: you can simply look at the predicted class for that node and the Decision Tree predicts that your flower is an Iris-Setosa (class=setosa).

Now suppose you find another flower, but this time the petal length is greater than 2.45 cm. You must move down to the root’s right child node (depth 1, right), which is not a leaf node, so it asks another question: is the petal width smaller than 1.75 cm? If it is, then your flower is most likely an Iris-Versicolor (depth 2, left). If not, it is likely an Iris-Virginica (depth 2, right). It’s really that simple.

It will be very similar structure in our case however decision tree will be big because of number of variables. We will be showing one decision tree in order to give us an idea.

%3 eca6d7735b0a4ed2ba9d67e112f680e3 Is RATING >= 3.75? 0354f8b0d125472aac153e3a63564aaf Is RATING >= 3.3? eca6d7735b0a4ed2ba9d67e112f680e3->0354f8b0d125472aac153e3a63564aaf F 20d2c8aaff2f4ee5bd274a3d0651547b Is RATING >= 4.18? eca6d7735b0a4ed2ba9d67e112f680e3->20d2c8aaff2f4ee5bd274a3d0651547b T 304227a9cc4449b885ce4e4fcc6ea4bb Is RATING >= 3.0? 0354f8b0d125472aac153e3a63564aaf->304227a9cc4449b885ce4e4fcc6ea4bb F d0931a37cc064509b8d2d605adc5ffa6 Is R_AVG_KNOWLEDGE >= 12.0? 0354f8b0d125472aac153e3a63564aaf->d0931a37cc064509b8d2d605adc5ffa6 T cf69fb6e3af548aa913e1f75c57500dc Is TEST_SPEAK >= 27.0? 304227a9cc4449b885ce4e4fcc6ea4bb->cf69fb6e3af548aa913e1f75c57500dc F bd40222e936344228804c8bb42ca870a Is Emphasis Area 2 == Applied Mathematics? 304227a9cc4449b885ce4e4fcc6ea4bb->bd40222e936344228804c8bb42ca870a T 19ba5ba8f63a45a196f34847d9541310 {'Reject': 92, 'Waitlist': 1} cf69fb6e3af548aa913e1f75c57500dc->19ba5ba8f63a45a196f34847d9541310 F 46334997ad2741729809c481e76ce142 {'Reject': 4, 'Waitlist': 1} cf69fb6e3af548aa913e1f75c57500dc->46334997ad2741729809c481e76ce142 T 5938472f4fac4b50b91a2242d4585ab4 {'Reject': 57, 'Waitlist': 4, 'Admit': 2} bd40222e936344228804c8bb42ca870a->5938472f4fac4b50b91a2242d4585ab4 F 99ca8237698e4193b4f7b2cc8a60c9e2 {'Reject': 5, 'Waitlist': 3} bd40222e936344228804c8bb42ca870a->99ca8237698e4193b4f7b2cc8a60c9e2 T 8bdee374016b4e7c9a92230f01570abf {'Admit': 1, 'Waitlist': 5, 'Reject': 2} d0931a37cc064509b8d2d605adc5ffa6->8bdee374016b4e7c9a92230f01570abf F f8e7036ecfb54d86b7dc86c03389136e Is GPA >= 3.95? d0931a37cc064509b8d2d605adc5ffa6->f8e7036ecfb54d86b7dc86c03389136e T 11a179bdda734c79aa854e6503e525ba {'Reject': 71, 'Waitlist': 11, 'Admit': 3} f8e7036ecfb54d86b7dc86c03389136e->11a179bdda734c79aa854e6503e525ba F 4350bd3aac064bf78db7367a4b401d3c {'Waitlist': 5, 'Reject': 6, 'Admit': 2} f8e7036ecfb54d86b7dc86c03389136e->4350bd3aac064bf78db7367a4b401d3c T cd3c25c17cc24d8bbb9bfbbc79c7c6eb Is R_AVG_WRITTEN >= 16.7? 20d2c8aaff2f4ee5bd274a3d0651547b->cd3c25c17cc24d8bbb9bfbbc79c7c6eb F 2294c8e2b806467f983c94a3978126a5 Is RATING >= 4.69? 20d2c8aaff2f4ee5bd274a3d0651547b->2294c8e2b806467f983c94a3978126a5 T ff28d32ef4a446ddbd4ba7f04b41a625 Is GPA >= 3.53? cd3c25c17cc24d8bbb9bfbbc79c7c6eb->ff28d32ef4a446ddbd4ba7f04b41a625 F eacae68ddbe148e68b84c548310ef433 Is R_AVG_RATING >= 22.5? cd3c25c17cc24d8bbb9bfbbc79c7c6eb->eacae68ddbe148e68b84c548310ef433 T 4b3a75657dcd4f6f82e4153833dfe9e3 {'Reject': 30, 'Waitlist': 7, 'Admit': 4} ff28d32ef4a446ddbd4ba7f04b41a625->4b3a75657dcd4f6f82e4153833dfe9e3 F 03588b7af4c447fda60192b2646695c2 {'Waitlist': 26, 'Admit': 2, 'Reject': 22} ff28d32ef4a446ddbd4ba7f04b41a625->03588b7af4c447fda60192b2646695c2 T 3165d74315a54b888fe0a2cb1af5e48d {'Waitlist': 15, 'Admit': 2, 'Reject': 2} eacae68ddbe148e68b84c548310ef433->3165d74315a54b888fe0a2cb1af5e48d F 935c4f4f380440f69f202fea0948b756 {'Waitlist': 27, 'Reject': 22, 'Admit': 6} eacae68ddbe148e68b84c548310ef433->935c4f4f380440f69f202fea0948b756 T 00fa7fdbc0ae4da9b8308baca70d20f5 Is MAJOR == Applied Mathematics? 2294c8e2b806467f983c94a3978126a5->00fa7fdbc0ae4da9b8308baca70d20f5 F f872fb23e057495abad023d31e24f6bc Is UU_FIRSTGEN == Unspecified? 2294c8e2b806467f983c94a3978126a5->f872fb23e057495abad023d31e24f6bc T 911283fa8b424aed95b45ce073b67fc0 {'Waitlist': 70, 'Admit': 24, 'Reject': 9} 00fa7fdbc0ae4da9b8308baca70d20f5->911283fa8b424aed95b45ce073b67fc0 F 317e3d46e1054ef5b7905b1759883c70 {'Admit': 6, 'Reject': 1} 00fa7fdbc0ae4da9b8308baca70d20f5->317e3d46e1054ef5b7905b1759883c70 T 83f5efee9ac4476480f7f7593a60ab53 {'Admit': 10} f872fb23e057495abad023d31e24f6bc->83f5efee9ac4476480f7f7593a60ab53 F 0284acb3d25a4244865c3d478dfd30cc {'Admit': 29, 'Waitlist': 17, 'Reject': 1} f872fb23e057495abad023d31e24f6bc->0284acb3d25a4244865c3d478dfd30cc T

In the figure below, we see that left bottom corner the number of samples are 93 with 0 admitted, 92 rejected and 1 waitlisted. As we expected the rating is the most strong.

Gaussian Naive Bayes

In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naïve) independence assumptions between the features. They are among the simplest Bayesian network models.

Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. There is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable. For example, a fruit may be considered to be an apple if it is red, round, and about 10 cm in diameter. A naive Bayes classifier considers each of these features to contribute independently to the probability that this fruit is an apple, regardless of any possible correlations between the color, roundness, and diameter features.

K Nearest Neighbor

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression. In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

  • In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

  • In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

Support Vector Machine

A Support Vector Machine (SVM) is a very powerful and versatile Machine Learning model, capable of performing linear or nonlinear classification, regression, and even outlier detection. It is one of the most popular models in Machine Learning, and anyone interested in Machine Learning should have it in their toolbox. SVMs are particularly well suited for classification of complex but small- or medium-sized datasets.

The fundamental idea behind SVMs is best explained with some picture. The figure below shows part of the iris dataset that was introduced before. The two classes can clearly be separated easily with a straight line (they are linearly separable). The left plot shows the decision boundaries of three possible linear classifiers. The model whose decision boundary is represented by the dashed line is so bad that it does not even separate the classes properly. The other two models work perfectly on this training set, but their decision boundaries come so close to the instances that these models will probably not perform as well on new instances. In contrast, the solid line in the plot on the right represents the decision boundary of an SVM classifier; this line not only separates the two classes but also stays as far away from the closest training instances as possible. You can think of an SVM classifier as fitting the widest possible street (represented by the parallel dashed lines) between the classes.

Logistic Regression

Logistic Regression (also called Logit Regression) is commonly used to estimate the probability that an instance belongs to a particular class (e.g., what is the probability that this email is spam?). If the estimated probability is greater than 50%, then the model predicts that the instance belongs to that class (called the positive class, labeled “1”), or else it predicts that it does not (i.e., it belongs to the negative class, labeled “0”). This makes it a binary classifier.

Random Forest

A Random Forest is an ensemble of Decision Trees, generally trained via the bagging method (or sometimes pasting), typically with maximum samples set to the size of the training set.

The Random Forest algorithm introduces extra randomness when growing trees; instead of searching for the very best feature when splitting a node, it searches for the best feature among a random subset of features. This results in a greater tree diversity, which (once again) trades a higher bias for a lower variance, generally yielding an overall better model.

When you are growing a tree in a Random Forest, at each node only a random subset of the features is considered for splitting. It is possible to make trees even more random by also using random thresholds for each feature rather than searching for the best possible thresholds (like regular Decision Trees do).

Perceptron

The Perceptron is one of the simplest artificial neural network architectures, invented in 1957 by Frank Rosenblatt. It is based on a slightly different artificial neuron (see figure below) called a linear threshold unit (LTU): the inputs and output are now numbers (instead of binary on/off values) and each input connection is associated with a weight. The LTU computes a weighted sum of its inputs $$(z = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n = w^T \cdot x),$$ then applies a step function to that sum and outputs the result: $$h_w(x) = step(z) = step(w^T \cdot x).$$

The most common step function used in Perceptrons is the Heaviside step function.

A single LTU can be used for simple linear binary classification. It computes a linear combination of the inputs and if the result exceeds a threshold, it outputs the positive class or else outputs the negative class (just like a Logistic Regression classifier or a linear SVM).

Stochastic Gradient Descent

We have already mentioned how this works. We will apply this model to our train set.

XG Boost Classifier

XGBoost is a decision-tree-based ensemble Machine Learning algorithm that uses a gradient boosting framework. In prediction problems involving unstructured data (images, text, etc.) artificial neural networks tend to outperform all other algorithms or frameworks. However, when it comes to small-to-medium structured/tabular data, decision tree based algorithms are considered best-in-class right now.

Which is the best model?

Score
Model
Adaptive Boosting Classifier 0.710526
XG Boost Classifier 0.697368
Random Forest 0.631579
Stochastic Gradient Decent 0.625000
Logistic Regression 0.598684
Decision Tree 0.585526
Perceptron 0.572368
Support Vector Machines 0.565789
KNN 0.506579
Naive Bayes 0.203947

Models with Grid search

One way to do that would be to fiddle with the hyperparameters manually, until you find a great combination of hyperparameter values. This would be very tedious work, and you may not have time to explore many combinations. Instead you should get Scikit-Learn’s GridSearchCV to search for you. All you need to do is tell it which hyperparameters you want it to experiment with, and what values to try out, and it will evaluate all the possible combinations of hyperparameter values, using cross-validation.

Ada Boost

Decision Tree

KNN

Light Gradient Boosting

Linear SVM

Logistic Regression

Random Forest

SGD

Support Vector Machines

XGB Classifier

Which one is the best model with GridSearch?

Score
Model
Stochastic Gradient Decent 0.743421
Adaptive Boosting Classifier 0.710526
Decision Tree 0.703947
Random Forest 0.697368
XG Boost Classifier 0.697368
Light GBM 0.684211
Linear Support Vector Machines 0.644737
Support Vector Machines 0.625000
Logistic Regression 0.592105
KNN 0.513158

Let us combine these two table.

Score Score with grid search
Score Score
Adaptive Boosting Classifier 0.710526 0.710526
XG Boost Classifier 0.697368 0.697368
Random Forest 0.631579 0.697368
Stochastic Gradient Decent 0.625000 0.743421
Logistic Regression 0.598684 0.592105
Decision Tree 0.585526 0.703947
Perceptron 0.572368 NaN
Support Vector Machines 0.565789 0.625000
KNN 0.506579 0.513158
Naive Bayes 0.203947 NaN
Light GBM NaN 0.684211
Linear Support Vector Machines NaN 0.644737

Stacking Approach to predict Decision

The Ensemble method we will discuss in here is called stacking (short for stacked generalization). It is based on a simple idea: instead of using trivial functions (such as hard voting) to aggregate the predictions of all predictors in an ensemble, why don’t we train a model to perform this aggregation? The figure below shows such an ensemble performing a regression task on a new instance. Each of the bottom three predictors predicts a different value (3.1, 2.7, and 2.9), and then the final predictor (called a blender, or a meta learner) takes these predictions as inputs and makes the final prediction (3.0).

To train the blender, a common approach is to use a hold-out set. Let’s see how it works. First, the training set is split in two subsets. The first subset is used to train the predictors in the first layer in the figure below.

Next, the first layer predictors are used to make predictions on the second (held-out) set (see the figure below). This ensures that the predictions are “clean,” since the predictors never saw these instances during training. Now for each instance in the hold-out set there are three predicted values. We can create a new training set using these predicted values as input features (which makes this new training set three-dimensional), and keeping the target values. The blender is trained on this new training set, so it learns to predict the target value given the first layer’s predictions.

It is actually possible to train several different blenders this way (e.g., one using Linear Regression, another using Random Forest Regression, and so on): we get a whole layer of blenders. The trick is to split the training set into three subsets: the first one is used to train the first layer, the second one is used to create the training set used to train the second layer (using predictions made by the predictors of the first layer), and the third one is used to create the training set to train the third layer (using predictions made by the predictors of the second layer). Once this is done, we can make a prediction for a new instance by going through each layer sequentially, as shown in

We use a few models that gave best accuracy on a test set with grid search as our base model. We will aggregate these models to make new model, stacked model. Below, we train these models on a train set using cross validation technique.

Average accuracy on a train set: 0.6755 (+/- 0.017) [ADA Boost]
Average accuracy on a train set: 0.6888 (+/- 0.027) [Decision tree]
Average accuracy on a train set: 0.6491 (+/- 0.027) [Light GBM]
Average accuracy on a train set: 0.6639 (+/- 0.028) [Random Forest]
Average accuracy on a train set: 0.7052 (+/- 0.025) [SGD]
Average accuracy on a train set: 0.6737 (+/- 0.045) [XGB]
Average accuracy on a train set: 0.6474 (+/- 0.034) [StackingClassifier]

Now, we will apply these models on a whole train set. We will generate new data frame that is the prediction of each model and get common values of this predictions to find one single prediction column of decision variable.

Average accuracy on a train set: %0.4f  0.8401976935749588

We will apply this approach to test set.

Last method we will be using to predict decision variable is h2o AutoML package.

Using h2o AutoML package to predict Decision

H2O’s AutoML can be used for automating the machine learning workflow, which includes automatic training and tuning of many models within a user-specified time-limit. Stacked Ensembles – one based on all previously trained models, another one on the best model of each family – will be automatically trained on collections of individual models to produce highly predictive ensemble models which, in most cases, will be the top performing models in the AutoML Leaderboard.

The H2O AutoML interface is designed to have as few parameters as possible so that all the user needs to do is point to their dataset, identify the response column and optionally specify a time constraint or limit on the number of total models trained.

We will be using the same data frame we have used above however, we need to convert pandas data frame into h2o data frame in order to use h2o packages.

Now we will be applying h2o AutoML package to predict decision variable on train set. We will see the first 10 of models that gives the best prediction on a train set. Then we will pick the best model among this 10 to try on a test set.

AutoML progress: |████████████████████████████████████████████████████████| 100%
model_id mean_per_class_error logloss rmse mse
GBM_grid_1_AutoML_20191203_103927_model_2 0.407229 1.31798 0.5308780.281831
XGBoost_grid_1_AutoML_20191203_103927_model_1 0.417118 0.7258380.50669 0.256735
XGBoost_grid_1_AutoML_20191203_103927_model_4 0.418664 0.7241 0.5032480.253259
XGBoost_grid_1_AutoML_20191203_103927_model_2 0.41918 0.7070620.4977440.247749
XGBoost_2_AutoML_20191203_103927 0.427726 0.7263260.5070590.257109
StackedEnsemble_BestOfFamily_AutoML_20191203_103927 0.432349 0.71419 0.4961870.246201
XGBoost_grid_1_AutoML_20191203_103927_model_7 0.433316 0.74956 0.5185270.26887
XGBoost_grid_1_AutoML_20191203_103927_model_3 0.433455 0.7451360.5120050.262149
XGBoost_1_AutoML_20191203_103927 0.436663 0.7117 0.4964190.246431
GBM_1_AutoML_20191203_103927 0.436918 0.7795710.5045640.254585

This shows that GBM grid is the best one. Let us the performance of this model on a test set. We will see with what probabilities the model guess the different levels of the variable.

gbm prediction progress: |████████████████████████████████████████████████| 100%
Rows:152
Cols:4


predict Admit Reject Waitlist
type enum real real real
mins 5.209438295227014e-077.788663860572825e-054.030894069031509e-06
mean 0.0894543802504499 0.5430801441875348 0.3674654755620154
maxs 0.9989826928502821 0.9999940195863029 0.9998790114674498
sigma 0.23210418898047652 0.4541262415953114 0.41652984678224486
zeros 0 0 0
missing0 0 0 0
0 Waitlist 0.05679192739664471 0.4480436880052786 0.4951643845980767
1 Waitlist 0.03619576290528703 0.06348535268099545 0.9003188844137174
2 Reject 0.007500718061633539 0.7485192076192061 0.2439800743191604
3 Reject 3.618287029661471e-060.9997149406330772 0.00028144107989308824
4 Reject 7.983239844384858e-050.9989953284207064 0.0009248391808497036
5 Waitlist 0.007682258284996146 0.4039500282703496 0.5883677134446543
6 Waitlist 0.00340063943500132230.16439528560115582 0.8322040749638427
7 Waitlist 0.3277703961710042 0.00051304723905211450.6717165565899437
8 Reject 0.001140537974717637 0.8326599185445787 0.16619954348070373
9 Waitlist 0.061813918816480816 0.00701555139263608340.9311705297908832

Overall perforamnce and confusion matrix will look like as below. Surpisingly, overall accuracy is 71 percentage which is slightly worse than SGD.

ModelMetricsMultinomial: gbm
** Reported on test data. **

MSE: 0.27305611997075074
RMSE: 0.5225477202808857
LogLoss: 1.365883168606446
Mean Per-Class Error: 0.3831297009722987

Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
Admit Reject Waitlist Error Rate
0 8.0 3.0 12.0 0.652174 15 / 23
1 2.0 65.0 12.0 0.177215 14 / 79
2 3.0 13.0 34.0 0.320000 16 / 50
3 13.0 81.0 58.0 0.296053 45 / 152
Top-3 Hit Ratios: 
k hit_ratio
0 1 0.703947
1 2 0.907895
2 3 1.000000

Conclusion about predicting the decision variable

The higher accuracy is the better. The accuracy table below shows that Stochastic Gradient Decent with grid search is the best model to predict the decision variable using the rating variable.

Accuracy
Model
Stochastic Gradient Decent 0.743421
Adaptive Boosting Classifier 0.710526
h2o AutoML 0.703947
Decision Tree 0.703947
Stacking(Mixed) Model 0.703947
Random Forest 0.697368
XG Boost Classifier 0.697368
Light GBM 0.684211
Linear Support Vector Machines 0.644737
Support Vector Machines 0.625000
Logistic Regression 0.592105
KNN 0.513158

Estimating the RATING variable.

Individual Models

For the first part, we will estimate the rating variable with different models then we will use stacking approach. We will not be using decision variable to predict rating variable. Thus, we will drop decision variable.

Now, we will train all our models on a train set by tuning hyperparameters of algorithm.

Let us see the performance of each of these individual models on a train set. Since we have used 5-fold cross-validation, the first number in front of paranthesis is the mean of RMSE over the train set and the number inside of paranthesis is the standard deviation of the RMSE over the train set.

Ridge Regression RMSE score: 0.8138 (0.0685)

LASSO RMSE score: 0.7934 (0.0695)

Elasticnet RMSE score: 0.8250 (0.0737)

TSR RMSE score: 0.8332 (0.0737)

Huber RMSE score: 0.8119 (0.0582)

Kernel Ridge Regression RMSE score: 1.0570 (0.0928)

SVR RMSE score: 0.8092 (0.0592)

Light GBM RMSE score: 0.8729 (0.0661)

SGD RMSE score: 0.8173 (0.0749)

Linear Regression RMSE score: 0.8311 (0.0756)

Decision Tree RMSE score: 1.1643 (0.1081)

Random Forest RMSE score: 0.8624 (0.0719)

Gradient BoostingRMSE score: 0.9243 (0.0659)

XG boostRMSE score: 0.8785 (0.0583)

Since we are predicting the RATING variable which is numeric, we will use RMSE distance to measure how well our model is doing. The lower is the better. We can put this into a data frame in order to compare them.

Score
Model
LASSO Regression 0.793411
Epsilon-Support Vector Regression 0.809163
Huber Regressor 0.811865
Ridge Regression 0.813751
SGD 0.817268
Elastic Net 0.825020
Linear Regression 0.831060
Thielsen Regressor 0.833203
Random Forest regressor 0.862422
Light GBM 0.872895
XGBoost Regressor 0.878517
Gradient Boosting 0.924299
Kernel Ridge Regression 1.056997
Decision Tree Regressor 1.164290

Stacking approach (Blending)

Now, let us train all these models as well as stacking one on a whole train set.

Now it is time to mix all the models. All the numbers in front of models are chosen randomly. The higher number is given depending on their higher performance.

def mixed_models_predict(X):
    return (
            (0.01 * xgb_model_full_data.predict(X)) + \
            (0.01 * lgb_model_full_data.predict(X)) + \
            (0.05 * rf_model_full_data.predict(X)) + \
            (0.05 * tsr_model_full_data.predict(X)) + \
            (0.05 * lin_model_full_data.predict(X)) + \
            (0.05 * elastic_model_full_data.predict(X)) + \
            (0.15 * lasso_model_full_data.predict(X)) + \
            (0.05 * sgd_model_full_data.predict(X)) + \
            (0.1 * ridge_model_full_data.predict(X)) + \
            (0.1 * huber_model_full_data.predict(X)) + \
            (0.15 * svr_model_full_data.predict(X)) + \
            (0.29 * stack_gen_model.predict(np.array(X))))

Now we can try our mixed model both on a train set and test set.

RMSE score on train data:
0.7914288623274237
RMSE score on test data:
0.8117989462368014

This shows RMSE score of the stacking approach on a test data is slighly worse than individual model. We will use h2o AutoML to predict the rating variable.

Now, we would like to use h2o AutoML package to predict rating and to compare the result with our stacking approach result.

Using h2o AutoML to predict the RATING

We will see which model is doing better than others to predict rating variable. The table below shows first ten models.

AutoML progress: |████████████████████████████████████████████████████████| 100%
model_id mean_residual_deviance rmse mse mae rmsle
GLM_grid_1_AutoML_20191203_104948_model_1 0.5940930.7707740.5940930.6163690.179042
GBM_grid_1_AutoML_20191203_104948_model_5 0.5960920.77207 0.5960920.6174050.179331
GBM_grid_1_AutoML_20191203_104948_model_4 0.59661 0.7724060.59661 0.6172280.179354
GBM_grid_1_AutoML_20191203_104948_model_2 0.5983950.77356 0.5983950.6179190.17965
GBM_grid_1_AutoML_20191203_104948_model_1 0.5985240.7736440.5985240.6193540.17958
StackedEnsemble_BestOfFamily_AutoML_20191203_104948 0.59914 0.7740410.59914 0.6174620.17973
StackedEnsemble_AllModels_AutoML_20191203_104948 0.6016090.7756340.6016090.6185970.180068
XGBoost_grid_1_AutoML_20191203_104948_model_6 0.6081270.7798250.6081270.6297480.179725
GBM_5_AutoML_20191203_104948 0.6101990.7811530.6101990.6253020.180999
XGBoost_grid_1_AutoML_20191203_104948_model_7 0.6311040.7944210.6311040.6442460.182827

Let us see the performance of the best model on a test set. We would like to compare RMSE score with our stacking approach.

ModelMetricsRegressionGLM: glm
** Reported on test data. **

MSE: 0.7061227120640939
RMSE: 0.8403110805315457
MAE: 0.6512500190822342
RMSLE: 0.20689927999737506
R^2: -0.05415778048534392
Mean Residual Deviance: 0.7061227120640939
Null degrees of freedom: 157
Residual degrees of freedom: 94
Null deviance: 110.6498733711292
Residual deviance: 111.56738850612685
AIC: 523.4059100243061

Surprisingly, our stacking approach gives better RMSE score comparing to h2o AutoML.

Predicting RATING variable using other h2o models including Deep Learning

1. Using h2o Gradient Boosting Machine (GBM) to predict the Rating

The following tables shows

  1. Model summary (number of trees, number of leaves, max and min depth, min and max number of leaves)
  2. MSE, RMSE MAE and RMSLE scores on a train data and validation data.
  3. Scoring history (how the model gets lower RMSE depending on number of trees)
  4. Variable importance (AGE, GPA, Major_3 and R_AVG_WRITTEN has higher imporance comparing to other variables.)
Model Details
=============
H2OGradientBoostingEstimator :  Gradient Boosting Machine
Model Key:  gbm_grid1_model_32


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
0 85.0 85.0 25533.0 7.0 7.0 7.0 10.0 28.0 19.258823

ModelMetricsRegression: gbm
** Reported on train data. **

MSE: 0.4614040389362596
RMSE: 0.6792672809257484
MAE: 0.5387202712109215
RMSLE: 0.16096459681476313
Mean Residual Deviance: 0.4614040389362596

ModelMetricsRegression: gbm
** Reported on validation data. **

MSE: 0.5323020587689251
RMSE: 0.7295903362633891
MAE: 0.5880943320986378
RMSLE: 0.1652839104268758
Mean Residual Deviance: 0.5323020587689251

Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance validation_rmse validation_mae validation_deviance
0 2019-12-03 11:38:41 9.264 sec 0.0 0.779924 0.622010 0.608281 0.738063 0.598185 0.544737
1 2019-12-03 11:38:41 9.290 sec 5.0 0.773312 0.616423 0.598012 0.737473 0.596020 0.543867
2 2019-12-03 11:38:41 9.318 sec 10.0 0.767159 0.611273 0.588533 0.736305 0.594254 0.542145
3 2019-12-03 11:38:41 9.346 sec 15.0 0.760299 0.605753 0.578054 0.735002 0.593028 0.540228
4 2019-12-03 11:38:41 9.369 sec 20.0 0.752868 0.599767 0.566811 0.733505 0.591476 0.538030
5 2019-12-03 11:38:41 9.387 sec 25.0 0.746640 0.594541 0.557471 0.733162 0.591362 0.537527
6 2019-12-03 11:38:41 9.406 sec 30.0 0.740051 0.588614 0.547676 0.732461 0.590778 0.536499
7 2019-12-03 11:38:41 9.434 sec 35.0 0.733718 0.583301 0.538342 0.731006 0.589139 0.534370
8 2019-12-03 11:38:41 9.458 sec 40.0 0.728413 0.579169 0.530585 0.730719 0.589091 0.533950
9 2019-12-03 11:38:41 9.478 sec 45.0 0.722364 0.573991 0.521809 0.730680 0.588364 0.533893
10 2019-12-03 11:38:41 9.497 sec 50.0 0.716578 0.569142 0.513484 0.729632 0.587803 0.532363
11 2019-12-03 11:38:41 9.522 sec 55.0 0.710589 0.564449 0.504937 0.729573 0.587286 0.532276
12 2019-12-03 11:38:41 9.543 sec 60.0 0.704630 0.559518 0.496503 0.729592 0.587236 0.532304
13 2019-12-03 11:38:41 9.561 sec 65.0 0.699397 0.554956 0.489157 0.728633 0.586233 0.530906
14 2019-12-03 11:38:41 9.580 sec 70.0 0.694234 0.550810 0.481960 0.729227 0.587304 0.531772
15 2019-12-03 11:38:41 9.601 sec 75.0 0.690120 0.547471 0.476266 0.729323 0.587905 0.531913
16 2019-12-03 11:38:41 9.625 sec 80.0 0.684260 0.542618 0.468212 0.729688 0.587905 0.532445
17 2019-12-03 11:38:41 9.657 sec 85.0 0.679267 0.538720 0.461404 0.729590 0.588094 0.532302
Variable Importances: 
variable relative_importance scaled_importance percentage
0 AGE 292.096222 1.000000 0.080257
1 GPA 287.561249 0.984474 0.079011
2 MAJOR_3 249.743011 0.855003 0.068620
3 R_AVG_WRITTEN 193.733551 0.663253 0.053230
4 GRE_SUB 176.181992 0.603164 0.048408
5 R_AVG_ORAL 163.581039 0.560024 0.044946
6 GRE_VERB 149.060486 0.510313 0.040956
7 R_AVG_ACADEMIC 142.610580 0.488232 0.039184
8 GRE_AW 140.161819 0.479848 0.038511
9 R_AVG_KNOWLEDGE 137.200775 0.469711 0.037697
10 GRE_QUANT 131.190063 0.449133 0.036046
11 R_AVG_RATING 119.910126 0.410516 0.032947
12 R_AVG_EMOT 96.030693 0.328764 0.026385
13 R_AVG_RES 92.173149 0.315557 0.025326
14 R_AVG_MOT 91.957123 0.314818 0.025266
15 Emphasis Area 2_2 76.487328 0.261857 0.021016
16 Emphasis Area 3_3 69.237770 0.237038 0.019024
17 TEST_WRITE 68.219963 0.233553 0.018744
18 MAJOR_2 60.275513 0.206355 0.016561
19 TEST_LISTEN 58.102886 0.198917 0.015964
See the whole table with table.as_data_frame()

2. Using h2o Random Forest Algorithm to predict the Rating

The following tables shows

  1. Model summary (number of trees, number of leaves, max and min depth, min and max number of leaves)
  2. MSE, RMSE MAE and RMSLE scores on a validation data and cross-validation data.
  3. Cross-Validation Metrics Summary (MSE, RMSE MAE and RMSLE and their averages)
  4. Scoring history (how the model gets lower RMSE depending on number of trees)
  5. Variable importance (Age, MAJOR_3, GRE_AW and R_AVG_KNOWLEDGE has higher imporance comparing to other variables.)
Model Details
=============
H2ORandomForestEstimator :  Distributed Random Forest
Model Key:  rf_grid_model_15


Model Summary: 
number_of_trees number_of_internal_trees model_size_in_bytes min_depth max_depth mean_depth min_leaves max_leaves mean_leaves
0 100.0 100.0 15466.0 3.0 3.0 2.97 1.0 8.0 7.66

ModelMetricsRegression: drf
** Reported on train data. **

MSE: NaN
RMSE: NaN
MAE: NaN
RMSLE: NaN
Mean Residual Deviance: NaN

ModelMetricsRegression: drf
** Reported on validation data. **

MSE: 0.5412043052700249
RMSE: 0.7356658924199387
MAE: 0.5920856392951237
RMSLE: 0.1666097910722875
Mean Residual Deviance: 0.5412043052700249

ModelMetricsRegression: drf
** Reported on cross-validation data. **

MSE: 0.602830179027225
RMSE: 0.7764213926903515
MAE: 0.6187751029867875
RMSLE: 0.18155186161042472
Mean Residual Deviance: 0.602830179027225

Cross-Validation Metrics Summary: 
mean sd cv_1_valid cv_2_valid cv_3_valid cv_4_valid cv_5_valid
0 mae 0.6187751 0.0768542 0.6450626 0.5497973 0.73412216 0.5490337 0.6158598
1 mean_residual_deviance 0.6028302 0.14902145 0.60454977 0.52050996 0.84908575 0.4580052 0.5820002
2 mse 0.6028302 0.14902145 0.60454977 0.52050996 0.84908575 0.4580052 0.5820002
3 r2 0.005669171 0.010130606 0.02097922 5.931315E-4 -0.006338461 0.00807203 0.0050399355
4 residual_deviance 0.6028302 0.14902145 0.60454977 0.52050996 0.84908575 0.4580052 0.5820002
5 rmse 0.7720201 0.09229819 0.777528 0.7214638 0.9214585 0.6767608 0.7628894
6 rmsle 0.17979087 0.028202832 0.17736925 0.16631396 0.22693303 0.1522602 0.17607793
Scoring History: 
timestamp duration number_of_trees training_rmse training_mae training_deviance validation_rmse validation_mae validation_deviance
0 2019-12-03 11:39:13 29.434 sec 0.0 NaN NaN NaN NaN NaN NaN
1 2019-12-03 11:39:13 29.437 sec 1.0 NaN NaN NaN 0.729395 0.595822 0.532017
2 2019-12-03 11:39:13 29.439 sec 2.0 NaN NaN NaN 0.724944 0.590629 0.525543
3 2019-12-03 11:39:13 29.441 sec 3.0 NaN NaN NaN 0.727882 0.589844 0.529812
4 2019-12-03 11:39:13 29.443 sec 4.0 NaN NaN NaN 0.728967 0.593993 0.531393
5 2019-12-03 11:39:13 29.445 sec 5.0 NaN NaN NaN 0.732129 0.594280 0.536013
6 2019-12-03 11:39:13 29.447 sec 6.0 NaN NaN NaN 0.736088 0.598430 0.541826
7 2019-12-03 11:39:13 29.449 sec 7.0 NaN NaN NaN 0.736312 0.596148 0.542156
8 2019-12-03 11:39:13 29.451 sec 8.0 NaN NaN NaN 0.735224 0.593783 0.540555
9 2019-12-03 11:39:13 29.453 sec 9.0 NaN NaN NaN 0.736114 0.594355 0.541864
10 2019-12-03 11:39:13 29.455 sec 10.0 NaN NaN NaN 0.738419 0.598085 0.545263
11 2019-12-03 11:39:13 29.457 sec 11.0 NaN NaN NaN 0.737576 0.597348 0.544019
12 2019-12-03 11:39:13 29.459 sec 12.0 NaN NaN NaN 0.735756 0.594669 0.541337
13 2019-12-03 11:39:13 29.461 sec 13.0 NaN NaN NaN 0.735404 0.594289 0.540819
14 2019-12-03 11:39:13 29.463 sec 14.0 NaN NaN NaN 0.735668 0.593548 0.541208
15 2019-12-03 11:39:13 29.465 sec 15.0 NaN NaN NaN 0.733279 0.592386 0.537698
16 2019-12-03 11:39:13 29.467 sec 16.0 NaN NaN NaN 0.733432 0.591810 0.537923
17 2019-12-03 11:39:13 29.469 sec 17.0 NaN NaN NaN 0.733928 0.593114 0.538650
18 2019-12-03 11:39:13 29.472 sec 18.0 NaN NaN NaN 0.735580 0.594424 0.541077
19 2019-12-03 11:39:13 29.474 sec 19.0 NaN NaN NaN 0.736072 0.594506 0.541802
See the whole table with table.as_data_frame()

Variable Importances: 
variable relative_importance scaled_importance percentage
0 AGE 67.672417 1.000000 0.062841
1 MAJOR_3 63.343346 0.936029 0.058821
2 GRE_AW 62.170132 0.918692 0.057731
3 R_AVG_KNOWLEDGE 52.175598 0.771002 0.048450
4 R_AVG_ACADEMIC 49.758045 0.735278 0.046205
5 MAJOR_5 42.367226 0.626063 0.039342
6 R_AVG_EMOT 36.251003 0.535684 0.033663
7 TEST_READ 36.038612 0.532545 0.033466
8 TEST_WRITE 35.691498 0.527416 0.033143
9 TEST_SPEAK 32.808357 0.484811 0.030466
10 CTZNSHP_6 31.557686 0.466330 0.029305
11 CTZNSHP_2 30.604378 0.452243 0.028419
12 GRE_VERB 29.549799 0.436659 0.027440
13 R_AVG_RATING 28.818213 0.425849 0.026761
14 R_AVG_ORAL 27.385885 0.404683 0.025431
15 GPA 27.272652 0.403010 0.025325
16 MAJOR_2 27.183018 0.401685 0.025242
17 TEST_LISTEN 24.837915 0.367032 0.023065
18 MAJOR_4 23.802145 0.351726 0.022103
19 R_AVG_MOT 20.689726 0.305734 0.019213
See the whole table with table.as_data_frame()

Using h2o Deep Learning Algorithms to predict the Rating

deeplearning Grid Build progress: |███████████████████████████████████████| 100%
16.714356331030526

See the Model performance

Identify the best model generated with least error

The following tables shows

  1. Model summary (number of layers, number of units, activation function, dropout, metrics, bias and weights)
  2. MSE, RMSE MAE and RMSLE scores on a train data and validation data.
  3. Scoring history (how the model gets lower RMSE depending on number of trees)
  4. Variable importance (It is hard to tell because all of the variables have almost same amount of percentages with a small difference.)
Model Details
=============
H2ODeepLearningEstimator :  Deep Learning
Model Key:  dl_grid_model_3


Status of Neuron Layers: predicting RATING, regression, quantile distribution, Quantile loss, 41,345 weights/biases, 504.7 KB, 475,000 training samples, mini-batch size 1
layer units type dropout l1 l2 mean_rate rate_rms momentum mean_weight weight_rms mean_bias bias_rms
0 1 63 Input 5
1 2 128 RectifierDropout 10 0.001 0 0.451522 0.129564 0 0.00017083 0.0926476 -0.174969 0.645087
2 3 128 RectifierDropout 50 0.001 0 0.434724 0.12432 0 3.47801e-05 0.0468903 -0.00723855 0.0934175
3 4 128 RectifierDropout 50 0.001 0 0.152296 0.0926554 0 -0.00304255 0.0332958 0.0490408 0.121873
4 5 1 Linear 0.001 0 0.0323358 0.0487501 0 -0.0336463 0.121722 0.458348 1.09713e-154

ModelMetricsRegression: deeplearning
** Reported on train data. **

MSE: 0.32933992028547765
RMSE: 0.573881451421352
MAE: 0.42701955667175023
RMSLE: 0.14012111489372026
Mean Residual Deviance: 0.21350977833587512

ModelMetricsRegression: deeplearning
** Reported on validation data. **

MSE: 0.5320507992904316
RMSE: 0.7294181237743079
MAE: 0.59293519385762
RMSLE: 0.1642795236592628
Mean Residual Deviance: 0.29646759692881

Scoring History: 
timestamp duration training_speed epochs iterations samples training_rmse training_deviance training_mae training_r2 validation_rmse validation_deviance validation_mae validation_r2
0 2019-12-03 11:40:04 0.000 sec None 0.0 0 0.0 NaN NaN NaN NaN NaN NaN NaN NaN
1 2019-12-03 11:40:04 42.538 sec 9205 obs/sec 10.0 1 4750.0 0.763449 0.296639 0.593277 0.041800 0.739568 0.297285 0.594570 -0.004096
2 2019-12-03 11:40:09 47.628 sec 12761 obs/sec 150.0 15 71250.0 0.576603 0.213132 0.426263 0.453426 0.854709 0.346338 0.692675 -0.341084
3 2019-12-03 11:40:15 52.836 sec 15436 obs/sec 350.0 35 166250.0 0.481838 0.182735 0.365471 0.618321 0.828306 0.337424 0.674848 -0.259508
4 2019-12-03 11:40:20 57.870 sec 16255 obs/sec 540.0 54 256500.0 0.573881 0.213510 0.427020 0.458573 0.729418 0.296468 0.592935 0.023275
5 2019-12-03 11:40:25 1 min 3.018 sec 16590 obs/sec 730.0 73 346750.0 0.558292 0.204478 0.408957 0.487588 0.745630 0.310570 0.621140 -0.020625
6 2019-12-03 11:40:30 1 min 8.264 sec 17094 obs/sec 940.0 94 446500.0 0.513069 0.189142 0.378284 0.567241 0.781305 0.319860 0.639719 -0.120624
7 2019-12-03 11:40:33 1 min 11.022 sec 16461 obs/sec 1000.0 100 475000.0 0.544100 0.200353 0.400706 0.513310 0.750229 0.304951 0.609902 -0.033254
8 2019-12-03 11:40:33 1 min 11.044 sec 16459 obs/sec 1000.0 100 475000.0 0.573881 0.213510 0.427020 0.458573 0.729418 0.296468 0.592935 0.023275
Variable Importances: 
variable relative_importance scaled_importance percentage
0 R_AVG_KNOWLEDGE 1.000000 1.000000 0.025921
1 Emphasis Area_4 0.987027 0.987027 0.025585
2 R_AVG_ACADEMIC 0.962293 0.962293 0.024944
3 Emphasis Area 2_2 0.934657 0.934657 0.024227
4 R_AVG_WRITTEN 0.915670 0.915670 0.023735
5 R_AVG_ORAL 0.874860 0.874860 0.022677
6 Emphasis Area_3 0.868058 0.868058 0.022501
7 R_AVG_RATING 0.833698 0.833698 0.021610
8 Emphasis Area 2_4 0.816010 0.816010 0.021152
9 R_AVG_EMOT 0.795749 0.795749 0.020627
10 Emphasis Area 2_3 0.790972 0.790972 0.020503
11 R_AVG_RES 0.789030 0.789030 0.020452
12 R_AVG_MOT 0.786090 0.786090 0.020376
13 LOW_INCOME_1 0.785315 0.785315 0.020356
14 Emphasis Area 3_4 0.763796 0.763796 0.019798
15 GRE_QUANT 0.761763 0.761763 0.019746
16 Emphasis Area 3_1 0.750920 0.750920 0.019465
17 LOW_INCOME_2 0.743031 0.743031 0.019260
18 GRE_SUB 0.721019 0.721019 0.018690
19 GRE_VERB 0.702078 0.702078 0.018199
See the whole table with table.as_data_frame()

Compare Model Performances

We will compare these three models (GBM, RF and Deep Learning) on a test set.

ModelMetricsRegression: gbm
** Reported on test data. **

MSE: 0.7035010023505746
RMSE: 0.838749666080753
MAE: 0.6524428577331286
RMSLE: 0.2063354428619273
Mean Residual Deviance: 0.7035010023505746


ModelMetricsRegression: drf
** Reported on test data. **

MSE: 0.7023122333958572
RMSE: 0.8380407110611376
MAE: 0.6502078553770162
RMSLE: 0.20637562470639753
Mean Residual Deviance: 0.7023122333958572


ModelMetricsRegression: deeplearning
** Reported on test data. **

MSE: 0.8020512064440112
RMSE: 0.8955731161909737
MAE: 0.6872879250485764
RMSLE: 0.2177330540119891
Mean Residual Deviance: 0.3436439625242882

Combine all the results for predicting the rating variable

Name of the model RMSE
LASSO Regression 0.793411
Epsilon-Support Vector Regression 0.809163
Huber Regressor 0.811865
Ridge Regression 0.813751
Stochastic Gradient Descent 0.817268
Stacking Approach(Blending) 0.817234
H2O GBM 0.838101
H2O RF 0.838119
H2O AutoML 0.840311
H2O DL 0.895573

Conclusion

The structure of the data contains both categorical and numerical variable. Target variables are DECISION variable(categorical) and RATING variable(numerical).

We started with two goals at the beginning of this jupyter notebook. The first one is to predict the decision variable that determines if the student is admitted, rejected or waitlisted. In this case, recall that we have used the rating variable to predict the decision variable. The second goal was to predict the rating variable. Our first problem is classification problem whereas the second one is regression problem.

The methods we have used includes Decision Tree, Logistic Regression Random Forest ,Stochastic Gradient Descent, k-nearest neighbor, Gaussian Naive Bayes, Perceptron, Support Vector Machine, Adaptive Boosting, XGBoost, Stacking(Blending) Ensemble and h2o functions such as AutoML, GBM, and Deep Learning.

We have found that Stochastic Gradiend Descent with grid search is the best model to predict the decision variables. Surprisingly, it even gives better result thatn h2oAutoML function.

To predict the RATING variable, we have used very complicated models, mixed bunch of models in order to minimize the error. However, we have found that basic models such as Lasso regression and epsilon support vector regression gave lower RMSE score comparing to other complicated models including neural network.

The number of predictor variables is quite large and it is not initially clear what the most significant predictor variables will be. The results of these analysis point to a common choice of the most relevant predictor variables:

  • Applicant age
  • Applicant GPA
  • Applicant’s choice of emphasis area for study in graduate school
  • Applicant’s GRE scores
  • Applicant’s undergraduate major
  • Applicant's average ratings of knowledge given by applicant's recommenders

Even though one might expect that these variables should have high predictive ability it is not clear which should be most predictive. A surprising output of our analysis is that age is found to be highly predictive, at least for certain models. This was somewhat unexpected as most applicant’s tend to be of a similar age in their early twenties. There are some outliers in their early to late thirties and perhaps the presence of these applicants naturally splits the dataset, making the age variable an easy predictor to split upon and reduce the classification or regression error. Also of some interest is the fact that among the GRE scores the verbal one tends to have slightly more predictive ability than the quantitative one. This is not entirely unsurprising, as the quantitative scores among Math PhD students tend to be fairly homogeneous. There is greater variability within the verbal scores, and apparently there is some sort of positive correlation between verbal abilities and a high rating being given to the student application. Whether this is deliberate or not on the part of the reviewers is unknown.

In both cases, predicting the decision and predicting the rating variable, we see that some simpler models gave us better result comparing to more complicated models. One possible explanation of this could be the relationship between our variables and the response is linear. Another explanation would be the number of sample is not too many.

For the future work, one might try multiple imputation technique instead of basic method to fill the missing values.

There can be extra variables that represent the number of publications, and where they are pubslished and impact factor of journals. Since publications may boost students' chance to get admitted, adding these aforementioned variables might improve the accuracy of the model.

Havin distributions of the GPA's of the universities where students are coming from may lead to have better prediction of the decision about students. It would also be useful to have a quantative opinion of the strength of each recommender.

References