Created
Sep 14, 2021 04:39 PM
Topics
โ Why Not Linear Regression๐ Logistic Regression ModelPropertiesHypothesis RepresentationLogistic (Sigmoid) Function Soft Threshold (Conversion to from signal)Why SigmoidInterpretation of Hypothesis OutputTarget FunctionDecision BoundaryNon-Linear Decision BoundariesExample from Intro2MLExample from Andrew NgMethod to Find Best-Fit LineLoss FunctionGoal: Maximum Likelihood EstimationIn PracticeCross-Entropy LossLogistic Regression Algorithm w/ Gradient Descentโ Iterative Optimization๐ฆBatch Gradient DescentProcedureFormulaInsightsChoosing Step Size ๐ฅ Stochastic Gradient Descent๐งบ Mini-Batch Gradient DescentComparing Gradient Descent Variants๐ค QuestionsResources
โ Why Not Linear Regression
Poor performance w/ classification problems
- Need to set an arbitrary threshold
- Outliers significantly reduce power of model
- Hypothesis' output value can be > 1 or < 0
๐ Logistic Regression Model
Usage: Classification
- predict the probability that an observation belongs to one of two possible classes
Similar to Linear Regression, but gives probability or True/False
Properties
Classification algorithm:
Hypothesis Representation
Maps linear regression to {0, 1} by nesting inside a Sigmoid function
Logistic (Sigmoid) Function
Soft Threshold (Conversion to from signal)
Why Sigmoid
- Smooth
- Easy to compute derivative / gradient
- Non linear: can model more complex relations
Interpretation of Hypothesis Output
estimated % that on input
Target Function
Probability that given ; parametrized by
- The data does not give us explicit probabilities
- Only provides samples generated
- Pick
As a binary classification problem (probability sums to 1):
Decision Boundary
A property of the hypothesis function & parameters
- Predict if
- Predict if
Logistic Function: Input โฅ 0 โ Output โฅ 0.5
Therefore,
Non-Linear Decision Boundaries
Use higher-order polynomials to classify data with complex geometric shapes
Example from Intro2ML
Apply Transformation
Then we can create a hyper-plane to separate the data for classification
Example from Andrew Ng
Difficulty: need to come up w/ transformation b4 inspecting data
Method to Find Best-Fit Line
Calculates maximum likelihood
- Pick a probability scaled by
- Calculate the % of
Loss Function
% that the predicted value is correct
Compute the likelihood of IID Training data
Goal: Maximum Likelihood Estimation
Adjust parameter to maximize likelihood
In Practice
Take of the likelihood to minimize
- Equivalent since the function is monotonically decreasing
Cross-Entropy Loss
This error measure is small when is large and positive
- pushes to classify each correctly
ย
Logistic Regression Algorithm w/ Gradient Descent
โ Iterative Optimization
Minimize the loss function โ make the model more useful
๐ฆBatch Gradient Descent
An optimization technique to compute logistic regression & other learning algorithms
Sliding Down Hill: progressively modifies parameter in a way that decreases error
- Computing 2nd derivative is almost impossible in most cases
- Gradient descent provides a close estimate
Procedure
- Start with an initial value of parameters
- Compute the direction at which weight decreases
- Update the parameters in that direction
- learning rate (how big the step we take)
Formula
Insights
- Take a step in the direction of steepest descent to gain the biggest decrease of E
Using first order Taylor expansion we compute the change in
Since is a unit vector, this equality holds
iff
Choosing Step Size
Large step size when far away from local minima
Small step size when close to the local minima
Simple Heuristic:
โ Learning rate algorithm
ย
๐ฅ Stochastic Gradient Descent
Perform gradient descent for each batch, not the entire training set
- Randomly pick ONE training sample
- Compute the gradient of loss function w/ this sample
- Update the weights
๐งบ Mini-Batch Gradient Descent
in between batch and stochastic
- Randomly pick training samples
- Compute the gradient of the loss associated with this mini-batch
- Update the weights
Comparing Gradient Descent Variants
Gradient Descent Variants
Name
Input at Each Step
Pro
Con
ย
๐ค Questions
- How gradient descent deals with local maxima / saddle points