Deep Learning Term

Basic Term

Epoch

The number of times the algorithm runs on the whole training dataset.

Batch

It denotes the number of samples to be taken to for updating the model parameters.

Learning rate

It is a parameter that provides the model a scale of how much model weights should be updated.

Cost Function/Loss Function

A cost function is used to calculate the between the predicted value and the actual value.

Weights/ Bias

The learnable parameters in a model that controls the signal between two neurons.

Independently and identically distributed (i.i.d.)

Independent Events:
- When we talk about independent events, we’re referring to the relationship between observations in a sample.
- Independence means that observing the current item does not influence or provide insight about the value of the next item you measure, or any other items in the sample.
- For instance, when flipping a coin, each toss is independent. The outcome of one toss doesn’t predict the next outcome.
- Independence relates to how you define your population and the process of obtaining your sample. Random sampling is essential to ensure independence.
Identically Distributed:
- Identically distributed refers to the probability distribution that describes the characteristic you’re measuring.
- Specifically, one probability distribution should adequately model all the values you observe in a sample.
- A dataset should not exhibit trends because trends indicate that one probability distribution does not describe all the data.
- In simpler terms, the same underlying distribution governs all the data points.

Overfitting VS Underfitting

Overfitting:

Definition: Overfitting occurs when a model becomes too specialized in learning from the training data to the point where it loses its ability to generalize well to new, unseen data.
Cause: The model becomes overly complex, capturing not only the underlying patterns but also the noise or random fluctuations present in the training data.
Result: While the model performs exceptionally well on the training data, it struggles when faced with new examples.
Solution: Regularization techniques (such as dropout, weight decay, or early stopping) can help mitigate overfitting by simplifying the model.

Underfitting

Definition: Underfitting occurs when a model is too simplistic and fails to capture the essential patterns in the data.
Cause: The model lacks complexity and doesn’t adapt well to the training data.
Result: Both the training and test performance are subpar.
Solution: Increase model complexity (e.g., use a more expressive architecture) or improve feature engineering.

Optimization Algorithms

Optimization algorithms or Optimizer play a pivotal role in fine-tuning neural network parameters throughout the training process, aiming to minimize a predefined loss function.

Stochastic Gradient Descent (SGD)

Description: SGD is a fundamental optimization algorithm. It updates model parameters by computing gradients on a small batch of training data (hence “stochastic”) and adjusting weights accordingly.
Pros: Simplicity, computationally efficient, and widely used.
Cons: Prone to oscillations and slow convergence.

Mini Batch Gradient Descent

Description: Mini-Batch Gradient Descent splits the training dataset into smaller batches (mini-batches) during model parameter updates. It strikes a balance between the efficiency of batch gradient descent and the noise of stochastic gradient descent.
Pros: Efficiency, Generalization, Parallelization, Adaptability
Cons: Hyperparameter Sensitivity, Noise

Adaptive Gradient Descent(Adagrad)

Description: Adagrad adapts the learning rate for each parameter based on the historical sum of squared gradients.
Pros: Effective for sparse data, automatic learning rate adjustment.
Cons: Learning rate decreases too aggressively over time.

AdaDelta

Description: AdaDelta is an extension of Adagrad that addresses its aggressive learning rate decay. It uses a moving average of gradients and squared gradients.
Pros: No need for manual learning rate tuning.
Cons: Hyperparameters still need tuning.

RMSprop (Root Mean Square Propagation)

Description: RMSprop adjusts the learning rate for each parameter by dividing the gradient by the moving average of squared gradients.
Pros: Robust against noisy gradients, faster convergence.
Cons: Sensitive to hyperparameters.

Adam (Adaptive Moment Estimation)

Description: Adam combines the benefits of both Adagrad and RMSprop. It adapts the learning rate for each parameter based on past gradients and squared gradients.
Pros: Efficient, handles sparse gradients, and converges faster.
Cons: Requires tuning of hyperparameters.

Reference

Deep Learning

#Term #Machine-Learning

Deep Learning Term

https://github.com/kewending/kewending.github.io/2024/04/03/Deep Learning Term/

Author

Kewen Ding

Posted on

April 3, 2024

Updated on

April 4, 2024

Licensed under

Riddles Previous

Dive into Deep Learning Chapter 1 Next