Machine Learning - Optimizations and Gradient Descent

Gradient Descent at a glance

Screenshot 2023-08-03 at 6.52.16 PM.png

How exactly does this have a runtime of O(d^2*n) or O(d^3) (depending on input size)?
Why use Gradient Descent instead of setting partial derivative and manually solving for each weight?
Explain how gradient descent works in great detail.

$$ \begin{equation} \mathbf{w}^{t+1}=\mathbf{w}^t-\alpha \nabla L\left(\mathbf{w}^t\right) \end{equation} $$

Screenshot 2023-08-04 at 2.46.26 PM.png

$$ \begin{equation} \mathbf{w}^{t+1}=\mathbf{w}^t-\frac{\alpha}{n} \sum_{i=1}^n \nabla L_i\left(\mathbf{w}^t, \mathbf{x}_i, y_i\right) \end{equation} $$

$$ \begin{equation} \mathbf{w}^{t+1}=\mathbf{w}^t+\frac{\alpha}{n} \mathbf{X}^T\left(\mathbf{y}-\mathbf{X} \mathbf{w}^t\right) \end{equation} $$

How does stochastic gradient descent differ from the exhaustive approach? How does mini-batch gradient descent differ from exhaustive approach?
What is the exponential decay and 1/t learning rate? What’s the use-case for both?
How does Momentum solve the ravine problem for a classic gradient descent algorithm with an exponential decay annealed learning rate?

Error Decomposition and Bias Variance

You just trained a model, used Gradient Descent, annealed the learning and momentum rates, yet still got bad performance. What could be the problem?
What are the main sources of prediction error?

Screenshot 2023-08-05 at 4.38.47 PM.png

Explain this portion of the derivation, line-by-line
Okay, now why do we calculate the expectation of the distribution of data?

Screenshot 2023-08-05 at 4.44.06 PM.png

Explain this portion of the derivation, line-by-line