Gradient Descent at a glance

- How exactly does this have a runtime of O(d^2*n) or O(d^3) (depending on input size)?
- Why use Gradient Descent instead of setting partial derivative and manually solving for each weight?
- Explain how gradient descent works in great detail.
$$
\begin{equation}
\mathbf{w}^{t+1}=\mathbf{w}^t-\alpha \nabla L\left(\mathbf{w}^t\right)
\end{equation}
$$

$$
\begin{equation}
\mathbf{w}^{t+1}=\mathbf{w}^t-\frac{\alpha}{n} \sum_{i=1}^n \nabla L_i\left(\mathbf{w}^t, \mathbf{x}_i, y_i\right)
\end{equation}
$$
$$
\begin{equation}
\mathbf{w}^{t+1}=\mathbf{w}^t+\frac{\alpha}{n} \mathbf{X}^T\left(\mathbf{y}-\mathbf{X} \mathbf{w}^t\right)
\end{equation}
$$
- How does stochastic gradient descent differ from the exhaustive approach? How does mini-batch gradient descent differ from exhaustive approach?
- What is the exponential decay and 1/t learning rate? What’s the use-case for both?
- How does Momentum solve the ravine problem for a classic gradient descent algorithm with an exponential decay annealed learning rate?
Error Decomposition and Bias Variance
- You just trained a model, used Gradient Descent, annealed the learning and momentum rates, yet still got bad performance. What could be the problem?
- What are the main sources of prediction error?

- Explain this portion of the derivation, line-by-line
- Okay, now why do we calculate the expectation of the distribution of data?

- Explain this portion of the derivation, line-by-line