Write your solutions to the following problems either by writing them on a piece of paper or on a tablet and scanning your answers as a PDF. Note that you are not allowed to use LaTeX, Google Docs, or any other digital document creation software to type your answers. Homeworks are due to Gradescope by 11:59PM on the due date. See the syllabus for details on the slip day policy.
Homework will be evaluated not only on the correctness of your answers, but on your ability to present your ideas clearly and logically. You should always explain and justify your conclusions, using sound reasoning. Your goal should be to convince the reader of your assertions. If a question does not require explanation, it will be explicitly stated.
Review the solutions to Homework 7 and pick two problem parts (for example, Problem 3c and Problem 5b) from Homework 7 in which your solutions have the most room for improvement, i.e., where they have unsound reasoning, could be significantly more efficient or clearer, etc. Include a screenshot of your solution to each problem part, and in a few sentences, explain what was deficient and how it could be fixed.
Alternatively, if you think one of your solutions is significantly better than the posted one, copy it here and explain why you think it is better. If you didn’t do Homework 7, choose two problem parts from it that look challenging to you, and in a few sentences, explain the key ideas behind their solutions in your own words.
Solution
Problem 2: The Sum of Errors (8 pts)
Consider a set of \(n\) points, \((\vec x_1, y_1), (\vec x_2, y_2), …, (\vec x_n, y_n)\), where each \(\vec x_i\) is a feature vector in \(\mathbb{R}^d\) and each \(y_i\) is a scalar.
which has no intercept term, are the components of the error vector \(\vec e = \vec y - X \vec w^{\ast}\) still guaranteed to sum to 0? If they are, explain why. If they are not, explain why not, but give at least one example dataset where they still do sum to 0.
Solution
Without an intercept term, the first column of \(X\) is no longer all 1s. The normal equations \(X^T\vec e = 0\) still ensure \(\vec e\) is orthogonal to each column of \(X\), but not necessarily to the all-ones vector. Therefore, there is no guarantee that the components of \(\vec e\) sum to 0.
However, they still can sum to 0. For instance, if \(\vec 1\) lies in the column space of \(X\), the errors will still sum to 0 — in other words, if you can make a vector of all ones using linear combinations of the other columns of \(X\), \(\vec e\) will be orthogonal to that vector, and therefore sum to 0.
Even if \(\vec 1\) isn’t in the column space of \(X\), if \(\vec y\) is in the column space of \(X\), the errors will sum to 0 because they’ll all be 0 exactly. For example, if
then since \(\vec y = X \begin{bmatrix} 5 \\ 6 \end{bmatrix}\) exactly, the error vector \(\vec e\) is just \(\begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix}\), and therefore sums to 0.
Problem 3: Moving Things Around (10 pts)
Let \(X\) be an \(n \times 4\) design matrix whose first column is all 1s, let \(\vec y\) be an observation vector, and let \(\vec w^{\ast} = (X^TX)^{-1}X^T \vec y\).
In this problem, you’ll reason about modifications to the design matrix and see how they affect the components of \(\vec w^{\ast}\).
a)
(3 pts) Let \(X_a\) be the design matrix that results from swapping the first two columns of \(X\). Let
\(\vec v^{\ast} = (X_a^TX_a)^{-1}X_a^T \vec y\). Express the components of \(\vec v^{\ast}\) in terms of \(w_0^{\ast}, w_1^{\ast}, w_2^{\ast}, w_3^{\ast}\).
Because the column space of the resulting design matrix has not changed, the optimal predictions themselves will not change, because the optimal predictions come from projecting \(\vec y\) onto the same \(\text{colsp}(X)\). So, the problem boils down to figuring out how to choose the coefficients in \(\vec{v}^{\ast}\) so that the predictions of the resulting model are the same as those in the original model. This logic holds for the other parts of the problem, too.
Swapping the first two columns of \(X\) interchanges the constant (intercept) column and the \(x_i^{(1)}\) column. The modified model is then
Intuitively, when we interchange two columns of our design matrix, all that does is interchange the terms in the model, which interchanges those weights in the parameter vector.
b)
(3 pts) Let \(X_b\) be the design matrix that results from adding 3 to each entry in the first column of \(X\). Let \(\vec v^{\ast} = (X_b^TX_b)^{-1}X_b^T \vec y\). Express the components of \(\vec v^{\ast}\) in terms of \(w_0^{\ast}, w_1^{\ast}, w_2^{\ast}, w_3^{\ast}\).
Adding \(3\) to each entry of the first column of \(X\) means the intercept column (previously all ones) becomes a column of all fours. The new model is therefore
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. To keep the model predictions identical to those produced by \(\vec w^{\ast}\), the term multiplying the constant column must remain the same:
For example, imagine fitting a line to data in \(\mathbb{R}^2\) and finding that the best-fitting line is \(y = 12 + 3x\). If we had to write this in the form \(y = v_0 \cdot 4 + v_1 x\), then the best choice for \(v_0\) would be \(3\), since \(4v_0 = 12\), and the best choice for \(v_1\) would be \(3\).
c)
(4 pts) Let \(X_c\) be the design matrix that results from adding 3 to each entry in the second column of \(X\). Let \(\vec v^{\ast} = (X_c^TX_c)^{-1}X_c^T \vec y\). Express the components of \(\vec v^{\ast}\) in terms of \(w_0^{\ast}, w_1^{\ast}, w_2^{\ast}, w_3^{\ast}\).
In order to compensate for these changes to our coefficients, we need to “offset” any alterations made to our coefficients. For the model to produce identical predictions as before, each coefficient multiplying a feature must match its original:
One way to think about this is that if we shift the feature \(x_i^{(1)}\) by a constant value, all predictions increase by that feature’s coefficient times the constant (here \(3w_1^{\ast}\)). To preserve the same overall outputs, the intercept term must decrease by that same amount.
This coould be further simplified, but there’s no need.
b)
(4 pts) To minimize \(f(\vec x)\), we’ll use gradient descent. Perform one iteration of gradient descent by hand, using the initial guess \(\vec x^{(0)} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}\) and learning rate \(\alpha = \frac{1}{2}\). What is \(\vec x^{(1)}\)?
We’ve already computed \(\nabla f(\vec x^{(0)}) = \nabla f(\begin{bmatrix} 0 \\ 1 \end{bmatrix}) = \begin{bmatrix} -10 \\ 2 \end{bmatrix}\) from the previous part, so we can plug in everything we know:
This means that after one gradient descent step with \(\alpha = \frac{1}{2}\), the algorithm moves the guess for \(\vec x^{\ast}\) from \(\begin{bmatrix} 0 \\ 1 \end{bmatrix}\) to \(\begin{bmatrix} 5 \\ 0 \end{bmatrix}\).
Problem 5: Product and Chain Rules (13 pts)
Our goal in this problem is to study the behavior of the function
where \(x \in \mathbb{R}^n\) and \(A\) is a symmetric \(n \times n\) matrix (meaning \(A = A^T\)). This function, called the Rayleigh quotient, will play an important role in Chapter 5 of the course, when we eventually study the dimensionality reduction problem first introduced in Chapter 1.1.
But first, we have to get a handle on a few gradient rules.
Note that we need to pay close attention to the types of functions we’re working with. \(h(g(\vec x))\) is well-defined, but \(g(h(\vec x))\) is not, since \(h\) doesn’t take in vectors (it takes in scalars).
Find the gradients of each of the following functions.
\(f_1(\vec x) = \log(\vec x^T A \vec x)\), where \(\vec x \in \mathbb{R}^n\) and \(A\) is a symmetric \(n \times n\) matrix
\(f_2(\vec x) = e^{-\sin(\vec a^T\vec x)}\), where \(\vec x, \vec a \in \mathbb{R}^n\)
Here, \(\log(x)\) denotes the base-\(e\) logarithm, i.e. \(\ln(x)\).
where \(\vec x \in \mathbb{R}^n\) and \(A\) is a symmetric \(n \times n\) matrix, then
$$ \nabla f(\vec x) = \frac{2}{\vec x^T \vec x} \left( A \vec x - f(\vec x) \vec x \right) $$
Solution
Let
$$ g(\vec x) = \vec x^T A \vec x \quad h(\vec x) = \frac{1}{\vec x^T \vec x} $$
Then,
$$ f(\vec x) = g(\vec x) h(\vec x) $$
Notice that we intentionally didn’t introduce a quotient rule! Instead, we gave you the tools to find \(\nabla h(\vec x)\), which allows you to then use the product rule.
So first, since \(\frac{\text{d}}{\text{d}x} \left(\frac{1}{x}\right) = -\frac{1}{x^2}\), we have
There were several ways to simplify the expression, and any correct answer will receive full credit. But, by using the fact that \(f(\vec x) = \frac{\vec x^T A \vec x}{\vec x^T \vec x}\), the expression simplifies rather nicely, and we will see this specific gradient again in Chapter 10, when studying PCA.
Problem 6: Convexity (12 pts)
In this video, we introduce the formal definition of convexity for vector-to-scalar functions. Intuitively, a function \(f: \mathbb{R}^d \to \mathbb{R}\) is convex if its graph is a bowl-shaped surface. Formally, \(f\) is convex if for all \(\vec x, \vec y \in \mathbb{R}^d\) and all \(t \in [0,1]\),
$$ f((1-t)\vec x + t \vec y) \le (1-t) f(\vec x) + t f(\vec y) $$
This is a formal way of saying that when you connect any two points on the graph of \(f\) with a line segment, the line segment lies on or above the graph of \(f\), never below.
The second derivative test for convexity is more convenient, but it doesn’t apply to non-differentiable functions, e.g. \(f(x) = |x|\) is convex, but it isn’t differentiable.
For each statement below, prove that the statement is true using the formal definition above, or give a counterexample.
a)
(4 pts) The sum of two convex functions must also be convex.
Solution
Let \(f\) and \(g\) be convex functions. We want to show that their sum \(h(x) = f(x) + g(x)\) is also convex.
Let’s start with the definition of convexity. For any \(x, y\) in \(f\)’s domain and \(t \in [0,1]\), since \(f\) and \(g\) are convex, we have:
$$ f((1-t)x + ty) \leq (1-t) f(x) + t f(y) $$
$$ g((1-t)x + ty) \leq (1-t) g(x) + t g(y) $$
Note that the above two inequalities are individually true for any valid \(t\), but to combine them we can pick the same \(t\). Adding the two inequalities gives
We can recognize that the left-hand side is \(h((1-t)x + ty)\), and the right-hand side is \((1-t) h(x) + t h(y)\).
$$ h((1-t)x + ty) \leq (1-t) h(x) + t h(y) $$
And we can conclude that \(h(x) = f(x) + g(x)\) satisfies the convexity definition.
$$ \boxed{\text{Therefore, the sum of convex functions is convex.}} $$
b)
(4 pts) The difference of two convex functions must also be convex.
Solution
This statement is not true in general. As a counterexample, let’s consider \(f(x) = x^2\) and \(g(x) = 2x^2\). Then both \(f\) and \(g\) are convex, but
$$ h(x) = f(x) - g(x) = x^2 - 2x^2 = -x^2 $$
which is concave, not convex (since its second derivative is negative).
$$ \boxed{\text{The difference of two convex functions is not necessarily convex.}} $$
c)
(4 pts) Suppose \(f(x)\) and \(g(x)\) are both scalar-to-scalar convex functions and that, for some scalar \(a\), \(f(a) = g(a)\). Then, \(h(x)\) is also convex, where
$$ h(x) = \begin{cases} f(x) & x \leq a \\\\ g(x) & x > a \end{cases} $$
Hint: The statement is false, so focus your energy on finding a counterexample.
Solution
We will show that this statement is false by constructing convex \(f\) and \(g\) for which \(h(x)\) is not convex: