Write your solutions to the following problems either by writing them on a piece of paper or on a tablet and scanning your answers as a PDF. Note that you are not allowed to use LaTeX, Google Docs, or any other digital document creation software to type your answers. Homeworks are due to Gradescope by 11:59PM on the due date. See the syllabus for details on the slip day policy.
Homework will be evaluated not only on the correctness of your answers, but on your ability to present your ideas clearly and logically. You should always explain and justify your conclusions, using sound reasoning. Your goal should be to convince the reader of your assertions. If a question does not require explanation, it will be explicitly stated.
Review the solutions to Homework 6. Pick two problem parts (for example, Problem 2a and Problem 5b) from Homework 6 in which your solutions have the most room for improvement, i.e., where they have unsound reasoning, could be significantly more efficient or clearer, etc. Include a screenshot of your solution to each problem part, and in a few sentences, explain what was deficient and how it could be fixed.
Alternatively, if you think one of your solutions is significantly better than the posted one, copy it here and explain why you think it is better. If you didn’t do Homework 6, choose two problem parts from it that look challenging to you, and in a few sentences, explain the key ideas behind their solutions in your own words.
Solution
Problem 2: Anonymous Feedback (6 pts)
We’d like to get your feedback on how the course has been going so far, now that we’re past the halfway point and Midterm 2 is fast approaching.
You can find the survey at this link, which you should complete after you’ve finished the rest of Homework 7. Unlike the Homework 3 survey, it is anonymous, so feel free to provide candid feedback.
In order to earn the 6 points for Homework 7, Problem 2, include a screenshot of the confirmation message you see after submitting the form. (We consider it an honor code violation to include a screenshot if you didn’t actually submit the form!)
Thank you for your feedback once again!
Problem 3: The Complete Solution (18 pts)
Before beginning this problem, make sure you’ve read both Chapter 6.3 and Chapter 6.4.
a)
(4 pts) Consider the matrix \(A\) and vector \(\vec b\) defined below.
Find the vector \(\vec x^{\ast}\) that minimizes \(\lVert \vec b - A \vec x \rVert^2\). Show your work, but feel free to use numpy to compute some of the matrix operations; here is a video that walks through how to do this. (We’re intentionally using different variables here than in the notes to have you think about the problem in general terms.)
and \(\vec{x} \in \mathbb{R}^2\), we want the vector \(\vec{x}^{\ast}\) that makes the squared error \(\lVert \vec{b} - A\vec{x} \rVert^2\) as small as possible.
Because \(A\) has two linearly independent columns, its columns form a 2D subspace of \(\mathbb{R}^4\), and there is a unique least-squares solution. The standard way to get it is to use the normal equations
Geometrically, \(A\vec{x}^{\ast}\) is the projection of \(\vec{b}\) onto the column space of \(A\), and \(\vec{x}^{\ast}\) are the coordinates of that projection in the basis given by the two columns of \(A\).
For the remainder of the problem, we will use the same vector \(\vec b\), but instead use the following matrix \(A\).
(2 pts) Now, find one vector \(\vec x^{\ast}\) that minimizes \(\lVert \vec b - A \vec x \rVert^2\) for this new matrix \(A\). Again, show your work. If you’ve read Chapter 6.4 closely, this should not require much calculation.
Solution
Since \(A\) has rank 2 but 5 columns, the normal equation
$$ A^T A \vec{x} = A^T \vec{b} $$
do not have a unique solution, and there are infinitely many vectors \(\vec{x}^{\ast}\) that minimize \(\lVert \vec{b} - A\vec{x}\rVert^2\).
To find one such vector, we can keep only two linearly independent columns of \(A\) — here, we use columns 3 and 1, in that order — since removing dependent columns does not change \(\text{colsp}(A)\).
This matrix has the same two columns as in part a), but in the opposite order, so the least-squares coefficients are also swapped relative to part a). Therefore,
Since \(\text{rank}(A) = 2\) and \(A\) has 5 columns, the rank-nullity theorem says
$$ \dim(\text{nullsp}(A)) = 5 - 2 = 3 $$
So, a basis for \(\text{nullsp}(A)\) will have 3 vectors. The easy way to find those three vectors is to determine how to write \(A\)’s linearly dependent columns as linear combinations of its linearly independent columns. Columns 1 and 3 are linearly independent (they are the same two columns as in part a)). By inspection — that is, by guessing and checking — we can find that:
(2 pts) Show that if \(\vec x’\) satisfies the normal equation, \(A^TA \vec x’ = A^T \vec b\), and \(\vec x_0 \in \text{nullsp}(A)\), then \(\vec x’ + \vec x_0\) also satisfies the normal equation. Hint: This is two-line solution; we’re mostly asking it so that you interalize what this means and why it’s true.
Solution
If \(\vec x’\) satisfies the normal equation, then
$$ A^T A \vec x' = A^T \vec b $$
By the definition of the null space, if \(\vec x_0 \in \text{nullsp}(A)\), then \(A\vec x_0 = \vec 0\). Then:
Thus, \(\vec x’ + \vec x_0\) also satisfies the normal equation. This means adding any vector in \(\text{nullsp}(A)\) to a solution \(\vec x’\) of the normal equation gives another valid solution.
e)
(3 pts) Describe, using set notation, the complete set of vectors \(\vec x^{\ast}\) that minimize \(\lVert \vec b - A \vec x \rVert^2\). Is this set a subspace?
Solution
The complete set of vectors that minimize \(\lVert \vec{b} - A\vec{x} \rVert^2\) is all vectors that result from starting with a particular solution \(\vec x’\) and adding any vector in the null space of \(A\).
This set is not a subspace, because it does not necessarily pass through the origin (for instance, \(\vec{x}^{\ast}\) itself may not be \(\vec{0}\)).
f)
(3 pts) There are infinitely many vectors \(\vec x^{\ast}\) that minimize \(\lVert \vec b - A \vec x \rVert^2\). If we try and use code to find a solution, it can’t return all of them — it’ll pick a particular one.
In Python, use np.linalg.lstsq to find a vector \(\vec x^{\ast}\) that minimizes \(\lVert \vec b - A \vec x \rVert^2\). Include a screenshot of your code and the vector \(\vec x^{\ast}\) it returns, and in your PDF, write out the coefficients of \(\vec x^{\ast}\) as a vector (in addition to the screenshot). Then, provide an educated guess of why you think it picked the \(\vec x^{\ast}\) that it did.
Solution
When fitting a linear model, the goal of np.linalg.lstsq is to find a weight vector \(\vec{w}^{\ast}\) that minimizes
$$ \lVert \vec{b} - A\vec{w} \rVert^2 $$
If \(A^TA\) is invertible (that is, if the columns of \(A\) are linearly independent), there is exactly one solution:
$$ \vec{w}^* = (A^TA)^{-1}A^T\vec{b} $$
However, if \(A\) does not have full rank, then \(A^TA\) cannot be inverted, and there are infinitely many \(\vec{w}\) that give the same minimum error.
In this case, np.linalg.lstsq uses the singular value decomposition (SVD) of \(A\) to find a stable and well-defined \(\vec{w}^{\ast}\). Without getting into the details, know that among all possible minimizers, this particular \(\vec{w}^{\ast}\) also has the smallest possible length \(\lVert \vec{w} \rVert\). In other words, it fits the data as well as possible while keeping the parameter vector as small as possible. It is called the “min-norm” solution.
Problem 4: Orthogonalization (27 pts)
Before starting, refer to Chapter 6.5, written just for this problem. It won’t be possible to do this problem without referencing it.
In parts a) through d), we’ll refer to the vectors \(\vec v_1 = \begin{bmatrix} 1 \\ 1 \\ 0 \\ 0 \end{bmatrix}\), \(\vec v_2 = \begin{bmatrix} 0 \\ 1 \\ 1 \\ 0 \end{bmatrix}\), and \(\vec v_3 = \begin{bmatrix} 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}\).
a)
(6 pts) By hand, apply the Gram-Schmidt process to the vectors \(\vec v_1, \vec v_2, \vec v_3\) to find an orthonormal set of vectors \(\vec q_1, \vec q_2, \vec q_3\). Show your work; you cannot use numpy for this.
Then, create the matrix \(Q = \begin{bmatrix} | & | & | \\ \vec q_1 & \vec q_2 & \vec q_3 \\ | & | & | \end{bmatrix}\) and confirm that \(Q^TQ = I\), but that \(QQ^T \neq I\).
(Tip: Instead of converting \(\vec Q_3 = \begin{bmatrix} 1/3 \\ -1/3 \\ 1/3 \\ 1 \end{bmatrix}\) to a unit vector, it’s easier to instead convert \(\begin{bmatrix} 1 \\ -1 \\ 1 \\ 3 \end{bmatrix}\), which points in the same direction as \(\vec Q_3\), to a unit vector. The result will be the same in both cases, but by multiplying through by 3 you get to avoid messier fractions.)
Finally, we need to construct \(Q = \begin{bmatrix} 1/\sqrt{2} & -1/\sqrt{6} & \sqrt{3}/6 \\ 1/\sqrt{2} & 1/\sqrt{6} & -\sqrt{3}/6 \\ 0 & 2/\sqrt{6} & \sqrt{3}/6 \\ 0 & 0 & \sqrt{3}/2 \end{bmatrix}\). Indeed, \(Q^TQ = I\) because all pairs of columns are orthogonal and have length 1. However, \(QQ^T \neq I\) because, for instance, row 3 and row 4 are not orthogonal, so \((QQ^T)_{3, 4} = 3/12 = 1/4 \neq 0\).
b)
(3 pts) Suppose \(\vec v_4 = \begin{bmatrix} 2 \\ 2 \\ 3 \\ 3 \end{bmatrix}\). If you were to apply the Gram-Schmidt process to the vectors \(\vec v_1, \vec v_2, \vec v_3, \vec v_4\), what would the vector \(\vec Q_4\) be? Why?
In the previous part, we already applied Gram-Schmidt to the vectors \(\vec v_1, \vec v_2, \vec v_3\), giving us the orthonormal vectors \(\vec q_1, \vec q_2, \vec q_3\). If we were to continue the Gram-Schmidt process for a fourth iteration, we’d have
Gram-Schmidt already gave us \(Q = \begin{bmatrix} 1/\sqrt{2} & -1/\sqrt{6} & \sqrt{3}/6 \\ 1/\sqrt{2} & 1/\sqrt{6} & -\sqrt{3}/6 \\ 0 & 2/\sqrt{6} & \sqrt{3}/6 \\ 0 & 0 & \sqrt{3}/2 \end{bmatrix}\), which has a column space that is equal to the span of \(\vec v_1, \vec v_2, \vec v_3\).
Find scalars \(a\), \(b\), and \(c\) such that \(a\vec q_1 + b\vec q_2 + c\vec q_3 = \vec y\), without solving a system of 3 equations and 3 unknowns. Instead, use the fact that \(\vec q_1, \vec q_2, \vec q_3\) are orthonormal. Hint: There’s a relevant problem from Lab 5.
Solution
The main idea here is that to write \(\vec y\) as a linear combination of \(\vec q_1\), \(\vec q_2\), and \(\vec q_3\), we can project \(\vec y\) onto each of these vectors individually and then use the corresponding coefficients to write \(\vec y\) as a linear combination of all three.
Remember, \(\text{proj}_{\vec q_i}(\vec y) = \frac{\vec y \cdot \vec q_i}{\vec q_i \cdot \vec q_i} \vec q_i\), but since each \(\vec q_i\) is a unit vector, we have \(\text{proj}_{\vec q_i}(\vec y) = (\vec y \cdot \vec q_i) \vec q_i\). So,
Notice that \(Q^T \vec y\) is a vector containing \(a\), \(b\), and \(c\), since \(Q^T \vec y\) contains the dot product of each of \(Q^T\)’s rows (which are \(Q\)’s columns) with \(\vec y\).
Let’s test out our logic in Python.
d)
(4 pts) Consider \(\vec y = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \end{bmatrix}\). Unlike in c), \(\vec y\)is not in \(\text{span}(\lbrace \vec v_1, \vec v_2, \vec v_3 \rbrace)\).
Find the vector in \(\text{span}(\lbrace \vec v_1, \vec v_2, \vec v_3 \rbrace)\) that is closest to \(\vec y\). Do not stack the \(\vec v_i\)’s into a matrix \(X\) and then use \(X(X^TX)^{-1}X^T\vec y\). Instead, use the fact that \(\vec q_1, \vec q_2, \vec q_3\) are orthonormal and have the same span as \(\vec v_1, \vec v_2, \vec v_3\). How does this simplify the problem?
Solution
Recall, the projection of \(\vec y\) onto \(\text{span}(\lbrace \vec q_1, \vec q_2, \vec q_3 \rbrace)\) is given by
$$ \vec p = \underbrace{Q(Q^TQ)^{-1}Q^T}_P\vec y $$
Since \(Q^TQ = I\), we have that
$$ \vec p = QQ^T \vec y $$
If \(Q\)’s rows were orthonormal, then \(QQ^T = I\), but that’s not the case here. Instead, \(\vec p = QQ^T \vec y\).
So, the vector in \(\text{span}(\lbrace \vec v_1, \vec v_2, \vec v_3 \rbrace)\) that is closest to \(\vec y\) is \(\boxed{\begin{bmatrix} 3/2 \\ 3/2 \\ 7/2 \\ 7/2 \end{bmatrix}}\).
e)
(5 pts) Open the the supplemental Jupyter Notebook we’ve created for Homework 7, which can either be found here in the course GitHub repository, or here on DataHub.
There, you’re asked to implement the function orthogonalize, which takes in an \(n \times d\) matrix \(V\) whose columns are linearly independent, and returns a matrix \(Q\) whose columns are orthonormal and have the same span as \(V\). This problem is not autograded. Rather, in your submission to this part, include a screenshot of your implementation and sample output in your PDF for Homework 7.
Solution
f)
(5 pts) A QR decomposition of a matrix \(A\) is a factorization of the form
$$ A = QR $$
where \(Q\) is an \(n \times d\) matrix with orthonormal columns and \(R\) is an \(d \times d\)upper triangular matrix (a matrix that has 0s below the diagonal).
For example, if \(A = \begin{bmatrix} 1 & 1 & 1 \\ -1 & 0 & 1 \\ 1 & 1 & 2 \end{bmatrix}\), a \(QR\) decomposition of \(A\) is
Finding the \(Q\) in a \(QR\) decomposition is straightforward: apply Gram-Schmidt to the columns of \(A\), assuming \(A\)’s columns are linearly independent. The question is how to find \(R\).
In the supplemental Jupyter Notebook, we’ve defined an arbitrary matrix \(A\) and call your orthogonalize function on it, and give you hints as to how to find \(R\). Using the experimentation there, and what you know about \(Q\), explain how to find \(R\).
Find a \(QR\) decomposition of \(A = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{bmatrix}\). Note that the columns of \(A\) are made up of the same three vectors you worked with in parts a) through d) of this problem.
Given a\(QR\) decomposition of \(A\), explain how to find another\(QR\) decomposition of \(A\) with a (slightly) different \(Q\) and/or \(R\).
Solution
The big idea is that if \(Q\) is a matrix with orthonormal columns, then \(Q^TQ = I\). So, if we’re trying to find an \(R\) such that \(A = QR\), then multiplying both sides by \(Q^T\) on the left gives us
$$ A = QR \implies Q^TA = Q^TQR \implies Q^TA = R $$
meaning that \(R = Q^TA\).
All that’s left to explain is why \(R\) is upper triangular. The product \(Q^TA\) contains the dot products of the rows of \(Q^T\) with the columns of \(A\). But, the rows of \(Q^T\) are the columns of \(Q\), so \(R = Q^TA\) contains the dot products of columns of \(Q\) with columns of \(A\). Specifically,
$$ R_{i, j} = \vec q_i \cdot \vec v_j $$
where \(\vec q_i\) is the \(i\)-th column of \(Q\) and \(\vec v_j\) is the \(j\)-th column of \(A\).
Remember, we constructed each \(\vec q_i\) to be orthogonal to all previously constructed \(\vec q_j\)’s for \(j < i\). Put in English, \(\vec q_2\) is orthogonal to \(\vec q_1\), \(\vec q_3\) is orthogonal to \(\vec q_1\) and \(\vec q_2\), and so on.
Each \(\vec q_i\) was found by taking \(\vec v_i\) and subtracting off a linear combination of\(\vec q_{i - 1}, \vec q_{i - 2}, \ldots, \vec q_1\). (More precisely, we built the \(\vec Q_i\)’s this way and then normalized them to get the \(\vec q_i\)’s, but the directions of the \(\vec Q_i\)’s are the same as the directions of the \(\vec q_i\)’s, so this reasoning still holds.) As an example, consider how we would construct \(\vec q_4\) if we were to apply Gram-Schmidt to the vectors \(\vec v_1, \vec v_2, \vec v_3, \vec v_4\):
But, since all the \(\vec q_i\)’s are orthogonal to each other, if we were to take the dot product of both sides of the equation with \(\vec q_5\), or \(\vec q_6\), or any other \(\vec q_i\) for \(i > 4\), we would get 0.
This illustrates that \(\vec q_i \cdot \vec v_j = 0\) when \(i > j\). But, a matrix with 0’s where \(i > j\) is a matrix with 0’s everywhere below the diagonal of \(i = j\), which is precisely an upper triangular matrix.
$$ \underbrace{\begin{bmatrix} \cdot & \cdot & \cdot & \cdot \\\\ 0 & \cdot & \cdot & \cdot \\\\ 0 & 0 & \cdot & \cdot \\\\ 0 & 0 & 0 & \cdot \end{bmatrix}}_{\text{in all the 0's, the row index (i) is greater than the column index (j)}} $$
So, since \(\vec q_i \cdot \vec v_j = 0\) when \(i > j\), \(R = Q^TA\) — which is made up of these dot products — is upper triangular.
We have already found the \(Q\) in a \(QR\) decomposition of \(A = \begin{bmatrix} 1 & 0 & 0 \\ 1 & 1 & 0 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{bmatrix}\): these are the vectors we found in part a) of the problem.
An easy solution is to negate one or more of the columns of \(Q\); the resulting columns of \(Q\) will still be orthonormal with the same column space. This will lead to a slightly different \(R\). (There are other ways of finding a \(QR\) decomposition as well.)
Problem 5: Same, but Different (13 pts)
In Chapter 2.4, we were introduced to one of many formulas for the optimal slope, \(w_1^{\ast}\), and optimal intercept, \(w_0^{\ast}\), for the simple linear regression model \(h(x_i) = w_0 + w_1 x_i\) when using squared loss:
$$ w_1^* = r \frac{\sigma_{y}}{\sigma_{x}} \qquad w_0^* = \bar y - w_1^* \bar x $$
The end goal of Chapters 3 through 6 has been to give us the tools to revisit the simple linear regression model in terms of linear algebra, so that we can extend our model to allow for multiple input variables. As we see in Chapter 7.1, the solution is to define the \(n \times 2\) “design matrix” \(X\) and observation vector \(\vec y \in \mathbb{R}^n\) as follows:
It’s not immediately obvious why the components of \(\vec w^{\ast}\) should have anything to do with the correlation, means of \(x\) and \(y\), and standard deviations of \(x\) and \(y\). In this problem, we will prove that both of these formulations are equivalent, for any dataset \((x_1, y_1)\), \((x_2, y_2)\), ..., \((x_n, y_n)\).
a)
(5 pts) Express the matrix \((X^TX)^{-1}\) using constants and/or summations involving \(x_i\) and/or \(y_i\).
To simplify, we’ll start by proving the statement in the hint, that \(\sum_{i = 1}^n x_i^2 = n \sigma_x^2 + n \bar{x}^2\). The key “trick” is writing \(x_i^2\) as \((x_i - \bar{x} + \bar{x})^2\), and then expand by grouping \(((x_i - \bar{x}) + \bar{x})^2\).
We’ll use the fact that \(\sum_{i = 1}^n (x_i - \bar{x}) = 0\); see Chapter 0.1.
This is where the hint comes in. Let’s show that \(\sum_{i=1}^n x_i y_i = nr \sigma_x \sigma_y + n \bar{x}\bar{y}\). It’s not immediately clear how to approach this, since starting with just \(\sum_{i=1}^n x_i y_i\) doesn’t seem to help us. Instead, let’s start by expanding the definition of \(r\).
Note that the second component of the vector above is \(w_1^{\ast} = r \frac{\sigma_y}{\sigma_x}\) and the first component of the vector above is \(w_0^{\ast} = \bar{y} - r \frac{\sigma_y}{\sigma_x} \bar{x} = \bar{y} - w_1^{\ast} \bar{x}\), as we first saw in Chapter 2.4! I think this is beautiful.
Problem 6: Putting it into Practice (8 pts)
This problem asks you to apply the concepts in Chapter 7.2, and follows the last problem.
Suppose we’d like to fit a hypothesis function of the form \(h(x_i) = w_0 + w_1 x_i^2\). Notice the squared term; this is not a simple linear regression line.
To do so, we’ll find the optimal parameter vector \(\vec w^{\ast}\) that satisfies the normal equations. The first 5 rows of our dataset are as follows, though note that our dataset has \(n\) rows in total.
Suppose \(x_1, x_2, …, x_n\) have a mean of \(\bar{x} = 5\) and a variance of \(\sigma_x^2 = 8\).
a)
(2 pts) Write the first five rows of the design matrix, \(X\).
Solution
Since our hypothesis function is \(h(x_i) = w_0 + w_1 x_i^2\), the design matrix must include a column of ones (for the intercept term) and a column containing \(x_i^2\) for each \(x_i\) value.
Here, each entry in the second column corresponds to \(x_i^2\) for the given \(x_i\) values in the dataset.
b)
(3 pts) Suppose that after solving the normal equations, we find \(\vec w^{\ast} = \begin{bmatrix} 5 \\ -2 \end{bmatrix}\). Find the augmented feature vector \(\text{Aug}(\vec x_3)\) and squared loss for row 3 of the dataset.
Solution
The augmented feature vector adds a constant \(1\) for the intercept and the squared value of \(x_3\):
(3 pts) Let \(X_\text{tri} = 3X\), where \(X\) is the full design matrix for our dataset with \(n\) rows. Determine the bottom-left value in the matrix \(X_\text{tri}^TX_\text{tri}\), i.e. the value in the second row and first column. Your answer should be an expression involving \(n\). Hint: You can use any of the hints or results from Problem 2 without needing to re-prove them.
This problem involves writing code and submitting it to the Gradescope autograder. The goal of this problem is to give you a taste of how linear algebra can be used to implement linear regression in code, and show you how to build models that involve multiple features (including categorical variables).
Open the the supplemental Jupyter Notebook we’ve created for Homework 7, which can either be found here in the course GitHub repository, or here on DataHub.
This problem is entirely autograded; to receive credit for Problem 7 of this homework, you’ll need to submit your completed notebook to the autograder on Gradescope. Your submission time for Homework 7 is the latter of your PDF and code submission times.