Low-Rank and Structured Low-Rank Reconstruction Approaches
Mark: Thanks for coming to low rank and structured low rank reconstruction approaches. I have nothing to declare. So for this talk, we're going to focus on providing some intuition for learning methods and a rough outline for how the mechanics of low rank reconstructions work. So there won't be an exhaustive review and it's not going to go too deep into theoretical stuff, but I will provide some links to other resources at the end.
[00:00:23] If you're interested in learning more, I'm going to start today with a motivating example. And you may have heard of the Netflix problem, which was obviously more relevant when Netflix actually used numerical ratings. But anyways, it's a collaborative filtering problem where you have a matrix of movies by users.
[00:00:39] And the problem is basically to predict ratings for movies that have not yet been viewed by a given user and in the absence of other information, this is obviously an impossible task you know, if every person was completely unique and unpredictable, there wouldn't be any good way to guess what the missing value should be.
[00:00:56] However, you know, your movie preferences turns out aren't as unique or quirky as you may have imagined. And user ratings are actually highly correlated. And so with these assumptions, that movie preferences are correlated across users. We can try and leverage ratings from other users to try and fill in the missing entries.
[00:01:15] Another way of thinking about it is that movies aren't actually completely independent entities either, and they can be grouped into genres with strong rating correlations within the genre. As it turns out these properties of the ratings matrix means that it has a low rank, which ultimately means that it can be possible to predict unseen entries from other entries in the matrix.
[00:01:38] Now, from an imaging perspective, we can directly map this idea to something like voxels and time points, where we can see that the voxels in this cardiac cine movie are not all moving independently, but that there are a lot of voxels that exhibit the same motion or contrast changes, which means that they're correlated.
[00:01:55] And so we can break down a data set into a small number of modes or components that have sort of coherent dynamics. And that is exactly what a low-rank model does. And we can see here a small number of components that are used to describe the cardiac cine data it's comprised of a largely static background and a few motion dynamics we could see that animated here.
[00:02:20] And if we add these four modes back together into a single composite dataset, we can see that most of the variation and dynamics are captured despite only having four distinct components or modes of variation. And this means that the data is well-described by a low rank linear model. And it's this low-rank property that we'll be using to perform our constrained image reconstruction. .
[00:02:44]Here, we'll try and go through some relevant background and concepts to give you some intuition for what properties the low rank matrix has and how we can interpret or think about those properties. First, we'll start with a singular value decomposition, which breaks down any matrix into left singular vectors, right singular vectors and scaling factors called singular values. It's an incredibly useful tool because it gives us direct access to not only the rank of a matrix, which is the number of non-zero singular values, but also those left and right singular vectors, which form bases for the column, then row vectors in our matrix. The singular vectors are paired up in the way that you see here with components illustrated in color. The first left singular vector only sees the first right singular vector and the first singular value. And you can pause the video and convince yourself that this is true with the rules of matrix multiplication. And this is also true for the second components and the third components and the fourth components, and ultimately the left singular vectors form a basis for the column vectors in our matrix, which means that any column in the matrix can be represented as a linear combination of these left singular vectors.
[00:03:53] Similarly the right singular vectors form a basis for the row vectors in our matrix, which means that any row in the matrix can be represented as a linear combination of these right singular vectors. The entries in the matrix of left single vectors, U along with the singular values, help define the linear combination weights.
[00:04:14] This is why low-rank methods can also be referred to as subspace learning methods, because ultimately the singular vectors define these linear subspaces that the matrix information resides in. And if the matrix is low rank, this is equivalent to having your column and row vectors all living in these low dimensional subspaces.
[00:04:31] You visualize that here as signals. In a 2d subspace of a three dimensional vector space. And it also becomes visually apparent why we sometimes refer to our low rank matrix vectors as being correlated.
[00:04:47] We could have also written our SVD in a slightly different way as a sum of rank one matrices or each one is comprised of a column vector and multiplied by a row vector in an outer product along with an associated scale factor coming from the singular values. A rank one matrix is just a matrix where every row is basically exactly the same aside from a scaling factor or equivalently, every column is exactly the same aside from a scaling factor. And this row-wise wise perspective, the left singular column vector provides the scaling weights for the right singular row vectors.
[00:05:24] If we look at a more realistic rank one space time matrix, like from the cardiac cine data I showed earlier, we can clearly see that every row is a scaled version of the basis row and simultaneously every column is a scaled version of this basis column. So through the SVD, we can then interpret our low rank matrix as a sum or superposition of rank-1 components or linearly independent features.
[00:05:52] I kind of like to think of these features like different layers of the data, kind of like the different layers of visual features in a Photoshop or visual effects composition. So each of these features capture some amount of information in the data set. And as we've seen each one is very strongly coherent.
[00:06:10] So it's not too hard to imagine that in this decomposition information about missing entries can be fairly easy to recover within each component because each of them has a very predictable internal structure.
[00:06:25] So the rank of a matrix can be defined in a number of ways. We can refer to it as the number of non-zero singular values. We can say that it's the number of linearly, independent rows and columns in the data. It's also the dimensionality of the row and column spaces. And the consequence is that when this matrix is low rank , we can say certain things about the properties of the matrix like that there's information that's shared across the entries of the matrix or that the rows and columns of the matrix are not linearly independent and the rows and columns of the matrix are correlated or that the rows and columns vectors lie in low dimensional subspaces.
[00:07:03] And ultimately this, you know, boils down to the fact that the matrix is redundant. So to better illustrate this redundancy, we can examine the degrees of freedom or the independent values needed to characterize a matrix. For an eight by eight matrix that's full rank, we need 64 degrees of freedom.
[00:07:19] That's one value for every entry in the matrix. And there's no redundancy here in a full rank matrix. In a low rank matrix. However, we can visualize the degrees of freedom through the SVD. First, if we look at the left singular vectors, the first left singular vector has seven degrees of freedom because there are eight entries, but it's normalized.
[00:07:40] So you lose one degree of freedom through that normalization. So we have seven. The next singular vector is also normalized. So you lose one degree of freedom and you lose an additional degree of freedom because it's defined to be orthogonal to the first singular vector. So then we have six and then similarly for the thirdsingular vector it's normalized and orthogonal to the first two.
[00:07:59] So you have five degrees of freedom. So we have 18 degrees of freedom for the singular vectors. Then we have three singular values. So we have three degrees of freedom there. And then we can go through a similar, accounting procedure for the right singular vectors, which is to say that we have seven degrees of freedom, then six, then five.
[00:08:17] So in total we have 18 plus three plus 18 degrees of freedom, which is 39 degrees of freedom. We need 39 independent values to characterize. The 64 entries in this matrix. So, you know, that's where we have this redundancy coming and we, the matrix nominally has 64 entries, but they're actually only 39 degrees of freedom.
[00:08:36] And we can have an expression for the degrees of freedom as you see here. Which is the R times N plus N minus R. But it's more easily visualized through this SVD counting procedure. So you may think that low rank models might just be a mathematical curiosity and that rich real-world signals would never really be able to fit into such a constrained representation.
[00:09:00] However, just to highlight the representational power of low rank models here, I'm showing you a set of signage signals. And if I asked you to guess how many components were used to form this set of signals? You might guess some, you know, fairly large number. But actually, you know, I generated these signals using only two components, right?
[00:09:19] So like kind of unexpectedly linear combinations of just two basis components can actually form quite a variety of output signals. Similarly on the right, you know, we can do this with more random looking signals and not just sinusoidal basis vectors. And there's, there's, you know, there's a certain degree of richness that you can actually express With linear combinations of just a small number of components.The last thing I wanted to touch on in this background section was the connection to sparsity and compressed sensing has maybe a bit more apparent if I use a diagram with a bit larger dimensionality.
[00:09:58] Okay. We can think of the diagonal vector of singular values now as actually a sparse vector because a low rec matrix only has a small number of non-zero singular values, which is the definition of sparse. So, if we do some matrix multiplications, we can actually see that the U and the V matrices are kind of like matrix sparsifying transforms.
[00:10:17] And when we multiply the data matrix with them, then we're actually just left with a sparse diagonal matrix. However in contrast to conventional compress sensing where sparsifying transforms need to be known ahead of time. No specific knowledge of the left and right singular vectors in U and V needs to be known at all.
[00:10:36] We actually only need to know that a low rank representation exists and that we can use our computational machinery to find thesesparsifying transforms in U and V.
[00:10:50]Now we're going to talk about low rank methods. And so the basic premise is, you know, you want to reconstruct your data in the presence of under sampling, noise or some other corruption or perturbation. In conventional reconstruction models, you want to find the image that best fits the measured k-space you know, least square sense.
[00:11:08] However, this may not produce the best reconstruction in terms of metrics such as mean squared error, or the problem might be under determined. So if you know that your data can be represented as a low rank matrix, you can use that as prior knowledge to additionally constrain or regularize the image reconstruction problem.
[00:11:27] I'll give you a basic recipe on how to do imagery construction with low-rank methods. First, we kind of have to identify the low rank matrix data. You know, for example, space by time matrix, then we need to construct a forward measurement model. This includes information about sampling, coil sensitivities, et cetera. Then we need to choose a suitable cost function with a low rank constraint.
[00:11:50] So that's a convex or non convex formulation, a local or global rank constraint. And then we need to solve the cost function using appropriate algorithm. And this is typically a first order iterative algorithm. First, I'll give you some examples on how to identify, appropriate matrix data. So space-time matrices are probably the most intuitive, you know, you can think of them as having signals with time courses that are correlated across space or equivalently that many voxels exhibit the same contrast dynamics.
[00:12:26] We could also think of matrix data that is space by spectra. So in this case, we're assuming that spectral content is similar across voxels. Or we can formulate a matrix, that's actually a space by coils. And in this case, you know, we're assuming that local spatial information is shared across coils.
[00:12:48] We can leverage matrices that are space by TE, where a voxels or special features are correlated across echo time dimension. Or we can do things like, space by diffusion and coding matrices, where we can assume that voxel contrasts are correlated across diffusion encodings. So ultimately, you know, what we're going to do is we're going to have to try and find some way to represent our data as a matrix.
[00:13:18] Sometimes in the literature, you'll find references to the Casorati matrix. And this is just a fancy way of referring to the data matrix of interest formed by concatenating signal vectors together. For example, our data are almost always space by some other encoding. So in this example, we have sets of 2d images across time, which would normally organize into a three-dimensional array.
[00:13:40] To form the cast already matrix. We unravel all of the spatial dimensions into a single vector. And then what we do is we concatenate those vectors across time or whatever other encoding is being used to form the matrix.
[00:13:55] As with most reconstruction algorithms, you'll also need to have to find your forward measurement operator or encoding operator, which maps your image to your k-space measurements. More concretely. This often includes multiple parts, including multiplication by coil sensitivities Fourier transform, and a sampling mask.
[00:14:12] The k-space sampling mask can also be replaced by a non Cartesian Fourier encoding. And although these operators aren't explicitly constructed, it will be also useful to be able to perform the adjoint or conjugate transpose of this forward operator, which often shows up in a gradient calculations of the data consistency.
[00:14:33] So next, you need to choose a way to enforce the low-rank constraint. And so there are a number of choices you can make here. You can choose a convex approach using the nuclear norm. We can choose a non-convex approach using a strict rank constraint. We can also enforce a low-rank property globally over the entire matrix, or we can enforce the low rank property sort of patch wise or locally.
[00:15:00] The nuclear norm is the convicts envelope of the rank functional and it's the matrix equivalent of using the L1 norm as a convict surrogate for sparsity. Okay, it's defined as the sum of the singular values of the matrix. So we can then construct a cost function that directly penalizes the nuclear norm of the estimated matrix.
[00:15:22] We can also try to directly enforce the rank of the estimate and matrix by using a set constraint that the matrix must have a specified rank R or lower. The tricky thing is that the set of rank-r matrices is not convex. And we've actually already seen previously when we discussed the SVD that adding multiple rank one matrices can result in a higher rank matrix, which obviously violates the convexity condition.
[00:15:47] However we can formulate a non convex constraint cost function that minimizes data consistency over the set of rank-r matrices and this approach. However, it will be sensitive to initialization.
[00:16:01] In addition to choosing the type of low-rank penalty or constraint, we often have the choice over how the low rank model is defined over the data. And we can choose a global rank model in which the entire data matrix is assumed to fit into this low rank framework. And you might want to choose a model like this when you know that the entire data set shares the same singular vectors and that there isn't really any difference between voxels that are close or distant in space.
[00:16:26]We can also choose to enforce the low-rank constraints on our local patch wise basis. That is each image patch or spatial neighborhood can have its own low rank representation. And that there is no global sharing of information about the singular vectors and other patches. And one way of formulating the cost function using the nuclear norm penalty is similar to the globally low rank case, except we have an operator RP, which selects image patches, and then we sum the nuclear norm penalty over all of these patches.
[00:16:57] And so we might want to use a model like this when we want more flexibility and representational power in our data. As the global matrix is no longer specifically constrained to have low rank. And this is also useful when we know that certain factors such as smoothly varying off resonance will reduce long range, signal correlations across space, but will still have a locally low-rank property.
[00:17:18] So other ways to formulate lowering penalties for image reconstruction, including combinations with other penalties or formulations such as low rank and sparse decompositions, and sometimes referred to as robust PCA. Low rank and sparse penalties and exploiting low rank tensor structure. Although I won't discuss these any further in this talk , I'll refer you to the references at the bottom for further reading.
[00:17:47] Finally after framing the optimization problem with the suitable cost function, we need to wait to solve it. And usually this involves a iterative first order method, which means that at most, we need to compute a gradient. So I'll discuss briefly two relatively straightforward approaches. Although there are many different algorithms that can solve rank constrained minimization problems.
[00:18:07] The first is an SVD free method using a matrix factorization approach. And the second is an SVD based approach, a proximal gradient method involving singular value thresholding for example.
[00:18:22]One simple approach to solving the non convex rank constrained problem is by actually splitting the minimization problem into two variables U and V, which each have at most rank r by construction and therefore their product has at most rank r, satisfying the constraint. We can solve this using a simple alternating minimization approach. And one of the advantage of this approach is that it doesn't use the SVD at all, which can be computationally expensive for large matrices. Instead each sub problem in the alternating scheme is simply a linear least squares problem, which can be solved fairly easily. A slightly more sophisticated factorization approach can be found in this 2009 Haldar and Hernando paper that you can see at the bottom.
[00:19:06] One thing I also want to point out here is that in this framework, sometimes one of the variables say V is actually predetermined using training data. And this can take the form of say estimating temporal basis vectors from fully sampled, low spatial resolution training data. And this is actually the approach used in the seminal work by Liang in his partially separable functions paper from 2007.
[00:19:30] The other reconstruction approach using the SVD employs a soft thresholding of singular values, somewhat analogous to soft thresholding algorithms used in sparse reconstruction problems. This approach essentially alternates between taking a gradient step with respect to the data consistency, and then computing the SVD, shrinking the singular values by some factor towards zero with a soft threshold, and then iterating.
[00:19:53] And you can read more about these types of approaches in the references at the bottom as well.
[00:20:01]I just wanted to touch on some other considerations, including sampling and parameter selection. So the sampling, if you have training data, for example, a low resolution, fully sampled reference portion of k space over which you're going to train or pre-train one of your subspaces, then you're under sampling for your general recovery problem can be regular because you're just solving a linear least squares fit.
[00:20:26] But otherwise, if you're actually solving the rank constrained problem more generally without training data, you typically need more random incoherent sampling, similar to the compressed sensing requirements.
[00:20:38] And parameter selection again, similar to compress sensing. You know, if you have training data or other reference data available, you can try and derive parameters from that.
[00:20:48] Otherwise they're typically empirically chosen, although you can, you know, you can choose to solve cost functions with model order penalties as well.
[00:20:58]Now we're going to talk about structured low-rank methods. So these methods are similar to general low-rank methods, but now it's not the data itself that is low-rank but rather a transformation of this input data. And the advantage of this is it doesn't require multi-dimensional datasets. You can generate a structured low rank matrix from a single image.
[00:21:16] So you don't need images by time or images by some other encoding. We can leverage low-rank properties that come from other types of constraints. And typically this involves collecting points of k-space using a kernel and concatenating them to form a Hankel structured matrix.
[00:21:34] So how do we construct this Hankel structured matrix?
[00:21:36] Well, we take small windows or kernels of our k-space and unravel them into vectors. You can see that here and illustrated here again, three by three kernel over the k-space on the left, unraveled into a row vector on the right. And then we zoom or shift this kernel around and then stack the resulting vectors.
[00:22:01] This procedure, if you kind of follow through for the entire k-space generates a block Hankel structured matrix, and you can see within each colored block, we can see constant anti diagonals, and we can also see that it has another type of anti diagonal structure at the block level.
[00:22:20] Intuitively, we can see that these Hankel matrices naturally arise by considering convolution operations across the k-space and essentially the Hankel matrix transformation, or sometimes they're referred to as a lifting of our signal x is setting up the convolution operation with some sort of vectorized y.
[00:22:40] Then it's not too hard to imagine that the low-rank properties of the Hankel matrix arise from signal convolution arguments.
[00:22:51] The low-rank nature of these Hankel structured matrices is definitely less intuitive than the low-rank models of the data directly, like we talked about in the previous section. So I'll try and give you some intuition here based on an example, from the limited support constraint arguments in the LORAKS paper.
[00:23:10] The argument is actually very simple. Let's say we have an image. Typically it's not going to be zero everywhere. And you know, that that's ultimately what limited support means. It's only non-zero on a sort of subset of voxel locations. So then typically we can identify some sort of function that when it's multiplied with the image, it doesn't change it at all.
[00:23:29] And, and we can see that here in the image, A which is the Shepp Logan phantom, you know, there's a lot of background regions, that are assumed to be zero. So if we have a function that is non zero in the regions where the signal is zero, which we kind of denote in red in the center, then, what we're able to do is multiply our image with this function and leave the image unchanged, right? Because we're obviously multiplying by zero.
[00:24:00] So if A times w here in this case equals A, we can rearrange this equation to something that equals zero, like you see here, and then we can group the terms together.
[00:24:09] And then for simplicity, we can redefine w right. So now we have an expression where A times something equals zero. And then we can take the Fourier transform of this expression, which turns it into a convolution equation. And as we mentioned earlier, we can represent convolutions as multiplications with Hankel matrix.
[00:24:28] Right? So we can do that here. We can rewrite the convolution equation as a matrix multiplication with a Hankel structured matrix on the left. But now with the existence of this linear matrix equation, we can realize that what this is expressing is that this Henkel matrix has a non-trivial null space, because we know that there's a non-trivial vector, that when we multiply it, it equals zero.
[00:24:55] And so in turn we know from this rank-nullity theorem, that the existence of a nullspace means that the matrix has reduced rank. And what that means again, is that the column and the row space have reduced dimensionality because some of those dimensions belong to the null space now. And ultimately if the null space is sufficiently large, we end up with the low rank matrix.
[00:25:14]In addition to limited support arguments, we can also identify other constraints that lead to structured low-rank matrices. These include phase constraints such as those used in the LORAKS method with partial Fourier type reconstructions or the ALOHA method for correcting EPI Nyquist ghosts or the MUSSELS method for correcting multi-shot phasing inconsistency.
[00:25:39] We can also look at coil sensitivity constraints. And these are included in methods such as PRUNO or ESPIRiT. SAKE and P-LORAKS, as well as transform sparsity constraints, which unfortunately don't have nice acronyms associated with them.
[00:25:55]Despite the considerable differences, conceptually in how the data is constrained mechanically, the low-rank optimization problems are quite similar to one another and similar to the problems in the previous section, the obvious difference being that there's an extra linear operator that maps the data into its Hankel matrix representation.
[00:26:13] Otherwise, however, aside from working out the details on the adjoint or pseudo-inverse of the Hankel operator, the procedure is very similar. There are also reconstruction algorithms that have been specially developed for structured low-rank problems that explicitly try and exploit the convolutional structure of this Hankel matrix, and one example, is the GIRAF method, which you can read about in the reference at the bottom.
[00:26:38]Finally, I also wanted to touch on sampling considerations of which there isn't actually that much known in terms of what is optimal for sampling in these structured low-rank contexts. If we examine an unconventional sampling strategy with a random contiguous missing chunk against something we would not normally do in a sort of conventional low-rank context. After transformation, Hankel transformation, we can see that actually the Hankel matrix looks more randomly sampled, and every row and every column has samples in it, that would permit low rank recovery.
[00:27:16] And we can see here, an example from the LORAKS approach, that's highlighting the performance of structured, low-rank methods in, again, very unconventional k-space sampling patterns that checkerboard or partial Fourier, or just a random sort of logo sampling, where sparsity or total variation type approaches don't work uniformly well. The structured low rank approach works very well as you can see in this far right hand column.
[00:27:50]The last thing I wanted to leave you with were some other resources for further reading on this topic. There are obviously a lot of topics that I skimmed over or completely omitted in this talk. But recently the IEEE Signal Processing magazine came out with an issue on computational MRI and I've highlighted three excellent review articles related to the content of this presentation.
[00:28:12] The first one relating to a dynamic MRI using low rank methods. The second on linear predictability and shift invariant structures in MR and the last one explicitly about structured low rank approaches to reconstruction. With that, thank you very much for your attention.