⮜ Back to Parent Page

Linear Algebra


Linear algebra is the backbone of deep learning. As Andrew Ng famously said,

“Linear algebra is the language of deep learning”.

Vectors and Matrices

Several types of mathmatical objects are important in discussing linear algebra study:

  • Scalars: single number, in contrst with the array, matrix … which are the series of numbers.
  • Vector: An ordered list of numbers (scalars) often represented as a column.
  • e.g., a vector in Rn\mathbb{R}^n can be written as v=[v1v2vn]\mathbf{v} = \begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}

  • Matrix (Matrices): A rectangular array of numbers with size m×nm \times n ( mm rows, nn columns), or 2-D array. We often denote a matrix as A=[aij]A = [a_{ij}] where aija_{ij} is the entry in the ii th row and jj th column.
  • e.g. 2×32\times 3 matrix: A=[a11a12a13a21a22a23]A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \end{bmatrix}. A column vector is a special case of a matrix (size n×1n \times 1).

  • Tensors are generalizations of matrices to higher dimensions (used to store multi-dimensional data in deep learning).

  • Matrix Operations

  • Transpose: mirror image of the matrix acress the main diagonal line. (AT)i,j=Aj,i(A^T)_{i,j} = A_{j, i}. e.g., A=[a11a12a21a22a31a32]AT=[a11a21a31a12a22a32]A = \begin{bmatrix} \boldsymbol{a_{11}} & a_{12} \\ a_{21} & \boldsymbol{a_{22}} \\ a_{31} & a_{32} \end{bmatrix} \Rightarrow A^T = \begin{bmatrix} \boldsymbol{a_{11}} & a_{21} & a_{31} \\ a_{12} & \boldsymbol{a_{22}} & a_{32} \end{bmatrix}
  • Addition: for matrices AA and BB of the same size, e.g. m×nm\times n, (A+B)ij=Aij+Bij(A + B)_{ij} = A_{ij} + B_{ij} (add corresponding entries).
  • Scalar Multiplication: for scalar cc, (cA)ij=cAij(cA)_{ij} = c \cdot A_{ij}.
  • Matrix Multiplication: AA is m×nm\times n and BB is n×pn\times p, their product C=ABC = AB is an m×pm\times p matrix with entries: Cij=k=1nAikBkjC_{ij} = \sum_{k=1}^{n} A_{ik}\, B_{kj} (sum of row ii of AA times column jj of BB). Matrix multiplication is associative and distributes over addition (but not commutative in general).
  • Multiplication (dot product) operation is both distributive, A(B+C)=AB+ACA(B+C)=AB+AC, and associative, A(BC)=(AB)CA(BC) = (AB)C. But it’s NOT commutative, ABBAAB \neq BA.

    BUT, the dot product of vectors is commutative, xTy=yTxx^Ty = y^Tx:

    Since, (AB)T=BTAT,(AT)T=AHence, xTy=(xTy)T=yTx\text{Since, }(AB)^T = B^T A^T\,,\,(A^T)^T = A\\ \text{Hence, }x^Ty = (x^Ty)^T = y^Tx

    Untile now, we can introduce the system of linear algebra, which is the main problems notation that we are goning to solve. Ax=bAx=b, where ARm×n\boldsymbol{A}\in\mathbb{R}^{m\times n} is a matrix, bRm\boldsymbol{b}\in\mathbb{R}^{m} is a vector, xRn\boldsymbol{x}\in\mathbb{R}^{n} is a vector of unknown.


    Additionals

  • Identity Matrix II: Mathmatically: InRn×n, and, xRn,Inx=xI_n \in \mathbb{R}^{n\times n}\text{, and, }\forall x \in \mathbb{R}^{n}, I_{n}x = x. It acts as the multiplicative identity: AI=AAI = A and IA=AIA = A. Simply, I=diag(1,1,,1)I = \operatorname{diag}(1,1,\dots,1), or I=[100001000000001]I = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & \ddots & \ddots & \vdots \\ \vdots & \vdots & \ddots & \ddots & 0 \\ 0 & 0 & \cdots & 0 & 1 \end{bmatrix}.
  • Inverse: For a square matrix AA, the inverse A1A^{-1} satisfies AA1=A1A=IAA^{-1} = A^{-1}A = I. Only invertible (non-singular) matrices have inverses, which requires full rank (no zero eigenvalues, see rank below). e.g. For a 2×22\times2, A=(abcd)A=\begin{pmatrix}a & b\\ c & d\end{pmatrix}, A1=1adbc(dbca)A^{-1} = \displaystyle\frac{1}{ad - bc}\begin{pmatrix} d & -b\\ -c & a \end{pmatrix} (provided adbc0ad - bc \neq 0). And we do can solve xx from Ax=bAx=b:
  • Ax=bA1Ax=A1bInx=A1bx=A1bAx = b\\ A^{-1}Ax = A^{-1}b\\ I_nx = A^{-1}b\\ x = A^{-1}b
  • Determinant ( detA\det A or A|A| ): A scalar value defined for square matrices that encodes volume scaling factor of the linear transformation and whether it flips orientation.
  • e.g. For 2×22\times2 as above, det(A)=adbc\det(A) = ad - bc. For larger matrices, it’s computed via a recursive formula or row reduction.

    det(AB)=det(A)det(B)\det(AB) = \det(A)\det(B),

    det(A)0    A is invertible\det(A)\neq 0 \iff A\ \text{is invertible}.

  • Trace: The trace of a square matrix is the sum of its diagonal entries: tr(A)=i=1naii\mathrm{tr}(A) = \sum_{i=1}^n a_{ii}. Notable property: the trace equals the sum of eigenvalues of AA (counting multiplicity), and it is invariant under change of basis.
  • Rank: The rank of a matrix is the number of linearly independent rows (which equals the number of independent columns). It indicates the dimension of the subspace spanned by its columns (column space). Full rank means rank = min(m,n)\min(m,n); for a square n×nn\times n matrix, full rank (rank =n=n) means the matrix is invertible. Low rank means the matrix’s rows/columns are linearly dependent (some redundancy in information).
  • e.g., [101011011]\begin{bmatrix} 1 & 0 & 1\\ 0 & 1 & 1\\ 0 & 1 & 1\\ \end{bmatrix} has rank 2: the first two columns are linearly independent, so the rank is at least 2, but since the third is a linear combination of the first two, the three columns are linearly dependent so the rank must be less than 3.

    How to compute rank of matrix:

    Given a mtrixA=[121123113500]2R1+R2R2[121101333500]3R1+R3R3[121101330133]R2+R3R3[121101330000]2R2+R1R1[105501330000]\text{Given a mtrix}\,A = \begin{bmatrix} 1 & 2 & 1 & 1 \\ -2 & -3 & 1 & 1 \\ 3 & 5 & 0 & 0 \end{bmatrix} \xrightarrow{2R_1 + R_2 \to R_2} \begin{bmatrix} 1 & 2 & 1 & 1 \\ 0 & 1 & 3 & 3 \\ 3 & 5 & 0 & 0 \end{bmatrix} \xrightarrow{-3R_1 + R_3 \to R_3} \begin{bmatrix} 1 & 2 & 1 & 1 \\ 0 & 1 & 3 & 3 \\ 0 & -1 & -3 & -3 \end{bmatrix} \xrightarrow{R_2 + R_3 \to R_3} \begin{bmatrix} 1 & 2 & 1 & 1 \\ 0 & 1 & 3 & 3 \\ 0 & 0 & 0 & 0 \end{bmatrix} \xrightarrow{-2R_2 + R_1 \to R_1} \begin{bmatrix} 1 & 0 & -5 & -5 \\ 0 & 1 & 3 & 3 \\ 0 & 0 & 0 & 0 \end{bmatrix}

    The final matrix (in reduced row echelon form) has two non-zero rows and thus the rank of matrix AA is 2.

  • Norm is used to measure the size of matrix. LpL^p norm: xp=(ixip)1p\left\|x\right\|_p = (\sum_i |x_i|^p)^{\frac{1}{p}}, for pR,p1p \in \mathbb{R}, p\geq 1. There are some usual used norms:
  • Euclidean Norm ( L2L^2 ): x2\left\|x\right\|_2 or simply denote as x\left\|x\right\|, genrally known as default ‘distance’ in geometry.
  • L1L^1 norm: x1=ixi\left\|x\right\|_1 = \sum_i |x_i|. commonly used in machine learning.
  • Max Norm ( LL^\infin ): x=maxixi\left\|x\right\|_\infin = \max_i |x_i|.
  • Special matrices

  • Symmetric matrix: A=ATA = A^T.
  • Unit Vector is the vector with unit norm: x2=1\left\|x\right\|_2 = 1.
  • Orthogonal: xTy=0x^Ty = 0, xx and yy are orthogonal vectors; AAT=ATA=IAA^T = A^TA = I, the AA is an orthogonal matrix, also A1=AT\Rightarrow A^{-1} = A^T. Their inverse is very cheap & easy to compute.
  • e.g., A=17[326632263]AT=17[362236623]AAT=149[490004900049]=[100010001]=IA is an Orthogonal matrix.A = \displaystyle\frac{1}{7}\begin{bmatrix} 3 & 2 & 6 \\ -6 & 3 & 2 \\ 2 & 6 & -3 \end{bmatrix} \Rightarrow A^T = \displaystyle\frac{1}{7}\begin{bmatrix} 3 & -6 & 2 \\ 2 & 3 & 6 \\ 6 & 2 & -3 \end{bmatrix} \Rightarrow AA^T = \displaystyle\frac{1}{49}\begin{bmatrix} 49 & 0 & 0 \\ 0 & 49 & 0 \\ 0 & 0 & 49 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = I \Rightarrow A \text{ is an Orthogonal matrix.}


    Decompositions

    Both Eigendecomposition and Singular Value Deconposition are used in Principal Component Analysis (PCA) algorithm.

    Eigenvalues and Eigenvectors

  • For a square matrix ARn×nA \in \mathbb{R}^{n\times n}, a non-zero vector vv is an Eigenvector if Av=λvAv = \lambda v for some scalar λ\lambda. Here λ\lambda is called the eigenvalue corresponding to vv.
  • To determine the eigenvalues of AA, solve the characteristic equation: det(AλI)=0det(A−λI)=0, and eugenvector by solving (AλI)v=0(A - \lambda I)v = 0.

    Eigendecomposition

    Assuming AA has nn linearly independent eigenvectors {v(1),,v(n)}\left\{v^{(1)}, \dots,v^{(n)} \right\} with correspounding eigenvalues{λ(1),,λ(n)}\left\{\lambda^{(1)}, \dots,\lambda^{(n)} \right\}, given V=[v(1),,v(n)]V = [v^{(1)}, \dots,v^{(n)} ] and λ=[λ(1),,λ(n)]T\lambda = [\lambda^{(1)}, \dots,\lambda^{(n)}]^T, the eigendecomposition of AA is then:

    A=Vdiag(λ)V1A = V \operatorname{diag}(\lambda) V^{-1}

    For Symmetric Matrices,

    A=QΛQTA = Q\Lambda Q^T

    where QQ is an Orthogonal matrix composed of eigenvector of A and Λ\Lambda is a diagonal matrix of eigenvalues. Geometrically, scaling space AA can be decomposit as λiΛ\lambda_i \in \Lambda (length) in direction v(i)Qv^{(i)} \in Q.

    🤖: The technic can be used in 🧠 PCA by Correlated matrix Eigendecomposition

    Singular Value Decomposition (SVD)

    Singular value decomposition is a fundamental matrix factorization. It states that any real m×nm\times n matrix AA can be factored as:

    A=UΣVTA = U\,\Sigma\,V^T
  • VV is an n×nn\times n orthogonal matrix, rotation or reflection;
  • Σ\Sigma is an m×nm\times n diagonal matrix (nonnegative), scales the axes by the singular values;
  • UU is an m×mm\times m orthogonal matrix, second rotation or reflection.
  • The diagonal entries σ1,σ2,\sigma_1,\sigma_2,\dots of Σ\Sigma are the singular values of AA. They are conventionally ordered σ1σ20\sigma_1 \ge \sigma_2 \ge \cdots \ge 0.

    SVD says that AA = (orthogonal rotation/reflection) ×\times (scaling) ×\times (orthogonal rotation/reflection).

    🧠 How to compute SVD:

    The folowing algorithm is mathematically valid, but in practice, this simple method can lead to numerical instability and increased computational complexity.

    Unknown block type: numbered_list_item

    W=ATA(WλI)x=0rank λi:λ1λ2λn0W = A^TA \\ (W - \lambda I)x = 0 \\ \text{rank }\lambda_i: \lambda_1\geq\lambda_2\geq\cdots\geq\lambda_n\geq0

    bring λi\lambda_i into eigenequation (WλI)x=0(W - \lambda I)x = 0, end up eigenvectors xx.

    Unknown block type: numbered_list_item

    Unitise the eigenvectors xx, into v1,v2,,vnv_1, v_2, \cdots, v_n, get V=[v1v2vn]V = \begin{bmatrix} v_1&v_2&\cdots&v_n \end{bmatrix}.

    Unknown block type: numbered_list_item

    Singular value σi=λi,i=1,2,,n\sigma_i = \sqrt{\lambda_i},\, i = 1,2,\cdots,n,

    Therefore, square diagonal matrix Σ=diag(σ1,σ2,,σn)\Sigma = \operatorname{diag}(\sigma_1, \sigma_2, \cdots, \sigma_n)

    Unknown block type: numbered_list_item

    For the first rr positive singular values of AA:

    uj=1σjAvj,j=1,2,,rU1=[u1u2ur]u_j = \frac{1}{\sigma_j}Av_j,\, j=1,2,\cdots,r \\ U_1 = \begin{bmatrix} u_1&u_2&\cdots&u_r \end{bmatrix}

    Find a set of standard orthogonal bases in the zero space of AT,{ur+1,ur+2,,um}A^T, \{u_{r+1}, u_{r+2}, \cdots, u_{m}\}:

    U2=[ur+1ur+2um]U_2 = \begin{bmatrix} u_{r+1}& u_{r+2}& \cdots& u_{m} \end{bmatrix}
    U=[U1U2]U = \begin{bmatrix} U_1&U_2 \end{bmatrix}

    Unknown block type: numbered_list_item

    🤖: The technic can be used in 🧠 PCA by using SVD method