1 Review of Convex Optimization
This chapter consolidates the prerequisite material from S&DS 431/631 that subsequent chapters build upon: convex sets, convex functions, operations preserving convexity, optimality conditions, and the notation for norms and dual norms used throughout the course.
1.1 Convex Sets
A convex set is the most fundamental geometric object in optimization. Many algorithmic guarantees — from the convergence of gradient descent to the existence of supporting hyperplanes — rely on convexity of the underlying domain. We begin by recalling the definition and the key separation theorems.
Definition 1.1 (Convex Set) A set C \subseteq \mathbb{R}^d is convex if and only if for all x, y \in C and all \lambda \in [0,1],
\lambda x + (1-\lambda)y \in C.
Geometrically, the line segment connecting any two points in C lies entirely within C.
1.1.1 Separating Hyperplane Theorems
When two convex sets do not overlap, we can always find a hyperplane that separates them. This result is the geometric backbone of duality theory and will be used extensively in the next chapter.
Theorem 1.1 (Separating Hyperplane Theorems)
- If C, D are convex and C \cap D = \emptyset, then there exists a separating hyperplane that weakly separates C and D.
- If C is compact and D is closed, then there exists a strict separation.
Recall that a compact set is a set that is both closed and bounded.
1.2 Convex Functions
Convex functions are the natural class of objectives for which local information (gradients) yields global conclusions. Understanding their properties — especially the first-order and second-order characterizations — is essential for both algorithm design and convergence analysis.
Definition 1.2 (Convex Function) A function f : D \to \mathbb{R} is convex if:
- D \subseteq \mathbb{R}^d is a convex set, and
- For all x, y \in D and all \lambda \in [0,1], f\!\left(\lambda x + (1-\lambda)y\right) \leq \lambda \cdot f(x) + (1-\lambda) \cdot f(y).
1.2.1 Properties of Convex Functions
Convex functions enjoy several characterizations that connect their geometry (epigraphs, sublevel sets) to their calculus (gradients, Hessians). We collect the most important ones here.
Theorem 1.2 (Gradient Monotonicity) For a univariate convex function f, the derivative f' is nondecreasing. More generally, for differentiable convex f,
\langle \nabla f(x) - \nabla f(y),\, x - y \rangle \geq 0 \quad \forall\, x, y \in D.
Theorem 1.3 (Epigraph Characterization) f is convex if and only if its epigraph is convex:
\text{epi}(f) = \{(x, t) \in \mathbb{R}^{d+1} : f(x) \leq t\}.
Theorem 1.4 (Sublevel Sets) f convex \implies all sublevel sets of f are convex.
The converse does not hold. For example, f(x) = \sqrt{|x|} has convex sublevel sets but is not convex.
Theorem 1.5 (First-Order Condition (Global Underestimator)) If f is convex and differentiable, then f is above all tangent lines:
f(y) \geq f(x) + \langle \nabla f(x),\, y - x \rangle \quad \forall\, x, y.
Beyond basic convexity, two stronger notions — strict convexity and strong convexity — play critical roles in ensuring uniqueness of minimizers and fast convergence rates.
Definition 1.3 (Strictly Convex Function) A function f is strictly convex if for all x \neq y and \lambda \in (0,1),
f(\lambda x + (1-\lambda)y) < \lambda \cdot f(x) + (1-\lambda) \cdot f(y).
Definition 1.4 (Strongly Convex Function) f is m-strongly convex if and only if
f - \frac{m}{2}\|x\|_2^2 \quad \text{is convex.}
1.2.2 Second-Order Characterization
When f is twice differentiable, convexity and its refinements can be checked via the Hessian. This characterization is the workhorse of verifying convexity in practice and will be central to the analysis of Newton’s method.
Theorem 1.6 (Second-Order Characterization) Assume f is twice differentiable and \text{dom}(f) is convex. Then:
- f is convex \iff \nabla^2 f(x) \succeq 0 for all x \in \text{dom}(f).
- f is strictly convex \impliedby \nabla^2 f(x) \succ 0 for all x \in \text{dom}(f).
- f is m-strongly convex \iff \nabla^2 f(x) \succeq m \cdot I_d for all x \in \text{dom}(f).
1.2.3 How to Check Convexity of a Function
The definition and characterizations above give us three practical approaches for verifying convexity:
- Restriction to a line. Given f : D \to \mathbb{R}, if the univariate function g_{x_0, v}(t) = f(x_0 + t \cdot v) is convex in t for all x_0 and v \in \mathbb{R}^d, then f is convex. Note: \text{dom}(g_{x_0,v}) = \{x = x_0 + tv \mid t \in \mathbb{R}\} \cap D.
- If f is twice differentiable, check \nabla^2 f(x) \succeq 0 for all x \in D.
- f is obtained from operations that preserve convexity.
1.2.4 Example: Log-Determinant is Concave
Consider f(X) = \log\det(X) = \sum_{i=1}^d \log \lambda_i(X), where
D = \{X \in \mathbb{R}^{d \times d} : X \succ 0\} \quad (\text{convex domain}).
Reference: Page 74 of [BV].
We establish concavity via restriction to a line: we show that g(t) = \log\det(X_0 + tV) is concave in t for every direction V.
Proof. Proof that \log\det(X) is concave.
Pick any X_0 \succ 0 and any V \in \mathbb{R}^{d \times d} symmetric. Define
g(t) = \log\det(X_0 + tV).
We compute:
\begin{aligned} g(t) &= \log\det\!\left(X_0^{1/2}\!\left(I + t\,X_0^{-1/2}V X_0^{-1/2}\right)\!X_0^{1/2}\right) \\ &= 2\log\det(X_0^{1/2}) + \log\det(I + t\,Y), \end{aligned}
where Y = X_0^{-1/2}V X_0^{-1/2}. Using \det(AB) = \det(A)\cdot\det(B), this becomes:
g(t) = \log\det(X_0) + \sum_{i=1}^n \log(1 + t \cdot \lambda_i(Y)).
Now check that h(t) = \log(1 + t\lambda) is concave in t for fixed \lambda:
h''(t) = \frac{-\lambda^2}{(1 + t\lambda)^2} < 0.
Since summation preserves convexity (and concavity), g(t) is concave. By restriction to a line, f(X) = \log\det(X) is concave.
Hence we conclude the proof. \blacksquare
For any X \succeq 0, the matrix square root X^{1/2} is defined as a PSD matrix such that X^{1/2}X^{1/2} = X.
Construction: Via eigendecomposition X = U\Lambda U^\top where \Lambda = \text{diag}(\lambda_1,\ldots,\lambda_d) with \lambda_i \geq 0, define
X^{1/2} = U\,\text{diag}\!\left(\sqrt{\lambda_1},\ldots,\sqrt{\lambda_d}\right) U^\top.
- Log-sum-exp is convex: f(x) = \log\!\left(\sum_{i=1}^d \exp(x_i)\right).
- Geometric mean is concave: f(x) = \left(\prod_{i=1}^n x_i\right)^{1/n}.
Hint: Check \nabla^2 f(x) \succeq 0 (or \preceq 0).
1.3 Operations Preserving Convexity
Rather than verifying convexity from scratch each time, we can build complex convex functions from simpler ones using a small set of rules. These operations form a “convexity calculus” that is indispensable in modeling.
1.3.1 1. Positive Weighted Sum
Theorem 1.7 (Positive Weighted Sum) If \{f_i\}_{i \in [m]} are convex functions and \{w_i\}_{i \in [m]} are nonneg. weights (w_i \geq 0), then
f = \sum_{i=1}^m w_i \cdot f_i \quad \text{is convex.}
Proof. Since each f_i is convex, for any \lambda \in [0,1] and any x, y we have:
f_i(\lambda x + (1-\lambda)y) \leq \lambda \cdot f_i(x) + (1-\lambda) \cdot f_i(y).
Multiplying both sides by w_i \geq 0 preserves the inequality. Summing over i = 1, \ldots, m yields:
f(\lambda x + (1-\lambda)y) \leq \lambda \cdot f(x) + (1-\lambda) \cdot f(y).
Hence we conclude the proof. \blacksquare
1.3.2 2. Composition with Affine Function
Theorem 1.8 (Composition with Affine) If f is convex, then g(x) = f(Ax + b) is convex.
Example: f(x) = \|Ax + b\| is convex whenever \|\cdot\| is a norm.
1.3.3 3. Pointwise Supremum
Theorem 1.9 (Pointwise Supremum) If f_1,\ldots,f_m are convex, then
g(x) = \max\{f_1(x), \ldots, f_m(x)\} \quad \text{is convex.}
Example: Piecewise linear function \max_{i \in [m]} \{a_i^\top x + b_i\}.
More generally, we can take pointwise supremum over a continuous set:
- If f(x,y) is convex in x for all y \in \mathcal{Y}, then g(x) = \sup_{y \in \mathcal{Y}} f(x,y) is convex.
Proof. Proof (Pointwise Supremum via Epigraph).
If t \geq g(x), then t \geq f(x,y) for all y \in \mathcal{Y}. Thus:
\text{epi}(g) = \bigcap_{y \in \mathcal{Y}} \text{epi}(f(\cdot, y)).
Since each \text{epi}(f(\cdot,y)) is convex and the intersection of convex sets is convex, \text{epi}(g) is convex, hence g is convex. \blacksquare
- Maximum eigenvalue: \lambda_{\max}(X) = \sup_{v:\|v\|=1} v^\top X v. This is convex in X as a supremum of linear (hence convex) functions.
- Support function: For any C \subseteq \mathbb{R}^d (even nonconvex), S_C(x) = \sup_{y \in C} x^\top y is convex.
- Conjugate function: f^*(x) = \sup_{y \in \text{dom}(f)} \{y^\top x - f(y)\} is convex.
1.3.4 4. Minimization over a Convex Set
Theorem 1.10 (Minimization Preserves Convexity) If \mathcal{Y} \subseteq \mathbb{R}^d is convex and f is convex in (x, y) jointly, then
g(x) = \inf_{y \in \mathcal{Y}} f(x, y) \quad \text{is convex.}
(Reference: [BV], page 88.)
Let S \subseteq \mathbb{R}^n be convex and \|\cdot\| a norm. Then
d(x, S) = \inf_{y \in S} \|x - y\| \quad \text{is convex.}
1.3.5 5. Composition Rule
Theorem 1.11 (Scalar Composition Rule) Let f(x) = h(g(x)). For the univariate case,
f''(x) = h''(g(x)) \cdot \left(g'(x)\right)^2 + h'(g(x)) \cdot g''(x).
Then h'' \geq 0 (i.e., h convex) together with either:
- h' \geq 0 and g'' \geq 0 (i.e., h nondecreasing and g convex), or
- h' \leq 0 and g'' \leq 0 (i.e., h nonincreasing and g concave)
implies f'' \geq 0, i.e., f is convex.
Summary:
- h convex, h nondecreasing, g convex \implies f convex.
- h convex, h nonincreasing, g concave \implies f convex.
1.4 Convex Optimization
With the tools for recognizing convex sets (Section 1.1) and convex functions (Section 1.2, Section 1.3) in hand, we can now define the class of optimization problems that enjoys the strongest theoretical guarantees: convex optimization. The key property is that every local minimum is a global minimum, which makes these problems tractable both in theory and in practice.
Definition 1.5 (Convex Optimization Problem) Abstract form:
\min_{x \in \Omega} f(x),
where \Omega \subseteq \mathbb{R}^d is a convex feasible set and f: D \to \mathbb{R} is a convex function.
Concrete form:
\begin{aligned} \min \quad & f(x) \\ \text{s.t.} \quad & f_i(x) \leq 0, \quad i = 1, 2, \ldots, m, \\ & Ax = b. \end{aligned}
Note: the equality constraint must be affine.
1.4.1 Properties of Convex Optimization
Theorem 1.12 (Uniqueness under Strict Convexity) If f is strictly convex, then x^* = \arg\min_{x \in \Omega} f(x) is unique.
Theorem 1.13 (Local Optimality Implies Global Optimality) Any locally optimal point is globally optimal.
Recall: \bar{x} is a locally optimal solution if there exists R > 0 such that for all x \in \Omega \cap \{x : \|x - \bar{x}\| \leq R\}, we have f(x) \geq f(\bar{x}). A globally optimal solution x^* satisfies f(x) \geq f(x^*) for all x \in \Omega.
1.4.2 Optimality Condition for Differentiable f
Theorem 1.14 (First-Order Optimality Condition) Consider \min_{x \in \Omega} f(x) where f is convex and differentiable. Then x^* is an optimal solution if and only if
\nabla f(x^*)^\top (x - x^*) \geq 0 \quad \forall\, x \in \Omega.
Equivalently, \min_{x \in \Omega} \nabla f(x^*)^\top (x - x^*) \geq 0.
When \Omega = \mathbb{R}^d, the optimality condition reduces to \nabla f(x^*) = 0, since the condition must hold for all directions.
1.5 Norms and Dual Norms
Norms measure the “size” of vectors and matrices and appear throughout optimization — in convergence rates, constraint sets, regularizers, and duality theory. We collect the key definitions and examples here.
Definition 1.6 (Norm) A function \|\cdot\| : \mathbb{R}^d \to \mathbb{R} is a norm if it satisfies:
- Nonnegativity: \|x\| \geq 0 for all x, with \|x\| = 0 if and only if x = 0.
- Homogeneity: \|\alpha x\| = |\alpha| \cdot \|x\| for all \alpha \in \mathbb{R}.
- Triangle inequality: \|x + y\| \leq \|x\| + \|y\|.
1.5.1 Vector Norms
\ell_p-norms (p \geq 1): \|x\|_p = \left(\sum_{i=1}^{d} |x_i|^p\right)^{1/p}. Important special cases:
- \|x\|_1 = \sum_{i=1}^d |x_i| (promotes sparsity in optimization),
- \|x\|_2 = \sqrt{\sum_{i=1}^d x_i^2} (Euclidean norm),
- \|x\|_\infty = \max_{i \in [d]} |x_i|.
1.5.2 Matrix Norms
There are two natural ways to define norms on matrices X \in \mathbb{R}^{m \times n}.
Induced (operator) norms. Given vector norms \|\cdot\|_\alpha on \mathbb{R}^n and \|\cdot\|_\beta on \mathbb{R}^m, the induced matrix norm is
\|X\|_{\alpha \to \beta} = \sup_{v \neq 0} \frac{\|Xv\|_\beta}{\|v\|_\alpha} = \sup_{\|v\|_\alpha \leq 1} \|Xv\|_\beta.
An induced norm measures the maximum “amplification” of the matrix as a linear map. Key examples:
- Spectral norm (\ell_2 \to \ell_2): \|X\|_{\text{op}} = \|X\|_{2 \to 2} = \sigma_1(X), the largest singular value.
- \ell_1 \to \ell_1: \|X\|_{1 \to 1} = \max_{j} \sum_{i=1}^m |X_{ij}| (maximum absolute column sum).
- \ell_\infty \to \ell_\infty: \|X\|_{\infty \to \infty} = \max_{i} \sum_{j=1}^n |X_{ij}| (maximum absolute row sum).
Induced norms are submultiplicative: \|XY\|_{\alpha \to \gamma} \leq \|X\|_{\beta \to \gamma} \cdot \|Y\|_{\alpha \to \beta}.
Entrywise (vectorized) norms. We can treat X \in \mathbb{R}^{m \times n} as a vector in \mathbb{R}^{mn} by stacking its entries, and apply any vector norm. The general form is
\|X\|_{p,\text{vec}} = \left(\sum_{i=1}^m \sum_{j=1}^n |X_{ij}|^p\right)^{1/p}.
Key examples:
- Frobenius norm (p = 2): \|X\|_F = \sqrt{\sum_{i,j} X_{ij}^2} = \sqrt{\text{tr}(X^\top X)} = \sqrt{\sum_{i=1}^r \sigma_i(X)^2}.
- Entrywise \ell_1-norm (p = 1): \|X\|_{1,\text{vec}} = \sum_{i,j} |X_{ij}|.
- Entrywise \ell_\infty-norm: \|X\|_{\infty,\text{vec}} = \max_{i,j} |X_{ij}|.
Schatten norms. A third family applies vector \ell_p-norms to the singular value vector \sigma(X) = (\sigma_1, \ldots, \sigma_r):
\|X\|_{S_p} = \|\sigma(X)\|_p = \left(\sum_{i=1}^r \sigma_i(X)^p\right)^{1/p}.
- S_1: Nuclear norm \|X\|_{\text{nuc}} = \sum_{i=1}^r \sigma_i(X) (promotes low rank in optimization).
- S_2: Frobenius norm \|X\|_F (same as the entrywise \ell_2-norm).
- S_\infty: Spectral norm \|X\|_{\text{op}} = \sigma_1(X) (same as the induced \ell_2 \to \ell_2 norm).
For matrices X, Y \in \mathbb{R}^{m \times n}, the standard inner product is \langle X, Y \rangle = \text{tr}(X^\top Y) = \sum_{i,j} X_{ij} Y_{ij}. The Frobenius norm is the norm induced by this inner product: \|X\|_F = \sqrt{\langle X, X \rangle}.
1.5.3 Norm Inequalities
Several inequalities relating different norms are used routinely in convergence analysis and algorithm design.
Hölder’s inequality. For p, q \geq 1 with \frac{1}{p} + \frac{1}{q} = 1,
|\langle x, y \rangle| \leq \|x\|_p \cdot \|y\|_q.
Equivalence of \ell_p-norms. For any x \in \mathbb{R}^d and 1 \leq p \leq q \leq \infty,
\|x\|_q \leq \|x\|_p \leq d^{1/p - 1/q} \|x\|_q.
The most frequently used special cases are:
- \|x\|_2 \leq \|x\|_1 \leq \sqrt{d}\, \|x\|_2,
- \|x\|_\infty \leq \|x\|_2 \leq \sqrt{d}\, \|x\|_\infty,
- \|x\|_\infty \leq \|x\|_1 \leq d\, \|x\|_\infty.
Both bounds are tight. The left inequality \|x\|_q \leq \|x\|_p is achieved by any standard basis vector e_i. The right inequality \|x\|_p \leq d^{1/p-1/q}\|x\|_q is achieved by x = (1,1,\ldots,1).
Matrix norm inequalities. For X \in \mathbb{R}^{m \times n} with r = \text{rank}(X),
- Spectral vs. Frobenius vs. nuclear: \|X\|_{\text{op}} \leq \|X\|_F \leq \|X\|_{\text{nuc}}.
- Frobenius vs. spectral: \|X\|_F \leq \sqrt{r}\, \|X\|_{\text{op}}.
- Nuclear vs. Frobenius: \|X\|_{\text{nuc}} \leq \sqrt{r}\, \|X\|_F.
- Frobenius vs. entrywise \ell_1: \|X\|_F \leq \|X\|_{1,\text{vec}} \leq \sqrt{mn}\, \|X\|_F.
- Spectral vs. entrywise \ell_\infty: \|X\|_{\infty,\text{vec}} \leq \|X\|_{\text{op}} \leq \sqrt{mn}\, \|X\|_{\infty,\text{vec}}.
More generally, the Schatten norms satisfy the same ordering as vector \ell_p-norms: for 1 \leq p \leq q \leq \infty,
\|X\|_{S_q} \leq \|X\|_{S_p} \leq r^{1/p - 1/q} \|X\|_{S_q}.
1.5.4 Dual Norms
Definition 1.7 (Dual Norm) Given a norm \|\cdot\| on a vector space V with inner product \langle \cdot, \cdot \rangle, the dual norm is defined as
\|z\|_* = \sup_{\|x\| \leq 1} \langle z, x \rangle.
The dual norm arises naturally in bounding inner products and appears throughout duality theory and convergence analysis.
Generalized Cauchy–Schwarz inequality: For any norm \|\cdot\| and its dual,
|\langle x, y \rangle| \leq \|x\| \cdot \|y\|_*.
Key dual norm pairs:
| Norm \|\cdot\| | Dual norm \|\cdot\|_* | Setting |
|---|---|---|
| \ell_p | \ell_q | Vectors: \frac{1}{p} + \frac{1}{q} = 1 |
| \ell_1 | \ell_\infty | Vectors |
| \ell_2 | \ell_2 | Vectors (self-dual) |
| Spectral norm \|X\|_{\text{op}} | Nuclear norm \|X\|_{\text{nuc}} | Matrices |
| Frobenius norm \|X\|_F | Frobenius norm \|X\|_F | Matrices (self-dual) |
| Schatten S_p | Schatten S_q | Matrices: \frac{1}{p} + \frac{1}{q} = 1 |
The dual norm satisfies (\|\cdot\|_*)_* = \|\cdot\| — a fact we will prove using Lagrange duality in Section 3.3.2.
1.6 Interactive Demo
The following code demonstrates the convergence of gradient descent on a smooth, strongly convex function and verifies the first-order optimality condition.
import numpy as np
# Strongly convex quadratic: f(x) = x^T Q x + b^T x
Q = np.array([[3.0, 0.5],
[0.5, 1.0]])
b_vec = np.array([-2.0, 1.0])
def f(x):
return x @ Q @ x + b_vec @ x
def grad_f(x):
return 2 * Q @ x + b_vec
# Optimal solution: grad f(x*) = 0 => x* = -0.5 * Q^{-1} b
x_star = -0.5 * np.linalg.solve(Q, b_vec)
f_star = f(x_star)
print(f"Optimal x* = {x_star}")
print(f"Optimal f* = {f_star:.6f}")
print(f"||grad f(x*)|| = {np.linalg.norm(grad_f(x_star)):.2e}\n")
# Condition number
eigvals = np.linalg.eigvalsh(2 * Q)
kappa = max(eigvals) / min(eigvals)
print(f"Eigenvalues of Hessian: {eigvals}")
print(f"Condition number kappa = {kappa:.2f}\n")
# Gradient descent
alpha = 1.0 / max(eigvals) # step size = 1/L
x = np.array([5.0, -3.0]) # initial point
history = []
for t in range(50):
gap = f(x) - f_star
dist = np.linalg.norm(x - x_star)
history.append((t, gap, dist))
x = x - alpha * grad_f(x)
print("Gradient descent convergence:")
print(f"{'Iter':>5} {'f(x)-f*':>14} {'||x-x*||':>14}")
for t, gap, dist in history[::5]:
print(f"{t:5d} {gap:14.6e} {dist:14.6e}")
# Verify first-order optimality: grad(x*)^T (x - x*) >= 0
print("\nVerifying optimality condition at x*:")
np.random.seed(42)
for _ in range(5):
x_test = np.random.randn(2) * 3
val = grad_f(x_star) @ (x_test - x_star)
print(f" grad(x*)^T (x - x*) = {val:.2e} (should be ~0 for unconstrained)")Summary
- Convex sets are closed under intersection, affine images, and perspective maps; the separating hyperplane theorem guarantees that any point outside a closed convex set can be separated from it by a hyperplane.
- A function f is convex if f(\theta x + (1-\theta)y) \leq \theta f(x) + (1-\theta)f(y) for all \theta \in [0,1]; it is \mu-strongly convex if f - \frac{\mu}{2}\|\cdot\|^2 is convex, guaranteeing a unique minimizer and quadratic growth around it.
- Convexity is preserved by nonnegative weighted sums, pointwise suprema, and composition rules (e.g., g \circ f is convex when g is convex nondecreasing and f is convex), providing a calculus for building complex convex models from simple pieces.
- For unconstrained smooth convex problems, the first-order optimality condition \nabla f(x^*) = 0 is both necessary and sufficient; for constrained problems, the condition becomes \nabla f(x^*)^\top (y - x^*) \geq 0 for all feasible y.
- The dual norm \|z\|_* = \sup_{\|x\| \leq 1} \langle z, x \rangle measures the “size” of a vector relative to the original norm and satisfies the generalized Cauchy–Schwarz inequality |\langle x, y \rangle| \leq \|x\| \cdot \|y\|_*.
In the next chapter, we develop Lagrange duality — one of the most powerful frameworks in optimization. Duality provides a systematic way to construct lower bounds on optimal values, leading to certificates of optimality, the celebrated KKT conditions, and the foundation for interior point methods.