Commit 9e63decd by Florent Chatelain

up nb

parent 8496bd75
 ... ... @@ -70,7 +70,7 @@ "\n", "#### Figure 1.\n", "Eigenvalue distribution $\\nu_n$ of $K$ versus limit measure $\\nu$, for $p=200$, $n=4\\,000$, $x_i\\sim .4 \\mathcal N(\\mu_1,I_p)+.6\\mathcal N(\\mu_2,I_p)$ for \n", "$[\\mu_1^T,\\mu_2^T]^T \\sim\\mathcal{N}(0, \\frac1p {\\tiny \\left[\\begin{matrix} 10 & 5.5 \\\\ 5.5 & 15\\end{matrix}\\right] } x %\\otimes I_p)$; \n", "$[\\mu_1^T,\\mu_2^T]^T \\sim\\mathcal{N}(0, \\frac1p {\\tiny \\left[\\begin{matrix} 10 & 5.5 \\\\ 5.5 & 15\\end{matrix}\\right] } x I_p)$; \n", "$\\varepsilon_S=.2$, $\\varepsilon_B=.4$. Sample vs theoretical spikes in blue vs red circles. The two \"humps\" remind the semi-circular and Marcenko-Pastur laws." ] }, ... ...
 %% Cell type:markdown id: tags: # #
*"Two-way kernel matrix puncturing: towards resource-efficient PCA and spectral clustering"*
##
-- Supplementary Material --
##
-- Python Codes of main article figures --
## Preamble: useful packages and functions %% Cell type:code id: tags:  python import numpy as np import scipy as sp import matplotlib.pyplot as plt import scipy.linalg as lin import scipy.stats as stats import scipy.sparse.linalg import scipy.special import pandas as pd import seaborn as sns from numpy.random import default_rng rng = default_rng(0) %load_ext autoreload %autoreload 2 # user's functions from the local .py file from punctutils import * %matplotlib inline plt.rcParams.update({"font.size": 12})  %% Cell type:markdown id: tags: ## Two-way puncturing of the kernel matrix The data matrix is $X\in \mathbb{C}^{p \times n}$ where $p$ and $n$ are the feature and sample size resp. Then $$K = \left[ \frac 1p (X \circ S)' (X \circ S) \right] \circ B$$ is the $n\times n$ *two-way punctured* kernel matrix where $\circ$ is the Hadamard (elementwise) product and - matrix $S$ is the Bernoulli iid $(p \times n)$ random matrix to select the **data** entries with rate eS - matrix $B$ is the Bernoulli iid $(n \times n)$ random matrix to select the **kernel** entries with rate eB %% Cell type:markdown id: tags: ## Simulations #### Figure 1. Eigenvalue distribution $\nu_n$ of $K$ versus limit measure $\nu$, for $p=200$, $n=4\,000$, $x_i\sim .4 \mathcal N(\mu_1,I_p)+.6\mathcal N(\mu_2,I_p)$ for $[\mu_1^T,\mu_2^T]^T \sim\mathcal{N}(0, \frac1p {\tiny \left[\begin{matrix} 10 & 5.5 \\ 5.5 & 15\end{matrix}\right] } x %\otimes I_p)$; $[\mu_1^T,\mu_2^T]^T \sim\mathcal{N}(0, \frac1p {\tiny \left[\begin{matrix} 10 & 5.5 \\ 5.5 & 15\end{matrix}\right] } x I_p)$; $\varepsilon_S=.2$, $\varepsilon_B=.4$. Sample vs theoretical spikes in blue vs red circles. The two "humps" remind the semi-circular and Marcenko-Pastur laws. %% Cell type:code id: tags:  python # Generation of the mean vectors for each sample p = 200 # dimension size n = 4000 # sample size c0 = p / n # dimension/sample size ratio # Set the covariance for the two mean vectorss cov_mu = np.array([[10, 5.5], [5.5, 15]]) / p mus = gen_synth_mus(p=p, n=n, cov_mu=cov_mu) # Set the proportion for each of the two classes cs = [0.4, 0.6] # Generate the noisy data matrix and the spikes matrices X, ells, vM = gen_synth_X(p, n, mus, cs) # Puncturing settings eS = 0.2 # data puncturing ratio eB = 0.4 # kernel puncturing ratio b = 1 # kernel matrix diagonal entry # Empirical Spectrum lambdas = puncture_eigs(X, eB, eS, b)[0] xmin = min(np.min(lambdas) * 0.8, np.min(lambdas) * 1.2) # accounting negative min xmax = np.max(lambdas) * 1.2 xs, density = lsd(eB, eS, c0, b, xmin=xmin, xmax=xmax, nsamples=200) isolated_eig_0 = spike(eB, eS, c0, ells[0], b=1)[0] isolated_eig_1 = spike(eB, eS, c0, ells[1], b=1)[0] f, ax = plt.subplots(1, 1) sns.histplot( lambdas.flatten(), # color="blue", cbar_kws={'edgecolor': 'darkblue'}, stat="density", ax=ax, ) plt.plot(xs, density, "r", label="Limiting density") plt.plot( isolated_eig_0, 0.002, "og", fillstyle="none", label=r"limiting spikes $\rho_{1,2}$" ) plt.plot(isolated_eig_1, 0.002, "og", fillstyle="none") plt.plot(lambdas[-1], 0.002, "ob", label=r"Largest eigvals") plt.plot(lambdas[-2], 0.002, "ob") plt.legend() plt.xlim([-2, 5]) _ = plt.show()  %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: #### Figure 2. Illustration of Theorem 2: asymptotic sample-population eigenvector alignment for $\mathcal L=\ell \in\mathbb R$, as a function of the information strength'' $\ell$. Various values of $(\varepsilon_S,\varepsilon_B,c_0)$ indicated in legend. Black dashed lines indicate the limiting (small $\varepsilon_S,\varepsilon_B$) phase transition threshold $\Gamma=(\varepsilon_S^2\varepsilon_Bc_0^{-1})^{-\frac12}$. As $\varepsilon_S,\varepsilon_B\to 0$, performance curves coincide when $\varepsilon_B\varepsilon_S^2c_0^{-1}$ is constant (plain versus dashed set of curves). %% Cell type:code id: tags:  python eSeBc0s = [ (0.1, 0.1, 1), (0.2, 0.025, 1), (0.1, 0.2, 2), (0.05, 0.05, 1), (0.1, 0.0125, 1), (0.05, 0.1, 2), ] ells = np.linspace(1, 400, 100) plt.figure(figsize=(8, 4)) for eS, eB, c0 in eSeBc0s: plt.plot( ells, [spike(eB, eS, c0, ell)[1] for ell in ells], label="({},{},{})".format(eS, eB, c0), ) plt.xlabel(r"$\ell$") plt.ylabel(r"$\zeta$") plt.grid("On") _ = plt.legend()  %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: #### Figure 3. Phase transition curves $F(\ell)=0$ for $\mathcal L=\ell\in\mathbb R$ and varying values of $\ell$, for $c_0=.05$. Above each phase transition curve, a spike eigenvalue is found away from the support of $\nu$. For large $\ell$, a wide range of $\varepsilon_B$'s (resp., $\varepsilon_S$) is admissible at virtually no performance loss. Here, also, sparser $B$ matrices are more effective than sparser $S$ matrices. %% Cell type:code id: tags:  python ells = [1, 2, 5] # spike powers c0 = 0.05 f, ax = plt.subplots(1, 2, figsize=(12, 4), sharey=True) for ell in ells: eBs, eSs = phase_transition(c0, ell, res=1e-3) ax[0].axes.plot(eBs, eSs, label=r"$\ell$={}".format(ell)) ax[1].axes.plot(eSs, eBs, label=r"$\ell$={}".format(ell)) ax[0].axes.set_xlabel(r"$\varepsilon_B$") ax[0].axes.set_ylabel(r"$\varepsilon_S$") ax[0].axes.grid("On") ax[1].axes.set_xlabel(r"$\varepsilon_S$") ax[1].axes.set_ylabel(r"$\varepsilon_B$") ax[1].axes.set_ylim([0, 0.3]) ax[1].axes.grid("On") ax[1].axes.legend() _ = plt.show()  %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: #### Figure 4: Two-way punctured matrices $K$ for **(left)** $(\varepsilon_S,\varepsilon_B)=(.2,1)$ or **(right)** $(\varepsilon_S,\varepsilon_B)=(1,.04)$, with $c_0=\frac12$, $n=4\,000$, $p=2\,000$, $b=0$. Clustering setting with $x_i\sim .4\mathcal N(\mu_1,I_p)+.6\mathcal N(\mu_2,I_p)$ for $[\mu_1^T,\mu_2^T]^T\sim \mathcal N(0,\frac1p[\begin{smallmatrix} 20 & 12 \\ 12 & 30\end{smallmatrix}]\otimes I_p)$. **(Top)** first $100\times 100$ absolute entries of $K$ (white for zero); **(Middle)** spectrum of $K$, theoretical limit, and isolated eigenvalues; **(Bottom)** second dominant eigenvector $\hat v_2$ of $K$ against theoretical average in red. **As confirmed by theory, although (top) $K$ is dense for $\varepsilon_B=1$ and sparse for $\varepsilon_B=.04$ ($96\%$ empty) and (middle) the spectra strikingly differ, (bottom) since $\varepsilon_S^2\varepsilon_Bc_0^{-1}$ is constant, the eigenvector alignment $|\hat v_2^T v_2|^2$ is the same in both cases.** %% Cell type:code id: tags:  python # Generation of the mean vectors for each sample p = 2000 # dimension size n = 4000 # sample size c0 = p / n # dimension/sample size ratio # Set the covariance for the two mean vectorss cov_mu = np.array([[20, 12], [12, 30]]) / p mus = gen_synth_mus(p=p, n=n, cov_mu=cov_mu) # Set the proportion for each of the two classes cs = [0.4, 0.6] n0 = int(n * cs[0]) # Generate the noisy data matrix and the spikes matrices X, ells, vM = gen_synth_X(p, n, mus, cs) # Puncturing settings eS = 0.2 # data puncturing ratio eB = 0.04 # kernel puncturing ratio b = 0 # kernel matrix diagonal entry f, ax = plt.subplots(3, 2, figsize=(12, 8)) # S-punctured spectrum B, _, _ = mask_B(n, 1, is_diag=b) # mask for diag entries X_S = puncture_X(X, eS) # punctured data mat K_S = puncture_K_vanilla(X_S, B) # remove diag entries l, u_S = sp.sparse.linalg.eigsh(K_S, k=2, which="LA", tol=0, return_eigenvectors=True) u_S = u_S[:, 1].ravel() #  second principal eigvect u_S = u_S * np.sign(u_S[:n0].mean() - u_S[n0:].mean()) lambdas_S = np.linalg.eigvalsh(K_S.todense()) #  B-punctured spectrum B, _, _ = mask_B(n, eB, is_diag=b) K_B = puncture_K_vanilla(X, B) #  punctured kernel mat l, u_B = sp.sparse.linalg.eigsh(K_B, k=2, which="LA", tol=0, return_eigenvectors=True) u_B = u_B[:, 1].ravel() # second principal eigvect u_B = u_B * np.sign(u_B[:n0].mean() - u_B[n0:].mean()) lambdas_B = np.linalg.eigvalsh(K_B.todense()) # S-punctured figs ax[0, 0].imshow( np.abs(K_S[:100, :100].todense()), interpolation="nearest", cmap="gray_r" ) ax[0, 0].axis("off") disp_eigs(ax[1:, 0], lambdas_S, u_S, 1, eS, c0, n0, ells, b, vM) ax[1, 0].axes.set_ylim([0, 2]) # B-punctured figs ax[0, 1].imshow( np.abs(K_B[:100, :100].todense()), interpolation="nearest", cmap="gray_r" ) ax[0, 1].axis("off") disp_eigs(ax[1:, 1], lambdas_B, u_B, eB, 1, c0, n0, ells, b, vM)  %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: #### Figure 5. Limiting probability of error of spectral clustering of $\mathcal N(\pm\mu,I_p)$ with equal class sizes on $K$: as a function of $\varepsilon_B$ for fixed $\ell=\|\mu\|^2=50$ (top), and $\varepsilon_S$ for fixed $\ell=50$ (bottom). Simulations (single realization) in markers for $p=n=4\,000$ ($\color{blue}\times$) and $p=n=8\,000$ ($\color{blue}+$). Very good fit between theory and practice for not too small $\varepsilon_S,\varepsilon_B$ . %% Cell type:code id: tags:  python p = 4000 ns = np.array([2000, 2000]) n0 = ns[0] n = np.sum(ns) cs = ns / n c0 = p / n power = 50 mu = rng.normal(size=(p,)) mu = mu / np.linalg.norm(mu) * np.sqrt(power) mus = np.array([mu, -mu]).T X, ells, _ = gen_synth_X(p, n, mus, cs) ell = np.max(ells) f, ax = plt.subplots(2, 1, figsize=(8, 8), sharex=True) ## Perf vs epsilon_B (fixed epsilon_S) res = 1e-1 eSs = [1 / 10, 1 / 20] eBs = np.linspace(res, 1, int(1 / res)) res_thin = 1e-2 eSs_thin = [1 / 10, 1 / 20] eBs_thin = np.linspace(res_thin, 1, int(1 / res_thin)) perf_sim = np.zeros((len(eBs), len(eSs))) perf_th = np.zeros((len(eBs_thin), len(eSs_thin))) for iS, eS in enumerate(eSs): for iB, eB in enumerate(eBs): u = puncture_eigs(X, eB, eS, b=1, sparsity=1)[1] overlap = 1 / n * (np.sum(u[:n0] > 0) + np.sum(u[n0:] < 0)) perf_sim[iB, iS] = np.min([overlap, 1 - overlap]) ax[0].axes.plot( eBs, perf_sim[:, iS], "xb", label=r"Simulation, $\varepsilon_S=${}".format(eS) ) for iS, eS in enumerate(eSs_thin): for iB, eB in enumerate(eBs_thin): zeta = spike(eB, eS, c0, ell, b=1)[1] perf_th[iB, iS] = qfunc(np.sqrt(zeta / (1 - zeta))) ax[0].axes.plot( eBs_thin, perf_th[:, iS], "r", label=r"Theory, $\varepsilon_S=${}".format(eS) ) ax[0].axes.set_xlabel(r"$\varepsilon_B$") ax[0].axes.set_ylabel(r"$P_e$") ax[0].axes.grid("On") ax[0].axes.legend() ## Perf vs epsilon_S (fixed epsilon_B) eBs = [1 / 100, 1 / 400] eSs = np.linspace(res, 1, int(1 / res)) res_thin = 1e-2 eBs_thin = [1 / 100, 1 / 400] eSs_thin = np.linspace(res_thin, 1, int(1 / res_thin)) perf_sim = np.zeros((len(eSs), len(eBs))) perf_th = np.zeros((len(eSs_thin), len(eBs_thin))) for iB, eB in enumerate(eBs): for iS, eS in enumerate(eSs): u = puncture_eigs(X, eB, eS, b=1, sparsity=1)[1] overlap = 1 / n * (np.sum(u[:n0] > 0) + np.sum(u[n0:] < 0)) perf_sim[iS, iB] = np.min([overlap, 1 - overlap]) ax[1].axes.plot( eSs, perf_sim[:, iB], "xb", label=r"Simulation, $\varepsilon_B=${}".format(eB) ) for iB, eB in enumerate(eBs_thin): for iS, eS in enumerate(eSs_thin): zeta = spike(eB, eS, c0, ell, b=1)[1] perf_th[iS, iB] = qfunc(np.sqrt(zeta / (1 - zeta))) ax[1].axes.plot( eSs_thin, perf_th[:, iB], "r", label=r"Theory, $\varepsilon_B=${}".format(eB) ) ax[1].axes.set_xlabel(r"$\varepsilon_S$") ax[1].axes.set_ylabel(r"$P_e$") ax[1].axes.grid("On") ax[1].axes.legend()  %%%% Output: execute_result %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: #### Figure 6 Examples of **MNIST-fashion images**, trouser instances **(top row)**, pullover instances **(bottom row)**. **Note:** the GAN data we generated and used in the submitted paper are too voluminous to be included in the supplementary material and is not publicly and anonymously available. For this reason, we are using below the smaller, publicly available, MNIST-fashion real word data set. However **the conclusions obtained for this dataset are very similar to those drawn in the paper for the GAN data.** %% Cell type:code id: tags:  python from tensorflow.keras.datasets import fashion_mnist (X, y), _ = fashion_mnist.load_data() selected_labels = [1, 2] nb_im = 5 # number of images to display for each class im0 = X[y == selected_labels[0]] im1 = X[y == selected_labels[1]] f, ax = plt.subplots(2, 5, figsize=(10, 4), sharex=True, sharey=True) for i in range(nb_im): ax[0, i].axes.imshow(im0[i, :, :].squeeze(), interpolation="nearest", cmap="gray_r") ax[0, i].axes.axis("off") ax[1, i].axes.imshow(im1[i, :, :].squeeze(), interpolation="nearest", cmap="gray_r") ax[1, i].axes.axis("off")  %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: #### Figure 7 Empirical classification errors for $2$-class (balanced) MNIST-fashion images (trouser vs pullover), with $n=512$ (**top**) and $n=2048$ (**bottom**). **Theoretically predicted plateau''-behavior observed for all $\varepsilon_B$ not too small**. %% Cell type:code id: tags:  python nbMC = 40 df_1, _ = get_perf_clustering(n0=256, nbMC=nbMC) df_2, _ = get_perf_clustering(n0=1024, nbMC=nbMC)  %%%% Output: stream [progression 40/40] %% Cell type:code id: tags:  python f, ax = plt.subplots(2, 1, figsize=(8, 8), sharex=True) plot_perf_clustering(df_1, n0, ax[0]) plot_perf_clustering(df_2, n0, ax[1])  %%%% Output: display_data [Hidden Image Output] %% Cell type:markdown id: tags: #### Figure 8 Sample vs limiting spectra and dominant eigenvector of $K$ for 2-class MNIST-fashion images (trouser vs pullover); **(left)** $\varepsilon_S=\varepsilon_B=1$ (error rate: $\mathbb P_e=.09$); **(right)** $\varepsilon_S=0.02$, $\varepsilon_B=0.2$ ($\mathbb P_e=.12$). **Surprisingly good fit between sample and predicted isolated eigenvalue/eigenvector in all cases; as for spectral measure, significant prediction improvement as $\varepsilon_S,\varepsilon_B\to 0$** %% Cell type:code id: tags:  python eB = 0.2 eS = 0.02 X, n0, n1, ell = gen_MNIST_vectors(n0=2048) (p, n) = X.shape c0 = p / n eigvals_punct = puncture_eigs(X, eB, eS, b=1, sparsity=0)[0] u_punct = puncture_eigs(X, eB, eS, b=1, sparsity=1)[1] u_punct = np.sign(u_punct[:n0].mean() - u_punct[n0:].mean()) * u_punct overlap = 1 / n * (np.sum(u_punct[:n0] > 0) + np.sum(u_punct[n0:] < 0)) print("Punctured (right) error rate == {:.2f}".format(1 - overlap)) eigvals_full = puncture_eigs(X, eB=1, eS=1, b=1, sparsity=0)[0] u_full = puncture_eigs(X, eB=1, eS=1, b=1, sparsity=1)[1] u_full = np.sign(u_full[:n0].mean() - u_full[n0:].mean()) * u_full overlap = 1 / n * (np.sum(u_full[:n0] > 0) + np.sum(u_full[n0:] < 0)) print("Full (left) error rate == {:.2f}".format(1 - overlap)) # Truncate the eigvals dist to show the histrogram lmax = 15 print( "Full (left): ratio of eigenvalues < {:d} == {:d}/{:d}".format( lmax, np.sum(eigvals_full < lmax), n ) ) f, axes = plt.subplots(2, 2, figsize=(12, 10)) # Puncturing disp_eigs(axes[:, 1], eigvals_punct, u_punct, eB, eS, c0, n0, ell, b=1, vM=None) # Full disp_eigs_full(axes[:, 0], eigvals_full, u_full, c0, ell, lmax=lmax, b=1) plt.show()  %%%% Output: stream Punctured (right) error rate == 0.12 Full (left) error rate == 0.10 Full (left): ratio of eigenvalues < 15 == 4083/4096 Full case: largest sample eigval == 981.74, lagest limiting spike == 783.62 %%%% Output: display_data [Hidden Image Output] %% Cell type:code id: tags:  python  ... ...
Supports Markdown
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!