Eigenvalue distribution $\nu_n$ of $K$ versus limit measure $\nu$, for $p=200$, $n=4\,000$, $x_i\sim .4 \mathcal N(\mu_1,I_p)+.6\mathcal N(\mu_2,I_p)$ for
$\varepsilon_S=.2$, $\varepsilon_B=.4$. Sample vs theoretical spikes in blue vs red circles. <b>The two "humps" remind the semi-circular and Marcenko-Pastur laws.</b>
%% Cell type:code id: tags:
``` python
# Generation of the mean vectors for each sample
p=200# dimension size
n=4000# sample size
c0=p/n# dimension/sample size ratio
# Set the covariance for the two mean vectorss
cov_mu=np.array([[10,5.5],[5.5,15]])/p
mus=gen_synth_mus(p=p,n=n,cov_mu=cov_mu)
# Set the proportion for each of the two classes
cs=[0.4,0.6]
# Generate the noisy data matrix and the spikes matrices
X,ells,vM=gen_synth_X(p,n,mus,cs)
# Puncturing settings
eS=0.2# data puncturing ratio
eB=0.4# kernel puncturing ratio
b=1# kernel matrix diagonal entry
# Empirical Spectrum
lambdas=puncture_eigs(X,eB,eS,b)[0]
xmin=min(np.min(lambdas)*0.8,np.min(lambdas)*1.2)# accounting negative min
Illustration of Theorem 2: asymptotic sample-population eigenvector alignment for $\mathcal L=\ell \in\mathbb R$, as a function of the ``information strength'' $\ell$. Various values of $(\varepsilon_S,\varepsilon_B,c_0)$ indicated in legend. Black dashed lines indicate the limiting (small $\varepsilon_S,\varepsilon_B$) phase transition threshold $\Gamma=(\varepsilon_S^2\varepsilon_Bc_0^{-1})^{-\frac12}$. <b>As $\varepsilon_S,\varepsilon_B\to 0$, performance curves coincide when $\varepsilon_B\varepsilon_S^2c_0^{-1}$ is constant (plain versus dashed set of curves).</b>
%% Cell type:code id: tags:
``` python
eSeBc0s = [
(0.1, 0.1, 1),
(0.2, 0.025, 1),
(0.1, 0.2, 2),
(0.05, 0.05, 1),
(0.1, 0.0125, 1),
(0.05, 0.1, 2),
]
ells = np.linspace(1, 400, 100)
plt.figure(figsize=(8, 4))
for eS, eB, c0 in eSeBc0s:
plt.plot(
ells,
[spike(eB, eS, c0, ell)[1] for ell in ells],
label="({},{},{})".format(eS, eB, c0),
)
plt.xlabel(r"$\ell$")
plt.ylabel(r"$\zeta$")
plt.grid("On")
_ = plt.legend()
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
#### Figure 3.
Phase transition curves $F(\ell)=0$ for $\mathcal L=\ell\in\mathbb R$ and varying values of $\ell$, for $c_0=.05$. Above each phase transition curve, a spike eigenvalue is found away from the support of $\nu$. <b>For large $\ell$, a wide range of $\varepsilon_B$'s (resp., $\varepsilon_S$) is admissible at virtually no performance loss. Here, also, sparser $B$ matrices are more effective than sparser $S$ matrices.</b>
Two-way punctured matrices $K$ for **(left)** $(\varepsilon_S,\varepsilon_B)=(.2,1)$ or **(right)** $(\varepsilon_S,\varepsilon_B)=(1,.04)$, with $c_0=\frac12$, $n=4\,000$, $p=2\,000$, $b=0$. Clustering setting with $x_i\sim .4\mathcal N(\mu_1,I_p)+.6\mathcal N(\mu_2,I_p)$ for $[\mu_1^T,\mu_2^T]^T\sim \mathcal N(0,\frac1p[\begin{smallmatrix} 20 & 12 \\ 12 & 30\end{smallmatrix}]\otimes I_p)$. **(Top)** first $100\times 100$ absolute entries of $K$ (white for zero); **(Middle)** spectrum of $K$, theoretical limit, and isolated eigenvalues; **(Bottom)** second dominant eigenvector $\hat v_2$ of $K$ against theoretical average in red. **As confirmed by theory, although (top) $K$ is dense for $\varepsilon_B=1$ and sparse for $\varepsilon_B=.04$ ($96\%$ empty) and (middle) the spectra strikingly differ, (bottom) since $\varepsilon_S^2\varepsilon_Bc_0^{-1}$ is constant, the eigenvector alignment $|\hat v_2^T v_2|^2$ is the same in both cases.**
%% Cell type:code id: tags:
``` python
# Generation of the mean vectors for each sample
p = 2000 # dimension size
n = 4000 # sample size
c0 = p / n # dimension/sample size ratio
# Set the covariance for the two mean vectorss
cov_mu = np.array([[20, 12], [12, 30]]) / p
mus = gen_synth_mus(p=p, n=n, cov_mu=cov_mu)
# Set the proportion for each of the two classes
cs = [0.4, 0.6]
n0 = int(n * cs[0])
# Generate the noisy data matrix and the spikes matrices
Limiting probability of error of spectral clustering of $\mathcal N(\pm\mu,I_p)$ with equal class sizes on $K$: as a function of $\varepsilon_B$ for fixed $\ell=\|\mu\|^2=50$ <b>(top)</b>, and $\varepsilon_S$ for fixed $\ell=50$ <b>(bottom)</b>. Simulations (single realization) in markers for $p=n=4\,000$ ($\color{blue}\times$) and $p=n=8\,000$ ($\color{blue}+$). <b> Very good fit between theory and practice for not too small $\varepsilon_S,\varepsilon_B$ </b>.
**Note:** the GAN data we generated and used in the submitted paper are too voluminous to be included in the supplementary material and is not publicly and anonymously available. For this reason, we are using below the smaller, publicly available, MNIST-fashion real word data set. However **the conclusions obtained for this dataset are very similar to those drawn in the paper for the GAN data.**
%% Cell type:code id: tags:
``` python
from tensorflow.keras.datasets import fashion_mnist
(X, y), _ = fashion_mnist.load_data()
selected_labels = [1, 2]
nb_im = 5 # number of images to display for each class
Empirical classification errors for $2$-class (balanced) MNIST-fashion images (`trouser` vs `pullover`), with $n=512$ (**top**) and $n=2048$ (**bottom**). **Theoretically predicted ``plateau''-behavior observed for all $\varepsilon_B$ not too small**.
%% Cell type:code id: tags:
``` python
nbMC=40
df_1,_=get_perf_clustering(n0=256,nbMC=nbMC)
df_2,_=get_perf_clustering(n0=1024,nbMC=nbMC)
```
%%%% Output: stream
[progression 40/40]
%% Cell type:code id: tags:
``` python
f,ax=plt.subplots(2,1,figsize=(8,8),sharex=True)
plot_perf_clustering(df_1,n0,ax[0])
plot_perf_clustering(df_2,n0,ax[1])
```
%%%% Output: display_data
[Hidden Image Output]
%% Cell type:markdown id: tags:
#### Figure 8
Sample vs limiting spectra and dominant eigenvector of $K$ for 2-class MNIST-fashion images (`trouser` vs `pullover`); **(left)** $\varepsilon_S=\varepsilon_B=1$ (error rate: $\mathbb P_e=.09$); **(right)** $\varepsilon_S=0.02$, $\varepsilon_B=0.2$ ($\mathbb P_e=.12$). **Surprisingly good fit between sample and predicted isolated eigenvalue/eigenvector in all cases; as for spectral measure, significant prediction improvement as $\varepsilon_S,\varepsilon_B\to 0$**