Thresholds for the current known generators have been empirically set according to tests presented in this section. These tests involves the computation of the intertextual distance presented in~\cite{FakeDuplication}.

For each generator (Scigen, scigen-physics, Mathgen and propgen) a set of 400 texts is used (i.e: 1600 texts for the whole). For each text the distance to its nearest neighbor in the sample set is computed. The sample is composed of an extra 100 texts per generator (i.e: 400 additional texts). The nearest neighbor is always of the same nature than the tested text and columns 1-2-3-4 of Table~\ref{table1} show statistical information about the observed distances.

For each generator (Scigen, scigen-physics, Mathgen and propgen) a set of 400 texts is used (i.e: 1600 texts for the whole). For each text the distance to its nearest neighbour in the sample set is computed. The sample is composed of an extra 100 texts per generator (i.e: 400 additional texts). The nearest neighbour is always of the same nature than the tested text and columns 1-2-3-4 of Table~\ref{table1} show statistical information about the observed distances.

A set of 8200 genuine papers is also used. For each genuine text the distance to its nearest fake in the sample set is computed. The sample still being composed of the same 400 texts (100 per generator). For each of the 8200 genuine papers, the nearest fake neighbor is in one of the generated sample group.

A set of 8200 genuine papers is also used. For each genuine text the distance to its nearest fake in the sample set is computed. The sample still being composed of the same 400 texts (100 per generator). For each of the 8200 genuine papers, the nearest fake neighbour is in one of the generated sample group.

Columns 1 of Table~\ref{table1} shows that, for a genuine paper, the minimal distance to the nearest fake is always greater than the maximal distance of the nearest neighbor of a fake.

The first 2 rows of Table~\ref{table1} show that, for a genuine paper, the minimal distance to the nearest fake is always greater than the maximal distance of the nearest neighbour of a fake.

\begin{table}[ht]

\caption{Mean, Standard deviation and median for the distances between papers and theirs nearest neighbor.}

\caption{Mean, min-max distances between papers and theirs nearest neighbour, along with standard deviation and median.}

(\url{http://pdos.csail.mit.edu/scigen/} (dir {\tt data/samples/SCIgen)}) The graph~\ref{scigen} shows the observed distribution for texts having a Scigen text as nearest fake neighbor.

(\url{http://pdos.csail.mit.edu/scigen/} (dir {\tt data/samples/SCIgen)}) The graph~\ref{scigen} shows the observed distribution for texts having a Scigen text as nearest fake neighbour.

\caption{Distribution of distances to the {\emph Scigen} nearest neighbor. In blue for a set of {\emph non-scigen} paper. In red for a set of {\emph scigen} papers}

\caption{Distribution of distances to the {\emph Scigen} nearest neighbour. In blue for a set of {\emph non-scigen} paper. In red for a set of {\emph scigen} papers}

\label{scigen}

\end{center}

\end{figure}

\paragraph{scigen-physics}

\url{https://bitbucket.org/birkenfeld/scigen-physics} (dir {\tt data/samples/Physgen)} The graph~\ref{phygen} shows the observed distribution for texts having a scigen-physics text as nearest fake neighbor.

\url{https://bitbucket.org/birkenfeld/scigen-physics} (dir {\tt data/samples/Physgen)} The graph~\ref{phygen} shows the observed distribution for texts having a scigen-physics text as nearest fake neighbour.

\caption{Distribution of distances to the {\emph scigen-physics} nearest neighbor. In blue for a set of {\emph non-scigen-physics} paper. In red for a set of {\emph scigen-physics} papers}

\caption{Distribution of distances to the {\emph scigen-physics} nearest neighbour. In blue for a set of {\emph non-scigen-physics} paper. In red for a set of {\emph scigen-physics} papers}

\label{phygen}

\end{center}

\end{figure}

\paragraph{Mathgen}

\url{http://thatsmathematics.com/mathgen/} (dir {\tt data/samples/Mathgen}) The graph~\ref{mathgen} shows the observed distribution for texts having a mathgen text as nearest fake neighbor.

\url{http://thatsmathematics.com/mathgen/} (dir {\tt data/samples/Mathgen}) The graph~\ref{mathgen} shows the observed distribution for texts having a mathgen text as nearest fake neighbour.

\caption{Distribution of distances to the \emph{mathgen} nearest neighbor. In blue for a set of \emph{non-mathgen} paper. In red for a set of \emph{mathgen} papers}

\caption{Distribution of distances to the \emph{mathgen} nearest neighbour. In blue for a set of \emph{non-mathgen} paper. In red for a set of \emph{mathgen} papers}

\url{http://www.nadovich.com/chris/randprop/} (dir {\tt data/samples/Propgen)} The graph~\ref{propgen} shows the observed distribution for texts having a randprop text as nearest fake neighbor.

\url{http://www.nadovich.com/chris/randprop/} (dir {\tt data/samples/Propgen)} The graph~\ref{propgen} shows the observed distribution for texts having a randprop text as nearest fake neighbour.

\begin{thebibliography}{widest entry}\bibitem[1]{FakeDuplication} Cyril Labbé, Dominique Labbé. \emph{Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?} Scientometrics 94, no. 1 (2013): 379-396 (http://hal.archives-ouvertes.fr/hal-00641906v2/document).

%\bibitem[label2]{cite_key2} bibliographic information
\end{thebibliography}

\begin{thebibliography}{widest entry}

\bibitem[1]{FakeDuplication} Cyril Labbé, Dominique Labbé. \emph{Duplicate and fake publications in the scientific literature: how many SCIgen papers in computer science?} Scientometrics 94, no. 1 (2013): 379-396 (http://hal.archives-ouvertes.fr/hal-00641906v2/document).

%\bibitem[label2]{cite_key2} bibliographic information