\item The samples directory directories ({\tt data})
\item For log files ({\tt logs} and {\tt detaillogs}) are required by the stand-alone Java program.
\end{itemize}
\section{Usage}
...
...
@@ -244,15 +244,17 @@ INDEX-011.txt is Genuine 0.60918242 data/samples/Scigen/INDEX-scigen41.txt
INDEX-013.txt is Genuine 0.61375975 data/samples/Scigen/INDEX-scigen25.txt
\end{lstlisting}
\subsection{Max text length}
\subsection{Max-Min text length}
\begin{lstlisting}[language=bash]
# Max_length is the maximum size of a text
Max_length 30000
Min_length 10000
\end{lstlisting}
This set the max length in character (including white space char) for a text to be eligible for classification. This parameter is used in order to avoid miss classification: when an article is too long, this cause the characteristic of the article to becomes too generic and very long paper may be misclassified (without splitting misclassification rate: ??).
The default value is set at 30000 characters (about 15 pages). A longer text will be split into several part which are tested individually.
This set the max(min) length in character (including white space char) for a text to be eligible for classification. This parameter is used in order to avoid miss classification: when an article is too long, this cause the characteristic of the article to becomes too generic and very long paper may be misclassified (without splitting misclassification rate: 0.13\% or 42 misclassification/ 31577 samples). When the article is shorter than Min length, it will be marked as cant classify.
The default value for max length is set at 30000 characters (about 10 pages); a longer text will be split into several part which are tested individually. Default min length is set at 10000 characters.