@@ -80,32 +139,35 @@ Version & Date & Author & Comment \\ [0.5ex]
\section{Installation Requirements}
A runnable program for the SciDetect software is implemented in:
\begin{lstlisting}[language=bash]
A runnable program for the SciDetect software is available in:
\begin{lstlisting}
ScigenChecker_Local.jar
\end{lstlisting}
It can be used as a stand-alone Java program. The program component requires {\tt Java SE 6} or higher, with an additional libraries for pdf converter(included in {\tt lib/}); Furthermore the configuration file ({\tt config.txt}) and directories for log files ({\tt logs} and {\tt detaillogs}) are required by the client.
It can be used as a stand-alone Java program. The program component requires {\tt Java SE 6} or higher, with an additional libraries for pdf converter(included in {\tt lib/}).
Furthermore the configuration file ({\tt config.txt}) and directories for log files ({\tt logs} and {\tt detaillogs}) are required by the stand-alone Java program.
\section{Usage}
\subsection{Command line client-side}
SciDetect program is included in a runnable JAR file. The program is started by invoking:
\begin{lstlisting}[language=bash]
\begin{lstlisting}
$java -jar ScigenChecker_Local.jar <parameters>
\end{lstlisting}
Where {\tt <parameters>} stands for a combination of one or more of
the following command line options:
\begin{itemize}
\item{\tt-c <path\_to\_check>} gives the path to the directory (or file) that need to be checked;
\item{\tt-l <log\_filename>} gives path and name of the log file (defaults to {\tt/logs/start\_time.xls});
\item{\tt-d} Save detail log (optional, default false).
\item[]{\tt-c <path\_to\_check>} gives the path to the directory (or file) that need to be checked;
\item[]{\tt-l <log\_filename>} gives path and name of the log file (defaults to {\tt/logs/start\_time.xls});
\item[]{\tt-d} Save detail log (optional, default false).
@@ -118,6 +180,7 @@ A configuration file ({\tt config.txt}) should be accessible by the program. It
\subsection{Path to sample folder}
\begin{lstlisting}[language=bash]
# Where samples can be found
samples data/samples
\end{lstlisting}
This is used to set the directory where samples of texts produced by known generator can be found. This directory contains one directory per \emph{classes}. One directory contains examples that are representative of its class. In a standard release, the {\tt data/samples} directory contains four subdirectories with texts generated by the following generator:
...
...
@@ -136,6 +199,7 @@ New subdirectories can be added. This can be done for two purpose:
\subsection{Threshold configuration}
\begin{lstlisting}[language=bash]
# Defining Thresholds for Scigen
Threshold_Scigen 0.48 0.56
\end{lstlisting}
...
...
@@ -152,15 +216,24 @@ If new samples are added to the sample folder, the threshold configuration shoul
\subsection{Path for log files}
\begin{lstlisting}[language=bash]
# Set the default path for log files
Default_log_folder logs/
Default_detail_log_folder detaillogs/
\end{lstlisting}
These lines are use to set the default log folder and a default detail log folder (see section ?? for more information). In case the path to a log file is not set (no -l parameter), the log file will be saved in the default log folder under the name: {\tt time\_date.xls} (e/g: 09:46 25.02.2015.xls means the check was started at 9:46 on 25/2/2015).
These lines are use to set the default log folder and a default detail log folder (see section~\ref{detaillog} for more information). In case the path to a log file is not set (no -l parameter), the log file will be saved in the default log folder under the name: {\tt time\_date.xls} (e.g. 09:46 25.02.2015.xls means the check was started at 9:46 on 25/2/2015).
INDEX-53.txt is a Scigen 0.34236384 data/samples/Scigen/INDEX-scigen25.txt
INDEX-53.txt is a Physgen 0.47908222 data/samples/Physgen/INDEX-physgen7.txt
INDEX-011.txt is Genuine 0.60918242 data/samples/Scigen/INDEX-scigen41.txt
INDEX-013.txt is Genuine 0.61375975 data/samples/Scigen/INDEX-scigen25.txt
\end{lstlisting}
\subsection{Max text length}
\begin{lstlisting}[language=bash]
# Max_length is the maximum size of a text
Max_length 30000
\end{lstlisting}
This set the max length in character (including white space char) for a text to be eligible for classification. This parameter is used in order to avoid miss classification: when an article is too long, this cause the characteristic of the article to becomes too generic and very long paper may be misclassified (without splitting misclassification rate: ??).
...
...
@@ -168,12 +241,43 @@ This set the max length in character (including white space char) for a text to
The default value is set at 30000 characters (about 15 pages). A longer text will be splitter into several part which are tested individually.
\section{Make use of detail logging}
\label{deteaillog}
The detail log (parameter -d) stores all the distances from the text under test to all other samples in the sample set (i.e. all texts in all directories found at {\tt /data/sample}).
This can be use to get a more detail look at the results.
For example: an article returned with a distant to the nearest neighbour that barely pass the threshold. Turning on the detail log for that article and checking the results may help the decision.
%if it is just a rare incident or the distances to other samples are also suspicious and have a better estimation.