README.tex 7.35 KB
Newer Older
Cyril Labbe's avatar
Cyril Labbe committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165


%%%%% Packages

%%%%% Customizing Caption
%%%%% Customizing itemize
%%%%% Customizing Title Page
\vskip 0.5em

\vskip 20em

\title{SciDetect Documentation}

Nguyen Minh Tien \\ 
Cyril Labb\'e \\

\date{March 2015}

%\title{\color{red}Practical Typesetting}
%\author{\color{blue}Name\\ Work}
%\date{\color{green}December 2005}

\begin{table}[ht]\caption*{Revision History}\begin{tabular}{|c|c|c|c|}\hline
Version & Date & Author & Comment \\ [0.5ex] \hline
1.4 &  13-02-2015 & MT & Initial deployment  \\
1.41 &  17-02-2015 & MT & Added support for XML and XTX  \\
2.0 &  25-02-2015 & MT & Added multiple configurable parameters \\\hline\end{tabular}\label{table:nonlin}\end{table}



\section{Installation Requirements}

A runnable program for the SciDetect software is implemented in:
It can be used as a stand-alone Java program. The program component requires {\tt Java SE 6} or higher, with an additional libraries for pdf converter(included in {\tt lib/}); Furthermore the configuration file ({\tt config.txt}) and directories for log files ({\tt logs} and {\tt detaillogs}) are required by the client. 

\subsection{Command line client-side}

SciDetect program is included in a runnable JAR file. The program is started by invoking:
  $java -jar ScigenChecker_Local.jar <parameters>

Where {\tt <parameters>} stands for a combination of one or more of
the following command line options: 
\item {\tt -c <path\_to\_check>} gives the path to the directory (or file) that need to be checked;
\item {\tt -l <log\_filename>} gives  path and name of the log file (defaults to {\tt /logs/start\_time.xls});
\item {\tt -d} Save detail log (optional, default false).
Typical use:
$Java -jar ScigenChecker_Local.jar -c /tien/Test_demo -l /tien/Test_log.xls -d

\subsection{Supported file types}
At version 2.0 {\tt ScigenChecker\_Local} currently supports .PDF and two specific Springer xml format namely 
{\tt .XML} for {\tt A++} format
{\tt .XTX} for PDF extraction of PDF files

A configuration file ({\tt config.txt}) should be accessible by the program. It should be found in the same directory with the  {\tt ScigenChecker\_Local.jar}. The config file contains following information:

\subsection{Path to sample folder}
samples	data/samples
This is used to set the directory where samples of texts produced by known generator can be found. This directory contains one directory per \emph{classes}. One directory contains examples that are representative of its class. In a standard release, the {\tt data/samples} directory contains four subdirectories with texts generated by the following generator:
\item \url{} (dir {\tt data/samples/Mathgen});
\item \url{} (dir {\tt data/samples/Physgen)};
\item \url{} ( dir {\tt data/samples/Propgen)};
\item \url{} (dir {\tt data/samples/SCIgen)}. 

New subdirectories can be added. This can be done for two purpose:
\item add a corpus that represents fairly enough a particular field. By setting appropriate threshold, this will flag papers that appeared to be too far from that field.
\item In case new a generator appears, new samples (pdf) can be added in a new subdirectory (in {\tt data/samples}) containing a representative corpora of the new class. 

\subsection{Threshold configuration}
Threshold_Scigen	0.48	0.56

A line starting with  {\tt Threshold\_Dirname} is used to define thresholds needed to take  decisions to assigned tested texts the class for which examples can be found in the directory {\tt Dirname}. There should have one line (i.e. two Thresholds) per classe. These values are 2 real numbers between 0 and 1. The smallest one is use to take the decision to assigned the tested paper (almost certainly) to the classe. The second one is used as a threshold for suspicion for containing parts of generated text. 

The previous example (concerning Scigen class) has the following meaning. Given distances from the tested text to its nearest neighbor in the set of samples (i.e. texts found in the Scigen dir):
\item If the distance is greater than 0.56, then it is reasonably believable that this is a article.
\item From 0.56 to 0.48, there is a chance that this article or part of this article is Scigen generated.
\item If the distance is less than 0.48, there is a very high chance that this is an automatic Scigen generated article.

If new samples are added to the sample folder, the threshold configuration should also be added, if not the default-threshold values are used (?? and ??).

\subsection{Path for log files}
Default_log_folder	logs/
Default_detail_log_folder	detaillogs/

These lines are use to set the default log folder and a default detail log folder (see section ?? for more information). In case the path to a log file is not set (no -l parameter), the log file will be saved in the default log folder under the name: {\tt time\_date.xls} (e/g: 09:46 25.02.2015.xls means the check was started at 9:46 on 25/2/2015).

\subsection{Max text length}

Max_length	30000
This set the max length in character (including whitespace char) for a text to be eligible for classification. This parameter is used in order to avoid miss classification: when an article is too long, this cause the characteristic of the article to becomes too generic and very long paper may be misclassified (without splitting misclassification rate: ??). 

The default value is set at 30000 characters (about 15 pages). A longer text will be splited into several part which are tested individually. 

\section{Make use of detail logging}

The detail log (parameter -d) stores all the distances from the text under test to all other samples in the sample set (i.e. all texts in all directories found at {\tt /data/sample}).
This can be use to get a more detail look at the results. 

For example: an article returned with a distant to the nearest neighbor that barely pass the threshold. Turning on the detail log for that article and checking the results may help the decision.
%if it is just a rare incident or the distances to other samples  are also suspicious and have a better estimation.