README.tex 7.37 KB
Newer Older
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28


\documentclass[10pt,a4paper,titlepage]{article}

%%%%% Packages
\usepackage{fullpage}
\usepackage{url}
\usepackage[T1]{fontenc}
\usepackage{listings}
\usepackage{caption}
\usepackage{titling}

%%%%% Customizing Caption
\captionsetup{labelsep=space,justification=justified,singlelinecheck=off}
%%%%% Customizing itemize
\def\labelitemi{}
%%%%% Customizing Title Page
\pretitle{
\begin{flushleft}\Huge
\rule{\linewidth}{0.5mm}
\vskip 0.5em
}

\posttitle{
\rule{\linewidth}{0.5mm}
\par\end{flushleft}
\vskip 20em
}
Tien's avatar
Tien committed
29 30
\preauthor{\begin{flushright}}
\postauthor{\end{flushright}}
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
31 32 33 34
\predate{
\begin{flushleft}
\rule{\linewidth}{0.5mm}
\end{flushleft}
Tien's avatar
Tien committed
35 36
\begin{flushright}\large\scshape}
\postdate{\par\end{flushright}}
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61
%%%%%

\begin{document}
\title{SciDetect Documentation}



\author{
Nguyen Minh Tien \\ 
\texttt{Minh-Tien.nguyen@imag.fr}\\
%\and
Cyril Labb\'e \\
\texttt{first.last@imag.fr}\\
}

\date{March 2015}


%\title{\color{red}Practical Typesetting}
%\author{\color{blue}Name\\ Work}
%\date{\color{green}December 2005}
\maketitle



Tien's avatar
Tien committed
62 63 64 65 66 67 68 69 70 71 72 73 74 75
\begin{table}[ht]
\caption*{Revision History}
\begin{tabular}{|c|c|c|c|}
\hline
\hline
Version & Date & Author & Comment \\ [0.5ex] 
\hline
1.4 &  13-02-2015 & MT & Initial deployment  \\
1.41 &  17-02-2015 & MT & Added support for XML and XTX  \\
2.0 &  25-02-2015 & MT & Added multiple configurable parameters \\
\hline
\end{tabular}
\label{table:nonlin}
\end{table}
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141

\tableofcontents

\newpage

\section{Installation Requirements}

A runnable program for the SciDetect software is implemented in:
\begin{lstlisting}[language=bash]
ScigenChecker_Local.jar
\end{lstlisting} 
 
It can be used as a stand-alone Java program. The program component requires {\tt Java SE 6} or higher, with an additional libraries for pdf converter(included in {\tt lib/}); Furthermore the configuration file ({\tt config.txt}) and directories for log files ({\tt logs} and {\tt detaillogs}) are required by the client. 


\section{Usage}
\subsection{Command line client-side}

SciDetect program is included in a runnable JAR file. The program is started by invoking:
\begin{lstlisting}[language=bash]
  $java -jar ScigenChecker_Local.jar <parameters>
\end{lstlisting} 

Where {\tt <parameters>} stands for a combination of one or more of
the following command line options: 
\begin{itemize}
\item {\tt -c <path\_to\_check>} gives the path to the directory (or file) that need to be checked;
\item {\tt -l <log\_filename>} gives  path and name of the log file (defaults to {\tt /logs/start\_time.xls});
\item {\tt -d} Save detail log (optional, default false).
\end{itemize}
Typical use:
\begin{lstlisting}[language=bash]
$Java -jar ScigenChecker_Local.jar -c /tien/Test_demo -l /tien/Test_log.xls -d
\end{lstlisting} 

\subsection{Supported file types}
At version 2.0 {\tt ScigenChecker\_Local} currently supports .PDF and two specific Springer xml format namely 
{\tt .XML} for {\tt A++} format
{\tt .XTX} for PDF extraction of PDF files

\section{Configuration}
A configuration file ({\tt config.txt}) should be accessible by the program. It should be found in the same directory with the  {\tt ScigenChecker\_Local.jar}. The config file contains following information:

\subsection{Path to sample folder}
\begin{lstlisting}[language=bash]
samples	data/samples
\end{lstlisting}
This is used to set the directory where samples of texts produced by known generator can be found. This directory contains one directory per \emph{classes}. One directory contains examples that are representative of its class. In a standard release, the {\tt data/samples} directory contains four subdirectories with texts generated by the following generator:
\begin{itemize}
\item \url{http://thatsmathematics.com/mathgen/} (dir {\tt data/samples/Mathgen});
\item \url{https://bitbucket.org/birkenfeld/scigen-physics} (dir {\tt data/samples/Physgen)};
\item \url{http://www.nadovich.com/chris/randprop/} ( dir {\tt data/samples/Propgen)};
\item \url{http://pdos.csail.mit.edu/scigen/} (dir {\tt data/samples/SCIgen)}. 
\end{itemize}

New subdirectories can be added. This can be done for two purpose:
\begin{enumerate}
\item add a corpus that represents fairly enough a particular field. By setting appropriate threshold, this will flag papers that appeared to be too far from that field.
\item In case new a generator appears, new samples (pdf) can be added in a new subdirectory (in {\tt data/samples}) containing a representative corpora of the new class. 
\end{enumerate}

\subsection{Threshold configuration}
\begin{lstlisting}[language=bash]
Threshold_Scigen	0.48	0.56
\end{lstlisting}

Tien's avatar
Tien committed
142
A line starting with  {\tt Threshold\_Dirname} is used to define thresholds needed to take  decisions to assigned tested texts the class for which examples can be found in the directory {\tt Dirname}. There should have one line (i.e. two Thresholds) per classe. These values are 2 real numbers between 0 and 1. The smallest one is use to take the decision to assigned the tested paper (almost certainly) to the class. The second one is used as a threshold for suspicion for containing parts of generated text. 
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
143

Tien's avatar
Tien committed
144
The previous example (concerning Scigen class) has the following meaning. Given distances from the tested text to its nearest neighbour in the set of samples (i.e. texts found in the Scigen dir):
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
145
\begin{itemize}
Tien's avatar
Tien committed
146
\item If the distance is greater than 0.56, then it is reasonably believable that this is a genuine article.
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
147 148 149 150
\item From 0.56 to 0.48, there is a chance that this article or part of this article is Scigen generated.
\item If the distance is less than 0.48, there is a very high chance that this is an automatic Scigen generated article.
\end{itemize}

Tien's avatar
Tien committed
151
If new samples are added to the sample folder, the threshold configuration should also be added, if not the default-threshold values are used (0.48 and 0.56).
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
152 153 154 155 156 157 158 159 160 161 162 163 164 165

\subsection{Path for log files}
\begin{lstlisting}[language=bash]
Default_log_folder	logs/
Default_detail_log_folder	detaillogs/
\end{lstlisting}

These lines are use to set the default log folder and a default detail log folder (see section ?? for more information). In case the path to a log file is not set (no -l parameter), the log file will be saved in the default log folder under the name: {\tt time\_date.xls} (e/g: 09:46 25.02.2015.xls means the check was started at 9:46 on 25/2/2015).

\subsection{Max text length}

\begin{lstlisting}[language=bash]
Max_length	30000
\end{lstlisting}
Tien's avatar
Tien committed
166
This set the max length in character (including white space char) for a text to be eligible for classification. This parameter is used in order to avoid miss classification: when an article is too long, this cause the characteristic of the article to becomes too generic and very long paper may be misclassified (without splitting misclassification rate: ??). 
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
167

Tien's avatar
Tien committed
168
The default value is set at 30000 characters (about 15 pages). A longer text will be splitter into several part which are tested individually. 
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
169 170 171 172 173 174

\section{Make use of detail logging}

The detail log (parameter -d) stores all the distances from the text under test to all other samples in the sample set (i.e. all texts in all directories found at {\tt /data/sample}).
This can be use to get a more detail look at the results. 

Tien's avatar
Tien committed
175
For example: an article returned with a distant to the nearest neighbour that barely pass the threshold. Turning on the detail log for that article and checking the results may help the decision.
Cyril Labbe's avatar
Cyril  
Cyril Labbe committed
176 177 178 179
%if it is just a rare incident or the distances to other samples  are also suspicious and have a better estimation. 

\end{document}