Commit e3324912 authored by Jonathan Schaeffer's avatar Jonathan Schaeffer
Browse files

outil de nettoyage distant

parents 9a569002 a1e1b442
......@@ -19,33 +19,22 @@ Les configurations se font par variable d'environnement :
* `RESIFDD_KEYFILE` : si cette variable indique le chemin d'un fichier valide, alors il sera utilisé pour transférer les données correspondantes aux clés listées dans le fichier.
## Exemples
### Lancer tout le dump
### Invoquer l'outil
Démarrer le transfert de toutes les données à partir de 2009
``` shell
RESIFDD_WORKDIR=/osug-dc/resif RESIFDD_DATATIR=/scratch/resifdumper RESIFDD_START_AT=2009 src/resifdatadump-parallel
RESIFDD_WORKDIR=/osug-dc/resif RESIFDD_DATATIR=/scratch/resifdumper resifdatadump
```
Transférer les données listées dans le fichier `RESIFDD_KEYFILE` :
### Options particulières de l'outil
Lancer la sauvegarde d'une liste de stations :
``` shell
RESIFDD_WORKDIR=/osug-dc/resif RESIFDD_DATATIR=/scratch/resifdumper RESIFDD_KEYFILE=/scratch/resifdumper/keys.txt src/resifdatadump-parallel
RESIFDD_DATADIR=/osug-dc/resif RESIFDD_WORKDIR=/scratch/resif_datadump src/resifdatadump 2011/RA/NCAD 2012/MT/THE
```
Le ficher doit contenir une clé par ligne, comme rapportée dans les logs :
``` textfile
2018_RA_CGBP
2018_FR_RUSF
2018_RA_NCAD
2018_RA_PYTO
2016_FR_RUSF
2017_RA_PYTO
2017_RA_CGBP
2017_FR_RUSF
2016_MT_CLP2
2016_RA_NCAD
```
Sauvegarder les métadonnées
``` shell
RESIFDD_DATADIR=/osug-dc/resif RESIFDD_WORKDIR=/scratch/resif_datadump src/resifdatadump validated_seismic_metadata
``` shell
On peut générer un fichier de clés avec le script python `src/scan_dupms.py`.
On peut générer un fichier de clés avec le script python `src/scan_dupms.py` qui
# Ce script scanne l'archive distante IRODS
# Pour tout élément renvoyé (qui est de la donnée) :
# - si on trouve son existance dans validated_seismic_data alors on garde
# - sinon, on cherche dans cold_validated_seismic_data
# - sinon on supprime d'irods le répertoire
RESIF_DATA_DIR=/osug-dc/resif
ARCHIVE_DIRS="$RESIF_DATA_DIR/validated_seismic_data $RESIF_DATA_DIR/cold_validated_seismic_data"
IRODS_DATA=$(ils | awk -F'/' '/\/[0-9A-Z]+_[0-9A-Z]+_[0-9A-Z]+$/ {print $NF}')
for key in $IRODS_DATA; do
data_path=$(echo $key|tr '_' '/')
data_path_exists=0
for dir in $ARCHIVE_DIRS; do
if [[ -d $dir/$data_path ]]; then
data_path_exists=1
break
fi
done
if [[ $data_path_exists -eq 0 ]]; then
echo $data_path not found on local archive. What now ?
echo irm $key ?
fi
done
irm reports/2019-07-16.csv
irm reports/2019-07-17.csv
irm reports/20190718-0507.csv
irm reports/20190718-0509.csv
irm reports/20190719-0630.csv
irm reports/20190719-0641.csv
irm reports/20190719-0642.csv
irm reports/20190719-0643.csv
irm reports/20190719-0711.csv
irm reports/20190719-0712.csv
irm reports/20190719-1216.csv
irm reports/20190719-1235.csv
irm reports/20190719-1444.csv
irm reports/20190721-1245.csv
irm reports/20190721-1944.csv
irm reports/20190722-0422.csv
irm reports/20190722-0743.csv
irm reports/20190722-0821.csv
irm reports/20190722-0904.csv
irm reports/20190722-0930.csv
irm reports/20190803-2302.csv
irm reports/20190804-0002.csv
irm reports/20190804-0102.csv
irm reports/20190804-0202.csv
irm reports/20190804-0302.csv
irm reports/20190804-0402.csv
irm reports/20190804-0502.csv
irm reports/20190804-0602.csv
irm reports/20190804-0802.csv
irm reports/20190804-0902.csv
irm reports/20190804-1002.csv
irm reports/20190804-1102.csv
irm reports/20190804-1202.csv
irm reports/20190804-1302.csv
irm reports/20190804-1403.csv
irm reports/20190804-1503.csv
irm reports/20190804-1603.csv
irm reports/20190804-1703.csv
irm reports/20190804-1803.csv
irm reports/20190804-1903.csv
irm reports/20190804-2003.csv
irm reports/20190804-2103.csv
irm reports/20190804-2203.csv
irm reports/20190804-2303.csv
irm reports/20190805-0003.csv
irm reports/20190805-0103.csv
irm reports/20190805-0203.csv
irm reports/20190805-0303.csv
irm reports/20190805-0403.csv
irm reports/20190805-0503.csv
irm reports/20190805-0603.csv
irm reports/20190805-0703.csv
irm reports/20190805-0803.csv
irm reports/20190805-0903.csv
irm reports/20190805-1003.csv
irm reports/20190805-1103.csv
irm reports/20190805-1203.csv
irm reports/20190805-1303.csv
irm reports/20190805-1403.csv
irm reports/20190805-1503.csv
irm reports/20190805-1603.csv
irm reports/20190805-1703.csv
irm reports/20190805-1803.csv
irm reports/20190805-1903.csv
irm reports/20190805-2003.csv
irm reports/20190805-2104.csv
irm reports/20190805-2204.csv
irm reports/20190805-2304.csv
irm reports/20190806-0004.csv
irm reports/20190806-0104.csv
irm reports/20190806-0204.csv
irm reports/20190806-0304.csv
irm reports/20190806-0404.csv
irm reports/20190806-0504.csv
irm reports/20190806-0604.csv
irm reports/20190806-0704.csv
irm reports/20190806-0804.csv
irm reports/20190806-0904.csv
irm reports/20190806-1004.csv
irm reports/20190806-1104.csv
irm reports/20190806-1204.csv
irm reports/20190806-1304.csv
irm reports/20190806-1404.csv
irm reports/20190806-1504.csv
irm reports/20190806-1604.csv
irm reports/20190806-1704.csv
irm reports/20190806-1804.csv
irm reports/20190806-1904.csv
irm reports/20190806-2004.csv
irm reports/20190806-2104.csv
irm reports/20190806-2204.csv
irm reports/20190806-2304.csv
irm reports/20190807-0004.csv
irm reports/20190807-0104.csv
irm reports/20190807-0204.csv
irm reports/20190807-0304.csv
irm reports/20190807-0404.csv
irm reports/20190807-0504.csv
irm reports/20190807-0604.csv
irm reports/20190807-0704.csv
irm reports/20190807-0804.csv
irm reports/20190807-0905.csv
irm reports/20190807-1005.csv
irm reports/20190807-1105.csv
irm reports/20190807-1205.csv
irm reports/20190807-1305.csv
irm reports/20190807-1405.csv
irm reports/20190807-1505.csv
irm reports/20190807-1605.csv
irm reports/20190807-1705.csv
irm reports/20190807-1805.csv
irm reports/20190807-1905.csv
irm reports/20190807-2005.csv
irm reports/20190807-2105.csv
irm reports/20190807-2205.csv
irm reports/20190807-2307.csv
irm reports/20190808-0007.csv
irm reports/20190808-0107.csv
irm reports/20190808-0207.csv
irm reports/20190808-0307.csv
irm reports/20190808-0407.csv
irm reports/20190808-0507.csv
irm reports/20190808-0607.csv
irm reports/20190808-0707.csv
irm reports/20190808-0807.csv
irm reports/20190808-0907.csv
irm reports/20190808-1007.csv
irm reports/20190808-1107.csv
irm reports/20190808-1207.csv
irm reports/20190808-1307.csv
irm reports/20190808-1407.csv
irm reports/20190808-1507.csv
irm reports/20190808-1607.csv
irm reports/20190808-1707.csv
irm reports/20190808-1807.csv
irm reports/20190808-1907.csv
irm reports/20190808-2007.csv
irm reports/20190808-2107.csv
irm reports/20190808-2207.csv
irm reports/20190808-2307.csv
irm reports/20190809-0007.csv
irm reports/20190809-0107.csv
irm reports/20190809-0208.csv
irm reports/20190809-0308.csv
irm reports/20190809-0408.csv
irm reports/20190809-0508.csv
irm reports/20190809-0608.csv
irm reports/20190809-0708.csv
irm reports/20190809-0808.csv
irm reports/20190809-0908.csv
irm reports/20190809-1008.csv
irm reports/20190809-1108.csv
irm reports/20190809-1208.csv
irm reports/20190809-1308.csv
irm reports/20190809-1408.csv
irm reports/20190809-1508.csv
irm reports/20190809-1608.csv
irm reports/20190809-1708.csv
irm reports/20190809-1808.csv
irm reports/20190809-1908.csv
irm reports/20190809-2008.csv
irm reports/20190809-2108.csv
irm reports/20190809-2208.csv
irm reports/20190809-2308.csv
irm reports/20190810-0008.csv
irm reports/20190810-0108.csv
irm reports/20190810-0208.csv
irm reports/20190810-0308.csv
irm reports/20190810-0408.csv
irm reports/20190810-0508.csv
irm reports/20190810-0608.csv
irm reports/20190810-0708.csv
irm reports/20190810-0808.csv
irm reports/20190810-0908.csv
irm reports/20190810-1008.csv
irm reports/20190810-1108.csv
irm reports/20190810-1209.csv
irm reports/20190810-1309.csv
irm reports/20190810-1409.csv
irm reports/20190810-1509.csv
irm reports/20190810-1609.csv
irm reports/20190810-1709.csv
irm reports/20190810-1809.csv
irm reports/20190810-1909.csv
irm reports/20190810-2009.csv
irm reports/20190810-2109.csv
irm reports/20190810-2209.csv
irm reports/20190810-2309.csv
irm reports/20190811-0009.csv
irm reports/20190811-0109.csv
irm reports/20190811-0209.csv
irm reports/20190811-0309.csv
irm reports/20190811-0409.csv
irm reports/20190811-0509.csv
irm reports/20190811-0609.csv
irm reports/20190811-0709.csv
irm reports/20190811-0809.csv
irm reports/20190811-0909.csv
irm reports/20190811-1009.csv
irm reports/20190811-1109.csv
irm reports/20190811-1209.csv
irm reports/20190811-1309.csv
irm reports/20190811-1409.csv
irm reports/20190811-1509.csv
irm reports/20190811-1609.csv
irm reports/20190811-1709.csv
irm reports/20190811-1809.csv
irm reports/20190811-1909.csv
irm reports/20190811-2009.csv
irm reports/20190811-2109.csv
irm reports/20190811-2209.csv
irm reports/20190811-2309.csv
irm reports/20190812-0009.csv
irm reports/20190812-0109.csv
irm reports/20190812-0209.csv
irm reports/20190812-0309.csv
irm reports/20190812-0409.csv
irm reports/20190812-0509.csv
irm reports/20190812-0609.csv
irm reports/20190812-0709.csv
irm reports/20190812-0809.csv
irm reports/20190812-0909.csv
irm reports/20190812-1010.csv
irm reports/20190812-1110.csv
irm reports/20190812-1210.csv
irm reports/20190812-1310.csv
irm reports/20190812-1410.csv
irm reports/20190812-1510.csv
irm reports/20190812-1610.csv
irm reports/20190812-1710.csv
irm reports/20190812-1810.csv
irm reports/20190812-1910.csv
irm reports/20190812-2010.csv
irm reports/20190812-2110.csv
irm reports/20190812-2210.csv
irm reports/20190812-2310.csv
irm reports/20190813-0010.csv
irm reports/20190813-0110.csv
irm reports/20190813-0210.csv
irm reports/20190813-0310.csv
irm reports/20190813-0410.csv
irm reports/20190813-0510.csv
irm reports/20190813-0610.csv
irm reports/20190813-0710.csv
irm reports/20190813-0810.csv
irm reports/20190813-0910.csv
irm reports/20190813-1010.csv
irm reports/20190813-1110.csv
irm reports/20190813-1210.csv
irm reports/20190813-1310.csv
irm reports/20190813-1410.csv
irm reports/20190813-1510.csv
irm reports/20190813-1610.csv
irm reports/20190813-1710.csv
irm reports/20190813-1810.csv
irm reports/20190813-1910.csv
irm reports/20190813-2010.csv
irm reports/20190813-2110.csv
irm reports/20190813-2210.csv
irm reports/20190813-2310.csv
irm reports/20190814-0011.csv
irm reports/20190814-0111.csv
irm reports/20190814-0211.csv
irm reports/20190814-0311.csv
irm reports/20190814-0411.csv
irm reports/20190814-0511.csv
irm reports/20190814-0611.csv
irm reports/20190814-0711.csv
irm reports/20190814-0811.csv
irm reports/20190814-0911.csv
irm reports/20190814-1011.csv
irm reports/20190814-1111.csv
irm reports/20190814-1211.csv
irm reports/20190814-1311.csv
irm reports/20190814-1411.csv
irm reports/20190814-1511.csv
irm reports/20190814-1612.csv
irm reports/20190814-1712.csv
irm reports/20190814-1812.csv
irm reports/20190814-1912.csv
irm reports/20190814-2012.csv
irm reports/20190814-2112.csv
irm reports/20190814-2212.csv
irm reports/20190814-2312.csv
irm reports/20190815-0012.csv
irm reports/20190815-0112.csv
irm reports/20190815-0212.csv
irm reports/20190815-0312.csv
irm reports/20190815-0412.csv
irm reports/20190815-0513.csv
irm reports/20190815-0613.csv
irm reports/20190815-0713.csv
irm reports/20190815-0813.csv
irm reports/20190815-0913.csv
irm reports/20190815-1013.csv
irm reports/20190815-1113.csv
irm reports/20190815-1213.csv
irm reports/20190815-1313.csv
irm reports/20190815-1413.csv
irm reports/20190815-1513.csv
irm reports/20190815-1613.csv
irm reports/20190815-1714.csv
irm reports/20190815-1814.csv
irm reports/20190815-1914.csv
irm reports/20190815-2014.csv
irm reports/20190815-2114.csv
irm reports/20190815-2214.csv
irm reports/20190815-2314.csv
irm reports/20190816-0014.csv
irm reports/20190816-0114.csv
irm reports/20190816-0214.csv
irm reports/20190816-0314.csv
irm reports/20190816-0414.csv
irm reports/20190816-0514.csv
irm reports/20190816-0614.csv
irm reports/20190816-0714.csv
irm reports/20190816-0814.csv
irm reports/20190816-0914.csv
irm reports/20190816-1014.csv
irm reports/20190816-1115.csv
irm reports/20190816-1215.csv
irm reports/20190816-1315.csv
irm reports/20190816-1415.csv
irm reports/20190816-1515.csv
irm reports/20190816-1615.csv
irm reports/20190816-1715.csv
irm reports/20190816-1815.csv
irm reports/20190816-1915.csv
irm reports/20190816-2015.csv
irm reports/20190816-2115.csv
irm reports/20190816-2215.csv
irm reports/20190816-2315.csv
irm reports/20190817-0015.csv
irm reports/20190817-0115.csv
irm reports/20190817-0215.csv
irm reports/20190817-0315.csv
irm reports/20190817-0416.csv
irm reports/20190817-0516.csv
irm reports/20190817-0616.csv
irm reports/20190817-0716.csv
irm reports/20190817-0816.csv
irm reports/20190817-0916.csv
irm reports/20190817-1016.csv
irm reports/20190817-1116.csv
irm reports/20190817-1216.csv
irm reports/20190817-1316.csv
irm reports/20190817-1416.csv
irm reports/20190817-1516.csv
irm reports/20190817-1616.csv
irm reports/20190817-1716.csv
irm reports/20190817-1816.csv
irm reports/20190817-1916.csv
irm reports/20190817-2016.csv
irm reports/20190817-2116.csv
irm reports/20190817-2216.csv
irm reports/20190817-2316.csv
irm reports/20190818-0016.csv
irm reports/20190818-0116.csv
irm reports/20190818-0216.csv
irm reports/20190818-0316.csv
irm reports/20190818-0416.csv
irm reports/20190818-0516.csv
irm reports/20190818-0616.csv
irm reports/20190818-0716.csv
irm reports/20190818-0816.csv
irm reports/20190818-0916.csv
irm reports/20190818-1016.csv
irm reports/20190818-1116.csv
irm reports/20190818-1216.csv
irm reports/20190818-1316.csv
irm reports/20190818-1416.csv
irm reports/20190818-1516.csv
irm reports/20190818-1616.csv
irm reports/20190818-1716.csv
irm reports/20190818-1817.csv
irm reports/20190818-1917.csv
irm reports/20190818-2017.csv
irm reports/20190818-2117.csv
irm reports/20190818-2217.csv
irm reports/20190818-2317.csv
irm reports/20190819-0017.csv
irm reports/20190819-0117.csv
irm reports/20190819-0217.csv
irm reports/20190819-0317.csv
irm reports/20190819-0417.csv
irm reports/20190819-0517.csv
irm reports/20190819-0617.csv
irm reports/20190819-0717.csv
......@@ -256,7 +256,7 @@ if [[ -r ${RESIFDD_CONTINUE_FROM_FILE} ]]; then
RECOVERY_FILE=$RESIFDD_WORKDIR/recovery.$$
echo "Now using $RESIFDD_WORKDIR/recovery.$$ as recovery file"
else
echo "No recovery file present. Dumping everything now"
echo "No recovery file set"
fi
# Header for the report :
......
#!/usr/bin/python3
"""
Ce script compare l'archive locale avec ce qui est sur irods.
Il compare seulement le nom des répertoires.
Si le répertoire existe en local et n'existe pas sur iRODS alors le script l'ajoute
à la liste des répertoires manquant
"""
import subprocess
import re
......@@ -8,16 +15,16 @@ from collections import defaultdict
def get_irods_content():
remote_data = defaultdict(lambda: defaultdict(dict))
result = subprocess.run(['ils','-r', '-L'], stdout=subprocess.PIPE)
# On prepare une structure de donnees (dict) :
# remote_data = [
# key => [
# latest.tar => [sha2 => '', size => ''] ,
# previous.tar => [sha2 => '', size => '']
# key => [
# latest.tar => [sha2 => '', size => ''] ,
# previous.tar => [sha2 => '', size => '']
# ]
# ], ...
# ]
# Test
# root_dir='/tempZone/home/jschaeffer/'
# Prod :
......@@ -37,10 +44,10 @@ def get_irods_content():
words = re.split(' +', line)
remote_data[current_key][words[-1]]['size'] = words[4]
total_size = total_size + int(words[4])
print("Total iRODS storage used : "+str(total_size/(1024^3))+"GB")
return remote_data
# Maintenant, on a l'état sur le serveur irods. Comparons avec notre dépôt :
# Chercher les tribples YYYY/NET/STATION dans /osug-dc/resif/validated_seismic_data
# Pour chacun, vérifier son existence dans remote_data
......@@ -49,8 +56,8 @@ def get_irods_content():
def browse_local_data():
filesDepth3 = glob.glob('/osug-dc/resif/validated_seismic_data/*/*/*')
dirsDepth3 = filter(lambda f: os.path.isdir(f), filesDepth3)
return list(dirsDepth3)
return list(dirsDepth3)
if __name__ == "__main__":
dirs = browse_local_data()
......@@ -66,4 +73,3 @@ if __name__ == "__main__":
print("List of missing keys (usable with RESIFDD_KEYFILE) : ")
print('\n'.join(missing_keys))
# Ce script doit tester les archives stockées sur le serveur iRODS du CCIN2P3
# Prérequis :
# 1. Accès au serveur irods
# 2. Accès à l'archive SUMMER (TODO: automontage bynet ?)
# 3. exécutable msi disponible
# Déroulement du test
# - soit on donne une clé en paramètre du script (Ex. FR_FILF_2015)
# - soit une clé est générée aléatoirement à partir des métadonnées publiées par le webservice station
# - on teste si on a de la donnée locale pour cette clé
# si on n'en a pas, ça veut dire que l'archive a été détruite ou déplacée ... bref, aucun moyen de faire les tests suivants, on sort
# - on demande à irods des infos sur la donnée correspondant à la clé (ils -L)
# s'il n'y en a pas, c'est que les dumps n'ont pas été réalisés => ERROR
# - on récupère le dump distant et on en extrait un fichier au hasard
# - on compare les hash MD5 du fichier extrait et de la version actuelle
# - on compare les hash MD5 de la sortie des commandes msi sur les 2 fichiers.
workdir=/scratch/resif_datadump/test_restore
archivedir=/osug-dc/resif/validated_seismic_data
function test_random_file {
echo === Extracting one random file ===
target_line=$(tar tvf latest.tar | grep -e '[0-9]$' | shuf -n 1)
target_file=$(echo "$target_line"| awk '{print $NF}')
echo $target_file
tar -xf latest.tar $target_file
target_file_date=$(date +%s -d $(stat $target_file --print %z | awk '{print $1}'))
fname=$(echo $target_file | awk -F'/' '{print $NF}')
echo $fname
regex="([A-Z0-9]+)\.([A-Z]+)\.[A-Z0-9]*\.([A-Z0-9]{3}\.D)\.([12][0-9]{3})\.[0-3][0-9]{2}"
archive_file=""
if [[ $fname =~ $regex ]]; then
archive_file="$archivedir/${BASH_REMATCH[4]}/${BASH_REMATCH[1]}/${BASH_REMATCH[2]}/${BASH_REMATCH[3]}/$fname";
if [[ ! -r $archive_file ]]; then
printf "\033[1;33mWARNING\033[0m $archive_file not preset\n"
exit 1
fi
fi
echo $archive_file
archive_file_date=$(date +%s -d $(stat $archive_file --print %z | awk '{print $1}'))
if [ $archive_file_date -gt $target_file_date ]; then
echo "File in archive is newer than remote file. Skipping md5 test"
else
echo === Checking md5 ===
dumped_md5=$(md5sum $target_file | awk '{print $1}')
data_md5=$(md5sum $archive_file | awk '{print $1}')
echo $dumped_md5 $target_file
echo $data_md5 $archive_file
if [[ $dumped_md5 == $data_md5 ]]; then
printf "\033[0;32mOK\033[0m $target_file md5 sum is correct\n"
else
printf "\033[0;31mError\033[0m Files $target_file and $archive_file mismatch\n"
exit 1
fi
fi
echo === Compare msi traces ===
dumped_msi=$(msi $target_file)
data_msi=$(msi $archive_file)
if [[ $(md5sum <<< $dumped_msi | awk '{print $1}') == $(md5sum <<< $data_msi | awk '{print $1}') ]]; then
printf "\033[0;32mOK\033[0m Traces match\n"
else
echo "Traces of remote file $target_file"
echo "$dumped_msi"
echo "Traces of local file $archive_file"
echo "$data_msi"
printf "\033[1;33mWARNING\033[0m Traces mismatch\n"
fi
}
if [[ $# -eq 0 ]]; then
# Pas de clé en paramètre, récupère un channel au hasard dans les métadonnées
line=$(wget -q -O - "http://ws.resif.fr/fdsnws/station/1/query?level=channel&format=text" | shuf -n 1)
echo $line
[[ $line =~ ^(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|(.*)\|([1-2][0-9][0-9][0-9]).*\|([1-2][0-9][0-9][0-9]).*$ ]]
net=${BASH_REMATCH[1]}
sta=${BASH_REMATCH[2]}
start=${BASH_REMATCH[16]}
end=${BASH_REMATCH[17]}
echo "$start $end"
if [[ $start -eq $end ]]; then
year=$start
else
if [[ "$end" = "2500" ]]; then
# Les canaux permanents finissenten 2500, on ramène à l'année de 6 mois plus tôt (le dernier dump)
end=$(date +%Y -d '6 months ago')
fi
year=$(( ( RANDOM % (( $end - $start )) ) + $start ))
fi
key="${year}_${net}_${sta}"
else
key=$1
fi
# On teste pour vérifier que cette donnée existe (c'est pas forcé)
dir_to_test=$(echo $key | tr '_' '/')
stat $archivedir/$dir_to_test
if [[ $? -ne 0 ]]; then
printf "\033[1;33mWARNING\033[0m no data for this channel on archive\n"
exit 0
fi
workdir=$workdir/$key