Compare revisions

Elias Chetouane · Elias Chetouane · Elias Chetouane · Elias Chetouane · Elias Chetouane · Elias Chetouane
--- a/.gitlab-ci.yml
+++ b/.gitlab-ci.yml
@@ -25,7 +25,7 @@ actualisation_dois:
    - git config user.name "${GITLAB_USER_NAME}"
    - git config user.email "${GITLAB_USER_EMAIL}"
    - git remote set-url --push origin "https://PUSH_TOKEN:${ACCESS_TOKEN}@gricad-gitlab.univ-grenoble-alpes.fr/${CI_PROJECT_PATH}.git"
-    - git add -f dois-uga.csv 2-produce-graph/hist-evol-datasets-per-repo.png 2-produce-graph/hist-quantity-year-type.png 2-produce-graph/pie--datacite-client.png 2-produce-graph/pie--datacite-type.png 2-produce-graph/hist-last-datasets-by-client.png 1-enrich-with-datacite/all_datacite_clients_for_uga.csv 1-enrich-with-datacite/nb-dois.txt
+    - git add -f dois-uga.csv dois-uga--last-500.csv 2-produce-graph/hist-evol-datasets-per-repo.png 2-produce-graph/hist-quantity-year-type.png 2-produce-graph/pie--datacite-client.png 2-produce-graph/pie--datacite-type.png 2-produce-graph/hist-last-datasets-by-client.png 1-enrich-with-datacite/all_datacite_clients_for_uga.csv 1-enrich-with-datacite/nb-dois.txt
    - git commit -m "Execution du pipeline. Actualisation des dois et des graphes."
    - git push origin HEAD:${CI_COMMIT_REF_NAME}

@@ -49,6 +49,7 @@ actualisation_dois:
    # ajout des fichiers du dépôt qui ont été modifiés, au cas où un problème serait survenu dans "after_script"
    paths:
      - dois-uga.csv
+      - dois-uga--last-500.csv
      - 2-produce-graph/hist-evol-datasets-per-repo.png
      - 2-produce-graph/hist-quantity-year-type.png
      - 2-produce-graph/pie--datacite-client.png

--- a/0-collect-data/nakala-uga-users.txt
+++ b/0-collect-data/nakala-uga-users.txt
@@ -17,4 +17,18 @@ mbeligne
 acarbonnelle
 annegf
 tleduc
-abey
\ No newline at end of file
+abey
+mbarletta
+lmaritaud
+jbeaureder
+kboczon
+llacoste
+fcorsi
+ecarlier
+lvanbogaert
+nrousselot
+jlevy1
+mflecheux
+pbai
+ymonnier
+slecuyerchardevel
\ No newline at end of file
--- a/0-collect-data/rdg.py
+++ b/0-collect-data/rdg.py
@@ -22,15 +22,15 @@ urls = [
    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=producerAffiliation%3AUGA',
    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=contributorAffiliation%3AUGA',
    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=datasetContactAffiliation%3A(Grenoble AND Alpes)',
-    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=authorAffiliation%3AGrenoble(Grenoble AND Alpes)',
-    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=producerAffiliation%3AGrenoble(Grenoble AND Alpes)',
-    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=contributorAffiliation%3AGrenoble(Grenoble AND Alpes)'
+    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=authorAffiliation%3A(Grenoble AND Alpes)',
+    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=producerAffiliation%3A(Grenoble AND Alpes)',
+    'https://entrepot.recherche.data.gouv.fr/api/search?q=*&fq=contributorAffiliation%3A(Grenoble AND Alpes)'
    # possiblilité d'ajouter d'autres requêtes
 ]

 # on définit une fonction pour la lancer la requete avec chaque url pour les différentes affiliations
 def get_results(url):
-    req = requests.get(url)
+    req = requests.get(url+"&type=dataset")
    #print(req.url)
    results = [req.json()]
    
@@ -39,7 +39,7 @@ def get_results(url):
    count = nb_res
    page = 1
    while(nb_res > 0):
-        newurl = url+"&start="+str(count)
+        newurl = url+"&type=dataset"+"&start="+str(count)
        req = requests.get(newurl)
        results.append(req.json())
        nb_res = results[page]["data"]["count_in_response"]
@@ -59,7 +59,7 @@ def get_dois(results):
        nb_dois += len(num_dois)

        for item in num_dois :
-            dois.append(item["global_id"])
+            dois.append(item.get("global_id"))
        
    print("\tnb DOIs\t\t" + str(nb_dois))
    return dois

--- a/1-enrich-with-datacite/all_datacite_clients_for_uga.csv
+++ b/1-enrich-with-datacite/all_datacite_clients_for_uga.csv
 client,count,name,year,url
-cern.zenodo,730,Zenodo,2013,https://zenodo.org/
-inist.sshade,471,Solid Spectroscopy Hosting Architecture of Databases and Expertise,2019,https://www.sshade.eu/
-inist.osug,238,Observatoire des Sciences de l'Univers de Grenoble,2014,http://doi.osug.fr
-figshare.ars,232,figshare Academic Research System,2016,http://figshare.com/
-dryad.dryad,156,DRYAD,2018,https://datadryad.org
-inist.resif,79,Réseau sismologique et géodésique français,2014,https://www.resif.fr/
-inist.persyval,55,PERSYVAL-Lab : Pervasive Systems and Algorithms Lab,2016,
-rdg.prod,43,Recherche Data Gouv France,2022,https://recherche.data.gouv.fr/en
+cern.zenodo,885,Zenodo,2013,https://zenodo.org/
+inist.sshade,522,Solid Spectroscopy Hosting Architecture of Databases and Expertise,2019,https://www.sshade.eu/
+figshare.ars,380,figshare Academic Research System,2016,http://figshare.com/
+inist.osug,275,Observatoire des Sciences de l'Univers de Grenoble,2014,http://doi.osug.fr
+dryad.dryad,168,DRYAD,2018,https://datadryad.org
+inist.resif,99,Réseau sismologique et géodésique français,2014,https://www.resif.fr/
+rdg.prod,81,Recherche Data Gouv France,2022,https://recherche.data.gouv.fr/en
+inist.humanum,75,NAKALA,2020,https://nakala.fr
+inist.persyval,64,PERSYVAL-Lab : Pervasive Systems and Algorithms Lab,2016,
 fmsh.prod,28,Fondation Maison des sciences de l'homme,2023,
-inist.humanum,28,Huma-Num,2020,https://nakala.fr
-figshare.sage,14,figshare SAGE Publications,2018,
-mcdy.dohrmi,12,dggv-e-publications,2020,https://www.dggv.de/publikationen/dggv-e-publikationen.html
-tib.gfzbib,3,GFZpublic,2011,https://gfzpublic.gfz-potsdam.de
+inist.ccj,22,Centre Camille Jullian – UMR 7299,2020,
+pangaea.repository,18,PANGAEA,2020,https://www.pangaea.de/
+mcdy.dohrmi,14,dggv-e-publications,2020,https://www.dggv.de/publikationen/dggv-e-publikationen.html
+inist.cirm,7,Centre International de Rencontres Mathématiques,2017,
+figshare.sage,6,figshare SAGE Publications,2018,
+iris.iris,5,NSF Seismological Facility for the Advancement of Geoscience (SAGE),2018,http://www.iris.edu/hq/
 vqpf.dris,3,Direction des ressources et de l'information scientifique,2021,
-tib.repod,3,RepOD,2015,
-iris.iris,3,Incorporated Research Institutions for Seismology,2018,http://www.iris.edu/hq/
-ugraz.unipub,2,unipub,2019,http://unipub.uni-graz.at
+tib.gfzbib,3,GFZpublic,2011,https://gfzpublic.gfz-potsdam.de
+tib.repod,3,RepOD,2015,https://repod.icm.edu.pl/
+cnic.sciencedb,3,ScienceDB,2022,https://www.scidb.cn/en
+inist.eost,2,Ecole et Observatoire des Sciences de la Terre,2017,https://eost.unistra.fr/en/
+tib.gfz,2,GFZ Data Services,2011,https://dataservices.gfz-potsdam.de/portal/
+bl.mendeley,2,Mendeley Data,2015,https://data.mendeley.com/
 bl.nerc,2,NERC Environmental Data Service,2011,https://eds.ukri.org
-crui.ingv,1,Istituto Nazionale di Geofisica e Vulcanologia (INGV),2013,http://data.ingv.it/
-estdoi.ttu,1,TalTech,2019,https://digikogu.taltech.ee
-ardcx.nci,1,National Computational Infrastructure,2020,
-umass.uma,1,University of Massachusetts (UMass) Amherst,2018,https://scholarworks.umass.edu/
-bl.mendeley,1,Mendeley Data,2015,https://data.mendeley.com/
-inist.eost,1,Ecole et Observatoire des Sciences de la Terre,2017,https://eost.unistra.fr/en/
-bl.iita,1,International Institute of Tropical Agriculture datasets,2017,http://data.iita.org/
+tug.openlib,2,TU Graz OPEN Library,2020,https://openlib.tugraz.at/
+crui.ingv,2,Istituto Nazionale di Geofisica e Vulcanologia (INGV),2013,http://data.ingv.it/
+ugraz.unipub,2,unipub,2019,http://unipub.uni-graz.at
+ethz.sed,2,"Swiss Seismological Service, national earthquake monitoring and hazard center",2013,http://www.seismo.ethz.ch
 inist.opgc,1,Observatoire de Physique du Globe de Clermont-Ferrand,2017,
+ethz.da-rd,1,ETHZ Data Archive - Research Data,2013,http://data-archive.ethz.ch
+ethz.zora,1,"Universität Zürich, ZORA",2013,https://www.zora.uzh.ch/
+estdoi.ttu,1,TalTech,2019,https://digikogu.taltech.ee
+repod.dbuw,1,University of Warsaw Research Data Repository,2023,https://danebadawcze.uw.edu.pl/
+inist.ird,1,IRD,2016,
 inist.omp,1,Observatoire Midi-Pyrénées,2011,
-tug.openlib,1,TU Graz OPEN Library,2020,https://openlib.tugraz.at/
-tib.gfz,1,GFZ Data Services,2011,https://dataservices.gfz-potsdam.de/portal/
+umass.uma,1,University of Massachusetts (UMass) Amherst,2018,https://scholarworks.umass.edu/
 edi.edi,1,Environmental Data Initiative,2017,https://portal.edirepository.org/nis/home.jsp
+bl.iita,1,International Institute of Tropical Agriculture datasets,2017,http://data.iita.org/
+ardcx.nci,1,National Computational Infrastructure,2020,
 ihumi.pub,1,IHU Méditerranée Infection,2020,
-ethz.zora,1,"Universität Zürich, ZORA",2013,https://www.zora.uzh.ch/
-inist.ird,1,IRD,2016,
+inist.inrap,1,Institut national de recherches archéologiques préventives,2019,
+tib.mpdl,1,Max Planck Digital Library,2015,
+tudublin.arrow,1,ARROW@TU Dublin,2020,https://arrow.dit.ie/
--- a/1-enrich-with-datacite/concatenate-enrich-dois.py
+++ b/1-enrich-with-datacite/concatenate-enrich-dois.py
@@ -56,6 +56,25 @@ for doi in dois : #[:300]
 ## if new datasets has been founded
 if temp_rows :
 	df_fresh = pd.DataFrame(temp_rows)
+	dois_added = list(df_old["doi"])
+	to_del = []
+	for i in range(0, len(df_fresh)):
+		result = my_functions.get_origin_version(df_fresh.loc[i, "doi"])
+		if result[0] not in dois_added: 
+			dois_added.append(result[0])
+			df_fresh.loc[i, "doi"] = result[0]
+			if str(result[1]) != "[]": df_fresh.loc[i, "traveled_dois"] = str(result[1])
+			else: df_fresh.loc[i, "traveled_dois"] = ""	
+			if str(result[2]) != "[]": df_fresh.loc[i, "all_relations"] = str(result[2])
+			else: df_fresh.loc[i, "all_relations"] = ""
+		else:
+			to_del.append(i)
+			
+	df_fresh.drop(to_del, inplace=True)
+	print("Nombre de dois supprimés : " + str(len(to_del)))
+	
+	print("Nb dois a garder : " + str(len(dois_added)))
+
 	df_concat = pd.concat([df_old, df_fresh], ignore_index=True)

 	## remove not wanted datacite type & clients
@@ -68,10 +87,18 @@ if temp_rows :
 	df_out.to_csv("../dois-uga.csv", index = False)
 	print(f"\n\nnb of doi exported \t{len(df_out)}")

+
 	# write the number of dois found in a file to display on the website
 	with open("nb-dois.txt", 'w') as outf :
 		outf.write(str(len(df_out)))

+
+	## output last 500 DOIs to make it easier to open in web tools
+	df_last_dois = df_out.sort_values(by = "created", ascending = False, inplace = False)[:500]
+	df_last_dois["created"] = df_last_dois["created"].str[:10]
+	df_last_dois[["doi", "client", "resourceTypeGeneral", "created", "publisher", "rights", "sizes"]].to_csv("../dois-uga--last-500.csv", index = False)
+
+
 	## for the website : output another csv with datacite client and number of datasets
 	df_client_raw = df_out["client"].value_counts().to_frame()


--- a/1-enrich-with-datacite/nb-dois.txt
+++ b/1-enrich-with-datacite/nb-dois.txt
-2117
\ No newline at end of file
+2692
\ No newline at end of file
--- a/1-enrich-with-datacite/z_personal_functions.py
+++ b/1-enrich-with-datacite/z_personal_functions.py
 import requests, json

-def get_origin_version(doi, count=0, cited=0):
-    cited = 0
+# Fonction pour éviter la redondance des données associées à des DOIs différents mais pointant vers les mêmes fichiers :
+# Dans Zenodo par exemple, il y a un DOI associé à chaque version d'un dépôt et il faut remonter au DOI "chapeau"
+# Si le DOI "chapeau" obtenu ou un "is_identical_to" fait référence à un DOI déjà existant dans le csv, il doit être ignoré.
+def get_origin_version(doi, history=[], first=True):
+    if first: history=[] # ligne ajoutée pour éviter certains soucis de cache où history n'est pas vide au premier appel de la fonction
    req = requests.get( f"https://api.datacite.org/dois/{doi}" )
    res = req.json()
-    result = (doi, count, cited)
+    final = []
+    result = (doi, history, final) # doi est le DOI qui sera ajouté au csv, history retrace les dois et les relations ayant permis les recherches et final enregistre les relations du doi final ajouté au csv
    try:
-        related = res["data"]["attributes"]["relatedIdentifiers"]
+        related = res["data"]["attributes"]["relatedIdentifiers"] # test si des relations existent pour le doi courant
    except:
-        pass
+        pass # si pas de relation, on renvoie le doi courant
    else:
-        ignore = False
-        duplicate = False
+        ignore = False # ignore correspond à un doi ayant une version "chapeau" qui doit être trouvée. Le doi courant doit donc être ignoré
+        duplicate = False # duplicate correspond à un doi étant identique à un autre
        for i in related:
-            if i["relationType"] == "IsVersionOf" and i.get("relatedIdentifierType") == "DOI": 
+            final.append(i.get("relationType"))
+            if i.get("relationType") == "IsVersionOf" and i.get("relatedIdentifierType") == "DOI": 
                ignore = True
-                elem_to_save_i = i["relatedIdentifier"]
-                # supprimer le doi courant s'il apparait dans la liste
-            if i["relationType"] == "isCitedBy" and i.get("relatedIdentifierType") == "DOI": cited += 1
-            if i["relationType"] == "IsIdenticalTo" and i.get("relatedIdentifierType") == "DOI":
+                elem_to_save_i = i.get("relatedIdentifier")
+                history.append([i.get("relationType"), i.get("relatedIdentifier")])
+            if i.get("relationType") == "IsIdenticalTo" and i.get("relatedIdentifierType") == "DOI":
                duplicate = True
-                elem_to_save_d = i["relatedIdentifier"]
+                elem_to_save_d = i.get("relatedIdentifier") # pas de symétrie pour les is_identical_to, donc il suffit de prendre l'autre (pas le doi courant) pour éviter les doublons
+                history.append([i.get("relationType"), i.get("relatedIdentifier")])
        if duplicate and not(ignore):
-            result = (elem_to_save_d, count, cited)
-        if ignore: result = get_origin_version(elem_to_save_i, count+1, cited)
+            result = (elem_to_save_d, history, final) # si identique mais pas de version chapeau on peut s'arrêter
+        if ignore: result = get_origin_version(elem_to_save_i, history, False) # si version chapeau, on avance sans regarder les identiques
    return result

 def get_md_from_datacite( doi ) : 

--- a/2-produce-graph/hist-evol-datasets-per-repo.png
+++ b/2-produce-graph/hist-evol-datasets-per-repo.png
--- a/2-produce-graph/hist-last-datasets-by-client.png
+++ b/2-produce-graph/hist-last-datasets-by-client.png
--- a/2-produce-graph/hist-quantity-year-type.png
+++ b/2-produce-graph/hist-quantity-year-type.png
--- a/2-produce-graph/pie--datacite-client.png
+++ b/2-produce-graph/pie--datacite-client.png
--- a/2-produce-graph/pie--datacite-type.png
+++ b/2-produce-graph/pie--datacite-type.png
--- a/README.md
+++ b/README.md
-# Codes for the UGA Open research data monitor
+# Scripts & codes for the UGA Open research data monitor

-View contextualized results on the website : [mlarrieu.gricad-pages.univ-grenoble-alpes.fr/open-research-data-monitor](https://mlarrieu.gricad-pages.univ-grenoble-alpes.fr/open-research-data-monitor)
+See contextualized results on the website : [mlarrieu.gricad-pages.univ-grenoble-alpes.fr/open-research-data-monitor](https://mlarrieu.gricad-pages.univ-grenoble-alpes.fr/open-research-data-monitor)

 <br />
 <br />
@@ -11,8 +11,6 @@ View contextualized results on the website : [mlarrieu.gricad-pages.univ-grenobl

 - Recherche en format texte de `UGA` et `grenoble AND alpes` dans les champs suivants : `author`, `contributor`, `datasetContactAffiliation`, `producerAffiliation`

-
-
 ### DataCite

 - recherche avec les clients Datacite de l'UGA : `inist.osug`, `client.uid:inist.sshade`, `client.uid:inist.resif`, `client_id:inist.persyval`
@@ -42,18 +40,35 @@ View contextualized results on the website : [mlarrieu.gricad-pages.univ-grenobl
 - récupérer la liste de publications, filter sur celles où des jeux de données ont été produits
 - passer par HAL pour retrouver les DOI de ces jeux de données (champs `researchData_s`)

+## Filters
+- we remove the following datacite types `["Book", "ConferencePaper", "ConferenceProceeding", "JournalArticle", "BookChapter", "Service", "Preprint"]`
+- we remove the following datacite clients `["rg.rg", "inist.epure"]`
+
+## Comment sont comptées les données de la recherche ?
+
+Le monitor prend en compte les données dotées d'un DOI de l'agence DataCite, c'est-à-dire qu'elles sont Findable. Un dépôt de données comprend des métadonnées conformes au schéma de données DataCite et un ou plusieurs fichiers pouvant être organisés en arborescence. Ce sont les dépôts qui sont comptés et non les fichiers intégrés aux dépôts : un DOI compte donc pour une donnée de recherche.
+Le schéma de Datacite permet de déclarer des relations entre DOI, ce que nous utilisons pour gérer les versions ou les doublons de données.
+Afin d'éviter de compter deux pour un même dépôt, ou bien pour un dépôt mis à jour, le monitor est doté d'une fonction qui navigue entre les DOIs dont la relation est de type  `isVersionOf` ou `isIdenticalTo`. Dans le premier cas, la fonction "remonte" les versions jusqu'à la version parente, c'est-à-dire un DOI stable qui redirige vers la version la plus récente.
+Dans le deuxième cas, la fonction garde simplement la version signalée comme étant identique. Cette relation n'étant pas symétrique, le DOI conservé n'aura pas de relation "isIdenticalTo" et la redondance est évitée.
+
+
 <br />
 <br />

-## Filters
- we removethe following datacite types `["Book", "ConferencePaper", "ConferenceProceeding", "JournalArticle", "BookChapter", "Service", "Preprint"]`
- we remove the following datacite clients `["rg.rg", "inist.epure"]`

+## Data schema
+Les champs du tableau produit reprennent ceux du schéma de données de DataCite (cf. https://datacite-metadata-schema.readthedocs.io/en/4.5/), auquel deux champs sont ajoutés :
+- `all_relation`
+toutes les relations attachées au DOI identifié.
+
+- `traveled_dois`
+liste des DOIs parcourus par le script pour obtenir le DOI de concept


 <br />
 <br />

+
 ## Credits

 * Élias Chetouane: collecting data, program automation

--- a/dois-uga--last-500.csv
+++ b/dois-uga--last-500.csv
--- a/dois-uga.csv
+++ b/dois-uga.csv
--- a/notes-feedback.md
+++ b/notes-feedback.md
@@ -13,5 +13,3 @@ https://indico.math.cnrs.fr/event/10998/page/779-journees-mathrice-a-grenoble-le
 - nous faut il répliquer nos DOI dans Recherche Data Gouv pour plus de visibilité ? 


-2024-03-xx Software Heritage
-=================
\ No newline at end of file
No results found