Parsing GEPRIS for the list of funded NFDI projects with GEPRIS IDs and descriptions¶
Check out the the GEPRIS user interface for advanced search: https://gepris.dfg.de/gepris/OCTOPUS?task=doSearchExtended&context=projekt&keywords_criterion=NFDI&nurProjekteMitAB=false&findButton=Finden&person=&location=&fachlicheZuordnung=#&pemu=32&peu=#&zk_transferprojekt=false&teilprojekte=false&teilprojekte=true&bewilligungsStatus=&beginOfFunding=&gefoerdertIn=&oldContinentId=#&continentId=#&oldSubContinentId=##&subContinentId=##&oldCountryId=###&countryKey=###&einrichtungsart=-1
Getting HTML via requests¶
We use requests library to get HTML of that page into text
variable and print first 36 characters of it.
import requests
GEPRIS_URL = "https://gepris.dfg.de/gepris/OCTOPUS"
params = {'keywords_criterion': '',
'nurProjekteMitAB': 'false',
'findButton': 'Finden',
'task': 'doSearchExtended',
'pemu': 32,
'context': 'projekt',
'language': 'en',
'hitsPerPage': 50,
'index': 0}
r = requests.get(GEPRIS_URL, params=params)
text = r.text
print(text[0:36])
<?xml version="1.0" encoding="utf-8"
Parsing search results from HTML via BeautifulSoup¶
We use the BeautifulSoup library. The number of pages for search results is
from bs4 import BeautifulSoup
try:
pages = int(soup.find('span', id="result-info").find('strong').text.split()[0])
except:
pages = 1
print(pages)
1
All search results are found via
soup = BeautifulSoup(text, 'html.parser')
results = soup.find_all("div", class_="results")
print(results)
[<div class="results">
<h2><a href="/gepris/projekt/441914366">GHGA – German Human Genome-Phenome Archive</a></h2> <span id="icons"><a href="/gepris/projekt/441914366?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/441926934">NFDI4Cat – NFDI for Catalysis-Related Sciences</a></h2> <span id="icons"><a href="/gepris/projekt/441926934?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/441958017">NFDI4Culture – Consortium for research data on material and immaterial cultural heritage</a></h2> <span id="icons"><a href="/gepris/projekt/441958017?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/441958208">NFDI4Chem – Chemistry Consortium in the NFDI</a></h2> <span id="icons"><a href="/gepris/projekt/441958208?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/442032008">NFDI4BioDiversity – Biodiversity, Ecology & Environmental Data</a></h2> <span id="icons"><a href="/gepris/projekt/442032008?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/442077441">DataPLANT – Data in PLANT research</a></h2> <span id="icons"><a href="/gepris/projekt/442077441?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/442146713">NFDI4Ing – National Research Data Infrastructure for Engineering Services</a></h2> <span id="icons"><a href="/gepris/projekt/442146713?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/442326535">NFDI4Health – National Research Data Infrastructure for Personal Health Data</a></h2> <span id="icons"><a href="/gepris/projekt/442326535?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/442494171">KonsortSWD – Consortium for the Social, Behavioural, Educational, and Economic Sciences</a></h2> <span id="icons"><a href="/gepris/projekt/442494171?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460033370">Text+</a></h2> <span id="icons"><a href="/gepris/projekt/460033370?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460036893">NFDI4Earth - NFDI Consortium Earth System Sciences</a></h2> <span id="icons"><a href="/gepris/projekt/460036893?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460037581">BERD@NFDI - NFDI for Business, Economic and Related Data</a></h2> <span id="icons"><a href="/gepris/projekt/460037581?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460129525">NFDI4Microbiota - National Research Data Infrastructure for Microbiota Research</a></h2> <span id="icons"><a href="/gepris/projekt/460129525?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460135501">MaRDI - Mathematical Research Data Initiative</a></h2> <span id="icons"><a href="/gepris/projekt/460135501?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460197019">FAIRmat – FAIR Data Infrastructure for Condensed-Matter Physics and the Chemical Physics of Solids</a></h2> <span id="icons"><a href="/gepris/projekt/460197019?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460234259">NFDI4DS - NFDI for Data Science and Artificial Intelligence</a></h2> <span id="icons"><a href="/gepris/projekt/460234259?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460247524">NFDI-MatWerk - National Research Data Infrastructure for Materials Science & Engineering</a></h2> <span id="icons"><a href="/gepris/projekt/460247524?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460248186">PUNCH4NFDI - Particles, Universe, NuClei and Hadrons for the NFDI</a></h2> <span id="icons"><a href="/gepris/projekt/460248186?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>, <div class="results">
<h2><a href="/gepris/projekt/460248799">DAPHNE4NFDI - DAta from PHoton and Neutron Experiments for NFDI</a></h2> <span id="icons"><a href="/gepris/projekt/460248799?displayMode=print&findButton=Finden&hitsPerPage=50&index=0&keywords_criterion=&language=en&nurProjekteMitAB=false&pemu=32" rel="nofollow" target="_blank" title="Open print view"><img alt="Print View" src="/gepris/images/iconPrint.gif"/></a></span></div>]
Let’s process a bit those results.
consortia = []
for result in results:
a = result.find('a')
t = a.get_text().replace(' – ', ' - ')
try:
[title, description] = t.split(' - ')
except:
[title, description] = [t, '']
consortia.append(["https://gepris.dfg.de" + a.get('href'), title, description])
print(consortia)
[['https://gepris.dfg.de/gepris/projekt/441914366', 'GHGA', 'German Human Genome-Phenome Archive'], ['https://gepris.dfg.de/gepris/projekt/441926934', 'NFDI4Cat', 'NFDI for Catalysis-Related Sciences'], ['https://gepris.dfg.de/gepris/projekt/441958017', 'NFDI4Culture', 'Consortium for research data on material and immaterial cultural heritage'], ['https://gepris.dfg.de/gepris/projekt/441958208', 'NFDI4Chem', 'Chemistry Consortium in the NFDI'], ['https://gepris.dfg.de/gepris/projekt/442032008', 'NFDI4BioDiversity', 'Biodiversity, Ecology & Environmental Data'], ['https://gepris.dfg.de/gepris/projekt/442077441', 'DataPLANT', 'Data in PLANT research'], ['https://gepris.dfg.de/gepris/projekt/442146713', 'NFDI4Ing', 'National Research Data Infrastructure for Engineering Services'], ['https://gepris.dfg.de/gepris/projekt/442326535', 'NFDI4Health', 'National Research Data Infrastructure for Personal Health Data'], ['https://gepris.dfg.de/gepris/projekt/442494171', 'KonsortSWD', 'Consortium for the Social, Behavioural, Educational, and Economic Sciences'], ['https://gepris.dfg.de/gepris/projekt/460033370', 'Text+', ''], ['https://gepris.dfg.de/gepris/projekt/460036893', 'NFDI4Earth', 'NFDI Consortium Earth System Sciences'], ['https://gepris.dfg.de/gepris/projekt/460037581', 'BERD@NFDI', 'NFDI for Business, Economic and Related Data'], ['https://gepris.dfg.de/gepris/projekt/460129525', 'NFDI4Microbiota', 'National Research Data Infrastructure for Microbiota Research'], ['https://gepris.dfg.de/gepris/projekt/460135501', 'MaRDI', 'Mathematical Research Data Initiative'], ['https://gepris.dfg.de/gepris/projekt/460197019', 'FAIRmat', 'FAIR Data Infrastructure for Condensed-Matter Physics and the Chemical Physics of Solids'], ['https://gepris.dfg.de/gepris/projekt/460234259', 'NFDI4DS', 'NFDI for Data Science and Artificial Intelligence'], ['https://gepris.dfg.de/gepris/projekt/460247524', 'NFDI-MatWerk', 'National Research Data Infrastructure for Materials Science & Engineering'], ['https://gepris.dfg.de/gepris/projekt/460248186', 'PUNCH4NFDI', 'Particles, Universe, NuClei and Hadrons for the NFDI'], ['https://gepris.dfg.de/gepris/projekt/460248799', 'DAPHNE4NFDI', 'DAta from PHoton and Neutron Experiments for NFDI']]
Finally, we create a pandas-dataframe
import pandas as pd
nfdi = pd.DataFrame(consortia, columns=['GEPRIS', 'Title', 'Description'])
nfdi
GEPRIS | Title | Description | |
---|---|---|---|
0 | https://gepris.dfg.de/gepris/projekt/441914366 | GHGA | German Human Genome-Phenome Archive |
1 | https://gepris.dfg.de/gepris/projekt/441926934 | NFDI4Cat | NFDI for Catalysis-Related Sciences |
2 | https://gepris.dfg.de/gepris/projekt/441958017 | NFDI4Culture | Consortium for research data on material and i... |
3 | https://gepris.dfg.de/gepris/projekt/441958208 | NFDI4Chem | Chemistry Consortium in the NFDI |
4 | https://gepris.dfg.de/gepris/projekt/442032008 | NFDI4BioDiversity | Biodiversity, Ecology & Environmental Data |
5 | https://gepris.dfg.de/gepris/projekt/442077441 | DataPLANT | Data in PLANT research |
6 | https://gepris.dfg.de/gepris/projekt/442146713 | NFDI4Ing | National Research Data Infrastructure for Engi... |
7 | https://gepris.dfg.de/gepris/projekt/442326535 | NFDI4Health | National Research Data Infrastructure for Pers... |
8 | https://gepris.dfg.de/gepris/projekt/442494171 | KonsortSWD | Consortium for the Social, Behavioural, Educat... |
9 | https://gepris.dfg.de/gepris/projekt/460033370 | Text+ | |
10 | https://gepris.dfg.de/gepris/projekt/460036893 | NFDI4Earth | NFDI Consortium Earth System Sciences |
11 | https://gepris.dfg.de/gepris/projekt/460037581 | BERD@NFDI | NFDI for Business, Economic and Related Data |
12 | https://gepris.dfg.de/gepris/projekt/460129525 | NFDI4Microbiota | National Research Data Infrastructure for Micr... |
13 | https://gepris.dfg.de/gepris/projekt/460135501 | MaRDI | Mathematical Research Data Initiative |
14 | https://gepris.dfg.de/gepris/projekt/460197019 | FAIRmat | FAIR Data Infrastructure for Condensed-Matter ... |
15 | https://gepris.dfg.de/gepris/projekt/460234259 | NFDI4DS | NFDI for Data Science and Artificial Intelligence |
16 | https://gepris.dfg.de/gepris/projekt/460247524 | NFDI-MatWerk | National Research Data Infrastructure for Mate... |
17 | https://gepris.dfg.de/gepris/projekt/460248186 | PUNCH4NFDI | Particles, Universe, NuClei and Hadrons for th... |
18 | https://gepris.dfg.de/gepris/projekt/460248799 | DAPHNE4NFDI | DAta from PHoton and Neutron Experiments for NFDI |
Let’s save the dataframe to CSV-file.
nfdi.to_csv("../../../data/GEPRIS_NFDI_all.csv", index=False, encoding='utf-8')