Parsing GEPRIS pages of NFDI consortia for project applicants and participants¶
The list of funded NFDI projects with their GEPRIS IDs was saved in the notebook 02. Let’s read that file into a dataframe.
import pandas as pd
df = pd.read_csv("../../../data/GEPRIS_NFDI_all.csv").fillna('')
df
GEPRIS | Title | Description | |
---|---|---|---|
0 | https://gepris.dfg.de/gepris/projekt/441914366 | GHGA | German Human Genome-Phenome Archive |
1 | https://gepris.dfg.de/gepris/projekt/441926934 | NFDI4Cat | NFDI for Catalysis-Related Sciences |
2 | https://gepris.dfg.de/gepris/projekt/441958017 | NFDI4Culture | Consortium for research data on material and i... |
3 | https://gepris.dfg.de/gepris/projekt/441958208 | NFDI4Chem | Chemistry Consortium in the NFDI |
4 | https://gepris.dfg.de/gepris/projekt/442032008 | NFDI4BioDiversity | Biodiversity, Ecology & Environmental Data |
5 | https://gepris.dfg.de/gepris/projekt/442077441 | DataPLANT | Data in PLANT research |
6 | https://gepris.dfg.de/gepris/projekt/442146713 | NFDI4Ing | National Research Data Infrastructure for Engi... |
7 | https://gepris.dfg.de/gepris/projekt/442326535 | NFDI4Health | National Research Data Infrastructure for Pers... |
8 | https://gepris.dfg.de/gepris/projekt/442494171 | KonsortSWD | Consortium for the Social, Behavioural, Educat... |
9 | https://gepris.dfg.de/gepris/projekt/460033370 | Text+ | |
10 | https://gepris.dfg.de/gepris/projekt/460036893 | NFDI4Earth | NFDI Consortium Earth System Sciences |
11 | https://gepris.dfg.de/gepris/projekt/460037581 | BERD@NFDI | NFDI for Business, Economic and Related Data |
12 | https://gepris.dfg.de/gepris/projekt/460129525 | NFDI4Microbiota | National Research Data Infrastructure for Micr... |
13 | https://gepris.dfg.de/gepris/projekt/460135501 | MaRDI | Mathematical Research Data Initiative |
14 | https://gepris.dfg.de/gepris/projekt/460197019 | FAIRmat | FAIR Data Infrastructure for Condensed-Matter ... |
15 | https://gepris.dfg.de/gepris/projekt/460234259 | NFDI4DS | NFDI for Data Science and Artificial Intelligence |
16 | https://gepris.dfg.de/gepris/projekt/460247524 | NFDI-MatWerk | National Research Data Infrastructure for Mate... |
17 | https://gepris.dfg.de/gepris/projekt/460248186 | PUNCH4NFDI | Particles, Universe, NuClei and Hadrons for th... |
18 | https://gepris.dfg.de/gepris/projekt/460248799 | DAPHNE4NFDI | DAta from PHoton and Neutron Experiments for NFDI |
For testing let’s start with BERD@BW consortium only.
import requests
GEPRIS = "https://gepris.dfg.de/gepris/projekt/460037581"
params = {'language': 'en'}
r = requests.get(GEPRIS, params=params)
text = r.text.encode(r.encoding).decode('utf8')
print(text[0:100].strip())
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www
Then the subject area is parsed:
from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'html.parser')
results = soup.find_all("div", class_="firstUnderAntragsbeteiligte")
subject_area = results[0].find_all('span')[-1].text.strip().replace(' ', '').replace('\n', ', ')
print(subject_area)
Social and Behavioural Sciences
The homepage is parsed:
try:
homepage = soup.find('a', class_="extern").get('href')
except:
homepage = ''
print(homepage)
The project description is parsed:
description = soup.find('div', id="projekttext").text.strip()
print(description)
The research domains of business, economics, and other social sciences are concerned with the relationships among individuals and organizations within a society. To understand these complex systems, social science disciplines have been using empirical methods for a long time. However, unstructured and non-standard data, i.e., information that either does not have a previously defined data model or is not organized in a predefined manner (e.g., images or videos from social media), are gaining in relevance. The generation of continuous data streams in society and economy (datafication) strengthens this trend: It is estimated that by 2025 80% of the data processed in economic applications will be available in unstructured form. Because of the sheer size and, more importantly, the lack of structure and the heterogeneity of raw digital data, the BERD@NFDI community calls for innovative and reusable methods, mostly from artificial intelligence and machine learning, as well as a suitable storage and computing environment to process the data in a way that it can be used for further scientific analyses. Consequently, algorithms become an integral part of the research data life cycle and thus have to be managed in the same way as the data itself.Within this context, the mission of BERD@NFDI is to develop, provide, and maintain a future-oriented, powerful research data infrastructure for the integrated management of unstructured data and related scientific software. With this focus and a clear commitment to openness (e.g., open software, open standards) BERD@NFDI can be an important and unique mosaic stone to build up the NFDI.BERD@NFDI will coordinate and enable sustainable research processes through the ongoing design of suitable services. It will not provide a mere storage location for data and scientific software, but a service portfolio that will be strictly aligned to the actual needs of the scientific user community. User feedback will be ensured through the BERD@NFDI training program, technically integrated interaction features, and institutionalized user participation in the consortium's committee structure.The consortium is headed by the University of Mannheim and is composed of seven co-applicant institutions including infrastructure providers and research institutions with a focus on business, economics, and other social sciences. Many of them have a long-lasting tradition of prosperous collaboration, which leverages the consortium to a high level of efficiency and effectiveness to successfully conduct the ambitious work program. Finally, the transparent governance structure with clear responsibilities together with a user-centric, multi-stakeholder approach will ensure the success of BERD@NFDI and will make a structural contribution to the success of the NFDI as a whole.
Let’s parse the rest of useful data, put it into berd
dictionary and pretty print it:
s = soup.find_all('div', class_="content_frame")
berd = {}
for k in s[1].find_all('div'):
try:
t = k.find('span', class_="name")
if t.get_text().strip()!="DFG Programme" and k.find_all('a', class_="intern"):
berd[t.get_text().strip()] = [ (g.get_text(), "https://gepris.dfg.de" + g.get('href')) for g in k.find_all('a', class_="intern")]
except:
pass
from pprint import pprint
pprint(berd)
{'Applying institution': [('Universität Mannheim',
'https://gepris.dfg.de/gepris/institution/10045')],
'Co-Spokespersons': [('Professor Dr. Bernd Bischl',
'https://gepris.dfg.de/gepris/person/289708818'),
('Professor Dr. Stefan Dietze',
'https://gepris.dfg.de/gepris/person/218646654'),
('Professor Dr. Marc Fischer',
'https://gepris.dfg.de/gepris/person/1804971'),
('Dr. Sabine Gehrlein',
'https://gepris.dfg.de/gepris/person/398007310'),
('Professor Dr. Mark Heitmann',
'https://gepris.dfg.de/gepris/person/216564778'),
('Professor Dr. Hartmut Höhle',
'https://gepris.dfg.de/gepris/person/442455860'),
('Professor Dr. Göran Kauermann',
'https://gepris.dfg.de/gepris/person/1375171'),
('Professorin Dr. Frauke Kreuter',
'https://gepris.dfg.de/gepris/person/1755279'),
('Professor Dr. Klaus Tochtermann',
'https://gepris.dfg.de/gepris/person/1552555'),
('Professor Dr. Christof Wolf',
'https://gepris.dfg.de/gepris/person/1610376')],
'Co-applicant institution': [('GESIS - Leibniz-Institut für '
'Sozialwissenschaften',
'https://gepris.dfg.de/gepris/institution/145000415'),
('Institut für Arbeitsmarkt- und Berufsforschung '
'(IAB)der Bundesagentur für Arbeit (BA)',
'https://gepris.dfg.de/gepris/institution/12925'),
('Ludwig-Maximilians-Universität München',
'https://gepris.dfg.de/gepris/institution/10108'),
('Universität Hamburg',
'https://gepris.dfg.de/gepris/institution/10192'),
('Universität zu Köln',
'https://gepris.dfg.de/gepris/institution/10282'),
('ZBW - Leibniz-Informationszentrum Wirtschaft',
'https://gepris.dfg.de/gepris/institution/17916')],
'Cooperation partners': [('Professor Rayid Ghani',
'https://gepris.dfg.de/gepris/person/406699457'),
('Professorin Dr. Julia Lane',
'https://gepris.dfg.de/gepris/person/442458155')],
'Participating Institution': [('Bayerische Akademie der Wissenschaften',
'https://gepris.dfg.de/gepris/institution/10138'),
('Gesellschaft für Sozial- und '
'Wirtschaftsgeschichte e. V.c/o Universität '
'RegensburgLehrstuhl für Wirtschafts- und '
'Sozialgeschichte',
'https://gepris.dfg.de/gepris/institution/442499111'),
('Gesellschaft für Unternehmensgeschichte e.V. '
'(GUG)',
'https://gepris.dfg.de/gepris/institution/442499393'),
('Institut für Bank- und Finanzgeschichte e.V. '
'(IBF)',
'https://gepris.dfg.de/gepris/institution/442499747'),
('Leibniz-Institut für Finanzmarktforschung '
'SAFE',
'https://gepris.dfg.de/gepris/institution/263945553'),
('Leibniz-Institut für ökologische '
'Raumentwicklung (IÖR) e.V.',
'https://gepris.dfg.de/gepris/institution/10398'),
('Rat für Sozial- und Wirtschaftsdaten '
'(RatSWD)c/o Wissenschaftszentrum Berlin (WZB)',
'https://gepris.dfg.de/gepris/institution/310279050'),
('Verband der Hochschullehrer für '
'Betriebswirtschaft (VHB) e.V.',
'https://gepris.dfg.de/gepris/institution/233943626'),
('Verein für Socialpolitik',
'https://gepris.dfg.de/gepris/institution/428818528'),
('ZEW - Leibniz-Zentrum für Europäische '
'Wirtschaftsforschung GmbH',
'https://gepris.dfg.de/gepris/institution/10543')],
'Participating Persons': [('Dr. Marianne Dörr',
'https://gepris.dfg.de/gepris/person/60331789'),
('Dr. Fabian Franke',
'https://gepris.dfg.de/gepris/person/96294761'),
('Dr. Katrin Moeller',
'https://gepris.dfg.de/gepris/person/236205840'),
('Dana Müller',
'https://gepris.dfg.de/gepris/person/350461336'),
('Professorin Dr. Isabella Peters',
'https://gepris.dfg.de/gepris/person/59860187'),
('Alexander Pfister',
'https://gepris.dfg.de/gepris/person/442456395'),
('Professor Dr. Mark Spoerer',
'https://gepris.dfg.de/gepris/person/1857714'),
('Professor Dr. Jochen Streb',
'https://gepris.dfg.de/gepris/person/1631070'),
('Professor Dr. Heiner Stuckenschmidt',
'https://gepris.dfg.de/gepris/person/1670080'),
('Dr. Peter Wittenburg',
'https://gepris.dfg.de/gepris/person/219421514')],
'Spokesperson': [('Professor Dr. Florian Stahl',
'https://gepris.dfg.de/gepris/person/274347517')]}
Now we can loop over all consortia and get that data about them automatically:
nfdi = {}
for GEPRIS, title in zip(df['GEPRIS'],df['Title']):
nfdi[GEPRIS] = {}
nfdi[GEPRIS]['title'] = title
params = {'language': 'en'}
r = requests.get(GEPRIS, params=params)
text = r.text.encode(r.encoding).decode('utf8')
soup = BeautifulSoup(text, 'html.parser')
results = soup.find_all("div", class_="firstUnderAntragsbeteiligte")
subject_area = results[0].find_all('span')[-1].text.strip().replace(' ', '').replace('\n', ', ')
nfdi[GEPRIS]['subject_area'] = subject_area
try:
homepage = soup.find('a', class_="extern").get('href')
except:
homepage = ''
nfdi[GEPRIS]['homepage'] = homepage
description = soup.find('div', id="projekttext").text.strip()
nfdi[GEPRIS]['description'] = description
s = soup.find_all('div', class_="content_frame")
for k in s[1].find_all('div'):
try:
t = k.find('span', class_="name")
if t.get_text().strip()!="DFG Programme" and k.find_all('a', class_="intern"):
nfdi[GEPRIS][t.get_text().strip()] = [ (g.get_text(), "https://gepris.dfg.de" + g.get('href')) for g in k.find_all('a', class_="intern")]
except:
pass
import pandas as pd
df_nfdi = pd.DataFrame(nfdi).fillna('')
df_nfdi.T
title | subject_area | homepage | description | Applying institution | Co-applicant institution | Participating Institution | Spokesperson | Participating Persons | Co-Spokespersons | Cooperation partners | |
---|---|---|---|---|---|---|---|---|---|---|---|
https://gepris.dfg.de/gepris/projekt/441914366 | GHGA | Medicine, Biology | https://ghga.dkfz.de/ | Human genome sequencing and other omics data m... | [(Deutsches Krebsforschungszentrum (DKFZ), htt... | [(Charité - Universitätsmedizin Berlin, https:... | [(CISPA - Helmholtz-Zentrum für Informationssi... | [(Professor Dr. Oliver Stegle, Ph.D., https:/... | [(Viktor Achter, https://gepris.dfg.de/gepris... | [(Privatdozent Dr. Peer Bork, https://gepris.... | |
https://gepris.dfg.de/gepris/projekt/441926934 | NFDI4Cat | Biology, Chemistry, Mathematics, Thermal Engin... | http://gecats.org/NFDI4Cat.html | The overall strategy for the transformation of... | [(DECHEMA Gesellschaft für Chemische Technik u... | [(Fraunhofer-Institut für Offene Kommunikation... | [(Technische Universität Darmstadt, https://ge... | [(Dr. Andreas Förster, https://gepris.dfg.de/... | [(Professor Dr.-Ing. Bastian Etzold, https://... | [(Professor Dr. Matthias Beller, https://gepr... | |
https://gepris.dfg.de/gepris/projekt/441958017 | NFDI4Culture | Humanities, Construction Engineering and Archi... | https://nfdi4culture.de | Digital data on tangible and intangible cultur... | [(Akademie der Wissenschaften und der Literatu... | [(FIZ KarlsruheLeibniz-Institut für Informatio... | [(Arbeitsgemeinschaft der kunsthistorischen Bi... | [(Professor Torsten Schrade, https://gepris.d... | [(Professorin Dr. Stefanie Acquavella-Rauch, ... | [(Reinhard Altenhöner, https://gepris.dfg.de/... | |
https://gepris.dfg.de/gepris/projekt/441958208 | NFDI4Chem | Chemistry | https://nfdi4chem.de | The vision of NFDI4Chem is the digitalisation ... | [(Friedrich-Schiller-Universität Jena, https:/... | [(FIZ KarlsruheLeibniz-Institut für Informatio... | [(Beilstein-Institut zur Förderung der chemisc... | [(Professor Dr. Christoph Steinbeck, https://... | [(Privatdozent Dr. Carsten Baldauf, https://g... | [(Dr. Felix Bach, https://gepris.dfg.de/gepri... | |
https://gepris.dfg.de/gepris/projekt/442032008 | NFDI4BioDiversity | Biology, Medicine, Agriculture, Forestry and V... | https://www.nfdi4biodiversity.org | Biodiversity is more than just the diversity o... | [(Universität Bremen, https://gepris.dfg.de/ge... | [(Forschungsinstitut für Nutztierbiologie (FBN... | [(Alfred-Wegener-InstitutHelmholtz-Zentrum für... | [(Professor Dr. Frank Oliver Glöckner, https:... | [(Professor Dr. Christian Ammer, https://gepr... | [(Professorin Dr. Aletta Bonn, https://gepris... | |
https://gepris.dfg.de/gepris/projekt/442077441 | DataPLANT | Biology | https://nfdi4plants.de | In modern hypothesis-driven research, scientis... | [(Albert-Ludwigs-Universität Freiburg, https:/... | [(Eberhard Karls Universität Tübingen, https:/... | [(Helmholtz Zentrum München - Deutsches Forsch... | [(Dr. Dirk von Suchodoletz, https://gepris.dfg... | [(Professor Dr. Rolf Backofen, https://gepris... | [(Dr. Jens Krüger, https://gepris.dfg.de/gepr... | |
https://gepris.dfg.de/gepris/projekt/442146713 | NFDI4Ing | Mechanical and Industrial Engineering, Thermal... | https://nfdi4ing.de | NFDI4Ing brings together the engineering commu... | [(Rheinisch-Westfälische Technische Hochschule... | [(Deutsches Zentrum für Luft- und Raumfahrt e.... | [(Bundesanstalt für Materialforschung und -prü... | [(Professor Dr.-Ing. Robert Schmitt, https://... | [(Professorin Dr. Jasmin Aghassi-Hagmann, htt... | [(Verena Anthofer, https://gepris.dfg.de/gepr... | |
https://gepris.dfg.de/gepris/projekt/442326535 | NFDI4Health | Medicine | https://www.nfdi4health.de | Germany has accumulated a wealth of health-rel... | [(Deutsche Zentralbibliothek für Medizin (ZB M... | [(Charité - Universitätsmedizin Berlin, https:... | [(Behörde für Gesundheit und Verbraucherschutz... | [(Professorin Dr. Juliane Fluck, https://gepr... | [(Professor Dr. Thomas Behrens, https://gepri... | [(Professor Dr. Wolfgang Ahrens, https://gepr... | |
https://gepris.dfg.de/gepris/projekt/442494171 | KonsortSWD | Social and Behavioural Sciences, Humanities | https://www.konsortswd.de | The social, behavioural, educational, and econ... | [(GESIS - Leibniz-Institut für Sozialwissensch... | [(Deutsches Institut für Wirtschaftsforschung ... | [(Bundesamt für Migration und Flüchtlinge, htt... | [(Professor Dr. Christof Wolf, https://gepris... | [(Dr. Maja Adena, https://gepris.dfg.de/gepri... | ||
https://gepris.dfg.de/gepris/projekt/460033370 | Text+ | Humanities | Text+ aims to develop a research data infrastr... | [(Leibniz-Institut für Deutsche Sprache (IDS),... | [(Berlin-Brandenburgische Akademie der Wissens... | [(Akademie der Wissenschaften in Hamburg, http... | [(Professor Dr. Erhard Hinrichs, https://gepr... | [(Professor Dr. Andreas Henrich, https://gepr... | [(Privatdozent Dr. Alexander Geyken, https://... | ||
https://gepris.dfg.de/gepris/projekt/460036893 | NFDI4Earth | Geosciences, Agriculture, Forestry and Veterin... | NFDI4Earth addresses digital needs of Earth Sy... | [(Technische Universität Dresden, https://gepr... | [(Alfred-Wegener-InstitutHelmholtz-Zentrum für... | [(Bayerische Akademie der Wissenschaften, http... | [(Professor Dr. Lars Bernard, https://gepris.... | [(Roland Bertelmann, https://gepris.dfg.de/ge... | |||
https://gepris.dfg.de/gepris/projekt/460037581 | BERD@NFDI | Social and Behavioural Sciences | The research domains of business, economics, a... | [(Universität Mannheim, https://gepris.dfg.de/... | [(GESIS - Leibniz-Institut für Sozialwissensch... | [(Bayerische Akademie der Wissenschaften, http... | [(Professor Dr. Florian Stahl, https://gepris... | [(Dr. Marianne Dörr, https://gepris.dfg.de/ge... | [(Professor Dr. Bernd Bischl, https://gepris.... | [(Professor Rayid Ghani, https://gepris.dfg.d... | |
https://gepris.dfg.de/gepris/projekt/460129525 | NFDI4Microbiota | Biology | Microbes – bacteria, archaea, unicellular euka... | [(Deutsche Zentralbibliothek für Medizin (ZB M... | [(European Molecular Biology Laboratory (EMBL)... | [(Christian-Albrechts-Universität zu Kiel, htt... | [(Professor Dr. Konrad Förstner, https://gepr... | [(Professorin Dr. Anke Becker, https://gepris... | |||
https://gepris.dfg.de/gepris/projekt/460135501 | MaRDI | Mathematics | Mathematical research data (MRD) has become va... | [(Weierstraß-Institut für Angewandte Analysis ... | [(Deutsche Mathematiker-Vereinigung e.V.c/o WI... | [(European Mathematical Society, https://gepri... | [(Professor Dr. Michael Hintermüller, https:/... | [(Professor Dr. Peter Bastian, https://gepris... | [(Professorin Dr. Ilka Agricola, https://gepr... | ||
https://gepris.dfg.de/gepris/projekt/460197019 | FAIRmat | Physics | Scientific data are a significant raw material... | [(Humboldt-Universität zu Berlin, https://gepr... | [(FAIR-DI e.V.Humboldt-Universität zu Berlin, ... | [(Deutsche Physikalische Gesellschaft e. V., h... | [(Professorin Dr. Claudia Draxl, https://gepr... | [(Professor Dr. Martin Aeschlimann, https://g... | [(Dr. Martin Albrecht, https://gepris.dfg.de/... | ||
https://gepris.dfg.de/gepris/projekt/460234259 | NFDI4DS | Computer Science, Systems andElectrical Engine... | The vision of NFDI4DataScience (NFDI4DS) is to... | [(Fraunhofer-Gesellschaft zur Förderung der an... | [(Deutsche Zentralbibliothek für Medizin (ZB M... | [(Alfred-Wegener-InstitutHelmholtz-Zentrum für... | [(Dr. Sonja Schimmler, https://gepris.dfg.de/... | [(Privatdozent Dr. Carsten Baldauf, https://g... | [(Professor Dr. Ziawasch Abedjan, https://gep... | ||
https://gepris.dfg.de/gepris/projekt/460247524 | NFDI-MatWerk | Materials Science and Engineering | Since the Stone Age the mastery of materials h... | [(Fraunhofer-Gesellschaft zur Förderung der an... | [(Deutsches Forschungszentrum für Künstliche I... | [(Albert-Ludwigs-Universität Freiburg, https:/... | [(Professor Dr. Christoph Eberl, https://gepr... | [(Professor Dr.-Ing. Tilmann Beck, https://ge... | [(Professor Dr.-Ing. Erik Bitzek, https://gep... | ||
https://gepris.dfg.de/gepris/projekt/460248186 | PUNCH4NFDI | Physics | PUNCH4NFDI, the consortium of particle, astrop... | [(Deutsches Elektronen-Synchrotron (DESY), htt... | [(Forschungszentrum Jülich, https://gepris.dfg... | [(Albert-Ludwigs-Universität Freiburg, https:/... | [(Privatdozent Dr. Thomas Schörner-Sadenius, ... | [(Privatdozent Dr. Philip Bechtle, https://ge... | |||
https://gepris.dfg.de/gepris/projekt/460248799 | DAPHNE4NFDI | Physics, Biology, Chemistry, Materials Science... | The photon and neutron science community encom... | [(Deutsches Elektronen-Synchrotron (DESY), htt... | [(Bergische Universität Wuppertal, https://gep... | [(Bruker AXS GmbH, https://gepris.dfg.de/gepri... | [(Dr. Anton Barty, https://gepris.dfg.de/gepr... | [(Dr. Sebastian Busch, https://gepris.dfg.de/... |
Let’s save the parsed data into a CSV-file:
df_nfdi.T.to_csv("../../../data/GEPRIS_NFDI_project_pages.csv", index=True, index_label='gepris', encoding='utf-8')