Parsing GEPRIS pages of NFDI consortia for project applicants and participants

The list of funded NFDI projects with their GEPRIS IDs was saved in the notebook 02. Let’s read that file into a dataframe.

import pandas as pd
df = pd.read_csv("../../../data/GEPRIS_NFDI_all.csv").fillna('')
df
GEPRIS Title Description
0 https://gepris.dfg.de/gepris/projekt/441914366 GHGA German Human Genome-Phenome Archive
1 https://gepris.dfg.de/gepris/projekt/441926934 NFDI4Cat NFDI for Catalysis-Related Sciences
2 https://gepris.dfg.de/gepris/projekt/441958017 NFDI4Culture Consortium for research data on material and i...
3 https://gepris.dfg.de/gepris/projekt/441958208 NFDI4Chem Chemistry Consortium in the NFDI
4 https://gepris.dfg.de/gepris/projekt/442032008 NFDI4BioDiversity Biodiversity, Ecology & Environmental Data
5 https://gepris.dfg.de/gepris/projekt/442077441 DataPLANT Data in PLANT research
6 https://gepris.dfg.de/gepris/projekt/442146713 NFDI4Ing National Research Data Infrastructure for Engi...
7 https://gepris.dfg.de/gepris/projekt/442326535 NFDI4Health National Research Data Infrastructure for Pers...
8 https://gepris.dfg.de/gepris/projekt/442494171 KonsortSWD Consortium for the Social, Behavioural, Educat...
9 https://gepris.dfg.de/gepris/projekt/460033370 Text+
10 https://gepris.dfg.de/gepris/projekt/460036893 NFDI4Earth NFDI Consortium Earth System Sciences
11 https://gepris.dfg.de/gepris/projekt/460037581 BERD@NFDI NFDI for Business, Economic and Related Data
12 https://gepris.dfg.de/gepris/projekt/460129525 NFDI4Microbiota National Research Data Infrastructure for Micr...
13 https://gepris.dfg.de/gepris/projekt/460135501 MaRDI Mathematical Research Data Initiative
14 https://gepris.dfg.de/gepris/projekt/460197019 FAIRmat FAIR Data Infrastructure for Condensed-Matter ...
15 https://gepris.dfg.de/gepris/projekt/460234259 NFDI4DS NFDI for Data Science and Artificial Intelligence
16 https://gepris.dfg.de/gepris/projekt/460247524 NFDI-MatWerk National Research Data Infrastructure for Mate...
17 https://gepris.dfg.de/gepris/projekt/460248186 PUNCH4NFDI Particles, Universe, NuClei and Hadrons for th...
18 https://gepris.dfg.de/gepris/projekt/460248799 DAPHNE4NFDI DAta from PHoton and Neutron Experiments for NFDI

For testing let’s start with BERD@BW consortium only.

import requests
GEPRIS = "https://gepris.dfg.de/gepris/projekt/460037581"
params = {'language': 'en'}
r = requests.get(GEPRIS, params=params)
text = r.text.encode(r.encoding).decode('utf8')
print(text[0:100].strip())
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
    "http://www

Then the subject area is parsed:

from bs4 import BeautifulSoup
soup = BeautifulSoup(text, 'html.parser')
results = soup.find_all("div", class_="firstUnderAntragsbeteiligte")
subject_area = results[0].find_all('span')[-1].text.strip().replace('  ', '').replace('\n', ', ')
print(subject_area)
Social and Behavioural Sciences

The homepage is parsed:

try:
    homepage = soup.find('a', class_="extern").get('href')
except:
    homepage = ''
print(homepage)

The project description is parsed:

description = soup.find('div', id="projekttext").text.strip()
print(description)
The research domains of business, economics, and other social sciences are concerned with the relationships among individuals and organizations within a society. To understand these complex systems, social science disciplines have been using empirical methods for a long time. However, unstructured and non-standard data, i.e., information that either does not have a previously defined data model or is not organized in a predefined manner (e.g., images or videos from social media), are gaining in relevance. The generation of continuous data streams in society and economy (datafication) strengthens this trend: It is estimated that by 2025 80% of the data processed in economic applications will be available in unstructured form. Because of the sheer size and, more importantly, the lack of structure and the heterogeneity of raw digital data, the BERD@NFDI community calls for innovative and reusable methods, mostly from artificial intelligence and machine learning, as well as a suitable storage and computing environment to process the data in a way that it can be used for further scientific analyses. Consequently, algorithms become an integral part of the research data life cycle and thus have to be managed in the same way as the data itself.Within this context, the mission of BERD@NFDI is to develop, provide, and maintain a future-oriented, powerful research data infrastructure for the integrated management of unstructured data and related scientific software. With this focus and a clear commitment to openness (e.g., open software, open standards) BERD@NFDI can be an important and unique mosaic stone to build up the NFDI.BERD@NFDI will coordinate and enable sustainable research processes through the ongoing design of suitable services. It will not provide a mere storage location for data and scientific software, but a service portfolio that will be strictly aligned to the actual needs of the scientific user community. User feedback will be ensured through the BERD@NFDI training program, technically integrated interaction features, and institutionalized user participation in the consortium's committee structure.The consortium is headed by the University of Mannheim and is composed of seven co-applicant institutions including infrastructure providers and research institutions with a focus on business, economics, and other social sciences. Many of them have a long-lasting tradition of prosperous collaboration, which leverages the consortium to a high level of efficiency and effectiveness to successfully conduct the ambitious work program. Finally, the transparent governance structure with clear responsibilities together with a user-centric, multi-stakeholder approach will ensure the success of BERD@NFDI and will make a structural contribution to the success of the NFDI as a whole.

Let’s parse the rest of useful data, put it into berd dictionary and pretty print it:

s = soup.find_all('div', class_="content_frame")
berd = {}
for k in s[1].find_all('div'):
    try:
        t = k.find('span', class_="name")
        if t.get_text().strip()!="DFG Programme" and k.find_all('a', class_="intern"):
            berd[t.get_text().strip()] = [ (g.get_text(), "https://gepris.dfg.de" + g.get('href')) for g in k.find_all('a', class_="intern")]
    except:
        pass
from pprint import pprint
pprint(berd)
{'Applying institution': [('Universität Mannheim',
                           'https://gepris.dfg.de/gepris/institution/10045')],
 'Co-Spokespersons': [('Professor Dr. Bernd  Bischl',
                       'https://gepris.dfg.de/gepris/person/289708818'),
                      ('Professor Dr. Stefan  Dietze',
                       'https://gepris.dfg.de/gepris/person/218646654'),
                      ('Professor Dr. Marc  Fischer',
                       'https://gepris.dfg.de/gepris/person/1804971'),
                      ('Dr. Sabine  Gehrlein',
                       'https://gepris.dfg.de/gepris/person/398007310'),
                      ('Professor Dr. Mark  Heitmann',
                       'https://gepris.dfg.de/gepris/person/216564778'),
                      ('Professor Dr. Hartmut  Höhle',
                       'https://gepris.dfg.de/gepris/person/442455860'),
                      ('Professor Dr. Göran  Kauermann',
                       'https://gepris.dfg.de/gepris/person/1375171'),
                      ('Professorin Dr. Frauke  Kreuter',
                       'https://gepris.dfg.de/gepris/person/1755279'),
                      ('Professor Dr. Klaus  Tochtermann',
                       'https://gepris.dfg.de/gepris/person/1552555'),
                      ('Professor Dr. Christof  Wolf',
                       'https://gepris.dfg.de/gepris/person/1610376')],
 'Co-applicant institution': [('GESIS - Leibniz-Institut für '
                               'Sozialwissenschaften',
                               'https://gepris.dfg.de/gepris/institution/145000415'),
                              ('Institut für Arbeitsmarkt- und Berufsforschung '
                               '(IAB)der Bundesagentur für Arbeit (BA)',
                               'https://gepris.dfg.de/gepris/institution/12925'),
                              ('Ludwig-Maximilians-Universität München',
                               'https://gepris.dfg.de/gepris/institution/10108'),
                              ('Universität Hamburg',
                               'https://gepris.dfg.de/gepris/institution/10192'),
                              ('Universität zu Köln',
                               'https://gepris.dfg.de/gepris/institution/10282'),
                              ('ZBW - Leibniz-Informationszentrum Wirtschaft',
                               'https://gepris.dfg.de/gepris/institution/17916')],
 'Cooperation partners': [('Professor Rayid  Ghani',
                           'https://gepris.dfg.de/gepris/person/406699457'),
                          ('Professorin Dr. Julia  Lane',
                           'https://gepris.dfg.de/gepris/person/442458155')],
 'Participating Institution': [('Bayerische Akademie der Wissenschaften',
                                'https://gepris.dfg.de/gepris/institution/10138'),
                               ('Gesellschaft für Sozial- und '
                                'Wirtschaftsgeschichte e. V.c/o Universität '
                                'RegensburgLehrstuhl für Wirtschafts- und '
                                'Sozialgeschichte',
                                'https://gepris.dfg.de/gepris/institution/442499111'),
                               ('Gesellschaft für Unternehmensgeschichte e.V. '
                                '(GUG)',
                                'https://gepris.dfg.de/gepris/institution/442499393'),
                               ('Institut für Bank- und Finanzgeschichte e.V. '
                                '(IBF)',
                                'https://gepris.dfg.de/gepris/institution/442499747'),
                               ('Leibniz-Institut für Finanzmarktforschung '
                                'SAFE',
                                'https://gepris.dfg.de/gepris/institution/263945553'),
                               ('Leibniz-Institut für ökologische '
                                'Raumentwicklung (IÖR) e.V.',
                                'https://gepris.dfg.de/gepris/institution/10398'),
                               ('Rat für Sozial- und Wirtschaftsdaten '
                                '(RatSWD)c/o Wissenschaftszentrum Berlin (WZB)',
                                'https://gepris.dfg.de/gepris/institution/310279050'),
                               ('Verband der Hochschullehrer für '
                                'Betriebswirtschaft (VHB) e.V.',
                                'https://gepris.dfg.de/gepris/institution/233943626'),
                               ('Verein für Socialpolitik',
                                'https://gepris.dfg.de/gepris/institution/428818528'),
                               ('ZEW - Leibniz-Zentrum für Europäische '
                                'Wirtschaftsforschung GmbH',
                                'https://gepris.dfg.de/gepris/institution/10543')],
 'Participating Persons': [('Dr. Marianne  Dörr',
                            'https://gepris.dfg.de/gepris/person/60331789'),
                           ('Dr. Fabian  Franke',
                            'https://gepris.dfg.de/gepris/person/96294761'),
                           ('Dr. Katrin  Moeller',
                            'https://gepris.dfg.de/gepris/person/236205840'),
                           ('Dana  Müller',
                            'https://gepris.dfg.de/gepris/person/350461336'),
                           ('Professorin Dr. Isabella  Peters',
                            'https://gepris.dfg.de/gepris/person/59860187'),
                           ('Alexander  Pfister',
                            'https://gepris.dfg.de/gepris/person/442456395'),
                           ('Professor Dr. Mark  Spoerer',
                            'https://gepris.dfg.de/gepris/person/1857714'),
                           ('Professor Dr. Jochen  Streb',
                            'https://gepris.dfg.de/gepris/person/1631070'),
                           ('Professor Dr. Heiner  Stuckenschmidt',
                            'https://gepris.dfg.de/gepris/person/1670080'),
                           ('Dr. Peter  Wittenburg',
                            'https://gepris.dfg.de/gepris/person/219421514')],
 'Spokesperson': [('Professor Dr. Florian  Stahl',
                   'https://gepris.dfg.de/gepris/person/274347517')]}

Now we can loop over all consortia and get that data about them automatically:

nfdi = {}
for GEPRIS, title in zip(df['GEPRIS'],df['Title']):
    nfdi[GEPRIS] = {}
    nfdi[GEPRIS]['title'] = title
    params = {'language': 'en'}
    r = requests.get(GEPRIS, params=params)
    text = r.text.encode(r.encoding).decode('utf8')
    soup = BeautifulSoup(text, 'html.parser')
    results = soup.find_all("div", class_="firstUnderAntragsbeteiligte")
    subject_area = results[0].find_all('span')[-1].text.strip().replace('  ', '').replace('\n', ', ')
    nfdi[GEPRIS]['subject_area'] = subject_area
    try:
        homepage = soup.find('a', class_="extern").get('href')
    except:
        homepage = ''
    nfdi[GEPRIS]['homepage'] = homepage
    description = soup.find('div', id="projekttext").text.strip()
    nfdi[GEPRIS]['description'] = description
    s = soup.find_all('div', class_="content_frame")
    for k in s[1].find_all('div'):
        try:
            t = k.find('span', class_="name")
            if t.get_text().strip()!="DFG Programme" and k.find_all('a', class_="intern"):
                nfdi[GEPRIS][t.get_text().strip()] = [ (g.get_text(), "https://gepris.dfg.de" + g.get('href')) for g in k.find_all('a', class_="intern")]
        except:
            pass
import pandas as pd
df_nfdi = pd.DataFrame(nfdi).fillna('')
df_nfdi.T
title subject_area homepage description Applying institution Co-applicant institution Participating Institution Spokesperson Participating Persons Co-Spokespersons Cooperation partners
https://gepris.dfg.de/gepris/projekt/441914366 GHGA Medicine, Biology https://ghga.dkfz.de/ Human genome sequencing and other omics data m... [(Deutsches Krebsforschungszentrum (DKFZ), htt... [(Charité - Universitätsmedizin Berlin, https:... [(CISPA - Helmholtz-Zentrum für Informationssi... [(Professor Dr. Oliver Stegle, Ph.D., https:/... [(Viktor Achter, https://gepris.dfg.de/gepris... [(Privatdozent Dr. Peer Bork, https://gepris....
https://gepris.dfg.de/gepris/projekt/441926934 NFDI4Cat Biology, Chemistry, Mathematics, Thermal Engin... http://gecats.org/NFDI4Cat.html The overall strategy for the transformation of... [(DECHEMA Gesellschaft für Chemische Technik u... [(Fraunhofer-Institut für Offene Kommunikation... [(Technische Universität Darmstadt, https://ge... [(Dr. Andreas Förster, https://gepris.dfg.de/... [(Professor Dr.-Ing. Bastian Etzold, https://... [(Professor Dr. Matthias Beller, https://gepr...
https://gepris.dfg.de/gepris/projekt/441958017 NFDI4Culture Humanities, Construction Engineering and Archi... https://nfdi4culture.de Digital data on tangible and intangible cultur... [(Akademie der Wissenschaften und der Literatu... [(FIZ KarlsruheLeibniz-Institut für Informatio... [(Arbeitsgemeinschaft der kunsthistorischen Bi... [(Professor Torsten Schrade, https://gepris.d... [(Professorin Dr. Stefanie Acquavella-Rauch, ... [(Reinhard Altenhöner, https://gepris.dfg.de/...
https://gepris.dfg.de/gepris/projekt/441958208 NFDI4Chem Chemistry https://nfdi4chem.de The vision of NFDI4Chem is the digitalisation ... [(Friedrich-Schiller-Universität Jena, https:/... [(FIZ KarlsruheLeibniz-Institut für Informatio... [(Beilstein-Institut zur Förderung der chemisc... [(Professor Dr. Christoph Steinbeck, https://... [(Privatdozent Dr. Carsten Baldauf, https://g... [(Dr. Felix Bach, https://gepris.dfg.de/gepri...
https://gepris.dfg.de/gepris/projekt/442032008 NFDI4BioDiversity Biology, Medicine, Agriculture, Forestry and V... https://www.nfdi4biodiversity.org Biodiversity is more than just the diversity o... [(Universität Bremen, https://gepris.dfg.de/ge... [(Forschungsinstitut für Nutztierbiologie (FBN... [(Alfred-Wegener-InstitutHelmholtz-Zentrum für... [(Professor Dr. Frank Oliver Glöckner, https:... [(Professor Dr. Christian Ammer, https://gepr... [(Professorin Dr. Aletta Bonn, https://gepris...
https://gepris.dfg.de/gepris/projekt/442077441 DataPLANT Biology https://nfdi4plants.de In modern hypothesis-driven research, scientis... [(Albert-Ludwigs-Universität Freiburg, https:/... [(Eberhard Karls Universität Tübingen, https:/... [(Helmholtz Zentrum München - Deutsches Forsch... [(Dr. Dirk von Suchodoletz, https://gepris.dfg... [(Professor Dr. Rolf Backofen, https://gepris... [(Dr. Jens Krüger, https://gepris.dfg.de/gepr...
https://gepris.dfg.de/gepris/projekt/442146713 NFDI4Ing Mechanical and Industrial Engineering, Thermal... https://nfdi4ing.de NFDI4Ing brings together the engineering commu... [(Rheinisch-Westfälische Technische Hochschule... [(Deutsches Zentrum für Luft- und Raumfahrt e.... [(Bundesanstalt für Materialforschung und -prü... [(Professor Dr.-Ing. Robert Schmitt, https://... [(Professorin Dr. Jasmin Aghassi-Hagmann, htt... [(Verena Anthofer, https://gepris.dfg.de/gepr...
https://gepris.dfg.de/gepris/projekt/442326535 NFDI4Health Medicine https://www.nfdi4health.de Germany has accumulated a wealth of health-rel... [(Deutsche Zentralbibliothek für Medizin (ZB M... [(Charité - Universitätsmedizin Berlin, https:... [(Behörde für Gesundheit und Verbraucherschutz... [(Professorin Dr. Juliane Fluck, https://gepr... [(Professor Dr. Thomas Behrens, https://gepri... [(Professor Dr. Wolfgang Ahrens, https://gepr...
https://gepris.dfg.de/gepris/projekt/442494171 KonsortSWD Social and Behavioural Sciences, Humanities https://www.konsortswd.de The social, behavioural, educational, and econ... [(GESIS - Leibniz-Institut für Sozialwissensch... [(Deutsches Institut für Wirtschaftsforschung ... [(Bundesamt für Migration und Flüchtlinge, htt... [(Professor Dr. Christof Wolf, https://gepris... [(Dr. Maja Adena, https://gepris.dfg.de/gepri...
https://gepris.dfg.de/gepris/projekt/460033370 Text+ Humanities Text+ aims to develop a research data infrastr... [(Leibniz-Institut für Deutsche Sprache (IDS),... [(Berlin-Brandenburgische Akademie der Wissens... [(Akademie der Wissenschaften in Hamburg, http... [(Professor Dr. Erhard Hinrichs, https://gepr... [(Professor Dr. Andreas Henrich, https://gepr... [(Privatdozent Dr. Alexander Geyken, https://...
https://gepris.dfg.de/gepris/projekt/460036893 NFDI4Earth Geosciences, Agriculture, Forestry and Veterin... NFDI4Earth addresses digital needs of Earth Sy... [(Technische Universität Dresden, https://gepr... [(Alfred-Wegener-InstitutHelmholtz-Zentrum für... [(Bayerische Akademie der Wissenschaften, http... [(Professor Dr. Lars Bernard, https://gepris.... [(Roland Bertelmann, https://gepris.dfg.de/ge...
https://gepris.dfg.de/gepris/projekt/460037581 BERD@NFDI Social and Behavioural Sciences The research domains of business, economics, a... [(Universität Mannheim, https://gepris.dfg.de/... [(GESIS - Leibniz-Institut für Sozialwissensch... [(Bayerische Akademie der Wissenschaften, http... [(Professor Dr. Florian Stahl, https://gepris... [(Dr. Marianne Dörr, https://gepris.dfg.de/ge... [(Professor Dr. Bernd Bischl, https://gepris.... [(Professor Rayid Ghani, https://gepris.dfg.d...
https://gepris.dfg.de/gepris/projekt/460129525 NFDI4Microbiota Biology Microbes – bacteria, archaea, unicellular euka... [(Deutsche Zentralbibliothek für Medizin (ZB M... [(European Molecular Biology Laboratory (EMBL)... [(Christian-Albrechts-Universität zu Kiel, htt... [(Professor Dr. Konrad Förstner, https://gepr... [(Professorin Dr. Anke Becker, https://gepris...
https://gepris.dfg.de/gepris/projekt/460135501 MaRDI Mathematics Mathematical research data (MRD) has become va... [(Weierstraß-Institut für Angewandte Analysis ... [(Deutsche Mathematiker-Vereinigung e.V.c/o WI... [(European Mathematical Society, https://gepri... [(Professor Dr. Michael Hintermüller, https:/... [(Professor Dr. Peter Bastian, https://gepris... [(Professorin Dr. Ilka Agricola, https://gepr...
https://gepris.dfg.de/gepris/projekt/460197019 FAIRmat Physics Scientific data are a significant raw material... [(Humboldt-Universität zu Berlin, https://gepr... [(FAIR-DI e.V.Humboldt-Universität zu Berlin, ... [(Deutsche Physikalische Gesellschaft e. V., h... [(Professorin Dr. Claudia Draxl, https://gepr... [(Professor Dr. Martin Aeschlimann, https://g... [(Dr. Martin Albrecht, https://gepris.dfg.de/...
https://gepris.dfg.de/gepris/projekt/460234259 NFDI4DS Computer Science, Systems andElectrical Engine... The vision of NFDI4DataScience (NFDI4DS) is to... [(Fraunhofer-Gesellschaft zur Förderung der an... [(Deutsche Zentralbibliothek für Medizin (ZB M... [(Alfred-Wegener-InstitutHelmholtz-Zentrum für... [(Dr. Sonja Schimmler, https://gepris.dfg.de/... [(Privatdozent Dr. Carsten Baldauf, https://g... [(Professor Dr. Ziawasch Abedjan, https://gep...
https://gepris.dfg.de/gepris/projekt/460247524 NFDI-MatWerk Materials Science and Engineering Since the Stone Age the mastery of materials h... [(Fraunhofer-Gesellschaft zur Förderung der an... [(Deutsches Forschungszentrum für Künstliche I... [(Albert-Ludwigs-Universität Freiburg, https:/... [(Professor Dr. Christoph Eberl, https://gepr... [(Professor Dr.-Ing. Tilmann Beck, https://ge... [(Professor Dr.-Ing. Erik Bitzek, https://gep...
https://gepris.dfg.de/gepris/projekt/460248186 PUNCH4NFDI Physics PUNCH4NFDI, the consortium of particle, astrop... [(Deutsches Elektronen-Synchrotron (DESY), htt... [(Forschungszentrum Jülich, https://gepris.dfg... [(Albert-Ludwigs-Universität Freiburg, https:/... [(Privatdozent Dr. Thomas Schörner-Sadenius, ... [(Privatdozent Dr. Philip Bechtle, https://ge...
https://gepris.dfg.de/gepris/projekt/460248799 DAPHNE4NFDI Physics, Biology, Chemistry, Materials Science... The photon and neutron science community encom... [(Deutsches Elektronen-Synchrotron (DESY), htt... [(Bergische Universität Wuppertal, https://gep... [(Bruker AXS GmbH, https://gepris.dfg.de/gepri... [(Dr. Anton Barty, https://gepris.dfg.de/gepr... [(Dr. Sebastian Busch, https://gepris.dfg.de/...

Let’s save the parsed data into a CSV-file:

df_nfdi.T.to_csv("../../../data/GEPRIS_NFDI_project_pages.csv", index=True, index_label='gepris', encoding='utf-8')