Parsing DFG for the list of funded NFDI consortia with homepages and descriptions

We google “accepted NFDI consortia” and find the page with accepted consortia: https://www.dfg.de/en/research_funding/programmes/nfdi/funded_consortia/index.html.

Getting HTML via requests

We use requests library to get HTML of that page into text variable and print first 33 characters of it.

import requests
NFDI_URL = "https://www.dfg.de/en/research_funding/programmes/nfdi/funded_consortia/index.html"
r = requests.get(NFDI_URL)
text = r.text
print(text[0:33])
<!DOCTYPE html>
<html lang="en">

Parsing all tables from HTML via pandas

Indeed text variable contains HTML. We can parse all tables from it using the pandas library.

import pandas as pd
pd.set_option('display.width', 1000)
df_list = pd.read_html(text.encode('latin1').decode('utf8'))
for df in df_list:
    print(df)
                                               Titel                                             Link
0  DataPLANT - Data in Plant research (Biology)Ex...              Externer Linkhttp://nfdi4plants.de/
1  GHGA - German Human Genome Archive (Medicine)E...               Externer Linkhttps://ghga.dkfz.de/
2  KonsortSWD - Consortium for the Social, Behavi...          Externer Linkhttps://www.konsortswd.de/
3  NFDI4BioDiversität - Biodiversity, Ecology & E...  Externer Linkhttps://www.nfdi4biodiversity.org/
4  NFDI4Cat - NFDI for Catalysis-Related Sciences...     Externer Linkhttp://gecats.org/NFDI4Cat.html
5  NFDI4Chem - Chemistry Consortium in the NFDI (...           Externer Linkhttps://www.nfdi4chem.de/
6  NFDI4Culture - Consortium for research data on...            Externer Linkhttps://nfdi4culture.de/
7  NFDI4Health - National Research Data Infrastru...         Externer Linkhttps://www.nfdi4health.de/
8  NFDI4Ing - National Research Data Infrastructu...                Externer Linkhttps://nfdi4ing.de/
                                               Titel                                               Link
0  BERD@NFDI - NFDI for Business, Economic and Re...             Externer Linkhttps://www.berd-nfdi.de/
1  DAPHNE4NFDI - DAta from PHoton and Neutron Exp...  Externer Linkhttps://www.sni-portal.de/de/daph...
2  FAIRmat - FAIR Data Infrastructure for Condens...  Externer Linkhttps://www.fair-di.eu/fairmat/fa...
3  MaRDI - Mathematical Research Data Initiative ...            Externer Linkhttps://www.mardi4nfdi.de/
4  NFDI4DataScience - NFDI for Data Science and A...                                     No website yet
5  NFDI4Earth - NFDI Consortium Earth System Scie...            Externer Linkhttps://www.nfdi4earth.de/
6  NFDI4Microbiota - National Research Data Infra...           Externer Linkhttps://nfdi4microbiota.de/
7  NFDI-MatWerk - National Research Data Infrastr...              Externer Linkhttps://nfdi-matwerk.de/
8  PUNCH4NFDI - Particles, Universe, NuClei and H...            Externer Linkhttps://www.punch4nfdi.de/
9  Text+ - Language and Text Based Research Data ...            Externer Linkhttps://www.text-plus.org/

The column Titel (aka ‘title’) contains both titles and descriptions. The column Link contains the string “Externer Link” and the links to homepages of NFDI consortia.

Processing tables

Let’s process and clean those tables. We also replace “No website yet” with empty string and “NFDI4BioDiversität” with “NFDI4BioDiversity”.

for df in df_list:
    df['Description'] = df['Titel'].apply(lambda x: x.split(' - ')[-1].replace('Externer Link- Project in GEPRIS', ''))
    df['Titel'] = df['Titel'] .apply(lambda x: x.split(' - ')[0].replace('NFDI4BioDiversität', 'NFDI4BioDiversity'))
    df['Link'] = df['Link'].apply(lambda x: x.replace('Externer Link', '').replace('No website yet',''))
    df

The NFDI consortia funded in 2020:

df_list[0]
Titel Link Description
0 DataPLANT http://nfdi4plants.de/ Data in Plant research (Biology)
1 GHGA https://ghga.dkfz.de/ German Human Genome Archive (Medicine)
2 KonsortSWD https://www.konsortswd.de/ Consortium for the Social, Behavioural, Educat...
3 NFDI4BioDiversity https://www.nfdi4biodiversity.org/ Biodiversity, Ecology & Envi-ronmental Data (B...
4 NFDI4Cat http://gecats.org/NFDI4Cat.html NFDI for Catalysis-Related Sciences (Chemistry)
5 NFDI4Chem https://www.nfdi4chem.de/ Chemistry Consortium in the NFDI (Chemistry)
6 NFDI4Culture https://nfdi4culture.de/ Consortium for research data on ma-terial and ...
7 NFDI4Health https://www.nfdi4health.de/ National Research Data Infrastructure for Pers...
8 NFDI4Ing https://nfdi4ing.de/ National Research Data Infrastructure for Engi...

The NFDI consortia funded in 2021:

df_list[1]
Titel Link Description
0 BERD@NFDI https://www.berd-nfdi.de/ NFDI for Business, Economic and Related Data (...
1 DAPHNE4NFDI https://www.sni-portal.de/de/daphne-nfdi/daphn... DAta from PHoton and Neutron Experiments for N...
2 FAIRmat https://www.fair-di.eu/fairmat/fairmat_/consor... FAIR Data Infrastructure for Condensed-Matter ...
3 MaRDI https://www.mardi4nfdi.de/ Mathematical Research Data Initiative (Mathema...
4 NFDI4DataScience NFDI for Data Science and Artificial Intellige...
5 NFDI4Earth https://www.nfdi4earth.de/ NFDI Consortium Earth System Sciences (Geoscie...
6 NFDI4Microbiota https://nfdi4microbiota.de/ National Research Data Infrastructure for Micr...
7 NFDI-MatWerk https://nfdi-matwerk.de/ National Research Data Infrastructure for Mate...
8 PUNCH4NFDI https://www.punch4nfdi.de/ Particles, Universe, NuClei and Hadrons for th...
9 Text+ https://www.text-plus.org/ Language and Text Based Research Data Infrastr...

Let’s save the dataframes into CSV-files.

df_list[0].to_csv("../../../data/DFG_NFDI_2020.csv", index=False, encoding='utf-8')
df_list[1].to_csv("../../../data/DFG_NFDI_2021.csv", index=False, encoding='utf-8')