Link Search Menu Expand Document

Processing guidelines

Table of contents

Sentence splitting

First, we split the plain text without line breaks & without hyphens into sentences via the SegTok library & save the sentences into separate lines of TSV files [via blatt to_tsv]. See data/sentences_raw/. Then, we correct sentence splitting manually and remove “noisy data” (e.g., tables), see data/sentences_corrected/.