Derived Text Format: Bag-of-words

These tab-separated files contain up-to-date frequency numbers for all corpus words (tokens and lemmas) per year, and can be used for further processing with statistical tools:

Derived Text Format: N-Grams

All corpus token bi-, tri-, tetra-, penta-, and hexagrams (combined more than 2 million) with corpus-based association measures and context measures:

Derived Text Format: Word Vectors

GloVe (Global Vectors for Word Representation) computations for all corpus words:

Dataset and Source Code: Idiomatic Language

Data and source code of the analysis published in: Amin, Miriam / Fankhauser, Peter / Kupietz, Marc / Schneider, Roman (2021): Data-driven Identification of Idioms in Song Lyrics. In: Proceedings of the 17th Workshop on Multiword Expressions (MWE 2021), Special Interest Group on the Lexicon (SIGLEX) of the Association for Computational Linguistics (ACL). [PDF]

Dataset and Source Code: Rhyme Detection

Method and data model described in: Stengel, Samuel (2022): Automatische Identifizierung von Reimen in deutschen Pop-Lyrics. Term paper, University of Leipzig.

Dataset and Source Code: Sociopolitical Vocabulary

Dataset as described in: Schneider, Roman / Hansen, Sandra / Lang, Christian (2022): Das Vokabular von Songtexten im gesellschaftlichen Kontext – ein diachron-empirischer Beitrag. In: Kämper, Heidrun / Plewnia, Albrecht (Hg.): Sprache in Politik und Gesellschaft: Perspektiven und Zugänge. Berlin, Boston: De Gruyter. 295-304. [PDF]

Dataset and Source Code: Spoken vs. Written Language

Language data models described in: Broll, Sarah / Schneider, Roman (2023): Empirische Verortung konzeptioneller Nähe/Mündlichkeit inner- und außerhalb schriftsprachlicher Korpora. In: Special Issue of the Journal for Language Technology and Computational Linguistic (JLCL). Vol. 36(1). [PDF]