Research Interests

My computational corpus linguistics group at FAU Erlangen-Nürnberg carries out foundational methodological research on the quantitative analysis of large text corpora. The algorithms and software tools developed by the group support innovative studies in the digital humanities and social sciences as well as practical applications in language technology. A particular focus lies on understanding cooccurrence phenomena and their application in corpus-based discourse analysis.

Methodological foundations – Corpus tools – Cooccurrence phenomena

Methodological foundations of corpus research and digital humanities

Corpus research in linguistics as well as in the digital humanities and social sciences relies on a wide range of statistical techniques and visualizations. A central goal of my research is to develop sound methodological foundations for corpus linguistics, which address key problems in order to ensure that quantitative analyses are both reliable and meaningful.

Projects

2014–2019: Kallimachos (BMBF e-Humanities-Zentrum led by U Würzburg)
The FAU sub-project was concerned with methodological issues and the interpretation of quantitative measures in literary stylometry, focussing on authorship attribution (phase 1) and lexical/syntactic complexity (phase 2).

Software

zipfR: R package for LNRE modelling of type-token distributions – https://zipfR.r-forge.r-project.org/
SIGIL online course and statistical analysis of corpus data and associated R package corpora: https://SIGIL.r-forge.r-project.org/

Key publications

Evert et al. (2017). Understanding and explaining Delta measures for authorship attribution. Digital Scholarship in the Humanities. 22(suppl_2), ii4–ii16. [free access (PDF), reference corpus].
Evert & Neumann (2017). The impact of translation direction on characteristics of translated texts. A multivariate analysis for English and German. In: Empirical Translation Studies. New Theoretical and Methodological Traditions, TiLSM number 300. [online supplement]
Schäfer at al. (2017). Japan's 2014 general election: Political bots, right-wing internet activism and PM Abe Shinzō's hidden nationalist agenda. Big Data, 5(4), 294–309. [open access (PDF)]
Evert et al. (2017). Reliable measures of syntactic and lexical complexity: The case of Iris Murdoch. In: Proceedings of Corpus Linguistics 2017. [PDF]
Evert & Arppe (2015). Some theoretical and experimental observations on naïve discriminative learning. In: Proceedings of QITL-6. [PDF]
Proisl etc. (2018). Delta vs. n-gram tracing: Evaluating the robustness of authorship attribution methods.. In Proceedings of LREC 2018. [PDF, slides]
Baroni & Evert (2007). Words and echoes: Assessing and mitigating the non-randomness problem in word frequency distribution modeling. In: Proceedings of ACL 2007. [PDF]
Evert (2006). How random is a corpus? The library metaphor. Zeitschrift für Anglistik und Amerikanistik 54(2). [manuscript (PDF)]

Corpus tools and language technology

My group develops algorithms and software tools for the automatic linguistic annotation, efficient indexing, flexible query and quantitative analysis of large text corpora. These tools form the basis of innovative research in the digital humanities as well as practical and commercial applications in language technology.

Projects

2021–2023: RAND – Reconstructing Arguments from Newsworthy Debates (DFG SPP 1999: RATIO)
2018–2020: RANT – Reconstructing Arguments from Noisy Text (DFG SPP 1999: RATIO)

A corpus-linguistic approach to argumentation mining in social media, combined with representation and inference in a powerful logical framework.
2020–2022: LeAK – Automatic anonymisation of German court decisions (research contract from BayStMJ)
This project explores the feasibility of fully automatic anonymisation of German court decisions. Its key contributions are the creation of a high-quality manually annotated gold standard and the thorough evaluation of automatic algorithms.

Software & resources

CWB, the IMS Open Corpus Workbench for indexing & querying large text corpora – http://cwb.sf.net/
EmpiriST corpus, a gold standard for linguistic annotation of German Web & CMC texts – https://github.com/fau-klue/empirist-corpus/
Web1T5-Easy indexes Google Web n-grams with SQLite – http://webascorpus.sf.net/

Key publications

Evert & Hardie (2015). Ziggurat: A new data model and indexing format for large annotated text corpora. In: Proceedings of CMLC-3. [PDF]
Evert et al. (2020). Corpus Query Lingua Franca part II: Ontology. In Proceedings of LREC 2020. [PDF, GitHub]
Evert et al. (2016). A distributional approach to open questions in market research. Computers in Industry 78. [manuscript (PDF)]
Proisl et al. (2020). EmpiriST corpus 2.0: Adding manual normalization, lemmatization and semantic tagging to a German Web and CMC corpus. In Proceedings of LREC 2020. [PDF, corpus & resources]
Evert et al. (2014). SentiKLUE: Updating a polarity classifier in 48 hours. In: Proceedings of SemEval 2014. [PDF]
Evert & Hardie (2011). Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In: Proceedings of Corpus Linguistics 2011. [PDF]
Evert (2010). Google Web 1T5 n-grams made easy (but not for the computer). In: Proceedings of WAC-6. [PDF, Web1T5-Easy]
Giesbrecht & Evert (2009). Part-of-speech tagging - a solved task? An evaluation of POS taggers for the Web as corpus. In: Proceedings of WAC-5. [PDF]

Collocations, multiword expressions and corpus-based discourse analysis

Cooccurrence patterns – such as collocations, multiword expression, valency and distributional semantics – play a central role not only in corpus linguistics but also for studying public discourses and political propaganda. My research in this area focuses on improving and refining the underlying analytical techniques as well as the development of new interactive methods for multi-modal corpus-based discourse analysis.

Projects

2022–2024: NormRechts – The Normalization of Right-wing Populist and New Right Discourses in Japan and Germany (DFG)
This project will further develop the MMDA methodology and apply it to the comparative analysis of right-wing populist discourses in Japan and Germany.
2021–2022: Tracking the Infodemic: Conspiracy theories in the corona crisis (VolkswagenStiftung)
This research project applies innovative corpus-linguistic methods to analyse the use and distribution of typical linguistic patterns of conspiracy theories and study the discursive strategies they share with right-wing populist and extremist discourses.
2017–2019: Exploring the Fukushima Effect (FAU Emerging Fields Initiative)
“Attitudes and Opinions towards Nuclear Power and Renewable Energy and the Emergence of a Transnational Algorithmic Public Sphere.” A key contribution of this project is the development of the innovative MMDA methodology and software toolkit for corpus-assisted discourse analysis.

Software

MMDA: an interactive software tool for corpus-assisted discourse analysis
The UCS Toolkit for collocation research – http://www.collocations.de/software.html
wordspace: an R package for distributional semantics – http://wordspace.r-forge.r-project.org/

Key publications

Evert (2008). Corpora and collocations. In: Corpus Linguistics. An International Handbook. [extended manuscript (PDF)]
Lapesa & Evert (2014). A large scale evaluation of distributional semantic models: Parameters, interactions and model selection. Transactions of the Association for Computational Linguistics 2. [PDF, supplementary material]
Heinrich et al. (2018). A transnational analysis of news and tweets about “nuclear phase-out” in the aftermath of the Fukushima incident. In Proceedings of the CIDTD 2018 Workshop. [PDF, slides]
Evert (2014). Distributional semantics in R with the wordspace package. In: Proceedings of COLING 2014. [PDF, wordspace homepage]
Evert et al. (2017). E-VIEW-alation – a large-scale evaluation study of association measures for collocation identification. In Proceedings of eLex 2017. [PDF, slides, video, E-VIEW-alation]
Evert & Krenn (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language 19(4). [manuscript (PDF)]