Text Mining for Digital Humanities:

Measuring Complexity of Literary Texts

Updated: 3:46pm, 30 Nov, 2022
28 March 2018 (Wed)
Room 101, 1/F., Runme Shaw Bldg., HKU
Related Files:
Photo Highlights:

Chair: Dr. Thomas Chiu, Lecturer, Faculty of Education, The University of Hong Kong

In recent years, the Digital Humanities have gained in visibility, broadened their domains of application and diversified in their methods. In this process, the relationship between Digital Literary Studies on the one hand, and Text Mining, Machine Learning and Computational Linguistics has intensified considerably. The research presented here is part of these recent developments in the Digital Humanities.

This talk focuses on the question of textual complexity, a quality of texts that can be measured on a variety of levels, e.g. on the lexical or the syntactic level. We will concentrate on the complexity of the lexicon which is known as ‘lexical diversity’ or ‘vocabulary richness’, using a measure which is not, as most others are, sensitive to text length. A greater degree of lexical diversity has often been associated with high-brow literary texts of high prestige and canonicity, whereas a lower degree has been associated with low-brow, short-lived, popular fiction (even though a very few high-brow authors are well-known for their stylistic simplicity, like Hemingway). It turns out, however, that this correlation doesn’t always hold true, in empirical tests on novels.

An important reason for these counter-intuitive empirical results may lie in the fact that just measuring the lexical diversity of complete literary texts may be a too simple approach. In the last 200 years, there has been an aesthetic imperative to model the speech of characters on their characteristic attributes. So writers will accommodate the language of their characters and, for example, simple people will be depicted using a simple language. This means, we can expect that a novel with a lot of direct speech by, for example, children, will have a lower lexical diversity. In other words, we have to take into account the difference between the narrator’s voice and the characters’ voices. This seems to be a straightforward task, but literary texts follow very different conventions in different languages and at different moments in time, which makes it more or less difficult to just rely on typographical markers. Therefore, we also present research that allows to automatically identify narrator and character speech using Machine Learning even in the absence of these markers.

Now that we can distinguish between the voices of the narrator and the characters, we can reexamine the problem of lexical diversity and describe it for the complete text, the narrative passages and the direct speech.

We conclude by reflecting on how research in Digital Literary Studies has changed as it has taken up methodological cues from Machine Learning and Computational Linguistics. We argue that it has not only become customary to think of research in Digital Literary Studies in terms of processing pipelines, but also that quite often now this kind of research is happening in an empirical framework: formal hypotheses are formulated and then empirically tested, based on data and evidence using statistical methods and tests.

About the speaker(s):

Please refer to for the details.

Please refer to for the details.

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram