Journal of Dutch Literature, volume 2, number 2, December 2011Mike Kestemont: What Can Stylometry Learn From Its Application to Middle Dutch Literature?

To refer to this article use this url: http://journalofdutchliterature.org/jdl/vol02/nr02/art03

Seminal Work

A milestone in twentieth century authorship attribution research has been the pioneering study in 1964 by the statisticians Mosteller and Wallace into the Federalist Papers, a collection of pamphlets advocating the ratification of the American Constitution, published under the pseudonym of Publius (1787-1788).[10] Although authorship attribution had of course been widely practiced before, their study introduced a groundbreaking novelty to the field. Previously, authorship attribution had often remained a largely impressionistic affair, in which attributions were only vaguely argued, based on the ‘manual’ and subjective reading experience of scholarly readers.[11] Scholars often turned to hand-selected ‘checklists’ of stylistic features to characterize an author’s style and used these lists to accept or reject someone’s authorship for a given text. These lists usually contained a limited set of stylistic peculiarities that had struck the expert reader’s eye, such as the use of conspicuous nouns or unusual syntactic constructions.

FIG2

Figure 1: The last leaf of the Lancelot Compilation (The Hague, Koninklijke Bibliotheek, MS 129 A 10, fol. 238r). The mysterious colophon (in red) reads: ‘Hier indet boec van lanselote dat heren lodewijcs es van velthem’ (‘Here ends the Book of Lancelot, which belongs to Sir Lodewijc van Velthem’). However, ‘van’ could also be understood as ‘written by’ or ‘compiled by’.

FIG2

1a

The use of such ‘lists’ often turned out to be problematic[12]. These lists tended, for instance, to be quite short, while it usually remained unclear why only this specific subset of features was included in the list and others were not. The main problem, however, was the low frequency of many of the items on the lists. Because usually stylistic peculiarities are already rare in an œuvre itself and often even linked to a specific topic or genre, these features often did not scale well to other (possibly shorter) texts. Moreover, precisely because of their infrequent appearance, these features catch the human eye, which makes them extremely vulnerable to stylistic imitation and forgery. Mosteller and Wallace suggested to radically move away from these conspicuous, low frequency elements and focus on the exact opposite: a text’s most frequent words, its function words.

Function words such as articles (the), prepositions (under) or pronouns (she) form a small and closed class of (typically short) common words, conveying only a very generic meaning.[13] Because the semantics of these function words is so pale, the same set of function words is extremely frequent in all texts. This makes these words very attractive for use in authorship attribution, not in the least because they are used by all authors writing in the same language and period and, thus, provide a statistically reliable base of comparison. Even more interesting is that these words are generally not related to a text’s content and can therefore be used for attribution across different topics and genres. No matter what an author writes about (from music to politics to science), he or she will always need to use definite articles. A final advantage is that they are not usually under an author’s conscious control and thus pretty robust to imitation.[14]

The earliest application of this revolutionary insight in Middle Dutch studies is to be found in an exceptionally early, but poorly recognized paper in Dutch Crossing (1988) by Saskia Murk Jansen.[15] In this study the author described an experiment concerning the authorship of the so-called mystic Mengeldichten (‘Mixed poems’). Since not all these poems appear in the surviving manuscripts of Hadewijch’s work (thirteenth century), a number of them are usually not attributed to the famous Brabantian beguine. In order to assess whether stylometry could add support to this thesis, Murk Jansen performed an experiment using Principal components analysis (pca), a technique from multivariate statistics, nowadays commonly used for text clustering in authorship attribution. Explicitly referring to the work on the Federalist Papers by Mosteller and Wallace, Murk Jansen restricted her analysis to a small set of thirteen highly frequent Middle Dutch function words. Although her results offered neither a clear-cut confirmation nor a firm rejection of existing hypotheses, the author seemed right in concluding ‘that it would be interesting to pursue this line of analysis’.[16]