Analyzing Rank-Frequency and Type-Token Relationships in Literary Texts

This article explores the rank-frequency and type-token relationships within an author-based corpus comprised of popular novels by renowned writers. We investigate whether these corpora exhibit Zipfian behavior in frequency distribution, analyze rank-frequency relationships through line fitting, and assess lexical richness using Heaps' law.

Introduction

This article entails an exploration of some of the foundational principles that underlie statistical natural language processing related to word frequencies. Languages exhibit a highly organized pattern of frequency distribution, wherein a small number of words with exceedingly high frequency comprise the majority of tokens in a given text. Conversely, a large number of words with very low frequency are present in the text . The remarkable aspect of this distribution is its mathematical simplicity, as it conforms closely to Zipf’s law . Zipf’s law states the following relation where $r$ is the frequency rank and $f(r)$ is the frequency of a word in a given corpus.

\begin{equation} f(r) \propto 1/r^\alpha \end{equation}

Zipf’s law has roots in the principle of least effort first elucidated in which states that frequently executed actions tend to become faster and more effortless over time, resulting in the adoption of efficient behavioral patterns aimed at minimizing effort. This phenomenon suggests that individuals commonly select their behaviors based on the principle of minimizing exertion. Substantiating this fact, not only in natural language but also in other fields such as firms bankruptcy , city-size distributions , and income distribution of companies , this phenomenon is shown to be valid.

This article examines the validity of Zipf’s law over corpus constructed by using popular novels in English. The Zipfian behavior is investigated by performing two analyzes: (i) word rank-frequency relation and (ii) type-token relation known as type-token ratio (TTR).


Corpus properties

There are a total of nine books selected from three prominent authors: Alexandre Dumas, George Eliot (Mary Ann Evans), and Leo Tolstoy. Table 1 provides an overview of these books, all of which were sourced in plain text format from Project Gutenberg.

To construct the author-based corpus, we have chosen three authors from different nationalities: the French novelist Alexandre Dumas, the British novelist George Eliot (Mary Ann Evans) from the Victorian era, and the Russian writer Leo Tolstoy. It is worth noting that, as the works by Tolstoy and Dumas were originally written in languages other than English, the word frequency analyses are influenced by the choices made by the translators.

Book name Author Genre Tokens Tokens w/o sw Type
The Count of Monte Cristo Dumas HF 441160 220206 18370
The Man in the Iron Mask Dumas Action 166018 83367 11195
Twenty Years After Dumas HF 233164 118162 11881
Adam Bede Eliot HF 205679 101348 13416
Daniel Deronda Eliot Fiction 297019 143170 17791
Middlemarch Eliot HF 304630 149068 17990
Anna Karenina Tolstoy Fiction 339508 158827 14120
Resurrection Tolstoy PF 167715 79599 9949
The Kingdom of God is within You Tolstoy HF 120727 57014 8858

Table 1: Specifications of corpus based on 3 authors (HF: historical fiction, PF: psychological fiction). Tokens w/o sw indicates the number of tokens after stop word removal. The number of types is based on the unique tokens.


Tokenization and stopword removal

After downloading and manually removing irrelevant texts at the beginning and end of the books, the two main preprocessing steps involve tokenization and stopword removal. In tokenization, the text is initially split by the whitespace character, and then, using regular expressions, only alphabetic characters (A-Z, a-z) are preserved, while others, such as numeric characters and punctuation marks, are removed from the words. For example, ‘isn’t’ is transformed into ‘isnt’. Additionally, single-character words are eliminated, and all uppercase characters are converted to lowercase. In stopword removal, a list of English stopwords is first obtained. Then, while iterating through the tokens generated by the tokenization process, stopwords are removed.

The total number of tokens obtained via the tokenization process, the total number of tokens after performing stopword removal, and the total number of types, i.e. vocabulary size is shown in Table 1. The following code snipped is used for tokenization and stopword removal.


Rank-frequency relationship

To obtain the rank-frequency relationship, tokens are ordered so that the highest rank word indicates the most frequent one. For example, Table 2 shows the top 10 most frequent word in Anna Karenina by Tolstoy.

Word Rank Frequency
said 1 2725
levin 2 1512
one 3 1156
now 4 867
vronsky 5 769
anna 6 738
go 7 682
come 8 671
know 9 670
went 10 661

Table 2: Top 10 most frequent words in Anna Karenina by Tolstoy.

By combining 3 books from 3 authors, a larger author-based corpus is obtained. The below code snippet contructs author corpus. Subsequently, using tokens without stopword removal, word frequencies are obtained by Counter(author_corpus[author]['tokens_wosw']) and then sorted to get rank, i.e., rank 1 is the most frequent word in the corpus. In below, the figures show the rank-frequency relationship in linear and log-log scale, respectively. While the linear scale plot shows the rank-frequency relationship for combined author corpus, the log-log scale plots indicate this relationship for each book of the authors. We observe that the rank-frequency relationship is Zipfian.


Type-token relationship (TTR)

The type-token relationship follows Heaps’ law which states the following:

\begin{equation} V_R(n)=K n^\beta \label{heaps} \end{equation}

where $n$ is the token size, $V_R$ is the number of unique words, $K$ and $\beta$ are parameters of the relationship. The typical values of $K$ and $\beta$ are between 10-100 and 0.4-0.6, respectively. Another name for this relation is TTR (type-token ratio) which is a measure of lexical diversity of a text. As the parameter $\beta$ increases, the lexical diversity increases as it means more number of distinct words for a given number of tokens.

In the following code snippets, the number of unique tokes are counted for author-based corpus and each book separately. We observe that the relationships comply with the Heaps’ law.


Line-fitting to TTR

The following code output shows the line fits using linear-least squares to the type-token relations for one book from each author. Since the number of data points increases as the token size grows on a logarithmic scale, the line fits may not seem like a ‘good fit’.

Table 3 shows the parameters, $m$ and $c$, of the line fits, $mx+c$ for each book in the author-based corpora for both stopwords included and excluded versions.

The line fit parameter $m$, which indicates the slope and the lexical diversity of the text, is the highest for The Kingdom of God is within You by Tolstoy and the lowest for Twenty Years After by Dumas. While the variance of the slope for Tolstoy’s 3 books is the highest, for Eliot’s books, the variance of the slope is lower. It can be observed that Eliot’s language is lexically more diverse than that of Dumas’. However, this may reflect the translator’s word choice as well since Dumas’ books are originally written in French.

    w/o stopwords   w/ stopwords  
Book name Author m c m c
The Count of Monte Cristo Dumas 0.59 2.64 0.57 2.52
The Man in the Iron Mask Dumas 0.63 2.3 0.59 2.28
Twenty Years After Dumas 0.56 2.87 0.53 2.84
Adam Bede Eliot 0.63 2.31 0.59 2.41
Daniel Deronda Eliot 0.61 2.55 0.59 2.42
Middlemarch Eliot 0.62 2.47 0.59 2.44
Anna Karenina Tolstoy 0.59 2.55 0.56 2.42
Resurrection Tolstoy 0.6 2.52 0.56 2.48
The Kingdom of God is within You Tolstoy 0.66 1.94 0.62 1.86

Table 3: Line fit (mx+c) parameters obtained by using linear least-squares for type-token relation of each author corpora consisting of three books for both stopwords included and excluded versions.


Conclusion

In this article, the examination of the frequency distribution and type-token relationship of the author-based corpus was undertaken. It was observed that Zipfian behavior in terms of frequency distribution is exhibited by the author-based corpus. Furthermore, conformity with Heaps’ law, which states a parametric relationship between type and token size implying the lexical richness of a text, was noted. To conduct a more in-depth analysis of the lexical diversity in the texts, line fitting on type-token relationships was performed using linear least-squares.

The code is available on GitHub: https://github.com/bilalkabas/literary-text-analysis