This article explores the rank-frequency and type-token relationships within an author-based corpus comprised of popular novels by renowned writers. We investigate whether these corpora exhibit Zipfian behavior in frequency distribution, analyze rank-frequency relationships through line fitting, and assess lexical richness using Heaps' law.
This article entails an exploration of some of the foundational principles that underlie statistical natural language processing related to word frequencies. Languages exhibit a highly organized pattern of frequency distribution, wherein a small number of words with exceedingly high frequency comprise the majority of tokens in a given text. Conversely, a large number of words with very low frequency are present in the text
\begin{equation} f(r) \propto 1/r^\alpha \end{equation}
Zipf’s law has roots in the principle of least effort first elucidated in
This article examines the validity of Zipf’s law over corpus constructed by using popular novels in English. The Zipfian behavior is investigated by performing two analyzes: (i) word rank-frequency relation and (ii) type-token relation known as type-token ratio (TTR).
There are a total of nine books selected from three prominent authors: Alexandre Dumas, George Eliot (Mary Ann Evans), and Leo Tolstoy. Table 1 provides an overview of these books, all of which were sourced in plain text format from Project Gutenberg.
To construct the author-based corpus, we have chosen three authors from different nationalities: the French novelist Alexandre Dumas, the British novelist George Eliot (Mary Ann Evans) from the Victorian era, and the Russian writer Leo Tolstoy. It is worth noting that, as the works by Tolstoy and Dumas were originally written in languages other than English, the word frequency analyses are influenced by the choices made by the translators.
Book name | Author | Genre | Tokens | Tokens w/o sw | Type |
---|---|---|---|---|---|
The Count of Monte Cristo | Dumas | HF | 441160 | 220206 | 18370 |
The Man in the Iron Mask | Dumas | Action | 166018 | 83367 | 11195 |
Twenty Years After | Dumas | HF | 233164 | 118162 | 11881 |
Adam Bede | Eliot | HF | 205679 | 101348 | 13416 |
Daniel Deronda | Eliot | Fiction | 297019 | 143170 | 17791 |
Middlemarch | Eliot | HF | 304630 | 149068 | 17990 |
Anna Karenina | Tolstoy | Fiction | 339508 | 158827 | 14120 |
Resurrection | Tolstoy | PF | 167715 | 79599 | 9949 |
The Kingdom of God is within You | Tolstoy | HF | 120727 | 57014 | 8858 |
Table 1: Specifications of corpus based on 3 authors (HF: historical fiction, PF: psychological fiction). Tokens w/o sw indicates the number of tokens after stop word removal. The number of types is based on the unique tokens.
After downloading and manually removing irrelevant texts at the beginning and end of the books, the two main preprocessing steps involve tokenization and stopword removal. In tokenization, the text is initially split by the whitespace character, and then, using regular expressions, only alphabetic characters (A-Z, a-z) are preserved, while others, such as numeric characters and punctuation marks, are removed from the words. For example, ‘isn’t’ is transformed into ‘isnt’. Additionally, single-character words are eliminated, and all uppercase characters are converted to lowercase. In stopword removal, a list of English stopwords is first obtained. Then, while iterating through the tokens generated by the tokenization process, stopwords are removed.
The total number of tokens obtained via the tokenization process, the total number of tokens after performing stopword removal, and the total number of types, i.e. vocabulary size is shown in Table 1. The following code snipped is used for tokenization and stopword removal.
To obtain the rank-frequency relationship, tokens are ordered so that the highest rank word indicates the most frequent one. For example, Table 2 shows the top 10 most frequent word in Anna Karenina by Tolstoy.
Word | Rank | Frequency |
---|---|---|
said | 1 | 2725 |
levin | 2 | 1512 |
one | 3 | 1156 |
now | 4 | 867 |
vronsky | 5 | 769 |
anna | 6 | 738 |
go | 7 | 682 |
come | 8 | 671 |
know | 9 | 670 |
went | 10 | 661 |
Table 2: Top 10 most frequent words in Anna Karenina by Tolstoy.
By combining 3 books from 3 authors, a larger author-based corpus is obtained. The below code snippet contructs author corpus. Subsequently, using tokens without stopword removal, word frequencies are obtained by Counter(author_corpus[author]['tokens_wosw'])
and then sorted to get rank, i.e., rank 1 is the most frequent word in the corpus. In below, the figures show the rank-frequency relationship in linear and log-log scale, respectively. While the linear scale plot shows the rank-frequency relationship for combined author corpus, the log-log scale plots indicate this relationship for each book of the authors. We observe that the rank-frequency relationship is Zipfian.
The type-token relationship follows Heaps’ law which states the following:
\begin{equation} V_R(n)=K n^\beta \label{heaps} \end{equation}
where $n$ is the token size, $V_R$ is the number of unique words, $K$ and $\beta$ are parameters of the relationship. The typical values of $K$ and $\beta$ are between 10-100 and 0.4-0.6, respectively. Another name for this relation is TTR (type-token ratio) which is a measure of lexical diversity of a text. As the parameter $\beta$ increases, the lexical diversity increases as it means more number of distinct words for a given number of tokens.
In the following code snippets, the number of unique tokes are counted for author-based corpus and each book separately. We observe that the relationships comply with the Heaps’ law.
The following code output shows the line fits using linear-least squares to the type-token relations for one book from each author. Since the number of data points increases as the token size grows on a logarithmic scale, the line fits may not seem like a ‘good fit’.
Table 3 shows the parameters, $m$ and $c$, of the line fits, $mx+c$ for each book in the author-based corpora for both stopwords included and excluded versions.
The line fit parameter $m$, which indicates the slope and the lexical diversity of the text, is the highest for The Kingdom of God is within You by Tolstoy and the lowest for Twenty Years After by Dumas. While the variance of the slope for Tolstoy’s 3 books is the highest, for Eliot’s books, the variance of the slope is lower. It can be observed that Eliot’s language is lexically more diverse than that of Dumas’. However, this may reflect the translator’s word choice as well since Dumas’ books are originally written in French.
w/o stopwords | w/ stopwords | ||||
---|---|---|---|---|---|
Book name | Author | m | c | m | c |
The Count of Monte Cristo | Dumas | 0.59 | 2.64 | 0.57 | 2.52 |
The Man in the Iron Mask | Dumas | 0.63 | 2.3 | 0.59 | 2.28 |
Twenty Years After | Dumas | 0.56 | 2.87 | 0.53 | 2.84 |
Adam Bede | Eliot | 0.63 | 2.31 | 0.59 | 2.41 |
Daniel Deronda | Eliot | 0.61 | 2.55 | 0.59 | 2.42 |
Middlemarch | Eliot | 0.62 | 2.47 | 0.59 | 2.44 |
Anna Karenina | Tolstoy | 0.59 | 2.55 | 0.56 | 2.42 |
Resurrection | Tolstoy | 0.6 | 2.52 | 0.56 | 2.48 |
The Kingdom of God is within You | Tolstoy | 0.66 | 1.94 | 0.62 | 1.86 |
Table 3: Line fit (mx+c) parameters obtained by using linear least-squares for type-token relation of each author corpora consisting of three books for both stopwords included and excluded versions.
In this article, the examination of the frequency distribution and type-token relationship of the author-based corpus was undertaken. It was observed that Zipfian behavior in terms of frequency distribution is exhibited by the author-based corpus. Furthermore, conformity with Heaps’ law, which states a parametric relationship between type and token size implying the lexical richness of a text, was noted. To conduct a more in-depth analysis of the lexical diversity in the texts, line fitting on type-token relationships was performed using linear least-squares.
The code is available on GitHub: https://github.com/bilalkabas/literary-text-analysis