ENGL 340 Group Project: A Quantitative Analysis of Thoreau’s Walden

Fall 2014 Update: While the course project’s preliminary results appear below, the project has since evolved into a more sophisticated effort to co-author an update to Harding’s 1962 article using digital tools: a team of authors is using Harding’s claims to guide a distant reading of Thoreau’s novel using Harvard’s General Inquirer categories and is, in turn, employing distant reading to assess those claims.  Led by Gregory Palermo, this team includes ENGL 340 veterans Michael Gole and Jonathan Pepperman alongside English department students Rebecca Miller, Jenna Cecchini, and Thomas McCarthy, supported by Drs. Paul Schacht and Kirk Anne.

Our group (Christine O’Neill, Angie Carson, Dan FladMichael Gole, and Greg Palermo) used data analysis to explore and verify Walter Harding’s claims in his essay “Five Ways of Looking at Walden.” This was a deductive process in which we isolated particular arguments from Harding’s text and then asked Dr. Kirk Anne (from Geneseo CIT) to extract raw data from the text of Walden using the Python and R programming languages . We interpreted this data, which we complemented with the usual close reading of Walden, to see if it supported or disproved Harding’s arguments. While data analysis can be useful for a number of reasons, establishing credibility (ethos) is among the most relevant to the future of the humanities.

Our process looked a lot like what is described in Stephen Ramsay’s article “The Hermeneutics of Screwing Around” and the Culturomics approach. First, we determined what specific type of data would be most useful to answer the questions we had generated around Harding’s claims.  Kirk then coded and ran the scripts to produce representations of the relevant data, which included spreadsheets, histograms, line graphs, and dispersion plots of the linguistic features of Walden and its versions. We were fortunate that he was able to produce such an abundance of data that we had the luxury of browsing through, selecting whatever seemed useful to us (and often asking for more detail about a certain aspect).

Each of the members of the group brought different questions to the data.  Here is how we, individually, dealt with turning the numbers into conclusions:


I found Harding’s claim that “the unifying device of the book is the year” to be interesting, so I tried to determine how I could quantify this observation. I asked a few questions, like “do the deletions show Thoreau getting rid of stuff that doesn’t have to do with the year, and the additions show him developing the theme of the year?” Eventually, what I really focused on was Harding’s specific delineation of the book’s arrangement: he said that Thoreau talked about his cabin and pine trees in the spring, bean fields in the summer, etc. Data analysis could give us a bird’s eye view of whether or not Harding was right about this “unifying theme” through examining the key terms he identified.

A lexical dispersion plot of Walden
A lexical dispersion plot of Walden

Next, I had Kirk Anne runs some numbers. I was able to access a chart of additions and deletions spanning the versions. Next, I asked Kirk to make a lexical dispersion plot for the key terms from Harding’s claim (and a few others) – in a nutshell, the plot was a graphic showing the concentration of these words across the chapters. So, if the word “spring” was heavy at the beginning and end of the book, or if a discussion of “ice” was heavy near the middle, that would indicate a special focus on the seasons. To take the opposite approach, I decided to look at the chapters first and see if additions/deletions/word-concentrations made sense according to the chapters. Sure enough, chapters with names like “The Bean Field,” or “Winter Animals” not only had high concentrations of season-related words, but showed the most overall addition and deletions.

What did that all mean? I interpret this data analysis to be confirmation of Harding’s claim. His editorial focus seems to have been on season related chapters, and clusters of season-related words appear in the appropriate spots of the text.

In a word, my strategy was: isolate a claim, ask some questions, use numbers to respond to the questions, and interpret the results.


The claim of Harding’s that most struck me was related to the readability of Walden: despite the size of Thoreau’s vocabulary, Harding says, his writing “cannot be termed ostentatious.” What I wanted to do, in order to substantiate Harding’s claim, was to quantify the lexical sophistication of different passages throughout the novel and compare that to the passages’ readabilities, which could be represented by readability indices like the Gunning Fog and Coleman-Liau.

Readability of different chapters of walden, by version, represented in a box-whisker plot of Coleman-Liau indices.
Readability of different chapters of Walden, by version, represented in a box-whisker plot of Coleman-Liau indices.

This, however, was far too large a project for the scope of our course. If we wanted to quantify the extent of Thoreau’s vocabulary, we’d have to compare his writing to works by contemporary authors. In addition, Harding isn’t specific about to what audience Thoreau’s text is readable: is he claiming that Walden was readable in Thoreau’s time? In Harding’s own time? Text is perceived as readable, in part, because of the norms of its age of reception; likewise, the indices–which were made in the latter part of the twentieth century–make assumptions about a text’s audience, an audience that may differ from a contemporary readership of a certain demographic.

So, I instead set out to see how the quantitative readability of different versions of a passage in Walden would compare with Geneseo students’ opinions of readability. I sent out a survey that asked them to read two versions of the same passage from Chapter 3, each of which scored quite differently when subjected to the algorithmic reading tests. Not knowing which one was supposed to be more readable, the students were to indicate which passage they found easier to read and to briefly explain why.

The results were exactly the opposite of what I’d initially hoped they’d be.  I won’t discuss their full implications, but a small majority (59%) of students indicated that the easier to read passage was the one that the indices indicated required a higher level of formal education. Admittedly, there were some problems with my survey. First, I used two versions of the same passage, so the second passage was more likely to be perceived as readable because it was already somewhat familiar (this is something I anticipated and also that quite a few respondents noted). In addition, most of the people who took my survey were English majors, who are quite used to finding their way through and comprehending intricate texts.

But we can learn something from this iteration of the study. Those who chose the passage that was supposed to be easier to read pointed out the attributes like punctuation and clause length on which the computational tools made their measurements. Those who chose this passage cited aspects of it that could not easily be quantitatively measured–for example, the rhetorical structure of Thoreau’s argument. Does this suggest that there are certain features of a text that cannot be quantified? That we need to be more attentive to what we apply certain algorithms? Or, do we just need more sophisticated ones?

I think that these questions lead well into Angie’s portion of the project.


My assigned data to analyze was in relation to Walden’s polarity and subjectivity. These two aspects, primarily subjectivity, worked in junction with Harding’s fifth style of reading Walden as a spiritual guidebook; a guidebook is inherently subjective in its having an opinion on how one is supposed to live their life. He had mentioned that there were four key chapters to reading Walden spiritually. Of course, it would make this project too easy if Harding’s key chapters matched up with the data received. Instead, I had the following to work with:

Harding’s Key Chapters:
Where I Lived, and What I Lived For
Higher Laws

Data Received from Kirk Anne:
Baker Farm (most subjective)
The Ponds (least subjective)
Reading (most positive)
The Village (most negative)***
Former Inhabitants; and Winter Visitors (most negative)

I’ll start with the polarity data, as it’s the easiest to explain. Thoreau came off as relatively negative throughout Walden. Though he was ranting during “Reading,” he did sound relatively positive compared to the rest of the book when talking about the benefits of reading. After reviewing the data and re-reading the chapter, it was easy to see why this was picked to be the most positive. The negative end of the spectrum was a bit more complicated. “The Village” was, according to the data, the most negative chapter of the novel. But when looking at the numbers, you can see that while this is said to be the most negative, it is also the shortest chapter of Walden. Kirk and I discussed that longer chapters such as “Former Inhabitants; and Winter Visitors” can have diluted negativity by the amount of excess words in the chapter that don’t coincide with the negative connotations. I took another look at both these chapters, as “Former Inhabitants; and Winter Visitors” was the second most negative chapter, to see if this was actually the case. I found that Kirk was correct and that “Former Inhabitants; and Winter Visitors” was, in my opinion, more negative that “The Village.” My reasoning was that in “Former Inhabitants; and Winter Visitors” Thoreau discusses the house fire that burns the man’s entire life away. In comparison, “The Village” discussed Thoreau’s arrest for not paying the Poll Tax, but his emotions towards the matter were far more indifferent than those displayed in “Former Inhabitants; and Winter Visitors.” Of course, this being just my opinion leaves room for error and is definitely something to continue studying.

The subjectivity data became my main focus for this project due to its relevance to the Spiritual Reading given by Harding. I took a look at four chapters: “Baker Farm” and “The Ponds” as they are the two our data said were respectively the most subjective and least subjective, and “Higher Laws” and “Economy” since they were the two that I picked to be the most subjective out of Harding’s four Key Chapters. After reviewing all four, I found that I agreed with Harding in that “Higher Laws” and “Economy” were more subjective than “Baker Farm.”

There are a few explanations I came up with for this occurrence. The first is similar to the polarity, where chapters like “Economy” are extensive in length, meaning the subjectivity is diluted by a higher word count. Another possibility I came up with is that the computer only picks up on direct opinions or keywords given by a person. I suggested this idea to Kirk and he believes that it could be a possibility due to his use of a code based on movie reviews as a training set for determining subjectivity and polarity. This indicates to him that there is probably a mismatch between the training set and Walden. This raises a new concern for collecting data from literature. So much of what is written is stated as a fact. When Thoreau is ranting, is he going to say “well, in my humble opinion, I think that the world is corrupted?” No! He states everything he believes as a fact and tells people that his way is the set right way (like any good spiritual guidebook would). I feel that this could make it difficult to create a training set for Walden, along with many other literary works, as they lack “opinion words/phrases” such as “I think,” “you should,” and so on. To me it only makes sense that there is at least one aspect of literature that requires a human mind to analyze it. After all, literature is created for humans, by humans, and isn’t made to be analyzed by a computer. I’m certainly not denouncing this project—I feel that using the technology available to us only can enhance the understanding we already have of literature. However, I don’t see the possibility of us ever fully replacing old-fashioned reading with computer analysis. Books will always need a human mind and eye to understand the human mind and hand that wrote them.


I addressed Walter Harding’s claim that Walden can be approached as a purely belletristic or aesthetic book, one of his “Five Ways of Looking at Walden.” Harding believes that Walden is an example of “good writing,” and that his generally straightforward writing style separates him from his contemporaries, who often used abstractions, euphemisms, circumlocutory logic and figurative language. (156) I figured that it would be a good idea to compare Walden to some of Thoreau’s contemporaries that Harding referenced. Aside from Walden, I looked at essays and other writings by Ralph Waldo Emerson, Nathaniel Hawthorne, Oliver Wendell Holmes, Walt Whitman, Edgar Allan Poe, Washington Irving, Francis Hopkinson, and Richard Henry Dana Jr. Unfortunately the majority of the data was not able to be produced in time for most of these sources. However, some very basic data is still worth noting. One of the things I was able to do was compare the length of the average sentence across these texts. If Harding was to be believed in his assertion that some of Thoreau’s contemporaries were overly abstract and circumlocutory, it stands to reason that their sentences would be generally longer and more wordy. This is not entirely the case however. In Walden, the average sentence length is 27.5 words per sentence. In looking at his contemporaries, there is a relatively even spread in terms of sentence length. This would pose an issue to Harding’s assertion if it were not for the fact that Harding also recognized that Thoreau’s sentences were unusually long. (158) Unfortunately, without more in depth data, it is difficult to further address the above claim regarding other texts.

To further address Harding’s claim, I used the data regarding Walden that Kirk Anne produced for our group. I first looked at Thoreau’s use of symbols. I reasoned that, if Thoreau was truly a less abstract writer than his contemporaries, the number of symbols in Walden would be relatively low. Thoreau used 11 symbols in Walden, which seems to be a relatively small amount. This potentially supports Harding’s claim that Walden was a much more straightforward work than those of his contemporaries, although I unfortunately do not have sufficient data regarding Thoreau’s contemporaries to draw any concrete conclusions from this number.

I next wanted to address the allusiveness of Thoreau in Walden. If Harding’s claim that Walden is simply good, relatively straightforward prose is accurate, it stands to reason that Thoreau would have a limited number of references to historical events, classic literature, and other similar things in Walden. While it is difficult to determine this from the data that I was provided with, I think a very general picture can be gleamed from the frequency of proper nouns in the book. In looking at allusiveness, we ideally want to remove locations from the data, as well as non-historic and non-fictional individuals. There are 386 proper nouns within Walden. Although the data could not account for this, it is reasonable to assume that this number would be considerably smaller if places and certain individuals were removed from the list. In general, I would make the argument that Walden is actually not particularly allusive, although there is unfortunately no data from contemporary authors to compare to.


I was working on verifying Harding’s claim that within Walden there is a “careful alteration of the spiritual and the mundane, the practical and the philosophical, the human and the animal”. I was also tasked with using the program Voyant tools as my tool for evaluating Harding’s claim. I began by uploading each chapter into Voyant tools, taking down notes on such things as the number of words and unique words in each chapter, and then combing through the chapters for the most common words. Voyant tools made this much easier than it would have been. I was able to pick out common words in chapters that I would qualify as spiritual, mundane, practical, philosophical, animal, or human. I spent more time on chapters Harding specifically mentions within his article as examples, so that I could draw upon his observations. However I did look at every chapter to make sure the alterations were continuing throughout the novel rather than simply in those chapters. Having read the chapters, it was sometimes frustrating that I knew these themes existed within the chapter, however the words being used didn’t always match up. For example, it was difficult to finds words that fit into the category of philosophical because many times Thoreau uses metaphor to convey these philosophical ideas. In other words, the words may be seemingly mundane but actually have a deeper meaning in context. However, even with these struggles, it was clear that there is truth to Harding’s claim, although I would say that all of these themes occur and exist throughout the novel, although the main theme being talked about may be alternating. However, even in a very mundane chapter like Brute Neighbors, there are still spiritual elements mixed in. In the end it was up to me to analyze the data given by voyant tools to see if I was able to come to the same conclusions as Harding.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.