ENGL 340 Group Project: A Quantitative Analysis of Thoreau’s Walden

Fall 2014 Update: While the course project’s preliminary results appear below, the project has since evolved into a more sophisticated effort to co-author an update to Harding’s 1962 article using digital tools: a team of authors is using Harding’s claims to guide a distant reading of Thoreau’s novel using Harvard’s General Inquirer categories and is, in turn, employing distant reading to assess those claims.  Led by Gregory Palermo, this team includes ENGL 340 veterans Michael Gole and Jonathan Pepperman alongside English department students Rebecca Miller, Jenna Cecchini, and Thomas McCarthy, supported by Drs. Paul Schacht and Kirk Anne.

Our group (Christine O’Neill, Angie Carson, Dan FladMichael Gole, and Greg Palermo) used data analysis to explore and verify Walter Harding’s claims in his essay “Five Ways of Looking at Walden.” This was a deductive process in which we isolated particular arguments from Harding’s text and then asked Dr. Kirk Anne (from Geneseo CIT) to extract raw data from the text of Walden using the Python and R programming languages . We interpreted this data, which we complemented with the usual close reading of Walden, to see if it supported or disproved Harding’s arguments. While data analysis can be useful for a number of reasons, establishing credibility (ethos) is among the most relevant to the future of the humanities.

Our process looked a lot like what is described in Stephen Ramsay’s article “The Hermeneutics of Screwing Around” and the Culturomics approach. First, we determined what specific type of data would be most useful to answer the questions we had generated around Harding’s claims.  Kirk then coded and ran the scripts to produce representations of the relevant data, which included spreadsheets, histograms, line graphs, and dispersion plots of the linguistic features of Walden and its versions. We were fortunate that he was able to produce such an abundance of data that we had the luxury of browsing through, selecting whatever seemed useful to us (and often asking for more detail about a certain aspect).

Each of the members of the group brought different questions to the data.  Here is how we, individually, dealt with turning the numbers into conclusions:


I found Harding’s claim that “the unifying device of the book is the year” to be interesting, so I tried to determine how I could quantify this observation. I asked a few questions, like “do the deletions show Thoreau getting rid of stuff that doesn’t have to do with the year, and the additions show him developing the theme of the year?” Eventually, what I really focused on was Harding’s specific delineation of the book’s arrangement: he said that Thoreau talked about his cabin and pine trees in the spring, bean fields in the summer, etc. Data analysis could give us a bird’s eye view of whether or not Harding was right about this “unifying theme” through examining the key terms he identified.

A lexical dispersion plot of Walden
A lexical dispersion plot of Walden

Next, I had Kirk Anne runs some numbers. I was able to access a chart of additions and deletions spanning the versions. Next, I asked Kirk to make a lexical dispersion plot for the key terms from Harding’s claim (and a few others) – in a nutshell, the plot was a graphic showing the concentration of these words across the chapters. So, if the word “spring” was heavy at the beginning and end of the book, or if a discussion of “ice” was heavy near the middle, that would indicate a special focus on the seasons. To take the opposite approach, I decided to look at the chapters first and see if additions/deletions/word-concentrations made sense according to the chapters. Sure enough, chapters with names like “The Bean Field,” or “Winter Animals” not only had high concentrations of season-related words, but showed the most overall addition and deletions.

What did that all mean? I interpret this data analysis to be confirmation of Harding’s claim. His editorial focus seems to have been on season related chapters, and clusters of season-related words appear in the appropriate spots of the text.

In a word, my strategy was: isolate a claim, ask some questions, use numbers to respond to the questions, and interpret the results.


The claim of Harding’s that most struck me was related to the readability of Walden: despite the size of Thoreau’s vocabulary, Harding says, his writing “cannot be termed ostentatious.” What I wanted to do, in order to substantiate Harding’s claim, was to quantify the lexical sophistication of different passages throughout the novel and compare that to the passages’ readabilities, which could be represented by readability indices like the Gunning Fog and Coleman-Liau.

Readability of different chapters of walden, by version, represented in a box-whisker plot of Coleman-Liau indices.
Readability of different chapters of Walden, by version, represented in a box-whisker plot of Coleman-Liau indices.

This, however, was far too large a project for the scope of our course. If we wanted to quantify the extent of Thoreau’s vocabulary, we’d have to compare his writing to works by contemporary authors. In addition, Harding isn’t specific about to what audience Thoreau’s text is readable: is he claiming that Walden was readable in Thoreau’s time? In Harding’s own time? Text is perceived as readable, in part, because of the norms of its age of reception; likewise, the indices–which were made in the latter part of the twentieth century–make assumptions about a text’s audience, an audience that may differ from a contemporary readership of a certain demographic.

So, I instead set out to see how the quantitative readability of different versions of a passage in Walden would compare with Geneseo students’ opinions of readability. I sent out a survey that asked them to read two versions of the same passage from Chapter 3, each of which scored quite differently when subjected to the algorithmic reading tests. Not knowing which one was supposed to be more readable, the students were to indicate which passage they found easier to read and to briefly explain why.

The results were exactly the opposite of what I’d initially hoped they’d be.  I won’t discuss their full implications, but a small majority (59%) of students indicated that the easier to read passage was the one that the indices indicated required a higher level of formal education. Admittedly, there were some problems with my survey. First, I used two versions of the same passage, so the second passage was more likely to be perceived as readable because it was already somewhat familiar (this is something I anticipated and also that quite a few respondents noted). In addition, most of the people who took my survey were English majors, who are quite used to finding their way through and comprehending intricate texts.

But we can learn something from this iteration of the study. Those who chose the passage that was supposed to be easier to read pointed out the attributes like punctuation and clause length on which the computational tools made their measurements. Those who chose this passage cited aspects of it that could not easily be quantitatively measured–for example, the rhetorical structure of Thoreau’s argument. Does this suggest that there are certain features of a text that cannot be quantified? That we need to be more attentive to what we apply certain algorithms? Or, do we just need more sophisticated ones?

I think that these questions lead well into Angie’s portion of the project.


My assigned data to analyze was in relation to Walden’s polarity and subjectivity. These two aspects, primarily subjectivity, worked in junction with Harding’s fifth style of reading Walden as a spiritual guidebook; a guidebook is inherently subjective in its having an opinion on how one is supposed to live their life. He had mentioned that there were four key chapters to reading Walden spiritually. Of course, it would make this project too easy if Harding’s key chapters matched up with the data received. Instead, I had the following to work with:

Harding’s Key Chapters:
Where I Lived, and What I Lived For
Higher Laws

Data Received from Kirk Anne:
Baker Farm (most subjective)
The Ponds (least subjective)
Reading (most positive)
The Village (most negative)***
Former Inhabitants; and Winter Visitors (most negative)

I’ll start with the polarity data, as it’s the easiest to explain. Thoreau came off as relatively negative throughout Walden. Though he was ranting during “Reading,” he did sound relatively positive compared to the rest of the book when talking about the benefits of reading. After reviewing the data and re-reading the chapter, it was easy to see why this was picked to be the most positive. The negative end of the spectrum was a bit more complicated. “The Village” was, according to the data, the most negative chapter of the novel. But when looking at the numbers, you can see that while this is said to be the most negative, it is also the shortest chapter of Walden. Kirk and I discussed that longer chapters such as “Former Inhabitants; and Winter Visitors” can have diluted negativity by the amount of excess words in the chapter that don’t coincide with the negative connotations. I took another look at both these chapters, as “Former Inhabitants; and Winter Visitors” was the second most negative chapter, to see if this was actually the case. I found that Kirk was correct and that “Former Inhabitants; and Winter Visitors” was, in my opinion, more negative that “The Village.” My reasoning was that in “Former Inhabitants; and Winter Visitors” Thoreau discusses the house fire that burns the man’s entire life away. In comparison, “The Village” discussed Thoreau’s arrest for not paying the Poll Tax, but his emotions towards the matter were far more indifferent than those displayed in “Former Inhabitants; and Winter Visitors.” Of course, this being just my opinion leaves room for error and is definitely something to continue studying.

The subjectivity data became my main focus for this project due to its relevance to the Spiritual Reading given by Harding. I took a look at four chapters: “Baker Farm” and “The Ponds” as they are the two our data said were respectively the most subjective and least subjective, and “Higher Laws” and “Economy” since they were the two that I picked to be the most subjective out of Harding’s four Key Chapters. After reviewing all four, I found that I agreed with Harding in that “Higher Laws” and “Economy” were more subjective than “Baker Farm.”

There are a few explanations I came up with for this occurrence. The first is similar to the polarity, where chapters like “Economy” are extensive in length, meaning the subjectivity is diluted by a higher word count. Another possibility I came up with is that the computer only picks up on direct opinions or keywords given by a person. I suggested this idea to Kirk and he believes that it could be a possibility due to his use of a code based on movie reviews as a training set for determining subjectivity and polarity. This indicates to him that there is probably a mismatch between the training set and Walden. This raises a new concern for collecting data from literature. So much of what is written is stated as a fact. When Thoreau is ranting, is he going to say “well, in my humble opinion, I think that the world is corrupted?” No! He states everything he believes as a fact and tells people that his way is the set right way (like any good spiritual guidebook would). I feel that this could make it difficult to create a training set for Walden, along with many other literary works, as they lack “opinion words/phrases” such as “I think,” “you should,” and so on. To me it only makes sense that there is at least one aspect of literature that requires a human mind to analyze it. After all, literature is created for humans, by humans, and isn’t made to be analyzed by a computer. I’m certainly not denouncing this project—I feel that using the technology available to us only can enhance the understanding we already have of literature. However, I don’t see the possibility of us ever fully replacing old-fashioned reading with computer analysis. Books will always need a human mind and eye to understand the human mind and hand that wrote them.


I addressed Walter Harding’s claim that Walden can be approached as a purely belletristic or aesthetic book, one of his “Five Ways of Looking at Walden.” Harding believes that Walden is an example of “good writing,” and that his generally straightforward writing style separates him from his contemporaries, who often used abstractions, euphemisms, circumlocutory logic and figurative language. (156) I figured that it would be a good idea to compare Walden to some of Thoreau’s contemporaries that Harding referenced. Aside from Walden, I looked at essays and other writings by Ralph Waldo Emerson, Nathaniel Hawthorne, Oliver Wendell Holmes, Walt Whitman, Edgar Allan Poe, Washington Irving, Francis Hopkinson, and Richard Henry Dana Jr. Unfortunately the majority of the data was not able to be produced in time for most of these sources. However, some very basic data is still worth noting. One of the things I was able to do was compare the length of the average sentence across these texts. If Harding was to be believed in his assertion that some of Thoreau’s contemporaries were overly abstract and circumlocutory, it stands to reason that their sentences would be generally longer and more wordy. This is not entirely the case however. In Walden, the average sentence length is 27.5 words per sentence. In looking at his contemporaries, there is a relatively even spread in terms of sentence length. This would pose an issue to Harding’s assertion if it were not for the fact that Harding also recognized that Thoreau’s sentences were unusually long. (158) Unfortunately, without more in depth data, it is difficult to further address the above claim regarding other texts.

To further address Harding’s claim, I used the data regarding Walden that Kirk Anne produced for our group. I first looked at Thoreau’s use of symbols. I reasoned that, if Thoreau was truly a less abstract writer than his contemporaries, the number of symbols in Walden would be relatively low. Thoreau used 11 symbols in Walden, which seems to be a relatively small amount. This potentially supports Harding’s claim that Walden was a much more straightforward work than those of his contemporaries, although I unfortunately do not have sufficient data regarding Thoreau’s contemporaries to draw any concrete conclusions from this number.

I next wanted to address the allusiveness of Thoreau in Walden. If Harding’s claim that Walden is simply good, relatively straightforward prose is accurate, it stands to reason that Thoreau would have a limited number of references to historical events, classic literature, and other similar things in Walden. While it is difficult to determine this from the data that I was provided with, I think a very general picture can be gleamed from the frequency of proper nouns in the book. In looking at allusiveness, we ideally want to remove locations from the data, as well as non-historic and non-fictional individuals. There are 386 proper nouns within Walden. Although the data could not account for this, it is reasonable to assume that this number would be considerably smaller if places and certain individuals were removed from the list. In general, I would make the argument that Walden is actually not particularly allusive, although there is unfortunately no data from contemporary authors to compare to.


I was working on verifying Harding’s claim that within Walden there is a “careful alteration of the spiritual and the mundane, the practical and the philosophical, the human and the animal”. I was also tasked with using the program Voyant tools as my tool for evaluating Harding’s claim. I began by uploading each chapter into Voyant tools, taking down notes on such things as the number of words and unique words in each chapter, and then combing through the chapters for the most common words. Voyant tools made this much easier than it would have been. I was able to pick out common words in chapters that I would qualify as spiritual, mundane, practical, philosophical, animal, or human. I spent more time on chapters Harding specifically mentions within his article as examples, so that I could draw upon his observations. However I did look at every chapter to make sure the alterations were continuing throughout the novel rather than simply in those chapters. Having read the chapters, it was sometimes frustrating that I knew these themes existed within the chapter, however the words being used didn’t always match up. For example, it was difficult to finds words that fit into the category of philosophical because many times Thoreau uses metaphor to convey these philosophical ideas. In other words, the words may be seemingly mundane but actually have a deeper meaning in context. However, even with these struggles, it was clear that there is truth to Harding’s claim, although I would say that all of these themes occur and exist throughout the novel, although the main theme being talked about may be alternating. However, even in a very mundane chapter like Brute Neighbors, there are still spiritual elements mixed in. In the end it was up to me to analyze the data given by voyant tools to see if I was able to come to the same conclusions as Harding.

The Thorough Thoreau: the Annotated Fluid-Text Edition

The ENGL340 Coder’s team presents the Improved Fluid-Text edition of Walden…

Working with Beth Witherell’s “The Writings of Henry D. Thoreau,” and the Princeton edition of Walden, our team incorporated annotations from eight volumes of journals into the Fluid-Text edition of Walden, making the project more massive, authentic, and penetrating.

“We commonly do not remember that it is, after all, always the first person that is speaking. I should not talk so much about myself if there were any body else whom I knew as well. Unfortunately, I am confined to this theme by the narrowness of my experience.” Walden page one.

Project: Success!with some foundering… Our ultimate goal this semester was to add a feature to the Fluid-Text Walden that grants users a behind-the-scenes study of Thoreau’s writing process, from the [seemingly] random ramblings of his journals to the finished product– the transcendentalist masterpiece of life in the woods. Given that our group as a whole had little prior experience with encoding, our journey was not directly the exploration of growing authorship, but instead the exploration of this idea in a digital way– we explored this idea in a very meta way, going behind the behind-the-scenes to revamp the digital edition of the work.

Our task involved expanding our collective knowledge of coding, primarily utilizing the TEI standard and the XML format, of which we plugged finally into the versioning machine of the Fluid-Text.

However, what can’t clearly be seen in the digital manifestation of our effort are the organizational bumps we plowed into along the way– the rerouting and clarifying and focusing and… Here we will historicize the endeavor (much like our project scaffolds Thoreau to a greater degree)– of the careful balance digital humanists must establish between planning  and actual implementation.

For all the struggle, our project was completed. The Fluid-Text edition is ever-enhancing itself, and Thoreau isn’t done with us yet, either.

Table of Contents:

•The Process
• The Product
• Challenges
• The Future
• Digital Humanities
• Project Members
• Resources


 “No man ever stood the lower in my estimation for having a patch in his clothes…”

For some time, the project was undefined. We rolled around in the dirt of ideas and potential projects, and had a few false starts and dead ends along the way. ENGL340’s Data Analysis group discussed feasible ideas with our team at the onset, as we were both working with types of coding and analysis as the basic  purpose to our project, but eventually the path was forked. Finally, it was decided, given the availability of materials, that we would create a coding project that amalgamated passages in Thoreau’s journals that were later referenced in the published edition of Walden in the digital edition, the Fluid-Text. These cross-references  were neatly cited in a compendium in the back of the volumes (#1-8) of “The Writings of Henry D. Thoreau.”

01_ThoreauThoreau kept extensive journals, of which he recorded ideas, notes, and thoughts he had during his days. Some were lengthy– others, jotted down half-ideas, of which seem only to make sense to Thoreau himself (or, when Thoreau later finished the thought by including it in Walden).

Simply attaining the journals proved to cost a pretty chunk of time. Due to their rarity, cost, etc., it was some time before a complete set could be shipped to Geneseo. It would also be useful for everyone in the group to possess a physical copy of the Princeton Walden to use in tandem with the journals, and this was another unforeseen snag we briefly were stalled with.

Screen Shot 2014-05-13 at 5.37.49 PM

But– again!– more unanticipated hindrances. Though Thoreau’s writings are in the public domain, the works we used are edited versions, and thus are under copyright. We couldn’t photocopy just anything we wanted due to these legal limitations, and this involved yet more obtaining of journal volumes– we had to use all volumes in multiple steps of the project, often with multiple people needing to use the same one. We could only photocopy the index for the sake of reference, shown here.

We were the “coding” group, yet, oddly, the bulk of our work (after the planning) was tedious, hand-typing of every single journal entry into our Google spreadsheet. We needed the entire passage itself, its citation (e.g. 6.10-15; that is, page 10, lines 10-15), the date; also needed were its counterparts, the page number in Walden itself, and the relevant passage that most resembled the keywords from the journal entry:

Screen Shot 2014-05-12 at 9.19.11 PM Screen Shot 2014-05-12 at 9.19.27 PM

Some 500+ entries were filled out with this data.

All right– the data was assembled all in one place. We had the journal text itself, and all the data to link it to Walden itself, as well as things like dates that we could both include and perform analyses on [more on that later]. What next?

The point of using Google Spreadsheet was twofold. One, keep the group in sync and connected; two, utilize the function and mass-apply capabilities of the program. Our next medium, TEI (using the program Oxygen), needed transformed data. By writing a function and applying it to all 600 items in our spreadsheet, we could quickly and easily slide forward in our agenda. The beauty of coding and things like Google Spreadsheets is that it eliminates, most of the time, sheer labor. We need not write out every TEI data string that would be plugged into the Fluid-Text versioning machine for Walden– instead, we could write a program and apply it to that which matched the pattern (in our case, all items).

TEI (the Text Encoding Initiative) is a standard of programming used most often in digital humanities. TEI providing the guidelines, the markup language XML (Extensible Markup Language) was used to encode our data in the editor Oxygen, which allows for advanced features. The string of tags in the image above (appearing from the Google Spreadsheet) were transplanted to Oxygen for finalization:

Screen Shot TEI

Examples include the “resp” tag, which identifies the responsibility (i.e., editor). All of these tags are read by the machine and can be manipulated by the coding engine to perform different tasks, if need be.

The majority of the work done, the rest was detail-work and error-check. The TEI code itself had to be checked in Oxygen for technically errors. For instance, Thoreau often used the ampersand (“&”) in his writing, but in TEI, the ampersand is not read as text but a command. This demanded a work-around, as did other small errors.

Screen Shot 2014-05-12 at 10.09.00 PMThe majority of the work done, the rest was detail-work and error-check. The TEI code itself had to be checked in Oxygen for technically errors. For instance, Thoreau often used the ampersand (“&”) in his writing, but in TEI, the ampersand is not read as text but a command of sorts (as it was, an error). This demanded a work-around, as did other small errors.

Additionally, we had to proofread the annotations in the final form as well, as Oxygen can only pick up on technical errors (things not strictly allowed), while we wanted to check for errors beyond that, such as formatting, etc.

Screen Shot 2014-05-12 at 6.57.45 PM


Our part done, everything we worked on over the semester has been uploaded to a site directory, where you can see all our files and data, and download it to see for yourself. The product is now available in the Fluid-Text edition of Walden.



Now anyone can access the digital edition of Walden, which contains not just the various editions in Thoreau’s writing process of the Work itself, but also his journaled annotations.

Screen Shot 2014-05-12 at 11.26.14 PMHere is an example of what the journal annotations look like. In the text itself, a c marks a note at the beginning of a paragraph. A simple mouse-over reveals the forerunner thought Thoreau had in his journal writings.

Some times Thoreau copied himself exactly. At others, he radically changed the sentence structure, retaining only the very kernel of the statement. It is possible he actually copied lines from an older version of Walden into the journal, and then back again, to an version closer to the final product…


Copyright regulations, an ethical adherence to avoid plagiarism, etc., all slowed our work down. Unable to scan the journals or Princeton Walden, we were stuck with hard labor when a technical solution was right at our fingertips.

At the beginning of the course (and project), we only had a plain, basic understanding of markup systems. As a result, a non-significant portion of our group project time was spent learning various coding languages; the attributes and values and tags; etc., etc. Ultimately, though, Joe Easterly’s expertise and work in encoding and script-writing helped the project glide along when it came to the markup stages.

Due to the digital/coding aspect of the course, we focused more on doing something with the text rather than saying something about it. However, despite not focusing on critical analysis of Thoreau’s authorship through time, our own journey of learning tools, how to build systems, etc., resembled not only what you can do with our project’s outcome, but Walden itself. While some may question whether “digital humanities” is an oxymoron, or two incompatible things juxtaposed forcefully, let this be a lesson that that is not the case. Our digital humanities project plan was not literary analysis using digital tools, but it ended up resembling such a thing in the end after all, upon reflection.

Planning, planning, planning! While it may be said, especially, of humanities projects that too much planning can land one in developmental hell, it is equally true that too little planning can lead to dead ends faster than Henry David Thoreau would eat a woodchuck if he could catch it (answer: instantly devoured raw, of course). But one must be patient and precise, and, again, know that humanities projects are some some 75% planning, and just 25% action. If you are stuck like we were, or even outright failed, learn from others and yourself.


Still much work is to be done for Digital Thoreau and the Fluid-Text edition, specifically.

For instance, in assigning tags to the dating of journal passages, we have discovered that some of the dating of the various editions (a-g) may in fact be wrong. Of course, the assignment of writings to “editions” (rough drafts) of Walden was an “imagined” discipline, as Thoreau did not leave separate notebooks with complete versions in it, but instead wrote over older editions in different inks, etc. However, this project will lead into that one, leading to a more accurate chronology of Walden‘s writing process.

As for the Fluid-Text edition itself, the project is still ongoing. Not quite all of our cross-references were finished in time to be added, and formatting choices are still to be updated for the additional section of the journal notes.

Finally, as for you, and us, after a long journey, its time we can actually relax and study the outcome of our project, comparing the annotations and cross-referencing Walden passages.


So… what’s the point of all this? Why do all this work when we students have access to the journals in college libraries? The goal, of course, is to harness the awesomely democratic power of the Internet, powered by our digital tools and digital thinking. The TEI standard, XML, Oxygen– all these and more were utilized in this single project to make available the pre-Walden musings of our favorite transcendentalist, Henry David Thoreau. And of course, “transcendentalism” is just another pre-digital handle for “digital humanities,” the global project of freeing knowledge from the dusty recesses of collegiate libraries for everyone, everywhere.

“To my astonishment I was informed on leaving college that I had studied navigation!–why, if I had taken one turn down the harbor I should have known more about it…”

Thoreau was very critical of the time’s he lived in– its politics, society, and education. He advocated for individualism, but, contrary to popular belief, he didn’t spend two years in solipsistic seclusion. He not only remarked on culture and life, but actively observed it, often passing into the town of Concord. “Life in the Woods,” then, is not a literal demand to be hermitic, but a way of thinking.

“What sort of space is that which separates a man from his fellows and makes him solitary? I have found that no exertion of the legs can bring two minds much nearer to one another…”

The progress in the spheres of digital humanities is the new Wood to gather in. Thoreau’s writings are packed with thoughts on contemporary matters. Transcendentalism is a good place to start for explaining the purpose of digital humanities, but it is really only the very beginning of its enormous potential. And the only way to explore this new world is to dive right in. You can’t just walk in the woods a few times, you have to live there, to get your hands dirty and understand how, and not just what.

“But lo! men have become the tools of their tools…”

So join us! Code! Not only have we given readers the digital version of Thoreau’s pre-Walden thoughts, but we have attempted to show how we built this. As digital humanists, one must not be intimidated or aloof or ignorant, or even deprived of, the tools, how to use them, and how to build them ourselves. Go a bit past merely using digital tools and the Internet (to view cat videos, no doubt), to see how such things are constructed. Do not become a one-shot, funny-cat-photo-poster on the web– or, as Thoreau thought about the Collin’s cat, if you don’t adapt to changing society (improved by you, of course), you will turn to be completely wild, without a fine balance between thinking and doing, or solitude and society, literary and digital tools– and “so become a dead cat at last.” Don’t be a dead cat. Be a digital humanist!

“If you have built castles in the air, your work need not be lost; that is where they should be. Now put the foundation under them.”

*All Walden quotations from the Fluid-Text edition

Project Members:
Andrew Nauffts, Matthew Spitzer, Victoria Salazar,
Kyle Parnell, CJ Ferraro, Cob O’Brien, Joe Easterly

SUNY Geneseo ENGL340: “Literature and Literary Studies in a Digital Age”
Spring 2014

Special thanks to:
• Dr. Paul Schacht, our Professor
• Beth Witherell, Ron Clapper, and others for their work on Thoreau

Courtesy of digital humanist and Thoreauvian fanboy, Dr. Schacht


• Introducing the TEI Guidelines

•  A Gentle Introduction to XML

Free coding lessons online

• Free 30-day trial of Oxygen XML Editor

• The Social Reader’s Text edition of Walden

• This blog post by Dr. Paul Schacht with a video explaining the Fluid-Text in general

• Various blogs on coding, digital humanities, etc.: including this one, Coding Horror


The Gettysburg Address as Fluid Text

Digital Thoreau’s “fluid text edition” of Henry D. Thoreau’s Walden is so named in reference to John Bryant’s 2002 book The Fluid Text: A Theory of Revision and Editing for Book and Screen. Every text is fluid, Bryant suggests, insofar as it represents not the definitive articulation of a fixed intention but rather one entry in the record of an author’s evolving and shifting intentions. The full record of those intentions would involve, at a minimum, all of the author’s drafts, and perhaps even information about authorial decisions in flux between the moment a pen is raised and the moment it touches paper.

Some texts are more obviously fluid than others because we have more information about their genesis. Such is the case with Walden. And some are fluid not only because of their pre-publication but also their post-publication history. An example of the latter is Lincoln’s “Gettysburg Address.”

The fluidity of this famous speech briefly became a matter of lively public discussion in 2013, its sesquicentennial year, when conservative media outlets expressed outrage over a recording of it made by President Obama. The reason for the outrage? Obama’s omission of the words “under God” from the final sentence.

As it turned out, Obama had given an historically faithful reading of one of the address’ five versions: the so-called “Nicolay copy,” sometimes referred to as the “first draft” of the address because it’s the earliest surviving manuscript copy and may have been the copy from which Lincoln read at the cemetery’s dedication on November 19, 1863. At the request of Ken Burns, Obama recorded the Nicolay copy as part of Burns’ Learn the Address project, which encourages “everyone in America to video record themselves reading or reciting the speech” — and which might remind us that the post-publication fluidity of some texts (most obviously, perhaps, speeches and plays) is partly a consequence of their having been intended for performance.

Google Cultural Institute has a nice timeline of the address’s interesting textual history. It draws largely from the House Divided Project, a digital humanities Civil War project at Dickinson College to which Dickinson undergraduates have contributed, and where you can read all five drafts.

The Gettysburg Foundation also provides transcriptions of the five versions, highlighting the differences between them in boldface.

In ENGL 340 tomorrow, we’ll take these five versions and encode the differences between them in XML, using the critical apparatus tagset of TEI. Then we’ll display them side by side using the Versioning Machine and — if time allows, and if all goes well — Juxta in order to see how visualization tools can help us understand the fluid nature of one of our nation’s most important texts.