The Thorough Thoreau: the Annotated Fluid-Text Edition

The ENGL340 Coder’s team presents the Improved Fluid-Text edition of Walden…

Working with Beth Witherell’s “The Writings of Henry D. Thoreau,” and the Princeton edition of Walden, our team incorporated annotations from eight volumes of journals into the Fluid-Text edition of Walden, making the project more massive, authentic, and penetrating.

“We commonly do not remember that it is, after all, always the first person that is speaking. I should not talk so much about myself if there were any body else whom I knew as well. Unfortunately, I am confined to this theme by the narrowness of my experience.” Walden page one.

Project: Success!with some foundering… Our ultimate goal this semester was to add a feature to the Fluid-Text Walden that grants users a behind-the-scenes study of Thoreau’s writing process, from the [seemingly] random ramblings of his journals to the finished product– the transcendentalist masterpiece of life in the woods. Given that our group as a whole had little prior experience with encoding, our journey was not directly the exploration of growing authorship, but instead the exploration of this idea in a digital way– we explored this idea in a very meta way, going behind the behind-the-scenes to revamp the digital edition of the work.

Our task involved expanding our collective knowledge of coding, primarily utilizing the TEI standard and the XML format, of which we plugged finally into the versioning machine of the Fluid-Text.

However, what can’t clearly be seen in the digital manifestation of our effort are the organizational bumps we plowed into along the way– the rerouting and clarifying and focusing and… Here we will historicize the endeavor (much like our project scaffolds Thoreau to a greater degree)– of the careful balance digital humanists must establish between planning  and actual implementation.

For all the struggle, our project was completed. The Fluid-Text edition is ever-enhancing itself, and Thoreau isn’t done with us yet, either.

 “No man ever stood the lower in my estimation for having a patch in his clothes…”

For some time, the project was undefined. We rolled around in the dirt of ideas and potential projects, and had a few false starts and dead ends along the way. ENGL340’s Data Analysis group discussed feasible ideas with our team at the onset, as we were both working with types of coding and analysis as the basic  purpose to our project, but eventually the path was forked. Finally, it was decided, given the availability of materials, that we would create a coding project that amalgamated passages in Thoreau’s journals that were later referenced in the published edition of Walden in the digital edition, the Fluid-Text. These cross-references  were neatly cited in a compendium in the back of the volumes (#1-8) of “The Writings of Henry D. Thoreau.”

01_ThoreauThoreau kept extensive journals, of which he recorded ideas, notes, and thoughts he had during his days. Some were lengthy– others, jotted down half-ideas, of which seem only to make sense to Thoreau himself (or, when Thoreau later finished the thought by including it in Walden).

Simply attaining the journals proved to cost a pretty chunk of time. Due to their rarity, cost, etc., it was some time before a complete set could be shipped to Geneseo. It would also be useful for everyone in the group to possess a physical copy of the Princeton Walden to use in tandem with the journals, and this was another unforeseen snag we briefly were stalled with.

But– again!– more unanticipated hindrances. Though Thoreau’s writings are in the public domain, the works we used are edited versions, and thus are under copyright. We couldn’t photocopy just anything we wanted due to these legal limitations, and this involved yet more obtaining of journal volumes– we had to use all volumes in multiple steps of the project, often with multiple people needing to use the same one. We could only photocopy the index for the sake of reference, shown here.

We were the “coding” group, yet, oddly, the bulk of our work (after the planning) was tedious, hand-typing of every single journal entry into our Google spreadsheet. We needed the entire passage itself, its citation (e.g. 6.10-15; that is, page 10, lines 10-15), the date; also needed were its counterparts, the page number in Walden itself, and the relevant passage that most resembled the keywords from the journal entry:

Some 500+ entries were filled out with this data.

All right– the data was assembled all in one place. We had the journal text itself, and all the data to link it to Walden itself, as well as things like dates that we could both include and perform analyses on [more on that later]. What next?

The point of using Google Spreadsheet was twofold. One, keep the group in sync and connected; two, utilize the function and mass-apply capabilities of the program. Our next medium, TEI (using the program Oxygen), needed transformed data. By writing a function and applying it to all 600 items in our spreadsheet, we could quickly and easily slide forward in our agenda. The beauty of coding and things like Google Spreadsheets is that it eliminates, most of the time, sheer labor. We need not write out every TEI data string that would be plugged into the Fluid-Text versioning machine for Walden– instead, we could write a program and apply it to that which matched the pattern (in our case, all items).

TEI (the Text Encoding Initiative) is a standard of programming used most often in digital humanities. TEI providing the guidelines, the markup language XML (Extensible Markup Language) was used to encode our data in the editor Oxygen, which allows for advanced features. The string of tags in the image above (appearing from the Google Spreadsheet) were transplanted to Oxygen for finalization:

Examples include the “resp” tag, which identifies the responsibility (i.e., editor). All of these tags are read by the machine and can be manipulated by the coding engine to perform different tasks, if need be.

The majority of the work done, the rest was detail-work and error-check. The TEI code itself had to be checked in Oxygen for technically errors. For instance, Thoreau often used the ampersand (“&”) in his writing, but in TEI, the ampersand is not read as text but a command. This demanded a work-around, as did other small errors.

Additionally, we had to proofread the annotations in the final form as well, as Oxygen can only pick up on technical errors (things not strictly allowed), while we wanted to check for errors beyond that, such as formatting, etc.

Our part done, everything we worked on over the semester has been uploaded to a site directory, where you can see all our files and data, and download it to see for yourself. The product is now available in the Fluid-Text edition of Walden.



Now anyone can access the digital edition of Walden, which contains not just the various editions in Thoreau’s writing process of the Work itself, but also his journaled annotations.

Screen Shot 2014-05-12 at 11.26.14 PMHere is an example of what the journal annotations look like. In the text itself, a c marks a note at the beginning of a paragraph. A simple mouse-over reveals the forerunner thought Thoreau had in his journal writings.

Some times Thoreau copied himself exactly. At others, he radically changed the sentence structure, retaining only the very kernel of the statement. It is possible he actually copied lines from an older version of Walden into the journal, and then back again, to an version closer to the final product…


Copyright regulations, an ethical adherence to avoid plagiarism, etc., all slowed our work down. Unable to scan the journals or Princeton Walden, we were stuck with hard labor when a technical solution was right at our fingertips.

At the beginning of the course (and project), we only had a plain, basic understanding of markup systems. As a result, a non-significant portion of our group project time was spent learning various coding languages; the attributes and values and tags; etc., etc. Ultimately, though, Joe Easterly’s expertise and work in encoding and script-writing helped the project glide along when it came to the markup stages.

Due to the digital/coding aspect of the course, we focused more on doing something with the text rather than saying something about it. However, despite not focusing on critical analysis of Thoreau’s authorship through time, our own journey of learning tools, how to build systems, etc., resembled not only what you can do with our project’s outcome, but Walden itself. While some may question whether “digital humanities” is an oxymoron, or two incompatible things juxtaposed forcefully, let this be a lesson that that is not the case. Our digital humanities project plan was not literary analysis using digital tools, but it ended up resembling such a thing in the end after all, upon reflection.

Planning, planning, planning! While it may be said, especially, of humanities projects that too much planning can land one in developmental hell, it is equally true that too little planning can lead to dead ends faster than Henry David Thoreau would eat a woodchuck if he could catch it (answer: instantly devoured raw, of course). But one must be patient and precise, and, again, know that humanities projects are some some 75% planning, and just 25% action. If you are stuck like we were, or even outright failed, learn from others and yourself.


Still much work is to be done for Digital Thoreau and the Fluid-Text edition, specifically.

For instance, in assigning tags to the dating of journal passages, we have discovered that some of the dating of the various editions (a-g) may in fact be wrong. Of course, the assignment of writings to “editions” (rough drafts) of Walden was an “imagined” discipline, as Thoreau did not leave separate notebooks with complete versions in it, but instead wrote over older editions in different inks, etc. However, this project will lead into that one, leading to a more accurate chronology of Walden‘s writing process.

As for the Fluid-Text edition itself, the project is still ongoing. Not quite all of our cross-references were finished in time to be added, and formatting choices are still to be updated for the additional section of the journal notes.

Finally, as for you, and us, after a long journey, its time we can actually relax and study the outcome of our project, comparing the annotations and cross-referencing Walden passages.


So… what’s the point of all this? Why do all this work when we students have access to the journals in college libraries? The goal, of course, is to harness the awesomely democratic power of the Internet, powered by our digital tools and digital thinking. The TEI standard, XML, Oxygen– all these and more were utilized in this single project to make available the pre-Walden musings of our favorite transcendentalist, Henry David Thoreau. And of course, “transcendentalism” is just another pre-digital handle for “digital humanities,” the global project of freeing knowledge from the dusty recesses of collegiate libraries for everyone, everywhere.

“To my astonishment I was informed on leaving college that I had studied navigation!–why, if I had taken one turn down the harbor I should have known more about it…”

Thoreau was very critical of the time’s he lived in– its politics, society, and education. He advocated for individualism, but, contrary to popular belief, he didn’t spend two years in solipsistic seclusion. He not only remarked on culture and life, but actively observed it, often passing into the town of Concord. “Life in the Woods,” then, is not a literal demand to be hermitic, but a way of thinking.

“What sort of space is that which separates a man from his fellows and makes him solitary? I have found that no exertion of the legs can bring two minds much nearer to one another…”

The progress in the spheres of digital humanities is the new Wood to gather in. Thoreau’s writings are packed with thoughts on contemporary matters. Transcendentalism is a good place to start for explaining the purpose of digital humanities, but it is really only the very beginning of its enormous potential. And the only way to explore this new world is to dive right in. You can’t just walk in the woods a few times, you have to live there, to get your hands dirty and understand how, and not just what.

“But lo! men have become the tools of their tools…”

So join us! Code! Not only have we given readers the digital version of Thoreau’s pre-Walden thoughts, but we have attempted to show how we built this. As digital humanists, one must not be intimidated or aloof or ignorant, or even deprived of, the tools, how to use them, and how to build them ourselves. Go a bit past merely using digital tools and the Internet (to view cat videos, no doubt), to see how such things are constructed. Do not become a one-shot, funny-cat-photo-poster on the web– or, as Thoreau thought about the Collin’s cat, if you don’t adapt to changing society (improved by you, of course), you will turn to be completely wild, without a fine balance between thinking and doing, or solitude and society, literary and digital tools– and “so become a dead cat at last.” Don’t be a dead cat. Be a digital humanist!

“If you have built castles in the air, your work need not be lost; that is where they should be. Now put the foundation under them.”

*All Walden quotations from the Fluid-Text edition

Project Members:
Andrew Nauffts, Matthew Spitzer, Victoria Salazar,
Kyle Parnell, CJ Ferraro, Cob O’Brien, Joe Easterly

SUNY Geneseo ENGL340: “Literature and Literary Studies in a Digital Age”
Spring 2014

Special thanks to:
• Dr. Paul Schacht, our Professor
• Beth Witherell, Ron Clapper, and others for their work on Thoreau

Courtesy of digital humanist and Thoreauvian fanboy, Dr. Schacht


• Introducing the TEI Guidelines

•  A Gentle Introduction to XML

Free coding lessons online

• Free 30-day trial of Oxygen XML Editor

• The Social Reader’s Text edition of Walden

• This blog post by Dr. Paul Schacht with a video explaining the Fluid-Text in general

• Various blogs on coding, digital humanities, etc.: including this one, Coding Horror


A Smattering of Stupid (?) Studies

We can’t always spend a whole class on a single book or author, can we? As an English literature student, obviously I read outside of class (“for fun,” as some would put it) and surf for literature-related material online. As I think these “supplemental study tools” can only be judged on an individual basis, nonetheless, here are some I think are interesting, weird, scarily cultish… or just plain stupid.

So adorbs

For example, there is the ever-adorable “Writers and Kitties” tumblr, which takes a step back from literature itself and merely captures our empathy for the authors themselves. Apparently Mark Twain liked to relax while playing pool with a kitty in the corner pocket.

However, if one chooses to follow the trail of Twain’s digital afterlife, it derails quite quickly.

Kitties > famous authors ?…

Here is a 1940’s ad from Royal Crown Cola featuring not just Twain, but containing (and focusing on!) his pool-playing kitten. Somewhat reassuring– it seems the digital age’s “nothing is sacred” mentality is nothing new (just as we have discussed that new technology scares are nothing new– paper?! OMG.)

Pure sex appeal.

However, if we continue down the electric pathway of the ghost of Twain, we rapidly descend into madness. The twitter handle Shirtless Mark Twain takes a simple picture of said phenomenon (the explanation lost in time, it seems– can’t find the reason for it) and creates a new life for the author; over a hundred tweets capitalizing on the photograph. He usually signs every third tweet or so with a call for “shirts off,” as well as often dissing the physique of fellow authors. Fortunately(?), the account has only a paltry 45 followers. Please do not add to the following– I think this one serious students may skip. However, there may be something here– a short, comic appeal focused on the author himself (or herself).

Moving on to more positive examples, there is the Henry David Thoreau twitter account, which seems to be mostly tweets of T quotations. Here’s one that demonstrates T’s quotability:

“As if you could kill time without injuring eternity.”

It’s interesting how powerful a quotation seems to become when it’s set off by itself like this. Reading Walden twice, I never noticed this particular line before specifically (nor, shamefully(?) can even remember it at all). Are things like twitter literature accounts helping literature, as well as general students, alike? I can’t help by notice the similarities between something like this and our Social Text edition of Walden. This particular tweet had a fine 39 retweets, and it was the favorite of 41. The ThoreauPage has a staggering 25.1k followers. However, missing from many of the tweeted quotations (despite the exposure) was discussion, such as can be found in a more academic setting like the Reader’s Walden. And even if there was a long discussion from different accounts, how much can be conveyed in 140 characters for each individual comment?

As much as one can speculate on the meaning of this quotation by itself, is there missing context to it that can only be found embedded within its proper passage? Is this additional exposure– twitter pages, etc.– to literature really beneficial, or even detrimental, if the content is all flash and no substance? Will non-literature students ever willingly follow the forbidding Thoreau, author of the much-hated Walden (at least so seems to be the opinion of most people in Geneseo whenever I mention the book to others)? (My friend has scathingly accused Geneseo of having a “Thoreau fetish”– we literature students seem to always be on the defense for our field of academic criticism [see my other blog post on Annotations]. Has anyone else been on the end of a quick explosion of anger toward Thoreau in Geneseo?)

I’m not aware of how one could gauge the academic usefulness of things like the ThoreauPage on twitter (forget ShirtlessTwain), it probably being too complex to control for a scientific polling, but here’s something interesting that perhaps could be related, if other studies could be found or done: “Texting can help improve your kid’s writing skills.” This article cites British research done that correlates (positively) “textisms” and a student’s writing skills. It seems that students’ texting is language and writing use (while before texting, the students would not be writing at all)– even those “textisms” are beneficial in a way, and better yet, students, fortunately, know when to not use them in schoolwork.

Hawt? Nawt.

Maybe those Thoreau quotations, even taken out of context (and only those pithy, short statements) are getting students to read while otherwise they would not be at all? Perhaps if Thoreau had posed for a shirtless daguerreotype, and if this comic aspect of Thoreau’s personal life was combined with serious quotations, the appeal to ThoreauPage would be even greater and more influential (unfortunately, I think Thoreau’s neckbeard disqualifies him for the label of “pure sex appeal,” as with our friend Mark Twain). But still, putting T on twitter must be more useful than reserving him to college courses… right?

Here’s a link to the 100 Best Opening Lines from Novels. Here’s a link to the complementary 100 Best Closing Lines, and here’s an entire tumblr page dedicated to that all-spoiling topic: the final sentence. Is digital media focusing on the easy quotables of literature? Or is that a false stereotype?

Speaking of pithiness, did you know the Bible has a Sparknotes page? See here for an interesting survey on who knows (aka reads) the Bible best among various religious groups. Religious (and Christian) or not, it would be a fair opinion that the Bible is the most well-read book in America and world-wide (especially if you’re one who thinks it ought to be read). Here’s another article discussing just how many read the Bible versus how many think it should be read. Hmm… will the biblical god be happy with those who have only sparknotes’d him? After all, 1 Peter 3:15 makes a call for Christians to know their book and faith, and yet it seems like they need a little help. The Bible, after all, is first and foremost a piece of literature with a profound affect on history, and huge numbers of classical works allude to it.

Sparknotes, of course, is the original student-slacking site, and the most well-known. But as much as teachers [seem to] hate it (input, Dr. Schacht?), it seems to be the general consensus that it helps students get a general grasp on basic plot and theme when reading, providing a shallow but supportive scaffolding to literary endeavors– however, the question is, as has been the entire post, does this tool substitute real thought and criticism, or add to its richness, and, in its shallowness itself, easily allow for otherwise difficult works to be entered into by hesitant students? I think, as long as students actually use these as supplemental tools as such, there can’t really be any net harm, only tiny, or even great, gains. Should students of literature loosen up a little in a ShirtlessTwain sort-of-way, in order to make ourselves more accessible and appealing? Is it time for Thoreauvians to call “shirts off, bro!”, and lighten the mood a little?

Pessimism floods us again– here is Twitterature, “The World’s Greatest Books in Twenty Tweets or Less.” Poorly reviewed on Amazon, it seems that good “scaffolding” for works themselves might be beneficial, but often times, as here, trying to be cool and compact leads one to laziness. The Guardian reviews, “The classics are so last century.” Ultimately, I think short, compact “guides” to literature may be helpful in (basic) understanding and motivating one to push past dense wordy slogs, but we might be careful to censor gently avoid the bad ones.

Finally, here is Elliott Holt‘s short detective story told entirely in Twitter itself. Twitterfiction tweets out 140-character stories. Good quality or not– most authors, when giving advice, say simply to start writing, no matter how bad or little. Practice for english students is analogous. How many literary-related tweets or Twitterfictions will it take to build up skill for a 10 page paper?…

I was going to end the post with a few cool, literature-related twitter accounts, but rather than quickly find some (having just made my twitter account for class and followed a few pages) without checking them over for quality, does anyone know of any useful or just interesting twitter accounts for literature-lovers?

“Iceland: Where one in ten will publish a book”

Speaking of our class discussion today about an inundation of books bypassing the traditionally limited gatekeeping of publishing, and a new era of democratizing authorship, here’s an article I saw a while ago about how…

“Iceland is experiencing a book boom. This island nation of just over 300,000 people has more writers, more books published and more books read, per head, than anywhere else in the world.”

I couldn’t find anything else through an easy search online (everything else was simply referencing the BBC article), but it’s certainly interesting, and if anyone finds out anything more just comment about it! The article discusses literary culture in Iceland and possible causes of the boom…

“We are a nation of storytellers. When it was dark and cold we had nothing else to do. Thanks to the poetic eddas and medieval sagas, we have always been surrounded by stories. After independence from Denmark in 1944, literature helped define our identity.”

On an unrelated note, here’s the article Dr. Schacht mentioned in class– Susan Sontag’s Against Interpretation (~only eight pages, worth a read):

“…Today… the project of interpretation is largely reactionary, stifling. Like the fumes of the automobile and of heavy industry which befoul the urban atmosphere, the effusion of interpretations of art today poisons our sensibilities. In a culture whose already classical dilemma is the hypertrophy of the intellect at the expense of energy and sensual capacity, interpretation is the revenge of the intellect upon art…

…In place of a hermeneutics we need an erotics of art.”



A Note on Annotations


In class today, the lion’s share of our discussion was predicated on a discussion of annotations in works, and their pros and contras in both traditional manuscript format and digital editions of works. In this  blog post, I simply want to share two works dealing with these themes and review them to a limited extent in case they meet anyone’s further fancy.

The first is Mark Z. Danielewski’s novel, House of Leaves. If anyone has read it, or simply opened it to skim a few pages (you should do this, the hyperlink is to it in the Milne catalogue), you will see why it’s so difficult to explain.

To begin, it should be explained that as an entire whole, the book is composed as ergodic literature, defined superbly as Espen J. Aarseth, the coiner of the term:

“In ergodic literature, nontrivial effort is required to allow the reader to traverse the text. If ergodic literature is to make sense as a concept, there must also be nonergodic literature, where the effort to traverse the text is trivial, with no extraoematic responsibilities placed on the reader except (for example) eye movement and the periodic or arbitrary turning of pages.”

The concept is best shown through a visual:


Danielewski’s novel contains colored text (for example, the word “house” always appears in blue ink, most probably alluding to the blue-hued Internet hyperlink), sideways text, mirror writing, writing in different shapes, etc. More notably, and annoyingly, are the thousands and thousands of increasingly abstract and erudite footnotes. Danielewski uses this purposefully frustrating technique to create a satire of criticism in academia, notably literature studies, which is how he views the opinion of the “general public” on literary studies; that of a society looking scathingly upon a seemingly, purposefully difficult field of study– and possibly worthless, in the end.

Ironically, Danielewski fails to fully portray the frustration factor of this satire, because for all this mess, his narration carries us on, trudging and complaining, through the hundreds of pages. The story of the House of Leaves is really three: that of Zampanò’s (1) notes on the Navidson Record (2), a set of documents about an [ex-]adventurer who discovers in his newly purchased, seemingly domestic and plain house an impossibly huge labyrinth. First seen as just a discrepancy between the dimensions of the house (the inner dimensions are larger than the outer dimensions by a quarter of an inch), the interior of the housesoon grows astonishingly large, to utterly fantastic dimensions, of huge, hulking hallways of eerie grey walls, all of which possess… nothing. As the house grows, nothing is found within the huge abyss as they explore except an eerie growl (the source of which, be it the house or the adventurer, Will Navidson, is never explicitly confirmed). The final story is that of Johnny Truant (3), who finds the notes of the just-murdered (?) Zampanò, and then rambles in his footnotes about his personal life. Ultimately, the story questions what we find, if anything, as we dive deeper and deeper into literature– are the meanings we find our own, or the author’s intended? Etc. etc.

Though we who hate footnotes may find this complicated (again, purposefully) book unappetizing at first glance, it really is a good book. It is a horror story (the unending, ever-expanding labyrinth…) and love story (Will and his wife) in addition to a satire. Just for kicks, I would suggest going to see it in Milne and flicking through it just to see how crazy the book’s style is (and how Danielewski portrays the “general public’s” perception of erudite literary studies– it is so off the mark, even to english students, that most of us, I think, will be astonished, lose our confidence and interest, as we stare at this immense, labyrinthine book).

The other book I will mention more briefly (having not yet finished it) is the novel S., by J. J. Abrams and Doug Dorst. It, too, is a story within a story. The “real story” is called The Ship of Theseus, and is about a man (S) who has lost his memory. It is written by the fictional author V. M. Straka, whom the other story revolves around. Notes in the margins are from Jennifer and Eric, two college students trying to uncover the mysterious of this mysterious Straka figure. There are notes from both Eric and Jennifer on their first read, their second read, and so on, leading to a rainbow of colors in the margins denoting the chronology of the notes.




In addition to the notes in the margins, there are further items stuffed within the pages, such as government documents, maps drawn on napkins, etc., which Jennifer and Eric put in, lending itself ideally to our discussion of the fluid-text edition of Walden, wherein Thoreau had multiple versions of his work which scholars had to piece together in a roughly chronological and version sequence, as well as any writings in his journals which were drawn on in Walden itself. It is interesting to see, rather than older and younger versions of a literary work juxtaposed together to compare and contrast, older and younger memories of a romantic relationship bound in the margin, side by side.

I think anyone who is a bit sketchy on the success of things such as public annotations and comments should check this book out. While things like Kindle seem to be a bit impersonal and distracting (“1,233,939 people highlighted this sentence”), S. creates a very personal story in the margin itself between just two people.

Anyway, just thought I’d suggest these two books for anyone interested in the topic. The first one is in Milne, as I’ve linked to, and S. you’ll have to buy, unfortunately, if you want to read it, most likely (though I own a copy if you want to look at it to get a better idea of what it looks like). Sorry if the descriptions seemed a bit confusing– as I mentioned, Danielewski’s novel is purposefully confusing, and S. I haven’t finished yet! If anyone else has read these and want to expand on my brief descriptions, feel free!

1Some footnotes are outright invented sources and authors, while others are from real sources. The number of footnotes is in the hundreds, if not thousands.2

2Including footnotes of footnotes

3 The online forum for readers of the work, who try to solve what is probably a purposefully unsolvable puzzle.