The Thorough Thoreau: the Annotated Fluid-Text Edition

The ENGL340 Coder’s team presents the Improved Fluid-Text edition of Walden…

Working with Beth Witherell’s “The Writings of Henry D. Thoreau,” and the Princeton edition of Walden, our team incorporated annotations from eight volumes of journals into the Fluid-Text edition of Walden, making the project more massive, authentic, and penetrating.

“We commonly do not remember that it is, after all, always the first person that is speaking. I should not talk so much about myself if there were any body else whom I knew as well. Unfortunately, I am confined to this theme by the narrowness of my experience.” Walden page one.

Project: Success!with some foundering… Our ultimate goal this semester was to add a feature to the Fluid-Text Walden that grants users a behind-the-scenes study of Thoreau’s writing process, from the [seemingly] random ramblings of his journals to the finished product– the transcendentalist masterpiece of life in the woods. Given that our group as a whole had little prior experience with encoding, our journey was not directly the exploration of growing authorship, but instead the exploration of this idea in a digital way– we explored this idea in a very meta way, going behind the behind-the-scenes to revamp the digital edition of the work.

Our task involved expanding our collective knowledge of coding, primarily utilizing the TEI standard and the XML format, of which we plugged finally into the versioning machine of the Fluid-Text.

However, what can’t clearly be seen in the digital manifestation of our effort are the organizational bumps we plowed into along the way– the rerouting and clarifying and focusing and… Here we will historicize the endeavor (much like our project scaffolds Thoreau to a greater degree)– of the careful balance digital humanists must establish between planning  and actual implementation.

For all the struggle, our project was completed. The Fluid-Text edition is ever-enhancing itself, and Thoreau isn’t done with us yet, either.



Table of Contents:

•The Process
• The Product
• Challenges
• The Future
• Digital Humanities
• Project Members
• Resources



THE PROCESS

 “No man ever stood the lower in my estimation for having a patch in his clothes…”

For some time, the project was undefined. We rolled around in the dirt of ideas and potential projects, and had a few false starts and dead ends along the way. ENGL340’s Data Analysis group discussed feasible ideas with our team at the onset, as we were both working with types of coding and analysis as the basic  purpose to our project, but eventually the path was forked. Finally, it was decided, given the availability of materials, that we would create a coding project that amalgamated passages in Thoreau’s journals that were later referenced in the published edition of Walden in the digital edition, the Fluid-Text. These cross-references  were neatly cited in a compendium in the back of the volumes (#1-8) of “The Writings of Henry D. Thoreau.”

01_ThoreauThoreau kept extensive journals, of which he recorded ideas, notes, and thoughts he had during his days. Some were lengthy– others, jotted down half-ideas, of which seem only to make sense to Thoreau himself (or, when Thoreau later finished the thought by including it in Walden).

Simply attaining the journals proved to cost a pretty chunk of time. Due to their rarity, cost, etc., it was some time before a complete set could be shipped to Geneseo. It would also be useful for everyone in the group to possess a physical copy of the Princeton Walden to use in tandem with the journals, and this was another unforeseen snag we briefly were stalled with.

Screen Shot 2014-05-13 at 5.37.49 PM

But– again!– more unanticipated hindrances. Though Thoreau’s writings are in the public domain, the works we used are edited versions, and thus are under copyright. We couldn’t photocopy just anything we wanted due to these legal limitations, and this involved yet more obtaining of journal volumes– we had to use all volumes in multiple steps of the project, often with multiple people needing to use the same one. We could only photocopy the index for the sake of reference, shown here.

We were the “coding” group, yet, oddly, the bulk of our work (after the planning) was tedious, hand-typing of every single journal entry into our Google spreadsheet. We needed the entire passage itself, its citation (e.g. 6.10-15; that is, page 10, lines 10-15), the date; also needed were its counterparts, the page number in Walden itself, and the relevant passage that most resembled the keywords from the journal entry:

Screen Shot 2014-05-12 at 9.19.11 PM Screen Shot 2014-05-12 at 9.19.27 PM


Some 500+ entries were filled out with this data.

All right– the data was assembled all in one place. We had the journal text itself, and all the data to link it to Walden itself, as well as things like dates that we could both include and perform analyses on [more on that later]. What next?

The point of using Google Spreadsheet was twofold. One, keep the group in sync and connected; two, utilize the function and mass-apply capabilities of the program. Our next medium, TEI (using the program Oxygen), needed transformed data. By writing a function and applying it to all 600 items in our spreadsheet, we could quickly and easily slide forward in our agenda. The beauty of coding and things like Google Spreadsheets is that it eliminates, most of the time, sheer labor. We need not write out every TEI data string that would be plugged into the Fluid-Text versioning machine for Walden– instead, we could write a program and apply it to that which matched the pattern (in our case, all items).

TEI (the Text Encoding Initiative) is a standard of programming used most often in digital humanities. TEI providing the guidelines, the markup language XML (Extensible Markup Language) was used to encode our data in the editor Oxygen, which allows for advanced features. The string of tags in the image above (appearing from the Google Spreadsheet) were transplanted to Oxygen for finalization:

Screen Shot TEI

Examples include the “resp” tag, which identifies the responsibility (i.e., editor). All of these tags are read by the machine and can be manipulated by the coding engine to perform different tasks, if need be.

The majority of the work done, the rest was detail-work and error-check. The TEI code itself had to be checked in Oxygen for technically errors. For instance, Thoreau often used the ampersand (“&”) in his writing, but in TEI, the ampersand is not read as text but a command. This demanded a work-around, as did other small errors.

Screen Shot 2014-05-12 at 10.09.00 PMThe majority of the work done, the rest was detail-work and error-check. The TEI code itself had to be checked in Oxygen for technically errors. For instance, Thoreau often used the ampersand (“&”) in his writing, but in TEI, the ampersand is not read as text but a command of sorts (as it was, an error). This demanded a work-around, as did other small errors.

Additionally, we had to proofread the annotations in the final form as well, as Oxygen can only pick up on technical errors (things not strictly allowed), while we wanted to check for errors beyond that, such as formatting, etc.

Screen Shot 2014-05-12 at 6.57.45 PM

 

Our part done, everything we worked on over the semester has been uploaded to a site directory, where you can see all our files and data, and download it to see for yourself. The product is now available in the Fluid-Text edition of Walden.

 


THE PRODUCT

Now anyone can access the digital edition of Walden, which contains not just the various editions in Thoreau’s writing process of the Work itself, but also his journaled annotations.

Screen Shot 2014-05-12 at 11.26.14 PMHere is an example of what the journal annotations look like. In the text itself, a c marks a note at the beginning of a paragraph. A simple mouse-over reveals the forerunner thought Thoreau had in his journal writings.

Some times Thoreau copied himself exactly. At others, he radically changed the sentence structure, retaining only the very kernel of the statement. It is possible he actually copied lines from an older version of Walden into the journal, and then back again, to an version closer to the final product…


 THE CHALLENGES 

Copyright regulations, an ethical adherence to avoid plagiarism, etc., all slowed our work down. Unable to scan the journals or Princeton Walden, we were stuck with hard labor when a technical solution was right at our fingertips.

At the beginning of the course (and project), we only had a plain, basic understanding of markup systems. As a result, a non-significant portion of our group project time was spent learning various coding languages; the attributes and values and tags; etc., etc. Ultimately, though, Joe Easterly’s expertise and work in encoding and script-writing helped the project glide along when it came to the markup stages.

Due to the digital/coding aspect of the course, we focused more on doing something with the text rather than saying something about it. However, despite not focusing on critical analysis of Thoreau’s authorship through time, our own journey of learning tools, how to build systems, etc., resembled not only what you can do with our project’s outcome, but Walden itself. While some may question whether “digital humanities” is an oxymoron, or two incompatible things juxtaposed forcefully, let this be a lesson that that is not the case. Our digital humanities project plan was not literary analysis using digital tools, but it ended up resembling such a thing in the end after all, upon reflection.

Planning, planning, planning! While it may be said, especially, of humanities projects that too much planning can land one in developmental hell, it is equally true that too little planning can lead to dead ends faster than Henry David Thoreau would eat a woodchuck if he could catch it (answer: instantly devoured raw, of course). But one must be patient and precise, and, again, know that humanities projects are some some 75% planning, and just 25% action. If you are stuck like we were, or even outright failed, learn from others and yourself.


THE FUTURE

Still much work is to be done for Digital Thoreau and the Fluid-Text edition, specifically.

For instance, in assigning tags to the dating of journal passages, we have discovered that some of the dating of the various editions (a-g) may in fact be wrong. Of course, the assignment of writings to “editions” (rough drafts) of Walden was an “imagined” discipline, as Thoreau did not leave separate notebooks with complete versions in it, but instead wrote over older editions in different inks, etc. However, this project will lead into that one, leading to a more accurate chronology of Walden‘s writing process.

As for the Fluid-Text edition itself, the project is still ongoing. Not quite all of our cross-references were finished in time to be added, and formatting choices are still to be updated for the additional section of the journal notes.

Finally, as for you, and us, after a long journey, its time we can actually relax and study the outcome of our project, comparing the annotations and cross-referencing Walden passages.


DIGITAL HUMANITIES 

So… what’s the point of all this? Why do all this work when we students have access to the journals in college libraries? The goal, of course, is to harness the awesomely democratic power of the Internet, powered by our digital tools and digital thinking. The TEI standard, XML, Oxygen– all these and more were utilized in this single project to make available the pre-Walden musings of our favorite transcendentalist, Henry David Thoreau. And of course, “transcendentalism” is just another pre-digital handle for “digital humanities,” the global project of freeing knowledge from the dusty recesses of collegiate libraries for everyone, everywhere.

“To my astonishment I was informed on leaving college that I had studied navigation!–why, if I had taken one turn down the harbor I should have known more about it…”

Thoreau was very critical of the time’s he lived in– its politics, society, and education. He advocated for individualism, but, contrary to popular belief, he didn’t spend two years in solipsistic seclusion. He not only remarked on culture and life, but actively observed it, often passing into the town of Concord. “Life in the Woods,” then, is not a literal demand to be hermitic, but a way of thinking.

“What sort of space is that which separates a man from his fellows and makes him solitary? I have found that no exertion of the legs can bring two minds much nearer to one another…”

The progress in the spheres of digital humanities is the new Wood to gather in. Thoreau’s writings are packed with thoughts on contemporary matters. Transcendentalism is a good place to start for explaining the purpose of digital humanities, but it is really only the very beginning of its enormous potential. And the only way to explore this new world is to dive right in. You can’t just walk in the woods a few times, you have to live there, to get your hands dirty and understand how, and not just what.

“But lo! men have become the tools of their tools…”

So join us! Code! Not only have we given readers the digital version of Thoreau’s pre-Walden thoughts, but we have attempted to show how we built this. As digital humanists, one must not be intimidated or aloof or ignorant, or even deprived of, the tools, how to use them, and how to build them ourselves. Go a bit past merely using digital tools and the Internet (to view cat videos, no doubt), to see how such things are constructed. Do not become a one-shot, funny-cat-photo-poster on the web– or, as Thoreau thought about the Collin’s cat, if you don’t adapt to changing society (improved by you, of course), you will turn to be completely wild, without a fine balance between thinking and doing, or solitude and society, literary and digital tools– and “so become a dead cat at last.” Don’t be a dead cat. Be a digital humanist!

“If you have built castles in the air, your work need not be lost; that is where they should be. Now put the foundation under them.”

*All Walden quotations from the Fluid-Text edition


Project Members:
Andrew Nauffts, Matthew Spitzer, Victoria Salazar,
Kyle Parnell, CJ Ferraro, Cob O’Brien, Joe Easterly

SUNY Geneseo ENGL340: “Literature and Literary Studies in a Digital Age”
Spring 2014

Special thanks to:
• Dr. Paul Schacht, our Professor
• Beth Witherell, Ron Clapper, and others for their work on Thoreau

NOT_DC_DH
Courtesy of digital humanist and Thoreauvian fanboy, Dr. Schacht

 Resources:

• Introducing the TEI Guidelines

•  A Gentle Introduction to XML

Free coding lessons online

• Free 30-day trial of Oxygen XML Editor

• The Social Reader’s Text edition of Walden

• This blog post by Dr. Paul Schacht with a video explaining the Fluid-Text in general

• Various blogs on coding, digital humanities, etc.: including this one, Coding Horror



 

The Gettysburg Address as Fluid Text

Digital Thoreau’s “fluid text edition” of Henry D. Thoreau’s Walden is so named in reference to John Bryant’s 2002 book The Fluid Text: A Theory of Revision and Editing for Book and Screen. Every text is fluid, Bryant suggests, insofar as it represents not the definitive articulation of a fixed intention but rather one entry in the record of an author’s evolving and shifting intentions. The full record of those intentions would involve, at a minimum, all of the author’s drafts, and perhaps even information about authorial decisions in flux between the moment a pen is raised and the moment it touches paper.

Some texts are more obviously fluid than others because we have more information about their genesis. Such is the case with Walden. And some are fluid not only because of their pre-publication but also their post-publication history. An example of the latter is Lincoln’s “Gettysburg Address.”

The fluidity of this famous speech briefly became a matter of lively public discussion in 2013, its sesquicentennial year, when conservative media outlets expressed outrage over a recording of it made by President Obama. The reason for the outrage? Obama’s omission of the words “under God” from the final sentence.

As it turned out, Obama had given an historically faithful reading of one of the address’ five versions: the so-called “Nicolay copy,” sometimes referred to as the “first draft” of the address because it’s the earliest surviving manuscript copy and may have been the copy from which Lincoln read at the cemetery’s dedication on November 19, 1863. At the request of Ken Burns, Obama recorded the Nicolay copy as part of Burns’ Learn the Address project, which encourages “everyone in America to video record themselves reading or reciting the speech” — and which might remind us that the post-publication fluidity of some texts (most obviously, perhaps, speeches and plays) is partly a consequence of their having been intended for performance.

Google Cultural Institute has a nice timeline of the address’s interesting textual history. It draws largely from the House Divided Project, a digital humanities Civil War project at Dickinson College to which Dickinson undergraduates have contributed, and where you can read all five drafts.

The Gettysburg Foundation also provides transcriptions of the five versions, highlighting the differences between them in boldface.

In ENGL 340 tomorrow, we’ll take these five versions and encode the differences between them in XML, using the critical apparatus tagset of TEI. Then we’ll display them side by side using the Versioning Machine and — if time allows, and if all goes well — Juxta in order to see how visualization tools can help us understand the fluid nature of one of our nation’s most important texts.

 

Literature and Literary Study in the Digital Age, Spring 2014 Edition

“Literature and Literary Study in the Digital Age” is in its third incarnation at SUNY Geneseo as ENGL 340 – a brand new course with its own place in the line-up of offerings under Geneseo’s new English major. The course began as HONR 206, Digital Humanities in Spring 2011, and was offered twice as ENGL 390, first as Studies in Literature: Literature in the Digital Age and then as Literature and Literary Study in the Digital Age.

The latest iteration of the course has its home in this space, a group blog for all students and faculty at Geneseo interested in digital humanities. The blog is part of a larger community organized as English @ SUNY Geneseo, a community powered by the open-source blogging platform WordPress and the open-source plugin Commons In A Box.

If you’re a student in the course, this is where you’ll be blogging this semester, following guidelines you’ll find on the page How to Blog Here. If you’re not a student in the course but you’ve joined the group, please join the conversation and follow the same guidelines.

This year the course coincides with the rollout of two projects at Digital Thoreau, a collaboration among SUNY Geneseo, the Thoreau Society, and The Thoreau Institute at the Walden Woods Project. The two projects are Walden: A Fluid Text Edition and The Readers’ Thoreau. Students in previous iterations of the course have contributed to both projects, and this year’s group will use the projects as resources and carry them further, while also continuing work at Digital Thoreau’s third project, The Days of Walter Harding, Thoreau Scholar.

I’m posting here from San Marino, California, where I’ve just spent the past two days in the Huntington Library with two Thoreau scholars who’ve been instrumental to Digital Thoreau and to the development of this course: Ron Clapper and Beth Witherell. We’ve been looking together at the HM 924, the manuscript of Walden. More on that to come.