Data-Mining Walden: Tools for Literary Analysis

Henry David Thoreau had a fraught relationship with technology. As we discussed in our presentation, it is difficult to tell whether he would be on board with our digital projects regarding his work. What we can say for sure is that the technology we have engaged with this semester have allowed us to read his book, Walden, as deliberatively and as reservedly as it was written. By apprehending his text in the digital dimension we achieved new and unique insights into the way Thoreau thought about place and how he crafted his thoughts into writing. 

Melissa, Sean, Cal, and Emma each took a chapter to mine in order to track the language of place and its developments throughout the text. This required the downloading and installation of some software with the help of Kirk Anne and Dr. Schacht. Brianne worked on answering the “so what?” question by analyzing the data collected by the other group members. We worked with the Natural Language Toolkit (NLTK) and spaCy,both of which allowed us to mine for certain words and types of words. However, eached proved to have their own limitations within each chapter. We found that spaCy was better equipped in Cal’s mining of “The Ponds” whereas NLTK was more helpful for Melissa, Sean, and Emma.

Zooming out, data mining a text such as
Walden did not come without challenges. Whether it was the virtual machine or the local server, Python proved to be a very demanding language, one with a steep learning curve which kept us guessing a lot of the time. Similarly, NLTK and spaCy had to be downloaded directly to our devices in order to accomplish the task at hand. It became pretty clear that while digital tools can often make reading easier learning the tools necessary to do so is all but simple. Still, when grappling with the limitations of all of our tools we seemed to be simultaneously addressing larger questions about the utility of technology, just as Thoreau does in
Walden.

Nevertheless, the technology proved indispensable for our project because it helped us to expedite the mining/reading process. Python, the language we used to learn more about Walden, allowed us to operate on the text, while spaCy and NLTK provided a bank of resources that we could apply to the chapters we all chose. Each tool informed us on a general sense of place which we followed up with closer readings. We were able to clearly discern between the broadly spatial chapters (“The Village” and “House Warming”) and the specifically geographic ones (“The Ponds” and “Conclusion”). Whether he was talking about physical places or metaphorical spaces, as in headspace, Thoreau constantly framed his thinking through place specific language. This sort of “mapping” truly makes Thoreau into the “Surveyor of the Soul” that Huey Coleman claims him to be. His attention to the local and the distant, from Concord to Siberia, demonstrates both the interconnectedness that technology in the 19th century was making possible and the expansive reach of an inner geography, a soul whose territory outran the map.

Just as some of Thoreau’s themes exceed the scope of a geographic specific reading, so too did our task at hand exceed the capabilities of some of our tools. One thing our group really wanted to stress in our presentation is the importance of validating failure in digital projects. All of the setbacks, miscues, and limitations faced by engaging with Jupyter Notebook, Atom, Python, Anaconda, spaCy, NLTK, and beyond were equally as useful to thinking about the digital humanities as our successes with each of these tools. When we encountered errors in our work we were forced to ask why. This moment of self-reflection was critical for doing digital work because of the knowledge that stood to be gained by asking questions about the tools. Coming to this class with a variety of digital backgrounds, it was very important that we moved as a unit. Fortunately, the tools we used leant themselves well to collaboration and, ultimately, this project became about creating our own community space around Walden. 

From his comparative measures of White and Walden Ponds, to his rambles through Concord, to his building of a house in the woods, and his reflections on place inward and outward, Thoreau was constantly attuned to the language of place. We too were attuned too language, constantly seeking the instances of geography in his text by moving through it digitally. Just as Thoreau spatializes his world in Walden, so too do we attend to space by tracking its relative importance throughout the book. By using digital tools we were able to read Walden collectively, collaboratively, effectively, and deliberately.

Democracy and Digitization

Like the human brain or the deepest parts of the ocean, the potential for discovery in the digital age seems boundless, especially to someone new to computing like me. Literature and Literary Study in the Digital Age has provided me with keys to locks on doors that I never even knew existed. The technical tools and languages fascinate me, how they command my computer to do things I never thought possible. However, I want to focus on how these technical things build a sort of digital democracy and how this might act as a model for other social environments. We have learned that most of what makes the internet work is open source and free to use/observe. Granted, editing the web can be limited by administrative privileges, but if I learned anything, it is that I am more in control than I thought when it comes to shaping my computing experience.

Applying these technical tools and concepts The Reader’s Thoreau is the best example of the sort of democracy I am talking about. This community, in which Thoreauvians can exchange questions and ideas about his works, is a microcosmic formation of democracy made possible by the computer. Apprehending a plain text version of Walden, raw and unbound from the material book, allows readers the access to the words at a level beyond that of the book. Plain text and plain-text editing with XML or HTML makes things like CommentPress possible. Digitizing Walden has not only brought the text to the more readers, it has engaged them in conversations with other readers. Here, then, is an example of how the technical can perform the conceptual, how digitization can democratize. After working with XML and HTML in the fall to digitize Yeats, I ultimately wanted all of my digital humanist work to surround this core issue, the democratization of information. Little did I know that the internet is set up perfectly for this type of work.

In my investigations of Lessig and Free 
Culture it became clear to me that computers are the backbone of what Lessig calls “remix culture.” The ability of markup languages like XML and HTML are instructive and thus can produce and reproduce texts that shed new light on old words. Similar to riffing in music or stigmergy in organizational theory, these languages allow developers (citizens of the web) to repeat and revise content in new and interesting ways. Lessig writes, “democratic tools gave ordinary people a way to express themselves more easily than any tools could before” (33). Just like a camera, the computer allows take control of their reality, revise and remix it to their liking. This makes the internet rich in texture and vibrant in culture. It reflects what is so good with democracy and it relies on technical copying and revision. This copying and revision happens, for us, at the command line, where we have been spending some time this semester. We can participate actively in the process of making and remaking by directly accessing our computers internal structure. Knowing the technical hierarchy gives each of us the chance to govern ourselves, which is both fundamental to democracy and vital to self-preservation in the hyper-surveillance culture we live in today.

True, the accessibility computers provide people can be used for harm. We are living in an era of “memetic warfare,” where hate can be propagated through the exact same methods of copying and revision. Open sourcing the internet is always at risk of this. Trolls on YouTube and Wikipedia will constantly disrupt the ideal digital democracy, just as corruption and scandal will plague our own democracy. However, the moment we attempt to purify this democracy by placing tight restrictions on spaces like Wikipedia and YouTube we sacrifice that very same democracy. In my directed study with Dr. Doggett, we are talking about this precise issue. The theorist we are reading, Slavoj Zizek, would say that to purify democracy is actually a totalitarian move. Thus, we must preserve the aberrations and deal with hate quickly and effectively. Wikipedia does this by running a “Talk” page alongside each entry, a separate HTML file for people to discuss and suggest changes to each page. It relies on a democratic schema to self-organize and create good.

Similarly, we have seen both sides of computer-as-society with The Reader’s Thoreau. We have engaged in a rich conversation of Walden all semester with each other and readers around the world. Blogging and commenting has fostered a community that exemplifies what we should strive for on and off the internet. We have also seen individuals penetrate the community looking to cause harm (I am referring to the woman asking for money). However, thanks to the self-organizing principles of the internet and some quick action from the site’s administrator, the community was able to move passed this and get back to reading deliberately.

All of this has been made possible by a hyperlinked internet that allows users to move freely between data points and information. As Jeffery Pomerantz points out, the potential of an HTML file is the precise reason why we have the internet. This the underlying technical structure of what makes the computer a democratic tool. Texts connect to other texts which connect people to texts and people to people. This is probably the most important thing I will take from this class. The computer’s ability to convene more and different people around a text, inviting new perspectives always, intrigues me as a student and excites me as a person. I want to take the digital humanities into my education going forward as it has proved so helpful in considering the ethics of writing, something I think about constantly. In short, the technicalities of digitization have prompted me to think in new ways about things that have always been important to me. By continuing in the pursuit of discovery, I will continue in my pursuit of democracy.

Thinking and Living with Computers: Making a Digital Humanist

I can remember a time when I believed computer science and the humanities represented what Stephen Jay Gould would call non-overlapping magesterium. In other words, the two fields emerged from completely different epistemic origins; they had little (if anything at all) to do with each other. This had to be true. I hated working with computers, I became easily frustrated doing so, and I felt inherently different from those of my peers who found computing so natural. The TI-84 on my trig class desk would taunt me for 40 minutes a day throughout all of 10th grade. Meanwhile, I felt at home in my literature and history classes. I loved books, both for their readability and their materiality. I enjoyed my copy of Grapes of Wrath for both the story and the pulpy pages themselves. Hence, I began to develop a sense that computers had simply no place in my humanist education and, likewise, it made sense that my STEM focused peers would have such a distaste for reading books. I can remember this time because it was not too long ago. In fact, it wasn’t until last semester that I uncovered the deeply human nature of the device on which type right now.

Working with Dr. Schacht last fall on a versioning project about W.B. Yeats’s later poetry not only made me more familiar with my computer; it granted me access to a whole new plane of thinking about language. Writing xml documents for this project in Atom and Oxygen created a discussion between my computer and Yeats’s manuscripts. In this way, computers can be Rosetta Stones, engaging different languages simultaneously to present new ways of expressing similar ideas. While I was never one for computer based assignments, this kind of work reminded me of the fun I would have translating Virgil and Catullus in high school Latin. Both demanded a delicacy and respect for the texts. Perhaps the most exciting prospect of this work, though, was the potential of expanding the accessibility of the humanist education.

There is a momentum to digital communication. Too often, books remain on shelves or in the backpacks of disinterested students. By bringing humanist work to the computer, the probability of it reaching more people skyrockets. With social platforms abound, people will run into more and more content that (hopefully) reflects their interests and the continuation of sharing can go on ad infinitum. The self-organizing aspect of some internet tools can be admittedly quite scary and I am not even remotely close to grasping the behind-the-scenes activity of this kind of communication. However, I see a very democratic potential in all of this. One of my main focuses in creating a digital version of Yeats’s poetry was bringing the text to those who couldn’t access the pricey and rare Cornell Manuscript Series. This semester of work got me excited to do more investigating with my computer and ultimately prompted me to take English 340 this spring.

After a few short months of learning more than I had in the previous 20 years, I feel much more comfortable with my computer. However, I recognize the limitlessness of such an endeavor and realize that I may never master these skills which, in a way, is why computing is so similar to the humanities. We don’t seek mastery of literature, rather we read in order to read more; there is no endpoint. Similarly, the reading we’ve done until now will help us in the reading we look forward to doing. In learning xml I didn’t learn all coding and all codes, but I did come to understand appreciate the symbolic nature of such languages and learning one has certainly made learning the next easier. There is a logic to this. It is no mistake that themes in both my English classes and my STEM classes here at Geneseo can find their beginnings in that one philosophy class I took freshman year: Introduction to logic.

It would be a fallacy to say that I am much more comfortable with computing only as a result my humanities classes. Sure, literature helped me step into the cold water of this new way of thinking, but thinking of the two as overlapping has given me the confidence to dive deeper. Thus, while I may not always understand my computer, I am now all the more excited to try and figure it out. What I once saw as a walled-off territory of inaccessible knowledge I now see as an horizon that beckons for further exploration.