Understanding text as “data” and accepting its fluidity

The survival of the humanities in the digital age relies upon the understanding that text is data: specific information that is carefully packaged for our analysis. In my experience as an English major, analysis is typically undertaken as a quest for interpretive meaning, bringing to the fore questions of symbolism and literary devices, like: “What does the green light symbolize in The Great Gatsby?” and “How does Fitzgerald’s diction convey this particular meaning, and not another?”

After focusing primarily on qualitative approaches to literary analysis, ENGL-340 introduced me to the quantitative analysis of text as data. With quantitative analysis, each bit of information is examined without projecting more meaning than what’s provided. An example of quantitative analysis applied to The Great Gatsby might be: “How many times are the eyes of T.J. Eckleberg mentioned? Where are they mentioned most often?”

ENGL-340 also introduced me to text as a corpus of data, or corpora if there’s more than one distinct corpus. We apply quantitative analysis to one body of text, or a collection of related texts, in search of patterns. Patterns are the keystone of meaning according to quantitative analysis, especially when examining a large corpora. For instance, you might wonder “What patterns can we discern across classic novels written by the modernist “Lost Generation”?” These “patterns” might materialize as the repeated use of a word, theme, or structural organization. From there, you might imagine why the great writers of the Lost Generation made similar, or different, stylistic choices.

Incorporating this mode of analysis into our literary approach is crucial because it bridges the perceived gap between the humanities and the sciences. Human thoughts crafted into words crafted into sentences crafted into coherent bodies of text are historical objects of information worthy of scientific analysis. They are markers of humanity’s achievements: some of the greatest, most timeless self-knowledge we’ve touched upon as a species is found in literature. Why not examine its concordance, then? Some tips for quantitative analysis that I picked up in ENGL-340, and which I hope to bring to future classes, are:

  • Searching a corpus for patterns at the command line: using a digital copy of the text you’re analyzing and the command line, you can search the text for the occurence rate of specific words, for a total word count, for the longest and shortest words in the text, for the word that occurs most often, and so on.
  • Generating graphs and other visuals for your data on Voyant: rather than using the command line, you can also view the quantitative analysis of textual data you upload on the website, VoyantTools.org. The cool thing about Voyant is that it generates different charts, graphs, and visuals of the data you’re focusing on.
  • Comparing data across versions using a Fluid Text Edition: here, we can analyze how data in a corpus has changed or stayed the same over time.

The second necessary development in my perception of literature as a result of ENGL-340 is tied to this final bullet point. First, there’s understanding text as data. Then, there’s understanding that text is fluid: not a fixed and stable object, but rather, an ongoing project.

Our favorite books, poems, and plays were formed gradually and with great care– not suddenly and miraculously crystallized. If we have access to earlier manuscripts/revised copies, we can compare versions of a text to better understand its development. Referring to a fluid text edition also brings awareness to the humanity of the author: reading Thoreau’s first draft of Walden, we are reminded of the flaws inherent in everyone’s first draft. It also serves to remind us, quite fittingly for Thoreau, that everything is done with some degree of deliberation in the literary world. Individual words are chosen with extreme care: Thoreau scratched out the words “book”, “work”, and “lecture” when describing Walden across various versions before finally settling on “book”. The attention put toward such minute matters reveals Thoreau’s dedication to “getting it right” as a writer– a quality not all writers, or authors, possess.

Thanks to ENGL-340, I am able to see the oft-overlooked juncture between qualitative and quantitative analysis in literary studies. With attention to the nature of text as “fluid data,” we can study the information before us both objectively and subjectively. Subjectivity particularly applies to imagining plausible revision narratives that explain changes made to a text. Of course, quantitative data is also fodder for further interpretation. Information, writing, art, life– I think James Gleick once called it a “moving target,” and I find that more than adequate. These things are constantly in flux as a result of being alive; final stability is like death, or whatever name you give to completion. Just like the way we learn from one another, we learn from art as we observe and analyze its change over time, its variations and constancies. In this way, ENGL-340 enhanced my understanding of literature and life.

The word stripped to essential bits

Our thoughts shape the way we perceive the world around us. Communication as a verbal transaction of information was a necessary instrument for survival long before anyone dreamed of communicating long-distance. But the birth of written language changed the way humans thought: suddenly there was a logical order of things, rules to prescribe; for instance, things had proper names and spellings. Abstract concepts were available to literate minds, laying figurative fertilizer down on the ground from which knowledge grows. The written word is the starting point for information technologies, but it is certainly not the conclusion, if there is to be one.

Written language introduced an unprecedented permanence in information by rendering the spoken word tangible. This ability for a message to transcend its historical situation is the fascination of the humanities, where we read the words of humans from all ages, thereby existing briefly outside time ourselves. The written word allowed us to process information in a new and different way. For the first time, we could think about the language and organization of our thoughts through the reflective lens of written word: using symbols to signify meaning, preserving information that might otherwise escape us if it existed in thought alone. It’s no wonder that after a certain point, even the written word had to evolve to keep up with human needs.

Early on in The Information, James Gleick writes: “Every new medium transforms the nature of human thought. In the long run, history is the story of information becoming aware of itself” (12). As humans engaged in the collaborative, continual progress of better information encoding, sharing, and storing capabilities, we are tasked with re-inventing language to better suit our changing informational needs. As an English major, it’s hard for me to part with the alphabet or even to imagine a world in which the alphabet becomes obsolete. Even though I only speak one language, letters (signifiers) that form words (symbols) are familiar and recognizable across most Romantic languages. But words often fail us in the transmission of information: there are paradoxes and redundancies, issues of brevity and clarity. As much as I love the occasionally frustrating intricacies of language, I now realize the historical and humanitarian necessity for ways to encode information so that it is universally accessible.

Information can be stripped down even further than words and letters to symbols and signs, the domain of math. Part of me desperately wanted to keep my romantic notion of English language and the (seemingly) institutionalized jargon of math language separate, but my wiser self recognizes that that mental separation is contrived. I’m making an effort to wane my inherent bias against mathematical language. As ironic as it seems, the simplicity of numbers makes them the more effective means of spreading information. Numbers, like letters, are signs in a system of patterns. Patterns don’t necessitate specific symbols, either. In its purest form, information is no greater than symbols signifying meaning.

Our common currency for information is letters and numbers, but it can be broken down even further. In binary code, utilized in the Morse telegraph, it is represented as a current (0) or an interruption (1). If binary code is considered an informational language, expressing complex theory and nuance seems drastically over-simplified. But as an instrument for spreading information, it has almost unquantifiable power: “Signs and symbols were not just placeholders; they were operators, like the gears and levers in a machine. Language, after all, is an instrument” (165) . Unlike any language before, binary code made information unlimited. Any message could be efficiently encoded and transmitted throughout networks. As Gleick reiterates, the encoding between languages made it possible: “The Morse scheme took the alphabet as a starting point and leveraged it, by substitution, replacing signs with new signs. It was a meta-alphabet, an alphabet once removed. This process—the transferring of meaning from one symbolic level to another—already had a place in mathematics” (152). ‘An alphabet once removed’ sounds particularly apt, because that’s at the heart of what we’re doing: encoding is moving between different levels of symbolic meaning.

As information becomes increasingly self-aware, the potential for levels of meaning expands infinitely. Computing and the humanities overlap because the ‘transferring of meaning’, or encoding, between one symbolic level (i.e. the alphabet) and another (i.e. code) occurs in conjunction with human progress, and helps us meet a growing need to more successfully communicate, store, and preserve information for future humanity.