Altmetric, altmetrics, data visualisation, datasets, Digital Humanities, DITA, Geoffrey Rockwell, infographics, information architecture, information overload, Jacob Harris, Julie Meloni, Stefan Sinclair, Stephen Ramsay, TAGS, text analysis, Twitter, Voyant Tools, Wordle
The title of this blog post is not flippant, but is in fact taken from a chapter of a recently-published book, entitled The Hermeneutics of Screwing Around; or What You Do with a Million Books, by Stephen Ramsay. Ramsay discusses how the concept of epistemological order—in his words, the “coherent, authoritative path through what is known”—has evolved over time, and recently been completely transformed entirely by the emergence and insatiable growth of the Internet. The essential premise of our current situation is that there now is far too much information, even on the most specialised subjects, for any one person to take in during his or her lifetime; however, new digital technologies allow us to explore the information in alternative ways.
One of these ways is text analysis. This involves feeding a corpus, or body, of text(s) through a computer programme in order to discover various conclusions that would not be apparent at first glance, or would otherwise have been prohibitively expensive and time-consuming to be carried out by human effort alone. For instance, a basic method of text analysis is counting the frequency of words that appear in a text; the information, which only takes a few seconds to compute, even for a corpus of thousands of words, can be displayed visually in the form of a word cloud. The word cloud currently produced by the RSS feed of this blog (which excludes the older posts) looks like this:
Wordle, however, is limited in other ways. I can demonstrate this by using the archive of data that I began exporting from Twitter a few weeks ago using TAGS (which has since been growing, as additional Tweets matching the criteria are automatically added every hour). Feeding the corpus consisting of the Tweets (all of which feature the hashtag #citylis, remember) into Wordle produces this word cloud (using exactly the same aesthetics as before):
This unwelcome result has occurred because of Worldle’s customisation limitations. Text analysis normally makes use of stopwords—extremely common words such as “the”, “a”, “and” and so forthwhich are not included in the computation. Wordle’s default stoplist does omit these words, but it cannot be edited to suit the needs of each individual word cloud; hence, this one is (as you would expect) dominated by the hashtag “citylis” that links the dataset together, whilst other Twitter jargon such as “RT” (retweet) and “MT” (modified tweet) is also visible.
Fortunately, other, more advanced, tools exist for text analysis. The best of those that I have come across so far is Voyant Tools (initially given the unfortunate name of Voyeur Tools), which is still under development by its creators, Geoffrey Rockwell and Stefan Sinclair, but offers significantly more options. Using the same citylis corpus and removing the unwanted stopwords produces a more meaningful visualisation, which clearly shows the most prolific Tweeters and the most common other hashtags used:
Voyant’s tools also extend beyond the word cloud, including a full text reader, a word trends graph, and the ability to locate individual words and see them in context (known as collocation), as can be seen from the screenshot of the entire interface below. This has the potential for more extensive meaningful analysis: for example, looking at the word trends graph, it is clear that there is a spike in activity for the words “lab” and “#bl”—my classmate, Daniel van Strien, explains why here. Another of my classmates, Rachel Giles, has also used Voyant’s ability to customise stopwords to produce more meaningful information about her dataset of cancer-related health journal articles from Altmetric.
This is only barely scratching the surface of text analysis, and I will post more on the subject next week as we explore the closely related technique of text mining. What is particularly exciting about text analysis is that it lends itself best to humanities research due to its focus on the written word, but is the complete opposite of the traditional “close-reading” approach due to its use of large datasets. These disciplines have traditionally been exclusively qualitative in nature, and so using what is a quantitative method to analyse the source material has the potential to open up a new wealth of opportunities for research and data visualisation, collectively known as the Digital Humanities. One interesting and freely-accessible example is this The Lord of the Rings project, which has analysed the complete text of the novel (plus J.R.R. Tolkien’s other works in some cases) to produce a range of relevant infographics.
However, I will nevertheless end on a note of caution. Analysing information in this way risks losing the context that underlies it, and visually appealing infographics such as word clouds can be used when they are not necessary at the expense of a more informative presentation, as Jacob Harris demonstrates with a comparison using the same dataset. As with several other aspects of this module, therefore, I would advise that this technique is used to supplement more traditional methods of achieving the same end, not to replace them.