Researching tweets (the Twitter kind)

The US Library of Congress, in April 2010, made a deal with Twitter (microblogging service where people chat or tweet in 140 characters) to acquire all the tweets from 2006 to April 2010 and starting from April 2010, all subsequent tweets (the deal is mentioned in my April 15, 2010 posting [scroll down about 60% of the way]). Reading between the lines of the Library of Congress’ Jan. 2013 update/white paper, the job has proved even harder than they originally anticipated,

In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public tweets from the company’s inception through the date of the agreement, an archive of tweets from 2006 through April, 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. The Library’s first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date. This month, all those objectives will be completed. To date, the Library has an archive of approximately 170 billion tweets.

The Library’s focus now is on confronting and working around the technology challenges to making the archive accessible to researchers and policymakers in a comprehensive, useful way. It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. The Library is now pursuing partnerships with the private sector to allow some limited access capability in our reading rooms. These efforts are ongoing and a priority for the Library. (p. 1)

David Bruggeman in his Jan. 15, 2013 posting about this Library of Congress ‘Twitter project’ provides some mind-boggling numbers,

… That [170 billion tweets] represents the archive Twitter had at the time of the agreement (covering 2006-early 2010) and 150 billion tweets in the subsequent months (the Library receives over half a million new tweets each day, and that number continues to rise).  The archive includes the tweets and relevant metadata for a total of nearly 67 terabytes of data.

Gayle Osterberg, Director of Communications for the Library of Congress writes in a Jan. 4, 2013 posting on the Library of Congress blog about ‘tweet’ archive research issues,

Although the Library has been building and stabilizing the archive and has not yet offered researchers access, we have nevertheless received approximately 400 inquiries from researchers all over the world. Some broad topics of interest expressed by researchers run from patterns in the rise of citizen journalism and elected officials’ communications to tracking vaccination rates and predicting stock market activity.

The white paper/update offers a couple of specific examples of requests,

Some examples of the types of requests the Library has received indicate how researchers might use this archive to inform future scholarship:

* A master’s student is interested in understanding the role of citizens in disruptive events. The student is focusing on real-time micro-blogging of terrorist attacks. The questions focus on the timeliness and accuracy of tweets during specified events.

* A post-doctoral researcher is looking at the language used to spread information about charities’ activities and solicitations via social media during and immediately following natural disasters. The questions focus on audience targets and effectiveness. (p. 4)

At least one of the reasons  no one has received access to the tweets is that a single search of the archived (2006- 2010) tweets alone would take 24 hours,

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost prohibitive and impractical for a public institution.

Some private companies offer access to historic tweets but they are not the free, indexed and searchable access that would be of most value to legislative researchers and scholars.

It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. (p. 4)

David Bruggeman goes on to suggest that, in an attempt to make the tweets searchable and more easily accessible, all this information could end up behind a paywall (Note: A link has been removed),

I’m reminded of how Ancestry.com managed to get exclusive access to Census records.  While the Bureau benefitted from getting the records digitized, having this taxpayer-paid information controlled by a private company is problematic.

As a Canuck and someone who tweets (@frogheart), I’m not sure how I feel about having my tweets archived by the US Library of Congress in the first place, let alone the possibility I might have to pay for access to my old tweets.