Tag Archives: archiving

Researching tweets (the Twitter kind)

The US Library of Congress, in April 2010, made a deal with Twitter (microblogging service where people chat or tweet in 140 characters) to acquire all the tweets from 2006 to April 2010 and starting from April 2010, all subsequent tweets (the deal is mentioned in my April 15, 2010 posting [scroll down about 60% of the way]). Reading between the lines of the Library of Congress’ Jan. 2013 update/white paper, the job has proved even harder than they originally anticipated,

In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public tweets from the company’s inception through the date of the agreement, an archive of tweets from 2006 through April, 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. The Library’s first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date. This month, all those objectives will be completed. To date, the Library has an archive of approximately 170 billion tweets.

The Library’s focus now is on confronting and working around the technology challenges to making the archive accessible to researchers and policymakers in a comprehensive, useful way. It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. The Library is now pursuing partnerships with the private sector to allow some limited access capability in our reading rooms. These efforts are ongoing and a priority for the Library. (p. 1)

David Bruggeman in his Jan. 15, 2013 posting about this Library of Congress ‘Twitter project’ provides some mind-boggling numbers,

… That [170 billion tweets] represents the archive Twitter had at the time of the agreement (covering 2006-early 2010) and 150 billion tweets in the subsequent months (the Library receives over half a million new tweets each day, and that number continues to rise).  The archive includes the tweets and relevant metadata for a total of nearly 67 terabytes of data.

Gayle Osterberg, Director of Communications for the Library of Congress writes in a Jan. 4, 2013 posting on the Library of Congress blog about ‘tweet’ archive research issues,

Although the Library has been building and stabilizing the archive and has not yet offered researchers access, we have nevertheless received approximately 400 inquiries from researchers all over the world. Some broad topics of interest expressed by researchers run from patterns in the rise of citizen journalism and elected officials’ communications to tracking vaccination rates and predicting stock market activity.

The white paper/update offers a couple of specific examples of requests,

Some examples of the types of requests the Library has received indicate how researchers might use this archive to inform future scholarship:

* A master’s student is interested in understanding the role of citizens in disruptive events. The student is focusing on real-time micro-blogging of terrorist attacks. The questions focus on the timeliness and accuracy of tweets during specified events.

* A post-doctoral researcher is looking at the language used to spread information about charities’ activities and solicitations via social media during and immediately following natural disasters. The questions focus on audience targets and effectiveness. (p. 4)

At least one of the reasons  no one has received access to the tweets is that a single search of the archived (2006- 2010) tweets alone would take 24 hours,

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost prohibitive and impractical for a public institution.

Some private companies offer access to historic tweets but they are not the free, indexed and searchable access that would be of most value to legislative researchers and scholars.

It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. (p. 4)

David Bruggeman goes on to suggest that, in an attempt to make the tweets searchable and more easily accessible, all this information could end up behind a paywall (Note: A link has been removed),

I’m reminded of how Ancestry.com managed to get exclusive access to Census records.  While the Bureau benefitted from getting the records digitized, having this taxpayer-paid information controlled by a private company is problematic.

As a Canuck and someone who tweets (@frogheart), I’m not sure how I feel about having my tweets archived by the US Library of Congress in the first place, let alone the possibility I might have to pay for access to my old tweets.

Dr. Wei Lu, the memristor, and the cat brain; military surveillance takes a Star Trek: Next Generation turn with a medieval twist; archiving tweets; patents and innovation

Last week I featured the ‘memristor’ story mentioning that much of the latest excitement was set off by Dr. Wei Lu’s work at the University of Michigan (U-M). While HP Labs was the center for much of the interest, it was Dr. Lu’s work (published in Nano Letters which is available behind a paywall) that provoked the renewed interest. Thanks to this news item on Nanowerk, I’ve now found more details about Dr. Lu and his team’s work,

U-M computer engineer Wei Lu has taken a step toward developing this revolutionary type of machine that could be capable of learning and recognizing, as well as making more complex decisions and performing more tasks simultaneously than conventional computers can.

Lu previously built a “memristor,” a device that replaces a traditional transistor and acts like a biological synapse, remembering past voltages it was subjected to. Now, he has demonstrated that this memristor can connect conventional circuits and support a process that is the basis for memory and learning in biological systems.

Here’s where it gets interesting,

In a conventional computer, logic and memory functions are located at different parts of the circuit and each computing unit is only connected to a handful of neighbors in the circuit. As a result, conventional computers execute code in a linear fashion, line by line, Lu said. They are excellent at performing relatively simple tasks with limited variables.

But a brain can perform many operations simultaneously, or in parallel. That’s how we can recognize a face in an instant, but even a supercomputer would take much, much longer and consume much more energy in doing so.

So far, Lu has connected two electronic circuits with one memristor. He has demonstrated that this system is capable of a memory and learning process called “spike timing dependent plasticity.” This type of plasticity refers to the ability of connections between neurons to become stronger based on when they are stimulated in relation to each other. Spike timing dependent plasticity is thought to be the basis for memory and learning in mammalian brains.

“We show that we can use voltage timing to gradually increase or decrease the electrical conductance in this memristor-based system. In our brains, similar changes in synapse conductance essentially give rise to long term memory,” Lu said.

Do visit Nanowerk for the full explanation provided by Dr. Lu, if you’re so inclined. In one of my earlier posts about this I speculated that this work was being funded by DARPA (Defense Advanced Research Projects Agency) which is part of the US Dept. of Defense . Happily, I found this at the end of today’s news item,

Lu said an electronic analog of a cat brain would be able to think intelligently at the cat level. For example, if the task were to find the shortest route from the front door to the sofa in a house full of furniture, and the computer knows only the shape of the sofa, a conventional machine could accomplish this. But if you moved the sofa, it wouldn’t realize the adjustment and find a new path. That’s what engineers hope the cat brain computer would be capable of. The project’s major funder, the Defense Advanced Research Projects Agency [emphasis mine], isn’t interested in sofas. But this illustrates the type of learning the machine is being designed for.

I previously mentioned the story here on April 8, 2010 and provided links that led to other aspects of the story as I and others have covered it.

Military surveillance

Named after a figure in Greek mythology, Argos Panoptes (the sentry with 100 eyes), there are two new applications being announced by researchers in a news item on Azonano,

Researchers are expanding new miniature camera technology for military and security uses so soldiers can track combatants in dark caves or urban alleys, and security officials can unobtrusively identify a subject from an iris scan.

The two new surveillance applications both build on “Panoptes,” a platform technology developed under a project led by Marc Christensen at Southern Methodist University in Dallas. The Department of Defense is funding development of the technology’s first two extension applications with a $1.6 million grant.

The following  image, which accompanies the article at the Southern Methodist University (SMU) website, features an individual who suggests a combination of the Geordi character in Star Trek: The Next Generation with his ‘sensing visor’ and a medieval knight in full armour wearing his helmet with the visor down.

Soldier wearing helmet with hi-res "eyes" courtesy of Southern Methodist University Research

From the article on the SMU site,

“The Panoptes technology is sufficiently mature that it can now leave our lab, and we’re finding lots of applications for it,” said ‘Marc’ Christensen [project leader], an expert in computational imaging and optical interconnections. “This new money will allow us to explore Panoptes’ use for non-cooperative iris recognition systems for Homeland Security and other defense applications. And it will allow us to enhance the camera system to make it capable of active illumination so it can travel into dark places — like caves and urban areas.”

Well, there’s nothing like some non-ccoperative retinal scanning. In fact, you won’t know that the scanning is taking place if they’re successful  with their newest research which suggests the panopticon, a concept from Jeremy Bentham in the 18th century about prison surveillance which takes place without the prisoners being aware of the surveillance (Wikipedia essay here).

Archiving tweets

The US Library of Congress has just announced that it will be saving (archiving) all the ‘tweets’ that have been sent since Twitter launched four years ago. From the news item on physorg.com,

“Library to acquire ENTIRE Twitter archive — ALL public tweets, ever, since March 2006!” the Washington-based library, the world’s largest, announced in a message on its Twitter account at Twitter.com/librarycongress.

“That’s a LOT of tweets, by the way: Twitter processes more than 50 million tweets every day, with the total numbering in the billions,” Matt Raymond of the Library of Congress added in a blog post.

Raymond highlighted the “scholarly and research implications” of acquiring the micro-blogging service’s archive.

He said the messages being archived include the first-ever “tweet,” sent by Twitter co-founder Jack Dorsey, and the one that ran on Barack Obama’s Twitter feed when he was elected president.

Meanwhile, Google made an announcement about another twitter-related development, Google Replay, their real-time search function which will give you data about the specific tweets made on a particular date.  Dave Bruggeman at the Pasco Phronesis blog offers more information and a link to the beta version of Google Replay.

Patents and innovation

I find it interesting that countries and international organizations use the number of patents filed as one indicator for scientific progress while studies indicate that the opposite may be true. This news item on Science Daily strongly suggests that there are some significant problems with the current system. From the news item,

As single-gene tests give way to multi-gene or even whole-genome scans, exclusive patent rights could slow promising new technologies and business models for genetic testing even further, the Duke [Institute for Genome Sciences and Policy] researchers say.

The findings emerge from a series of case studies that examined genetic risk testing for 10 clinical conditions, including breast and colon cancer, cystic fibrosis and hearing loss. …

In seven of the conditions, exclusive licenses have been a source of controversy. But in no case was the holder of exclusive patent rights the first to market with a test.

“That finding suggests that while exclusive licenses have proven valuable for developing drugs and biologics that might not otherwise be developed, in the world of gene testing they are mainly a tool for clearing the field of competition [emphasis mine], and that is a sure-fire way to irritate your customers, both doctors and patients,” said Robert Cook-Deegan, director of the IGSP Center for Genome Ethics, Law & Policy.

This isn’t an argument against the entire patenting system but rather the use of exclusive licenses.