Tag Archives: US Library of Congress

Sifting through Twitter with your computer cluster of more than 600 nodes named Olympus—one of the Top 500 fastest supercomputers in the world.

Here are two (seemingly) contradictory pieces of information (1) the US Library of Congress takes over 24 hours to complete a single search of tweets archived from 2006 – 2010, according to my Jan. 16, 2013 posting, and (2) Court (Courtney) Corley, a data scientist at the US Dept. of Energy’s Pacific Northwest National Laboratory (PNNL), has a system (SALSA; SociAL Sensor Analytics) that analyzes billions of tweets in seconds. It’s a little hard to make sense out of these two very different perspectives on accessing data from tweets.

The news from Corley and the PNNL is more recent and, before I speculate further, here’s a bit more about Corley’s work, from the June 6, 2013 PNNL news release (also on EurekAlert)

If you think keeping up with what’s happening via Twitter, Facebook and other social media is like drinking from a fire hose, multiply that by 7 billion – and you’ll have a sense of what Court Corley wakes up to every morning.

Corley, a data scientist at the Department of Energy’s Pacific Northwest National Laboratory, has created a powerful digital system capable of analyzing billions of tweets and other social media messages in just seconds, in an effort to discover patterns and make sense of all the information. His social media analysis tool, dubbed “SALSA” (SociAL Sensor Analytics), combined with extensive know-how – and a fair degree of chutzpah – allows someone like Corley to try to grasp it all.

“The world is equipped with human sensors – more than 7 billion and counting. It’s by far the most extensive sensor network on the planet. What can we learn by paying attention?” Corley said.

Among the payoffs Corley envisions are emergency responders who receive crucial early information about natural disasters such as tornadoes; a tool that public health advocates can use to better protect people’s health; and information about social unrest that could help nations protect their citizens. But finding those jewels amidst the effluent of digital minutia is a challenge.

“The task we all face is separating out the trivia, the useless information we all are blasted with every day, from the really good stuff that helps us live better lives. There’s a lot of noise, but there’s some very valuable information too.”

I was getting a little worried when I saw the bit about separating useless information from the good stuff since that can be a very personal choice. Thankfully, this followed,

One person’s digital trash is another’s digital treasure. For example, people known in social media circles as “Beliebers,” named after entertainer Justin Bieber, covet inconsequential tidbits about Justin Bieber, while “non-Beliebers” send that data straight to the recycle bin.

The amount of data is mind-bending. In social media posted just in the single year ending Aug. 31, 2012, each hour on average witnessed:

  • 30 million comments
  • 25 million search queries
  • 98,000 new tweets
  • 3.8 million blog views
  • 4.5 million event invites
  • 7.1 million photos uploaded
  • 5.5 million status updates
  • The equivalent of 453 years of video watched

Several firms routinely sift posts on LinkedIn, Facebook, Twitter, YouTube and other social media, then analyze the data to see what’s trending. These efforts usually require a great deal of software and a lot of person-hours devoted specifically to using that application. It’s what Corley terms a manual approach.

Corley is out to change that, by creating a systematic, science-based, and automated approach for understanding patterns around events found in social media.

It’s not so simple as scanning tweets. Indeed, if Corley were to sit down and read each of the more than 20 billion entries in his data set from just a two-year period, it would take him more than 3,500 years if he spent just 5 seconds on each entry. If he hired 1 million helpers, it would take more than a day.

But it takes less than 10 seconds when he relies on PNNL’s Institutional Computing resource, drawing on a computer cluster with more than 600 nodes named Olympus, which is among the Top 500 fastest supercomputers in the world.

“We are using the institutional computing horsepower of PNNL to analyze one of the richest data sets ever available to researchers,” Corley said.

At the same time that his team is creating the computing resources to undertake the task, Corley is constructing a theory for how to analyze the data. He and his colleagues are determining baseline activity, culling the data to find routine patterns, and looking for patterns that indicate something out of the ordinary. Data might include how often a topic is the subject of social media, who is putting out the messages, and how often.

Corley notes additional challenges posed by social media. His programs analyze data in more than 60 languages, for instance. And social media users have developed a lexicon of their own and often don’t use traditional language. A post such as “aw my avalanna wristband @Avalanna @justinbieber rip angel pic.twitter.com/yldGVV7GHk” poses a challenge to people and computers alike.

Nevertheless, Corley’s program is accurate much more often than not, catching the spirit of a social media comment accurately more than three out of every four instances, and accurately detecting patterns in social media more than 90 percent of the time.

Corley’s educational background may explain the interest in emergency responders and health crises mentioned in the early part of the news release (from Corley’s PNNL webpage),

B.S. Computer Science from University of North Texas; M.S. Computer Science from University of North Texas; Ph.D. Computer Science and Engineering from University of North Texas; M.P.H (expected 2013) Public Health from University of Washington.

The reference to public health and emergency response is further developed, from the news release,

Much of the work so far has been around public health. According to media reports in China, the current H7N9 flu situation in China was highlighted on Sina Weibo, a China-based social media platform, weeks before it was recognized by government officials. And Corley’s work with the social media working group of the International Society for Disease Surveillance focuses on the use of social media for effective public health interventions.

In collaboration with the Infectious Disease Society of America and Immunizations 4 Public Health, he has focused on the early identification of emerging immunization safety concerns.

“If you want to understand the concerns of parents about vaccines, you’re never going to have the time to go out there and read hundreds of thousands, perhaps millions of tweets about those questions or concerns,” Corley said. “By creating a system that can capture trends in just a few minutes, and observe shifts in opinion minute to minute, you can stay in front of the issue, for instance, by letting physicians in certain areas know how to customize the educational materials they provide to parents of young children.”

Corley has looked closely at reaction to the vaccine that protects against HPV, which causes cervical cancer. The first vaccine was approved in 2006, when he was a graduate student, and his doctoral thesis focused on an analysis of social media messages connected to HPV. He found that creators of messages that named a specific drug company were less likely to be positive about the vaccine than others who did not mention any company by name.

Other potential applications include helping emergency responders react more efficiently to disasters like tornadoes, or identifying patterns that might indicate coming social unrest or even something as specific as a riot after a soccer game. More than a dozen college students or recent graduates are working with Corley to look at questions like these and others.

As to why the US Library of Congress requires 24 hours to search one term in their archived tweets and Corley and the PNNL require seconds to sift through two years of tweets, only two possibilities come to my mind. (1) Corley is doing a stripped down version of an archival search so his searches are not comparable to the Library of Congress searches or (2) Corley and the PNNL have far superior technology.

Researching tweets (the Twitter kind)

The US Library of Congress, in April 2010, made a deal with Twitter (microblogging service where people chat or tweet in 140 characters) to acquire all the tweets from 2006 to April 2010 and starting from April 2010, all subsequent tweets (the deal is mentioned in my April 15, 2010 posting [scroll down about 60% of the way]). Reading between the lines of the Library of Congress’ Jan. 2013 update/white paper, the job has proved even harder than they originally anticipated,

In April, 2010, the Library of Congress and Twitter signed an agreement providing the Library the public tweets from the company’s inception through the date of the agreement, an archive of tweets from 2006 through April, 2010. Additionally, the Library and Twitter agreed that Twitter would provide all public tweets on an ongoing basis under the same terms. The Library’s first objectives were to acquire and preserve the 2006-10 archive; to establish a secure, sustainable process for receiving and preserving a daily, ongoing stream of tweets through the present day; and to create a structure for organizing the entire archive by date. This month, all those objectives will be completed. To date, the Library has an archive of approximately 170 billion tweets.

The Library’s focus now is on confronting and working around the technology challenges to making the archive accessible to researchers and policymakers in a comprehensive, useful way. It is clear that technology to allow for scholarship access to large data sets is lagging behind technology for creating and distributing such data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. The Library is now pursuing partnerships with the private sector to allow some limited access capability in our reading rooms. These efforts are ongoing and a priority for the Library. (p. 1)

David Bruggeman in his Jan. 15, 2013 posting about this Library of Congress ‘Twitter project’ provides some mind-boggling numbers,

… That [170 billion tweets] represents the archive Twitter had at the time of the agreement (covering 2006-early 2010) and 150 billion tweets in the subsequent months (the Library receives over half a million new tweets each day, and that number continues to rise).  The archive includes the tweets and relevant metadata for a total of nearly 67 terabytes of data.

Gayle Osterberg, Director of Communications for the Library of Congress writes in a Jan. 4, 2013 posting on the Library of Congress blog about ‘tweet’ archive research issues,

Although the Library has been building and stabilizing the archive and has not yet offered researchers access, we have nevertheless received approximately 400 inquiries from researchers all over the world. Some broad topics of interest expressed by researchers run from patterns in the rise of citizen journalism and elected officials’ communications to tracking vaccination rates and predicting stock market activity.

The white paper/update offers a couple of specific examples of requests,

Some examples of the types of requests the Library has received indicate how researchers might use this archive to inform future scholarship:

* A master’s student is interested in understanding the role of citizens in disruptive events. The student is focusing on real-time micro-blogging of terrorist attacks. The questions focus on the timeliness and accuracy of tweets during specified events.

* A post-doctoral researcher is looking at the language used to spread information about charities’ activities and solicitations via social media during and immediately following natural disasters. The questions focus on audience targets and effectiveness. (p. 4)

At least one of the reasons  no one has received access to the tweets is that a single search of the archived (2006- 2010) tweets alone would take 24 hours,

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost prohibitive and impractical for a public institution.

Some private companies offer access to historic tweets but they are not the free, indexed and searchable access that would be of most value to legislative researchers and scholars.

It is clear that technology to allow for scholarship access to large data sets is not nearly as advanced as the technology for creating and distributing that data. Even the private sector has not yet implemented cost-effective commercial solutions because of the complexity and resource requirements of such a task. (p. 4)

David Bruggeman goes on to suggest that, in an attempt to make the tweets searchable and more easily accessible, all this information could end up behind a paywall (Note: A link has been removed),

I’m reminded of how Ancestry.com managed to get exclusive access to Census records.  While the Bureau benefitted from getting the records digitized, having this taxpayer-paid information controlled by a private company is problematic.

As a Canuck and someone who tweets (@frogheart), I’m not sure how I feel about having my tweets archived by the US Library of Congress in the first place, let alone the possibility I might have to pay for access to my old tweets.

Dr. Wei Lu, the memristor, and the cat brain; military surveillance takes a Star Trek: Next Generation turn with a medieval twist; archiving tweets; patents and innovation

Last week I featured the ‘memristor’ story mentioning that much of the latest excitement was set off by Dr. Wei Lu’s work at the University of Michigan (U-M). While HP Labs was the center for much of the interest, it was Dr. Lu’s work (published in Nano Letters which is available behind a paywall) that provoked the renewed interest. Thanks to this news item on Nanowerk, I’ve now found more details about Dr. Lu and his team’s work,

U-M computer engineer Wei Lu has taken a step toward developing this revolutionary type of machine that could be capable of learning and recognizing, as well as making more complex decisions and performing more tasks simultaneously than conventional computers can.

Lu previously built a “memristor,” a device that replaces a traditional transistor and acts like a biological synapse, remembering past voltages it was subjected to. Now, he has demonstrated that this memristor can connect conventional circuits and support a process that is the basis for memory and learning in biological systems.

Here’s where it gets interesting,

In a conventional computer, logic and memory functions are located at different parts of the circuit and each computing unit is only connected to a handful of neighbors in the circuit. As a result, conventional computers execute code in a linear fashion, line by line, Lu said. They are excellent at performing relatively simple tasks with limited variables.

But a brain can perform many operations simultaneously, or in parallel. That’s how we can recognize a face in an instant, but even a supercomputer would take much, much longer and consume much more energy in doing so.

So far, Lu has connected two electronic circuits with one memristor. He has demonstrated that this system is capable of a memory and learning process called “spike timing dependent plasticity.” This type of plasticity refers to the ability of connections between neurons to become stronger based on when they are stimulated in relation to each other. Spike timing dependent plasticity is thought to be the basis for memory and learning in mammalian brains.

“We show that we can use voltage timing to gradually increase or decrease the electrical conductance in this memristor-based system. In our brains, similar changes in synapse conductance essentially give rise to long term memory,” Lu said.

Do visit Nanowerk for the full explanation provided by Dr. Lu, if you’re so inclined. In one of my earlier posts about this I speculated that this work was being funded by DARPA (Defense Advanced Research Projects Agency) which is part of the US Dept. of Defense . Happily, I found this at the end of today’s news item,

Lu said an electronic analog of a cat brain would be able to think intelligently at the cat level. For example, if the task were to find the shortest route from the front door to the sofa in a house full of furniture, and the computer knows only the shape of the sofa, a conventional machine could accomplish this. But if you moved the sofa, it wouldn’t realize the adjustment and find a new path. That’s what engineers hope the cat brain computer would be capable of. The project’s major funder, the Defense Advanced Research Projects Agency [emphasis mine], isn’t interested in sofas. But this illustrates the type of learning the machine is being designed for.

I previously mentioned the story here on April 8, 2010 and provided links that led to other aspects of the story as I and others have covered it.

Military surveillance

Named after a figure in Greek mythology, Argos Panoptes (the sentry with 100 eyes), there are two new applications being announced by researchers in a news item on Azonano,

Researchers are expanding new miniature camera technology for military and security uses so soldiers can track combatants in dark caves or urban alleys, and security officials can unobtrusively identify a subject from an iris scan.

The two new surveillance applications both build on “Panoptes,” a platform technology developed under a project led by Marc Christensen at Southern Methodist University in Dallas. The Department of Defense is funding development of the technology’s first two extension applications with a $1.6 million grant.

The following  image, which accompanies the article at the Southern Methodist University (SMU) website, features an individual who suggests a combination of the Geordi character in Star Trek: The Next Generation with his ‘sensing visor’ and a medieval knight in full armour wearing his helmet with the visor down.

Soldier wearing helmet with hi-res "eyes" courtesy of Southern Methodist University Research

From the article on the SMU site,

“The Panoptes technology is sufficiently mature that it can now leave our lab, and we’re finding lots of applications for it,” said ‘Marc’ Christensen [project leader], an expert in computational imaging and optical interconnections. “This new money will allow us to explore Panoptes’ use for non-cooperative iris recognition systems for Homeland Security and other defense applications. And it will allow us to enhance the camera system to make it capable of active illumination so it can travel into dark places — like caves and urban areas.”

Well, there’s nothing like some non-ccoperative retinal scanning. In fact, you won’t know that the scanning is taking place if they’re successful  with their newest research which suggests the panopticon, a concept from Jeremy Bentham in the 18th century about prison surveillance which takes place without the prisoners being aware of the surveillance (Wikipedia essay here).

Archiving tweets

The US Library of Congress has just announced that it will be saving (archiving) all the ‘tweets’ that have been sent since Twitter launched four years ago. From the news item on physorg.com,

“Library to acquire ENTIRE Twitter archive — ALL public tweets, ever, since March 2006!” the Washington-based library, the world’s largest, announced in a message on its Twitter account at Twitter.com/librarycongress.

“That’s a LOT of tweets, by the way: Twitter processes more than 50 million tweets every day, with the total numbering in the billions,” Matt Raymond of the Library of Congress added in a blog post.

Raymond highlighted the “scholarly and research implications” of acquiring the micro-blogging service’s archive.

He said the messages being archived include the first-ever “tweet,” sent by Twitter co-founder Jack Dorsey, and the one that ran on Barack Obama’s Twitter feed when he was elected president.

Meanwhile, Google made an announcement about another twitter-related development, Google Replay, their real-time search function which will give you data about the specific tweets made on a particular date.  Dave Bruggeman at the Pasco Phronesis blog offers more information and a link to the beta version of Google Replay.

Patents and innovation

I find it interesting that countries and international organizations use the number of patents filed as one indicator for scientific progress while studies indicate that the opposite may be true. This news item on Science Daily strongly suggests that there are some significant problems with the current system. From the news item,

As single-gene tests give way to multi-gene or even whole-genome scans, exclusive patent rights could slow promising new technologies and business models for genetic testing even further, the Duke [Institute for Genome Sciences and Policy] researchers say.

The findings emerge from a series of case studies that examined genetic risk testing for 10 clinical conditions, including breast and colon cancer, cystic fibrosis and hearing loss. …

In seven of the conditions, exclusive licenses have been a source of controversy. But in no case was the holder of exclusive patent rights the first to market with a test.

“That finding suggests that while exclusive licenses have proven valuable for developing drugs and biologics that might not otherwise be developed, in the world of gene testing they are mainly a tool for clearing the field of competition [emphasis mine], and that is a sure-fire way to irritate your customers, both doctors and patients,” said Robert Cook-Deegan, director of the IGSP Center for Genome Ethics, Law & Policy.

This isn’t an argument against the entire patenting system but rather the use of exclusive licenses.