Saturday, March 31, 2012

Blue Gene Supercomputer Lands in Rice

Rice University and IBM yesterday (2012.03.30) announced a partnership to build the first award-winning IBM Blue Gene supercomputer in Texas. This new computing capability will speed the search for new sources of energy, new ways of maximizing current energy sources, new cancer drugs and new routes to personalized medicine.


Rice faculty will use the Blue Gene to further their own research and to collaborate with academic and industry partners on a broad range of science and engineering fields related to energy, geophysics, basic life sciences, cancer research, personalized medicine and more.

Rice also announced a related collaboration agreement with the University of Sao Paulo (USP) in Brazil to initiate the shared administration and use of the Blue Gene supercomputer, which allows both institutions to share the benefits of the new computing resource. USP is Brazil's largest institution of higher education and research, and the agreement represents an important bond between Rice and USP.

Rice's new Blue Gene supercomputer, which has yet to be named, is slated to become operational in May. It is based on IBM's POWER processor technology, which was developed in part at the company's Austin, Texas labs. Rice and IBM shared the cost of the system.

Including the Blue Gene/P, Rice has partnered with IBM to launch three new supercomputers during the past two years that have more than quadrupled Rice's high-performance computing capabilities. The addition of the Blue Gene/P doubles the number of supercomputing CPU hours that Rice can offer. The six-rack system contains nearly 25,000 processor cores that are capable of conducting about 84 trillion mathematical computations each second. When fully operational, the system is expected to rank among the world's 300 fastest supercomputers as measured by the TOP500 supercomputer rankings.

A Word on Supercomputer & Blue Gene

IBM Blue Gene/P Supercomputer
Unlike your typical desktop or laptop computer, which have a single microprocessor, supercomputers contain thousands or even tens of thousands of processors. This makes them ideal for scientists who study large problems or engineers who models complex systems, because jobs can be divided among all the processors and run in a matter of seconds or minutes rather than weeks or months.  Supercomputers are used to simulate things that cannot be reproduced in a laboratory -- like Earth's climate, nuclear weapons, or the collision of galaxies -- and to examine vast databases like those used to map underground oil reservoirs or to develop personalized medical treatments. 

High-performance computers like the IBM Blue Gene/P are critical in virtually every discipline of science and engineering. In 2009, President Obama recognized IBM and its Blue Gene family of supercomputers with the National Medal of Technology and Innovation, the most prestigious award in the United States given to leading innovators for technological achievement.


Links:

jPage: Analytics Overview


After reviewing and tracking Big Data (blog), in this parallel post, I will review traditional and standard analytics tools such as database, business intelligence, and statistical/visualization.  Some of these tools are integrated or used in tandom with Big Data platform to form solution for Big Data Analytics.

Business Intelligence (BI)

SAS
SAS is the leader in business analytics software and services, and the largest independent vendor in the business intelligence market.

Cognos
IBM Cognos is a leader in the business intelligence and financial performance management.

SAP BusinessObject
SAP BusinessObject is a business intelligence (BI) solution that provides functionality of: 1) reporting and analysis, 2) dashboards, 3) data exploration, 4) mobile and 5) BI platform


Statistical Analysis
R
R is a language and environment for statistical computing and graphics. R provides a wide variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis, classification, clustering, ...) and graphical techniques, and is highly extensible. The S language is often the vehicle of choice for research in statistical methodology, and R provides an Open Source route to participation in that activity.  One of R's strengths is the ease with which well-designed publication-quality plots can be produced, including mathematical symbols and formulae where needed. Great care has been taken over the defaults for the minor design choices in graphics, but the user retains full control.  R is available as Free Software under the terms of the Free Software Foundation's GNU General Public License in source code form. It compiles and runs on a wide variety of UNIX platforms and similar systems (including FreeBSD and Linux), Windows and MacOS.


Predictive Analytics

SPSS
Predictive analytics helps organization anticipate change so that decision-makers can plan and carry out strategies that improve outcomes.


Optimization

ILOG
ILOG is a recognized industry leader in Business Rule Management Systems (BRMS), visualization components, optimization and supply chain solutions.


Visualization

An easy-to-use analytics and visualization software package. Tableau Public is a free service that lets users create and share data visualizations on websites, blogs and social media like Facebook and Twitter. There’s a revolutionary technology under the hood of Tableau Public called VizQL: A Visual Query Language. Users don’t need to be a programmer to use Tableau Public.


Live Post - last updated by Frank on 2012.03.31

Lecture: Margot Gerritsen & Computational Mathematics

Stanford Professor Margot Gerritsen illustrates how mathematics and computer modeling influence the design of modern airplanes, yachts, trucks and cars. This lecture of Oct 23rd 2010 is offered as part of the Classes Without Quizzes series at Stanford's 2010 Reunion Homecoming.




Notable quotes from the lecture: "When you look at it very carefully, all of the equations that govern fluid flow processes.. be it climate models, weather models, optimization of sail design for competitive yacht races … optimizing wings … fluid flow in oil and gas reservoirs, aquifers, ground water models, coastal oceans, wind turbine optimization … All of these processes that may seem completely different, are all governed by these equations, it’s all the same stuff… they look very complex … [but] it’s all relatively simple. "

Margot Gerritsen, PhD, is an Associate Professor of Energy Resources Engineering, with expertise in mathematical and computational modeling of energy and fluid flow processes. She teaches courses in energy and the environment, computational mathematics and computing at Stanford University.


Links:

Wednesday, March 28, 2012

Powering Smart Cities with HPC, Big Data and Cloud



What is a Smart City? How can a city or region leverages computing and analytics technologies such as Big Data, HPC and Cloud to better coordinate its service, promote economic development, and understand the needs and opinions of its citizens? Can St. Louis become one of such Smart City taking advantage of the Loop Media Hub initiative?


These were the topics being discussed by over 40 people at the March 15th event hosted by STLhpc.net at Missouri History Museum in St. Louis. This  free, public event was organized Gary Stiehr, the founder of STL.hpc and a research computing manager at WashU Genome Center.


Below are the agenda of the event and speaker.
  • An Overview of how Big Data, HPC & Cloud in Smarter Cities  (Frank Lee, Ph.D IBM Senior IT Architect)
     
  • Leveraging Distributed Computing to Enhance Customer and Visitor Relationship Management  (John Leach, CEO at Incite Retail)
     
  • Smart Cities: Where Transportation Research Meets High Performance Computing  (Gary Stiehr, Founder, STLhpc.net)
     
  • Applying Distributed Computing to SmartGrid  (Brad Molander, Technology Evangelist at NISC) Electrical grids across the nation have recently been deploying modernized assets such as smart meters that produce enormous amounts of data. NISC over the last two years has designed, engineered and deployed an internal private cloud environment to support the modern data needs of SmartGrid initiatives. This presentation will cover some some fundamentals of smart grid as well as illustrate the challenges and benefits of private cloud environments.


The topics I covered in my presentation including some that I've blogged about:




 Link:

Forecast Weather (Hyper-locally): To A Neighborhood Near You

This post is part 2 of the "Forecasting Weather" series. The last post is here.

Current weather forecasting mostly focus on temperature, wind and precipitation over a seven day period on a regional scale such as the metropolitan area of St. Louis. Predicting weather for a specific geographical area on a short-term scale, remains a highly challenging computing problem. Using my subdivision that nearly escaped the New Year Eve tornado as an example, to predict the incoming tornado and alert the public (my neighbors), the forecast needs to be provided at a granular scale of a mile and a minute. 

And it's this type of "hyper-local" and "near-realtime" forecasting that IBM's Deep Thunder system aims to provide. 

Initiated by IBM Research scientists as a collaboration with National Weather Service to forecast weather for 1996 Summer Olympics in Atlanta, Deep Thunder evolved over the years into a full-scale Deep Computing project that focused on much more short-term forecasts, predicting everything from where flooding and downed power lines will likely occur to where winds will be too high for utility works up to 84 hours into the future. 

Deep Thunder's forecast for Hurricane Irene, based on its model for New York's weather, predicted two days ahead that it would be reduced to tropical storm status. (Source: IBM)

Such forecast capability could be valuable to organizations with weather-sensitive needs. For example, in anticipation of high-wind condition, air traffic control and airlines can take precautionary efforts to reroute air traffic and prevent the situation of massive cancellation or stuck passengers on airplanes. 

In another application, a power utility company could learn what areas to prepare for outages in the event of a storm. With this foresight, they could reduce downtime by scheduling maintenance workers to fix a line they expect to fail.

The computing platform behind the Deep Thunder has several key components: receiving and processing data, modeling, post-processing analysis, visualization and dissemination. To achieve predictions with accuracy, Deep Thunder combines weather data from the National Oceanic and Atmospheric Administration, NASA, the U.S. Geological Survey, WeatherBug, and ground sensors.

The simulation codes used for weather modeling has existed for decades. Some trace their origins to the 1970s, but have evolved and improved considerably since then. Instead of inventing new weather models, IBM scientists have been adapting, refining and applying existing models using simulation codes. Additionally, they are developing new methods of data visualization, analysis and dissemination, and techniques for improving computational performance and system automation. 

A readout from Deep Thunder's minute-by-​minute windspeed forecasts for lower Manhattan on March 13 from the iOS visualizat​ion interface.  (source: IBM)
In the latest iteration of the Deep Thunder project, IBM has taken the technology mobile, putting it on an iPad app and showing it off to lawmakers on Capitol Hill and to reporters in New York City in March. The iPad app shows a simple line graph representing precipitation, wind speed and temperature over an 84-hour period. But the underlying implications of the minute-by-minute, highly localized forecasting technology are much more impressive, catching the attention of governments and private companies around the world.

The government of Rio De Janeiro, Brazil, for example, entered into a partnership with IBM in December 2010 to use Deep Thunder in a new weather prediction center designed to help the city adequately prepare people for flash floods, which left over 200 dead earlier in 2010. 


 


The Deep Thunder group has also been able to dovetail with other analytics-driven projects such as Smarter Cities. Working with colleagues in the new IBM Research center in Brazil as well as the IBM India Research Lab, the team is leading the Rio de Janeiro project to better anticipate flooding, and predict where mudslides might be triggered by severe storms. Here, highly targeted weather modeling is only part of the story. Through a new city command center, weather data can be integrated with other city information systems to determine how best to respond to such situations, including where and when to deploy emergency crews, make optimal use of shelters and monitor hospital bed availability.

With the World Cup coming to Rio in 2014 and Summer Olympics in 2016, the forecast for the business-of-weather approach pioneered by Deep Thunder looks bright.

The Unified Rio Emergency Operation Center, Powered by Deep Thunder Technology for Hyper-local Weather and Flood Forecast

IBM's other partnership is with the University of Brunei Darussalam, which is using the technology at a national level for flood forecasting, and as part of a program to predict the impact of climate change on the country's rainforests.

Please comment below about this technology. And make a prediction about when Deep Thunder will be available to forecast weather for my, and your subdivision. 

A final word: I wrote this blog on the American Airlines flight from Dallas to San Francisco. It was delayed for about 40 minutes, and the pilot apologetically mentioned that it was due to SFO shutting down two runways tonight, due to high wind…


Links

Sunday, March 25, 2012

Forecasting Weather (Tornado): For My Subdivision?

An EF3 tornado touched down in my city of Sunset Hills, Missouri, on Dec 31st 2010. We were in Los Angeles preparing to participate in the annual Rose Bowl Parade, when a friend from our subdivision called with the shocking news and turning on CNN filled in the rest of details. This friend was doing OK but his house suffered some wind damage.  Our house lost some shingles. We are all lucky comparing to the neighbors and people around town who lost their properties completely.

An EF3 Tornado Hit Sunset Hills, MO on Dec 31, 2010

The so-called New Year Eve tornado cut a long trail from Fenton, about 10 miles to the west, lept over the southwestern corner of our subdivision which is on a hilltop, and touched down near the intersection of Lindbergh & Watson, about 4 miles to the east of our house, wiping out one city block with mostly residential buildings. This is the one and only tornado that has ever hit Sunset Hills and luckily life was lost despite some injuries. 

An incoming storm can produce damaging hail and thunder lightening right next to an area with sunny sky. In the case of tornado, it’s almost always constrained to a certain terrain and area where the funnel touching down from the cloud to the ground. So while we get most of our information on weather forecast from TV watching, the information isn’t that actionable, as the forecast is too broad/regional or far out from the event imminent. In most cases, we glued to the TV while the storm is approaching just to find out what had already happened. 


Sunset Hills, A Beautiful Town Southwest of St. Louis County, Missouri

Weather forecast should be much more relevant if it’s more localized and short-term. In one possible scenario, while facing an incoming storm, what if Sunset Hills residents can log online, type in our home address and get minute-by-minute weather forecast within a one-mile radius that leads towards the touch down of tornado? This will help tremendously the readiness and preparation for the approaching disaster and reduce the casualty and avoid loss of life.  
In the next post of this series, I will discuss how to turn "should" into "can" as scientists and technologies tackle this challenge. 


Journey Blogs:
Links:


Friday, March 23, 2012

Watson Takes on Cancer

Watson, the "Jeopardy!"-playing IBM supercomputer, is getting another job at tackling cancer.

If a doctor at Memorial Sloan-Kettering Cancer Center in New York City asks the question about best treatment for a patient’s stage-three breast cancer, he'll soon be talking to Watson, a supercomputer that combines natural language processing with machine learning.

Doctors at Memorial Sloan-Kettering in New York City



















The computer, which is best-known for the way it crushed its human competition on "Jeopardy," can interpret spoken queries and then uses statistical analysis to deliver evidence-based statistically-ranked responses. The two organizations, IBM and Memorial Sloan-Kettering, are still discussing terms of the partnership, but the big idea is to create an online decision-support service with the smarts of Watson and the clinical insights of Memorial Sloan-Kettering.

Watson is architected to store and analyze data in parallel and high-speed. As an example, I can read (but not remember) 1 page of data in 3 minutes, while Watson can read and memorize all 200 million pages of data in three seconds.

The Watson version being designed for the hospital works similarly to the way it played "Jeopardy." When it is asked a question, it will provide suggestions -- one in which Watson feels most confident. and then some strong alternatives.

Watson competing against Ken Jenning and Brad Rutter at Jepardy! Game






















However, Memorial Sloan-Kettering is feeding Watson more information than it got on "Jeopardy," and that comes in the way of patients' medical backgrounds. Watson is learning the hospital's advanced electronic medical records system, making it even smarter and able to make very strong recommendations based on patients' backgrounds. IBM and Memorial Sloan-Kettering are working to feed Watson loads of oncology information. Both expect it to be a process that will evolve over the next year.

"We are going to work with the machine and teach the machine how to make medical decisions," Memorial's Dr. Larry Norton told ABC News. "It's going to take not only molecular disease and clinical research findings into account but also the patients' social and psychological situations and patients' expressed wishes, lifestyle -- all that comes into play when making a high quality medical decision."



Sloan-Kettering and IBM are already developing the first applications using Watson related to lung, breast and prostate cancers, and aim to begin piloting the solutions to some oncologists in late 2012, with wider distribution planned for late 2013.

Wellpoint, an insurance company, also hired Watson last year to guide treatment decisions for members. The first Watson deployment at WellPoint is underway now with nurses who manage complex patient cases and review treatment requests from medical providers. See my earlier blog on "Watson Debuts for Healthcare".

Links:

Thursday, March 22, 2012

j-Tool: IBM LanguageWare

IBM LanguageWare is a technology which provides a full range of text analysis functions. It is used extensively throughout the IBM product suite and is successfully deployed in solutions which focus on mining facts from large repositories of text.  IBM LanguageWare Resource Workbench is an Eclipse application for building custom language analysis into IBM LanguageWare resources and their associated UIMA annotators.


Link: 

Updates:
 2012.04.12 in St. Louis - focused on NLP

Wednesday, March 21, 2012

Cli-G Takes on Personalized Medicine

On March 14, IBM announced the creation of a new clinical genomics analytics platform to help physicians and administrators at Italy's Fondazione IRCCS Instituto Nazionale dei Tumori make decisions about which treatments could work most effectively for individual patients.

This effort in personalized medicine, or choosing specific treatments based on a patient’s personal genetic profile, is an important new area that could take some of the guesswork out of treating diseases like cancer and AIDS.

"Data from patients is a gold mine that helps us discover ways to treat other cases more successfully," said Boaz Carmeli, a healthcare researcher at IBM Research – Haifa. Carmeli's team has been leading the research into how clinical genomics can use data to get deeper insights into medical processes and how computers can be used to evolve and improve those processes.

"Before beginning this research, I always thought that when we get sick the doctor will tell us what treatment will work best," said Carmeli. "In reality, although we can have one diagnosis, there are many treatment options available. Choosing the best one depends on a huge number of factors, including our genetic profile, age, weight, family history, general health, and the current state of the disease."

The new Clinical Genomics analysis platform, also known as Cli-G ('clee-gee'), integrates and analyzes clinical data and evidence from patient records and incorporates expertise gathered from leading medical specialists, clinical healthcare guidelines, and other sources of knowledge. Clinicians access the system through a standard web browser, which offers a simple and intuitive graphical user interface.

Cli-G is a cousin to IBM Watson, the deep question-and-answer technology that beat two past-champions at the Jeopardy! TV quiz show. Both are learning systems. While Watson focuses primarily on gathering unstructured textual information from published sources, Cli-G is aimed at gathering specific types of information. Carmeli and his colleagues have coined a term, Evicase, to describe the way they structure information. It’s a combination of evidence-based medicine, which is statistical analysis of treatment outcomes, with case-based reasoning, which is knowledge gathered by studying the best practices of top physicians. Scientists in IBM Research are exploring how Cli-G and Watson could be used to compliment one another.

The project grew out of a long-term effort by IBM’s Haifa scientists to develop a network that every party involved in healthcare delivery can tap into to share information. Cli-G adds sophisticated analytics to the network. The technology uses all of the information available to predict the most likely outcomes for a particular patient for various treatment options. Then, based on Evicase, it recommends what it considers to be the best treatment.



"Most physicians base their treatment decisions on guidelines from what is known as 'evidence-based medicine'," explained Carmeli. "These guidelines stem primarily from the results of clinical trials, and help guide doctors with rules for what treatment works best. But how relevant are these rules for an 80-year-old woman where no treatment at all may represent the best option?

“The clinical trials don't cover all populations and other evidence doesn't take into account factors such as the patient's emotional state, lifestyle, family history, or genetic profile."

In 30 to 40 percent of the cases, physicians formally declare that they are recommending treatment options outside the evidence-based medicine guidelines. The IBM solution gathers data from hospitals to gain insight into what was done in these cases, alongside information from other knowledge sources – including the physicians themselves.

"When we started the project, we had many questions on when and why physicians stay within – or diverge from – the guidelines," said Carmeli. "There are no clear answers, but we do see many different personalities, disciplines, hospital cultures, and patients. So, the alternatives are different for everyone."

By analyzing the cases, identifying trends, and then introducing formalistic tools, IBM researchers aim to provide insight into treatment options and allow the physicians to investigate the reasoning behind these options.

One example of how this solution can help is in the area of breast cancer. Medical research discovered that certain genetic sequences can indicate whether a breast cancer patient is predisposed to respond positively to chemotherapy – but the test used to identify this genetic sequence is expensive and hospitals are not sure it's worthwhile.



Today, adjuvant chemotherapy or hormonal therapy helps treat breast cancer in about 30 percent of the cases. Yet treatment is being given to 80 percent of breast cancer patients. As a result, more than half the patients are receiving treatment that will not help them. Using Cli-G, medical staff can verify whether the tests can accurately identify the different cases and predict the effectiveness of the treatment.

"Computers are used nowadays to provide support and assistance wherever possible," explained Carmeli. "We don't expect to know or remember everything, but many of us walk around with mobile devices that allow us to have extensive knowledge at our fingertips.

"The same support can make healthcare work smarter … We envision a world in which physicians also have immediate access to the value of decision support provided by technology. In the end, this ultimately enables patients to get the best possible treatment."

Links:

Tuesday, March 20, 2012

Adding Speed to Acceleration

The world's largest and meanest scientific instrument just paired up world's fastest and baddest computing machine, thanks to an I/O system overhall. According to a recent blog post from Swiss National Supercomputing Center (CSCS), the supercomputer "Phoenix" that is serving the number crunching for the ATLAS LHC detector, just set a world record in the speed of which data is served to the full compute cluster.

Here is the background on the ATLAS dector and its associated data storage and analysis challenge:

ATLAS (A Toroidal LHC Apparatus) is one of the seven particle detector experiments (ALICE, ATLAS, CMS, TOTEM, LHCb, LHCf and MoEDAL) constructed at the Large Hadron Collider (LHC), a new particle accelerator at the European Organization for Nuclear Research (CERN) in Switzerland.




The detector generates unmanageably large amounts of raw data, about 25 megabytes per event (raw; zero suppression reduces this to 1.6 MB) times 23 events per beam crossing, times 40 million beam crossings per second in the center of the detector, for a total of 23 petabyte/second of raw data. Offline event reconstruction is performed on all permanently stored events, turning the pattern of signals from the detector into physics objects, such as jets, photons, and leptons. Grid and HPC are extensively used for event reconstruction, allowing the parallel use of university and laboratory computer networks throughout the world for the CPU-intensive task of reducing large quantities of raw data into a form suitable for physics analysis.

It has been a long journey and struggle for ATLAS analysis sites worldwide to find a balance of cost and performance with their HPC infrastructure, suc as the storage system and architecure that can sustain high demand of data I/O and reliability. As more workloads are outstripping the capability of local hard drive, some sites have started to investiage parallel file systems that can provide fast I/O storage from low-medium cost disks.

In the test he CSCS carried out and published on February 16, 2012, the engineers load-tested their recently-configured storage system based on IBM's GPFS file system. To their delight, the results showed that they now have the fastest I/O supercomputer among all the ATLAS sites. Accordng to the results, CPU efficiency in this cluster reached "whopping value of 0.99, which means that during the data-intensive computation, there is virtually no I/O wait and all the CPUs were available close to 100% for the ATLAS jobs. This is about 20% better than the next best sites measured with this workload. (See plotted performance in figure below, with yellow one representing the CSCS GPFS system.



What's striking about this speed-up is that there is little hardware change except changing 12 SAS HDD to two SSD (for metadata service). GPFS delivers superb I/O performance over existing nework infrastructure (Mellanox QDR Infiniband), 8 data servrs and 2 metadata servers with these SSD drives.

Another surprise is how the new system outperforms the prevous one based on Lustre file system, a competor of GPFS in the small family of parallel file systems. According to the CSCS blog, GPFS outperforms by far the previous configuration, reaching a speed of 7.2 GB/sec sustained during transfer. Another plus for new architecture -- that pure metadata operations (run command 'ls', scan file system, etc) no longer affect the performance of the system, thanks to the small hardware change to SSD.

In the final analysis, if the we are investing top talents and instrument to reveal the building blocks of God's universe, we should also start to think about how to match up with a smarter computing platform. It's about time, as time is what we cannot afford to lose in the chase of particles.

More from:

Monday, March 19, 2012

Adding Speed to Acceleration

The world's largest and meanest scientific instrument just paired up world's fastest and baddest computing machine, thanks to an I/O system overhall. According to a recent blog post from Swiss National Supercomputing Center (CSCS), the supercomputer "Phoenix" that is serving the number crunching for the ATLAS LHC detector, just set a world record in the speed of which data is served to the full compute cluster.

Here is the background on the ATLAS dector and its associated data storage and analysis challenge:

ATLAS (A Toroidal LHC Apparatus) is one of the seven particle detector experiments (ALICE, ATLAS, CMS, TOTEM, LHCb, LHCf and MoEDAL) constructed at the Large Hadron Collider (LHC), a new particle accelerator at the European Organization for Nuclear Research (CERN) in Switzerland.




The detector generates unmanageably large amounts of raw data, about 25 megabytes per event (raw; zero suppression reduces this to 1.6 MB) times 23 events per beam crossing, times 40 million beam crossings per second in the center of the detector, for a total of 23 petabyte/second of raw data. Offline event reconstruction is performed on all permanently stored events, turning the pattern of signals from the detector into physics objects, such as jets, photons, and leptons. Grid and HPC are extensively used for event reconstruction, allowing the parallel use of university and laboratory computer networks throughout the world for the CPU-intensive task of reducing large quantities of raw data into a form suitable for physics analysis.

It has been a long journey and struggle for ATLAS analysis sites worldwide to find a balance of cost and performance with their HPC infrastructure, suc as the storage system and architecure that can sustain high demand of data I/O and reliability. As more workloads are outstripping the capability of local hard drive, some sites have started to investiage parallel file systems that can provide fast I/O storage from low-medium cost disks.

In the test he CSCS carried out and published on February 16, 2012, the engineers load-tested their recently-configured storage system based on IBM's GPFS file system. To their delight, the results showed that they now have the fastest I/O supercomputer among all the ATLAS sites. Accordng to the results, CPU efficiency in this cluster reached "whopping value of 0.99, which means that during the data-intensive computation, there is virtually no I/O wait and all the CPUs were available close to 100% for the ATLAS jobs. This is about 20% better than the next best sites measured with this workload. (See plotted performance in figure below, with yellow one representing the CSCS GPFS system.



What's striking about this speed-up is that there is little hardware change except changing 12 SAS HDD to two SSD (for metadata service). GPFS delivers superb I/O performance over existing nework infrastructure (Mellanox QDR Infiniband), 8 data servrs and 2 metadata servers with these SSD drives.

Another surprise is how the new system outperforms the prevous one based on Lustre file system, a competor of GPFS in the small family of parallel file systems. According to the CSCS blog, GPFS outperforms by far the previous configuration, reaching a speed of 7.2 GB/sec sustained during transfer. Another plus for new architecture -- that pure metadata operations (run command 'ls', scan file system, etc) no longer affect the performance of the system, thanks to the small hardware change to SSD.

In the final analysis, if the we are investing top talents and instrument to reveal the building blocks of God's universe, we should also start to think about how to match up with a smarter computing platform. It's about time, as time is what we cannot afford to lose in the chase of particles.

Links:

Friday, March 16, 2012

Wings Beneath The Wind

I spoke about the Vestas Project yesterday (2012.03.15) at the event called "A New St. Louis? Powering Smart Cities with HPC, Big Data and Cloud". The event was held at Missouri History Museum and organzied by STLhpc.net, the brain child of Gary Stiehr of Washington University.



The project showcases how both Big Data and HPC converge to tackle both data-intensive and compute-intensive problems that are becoming more common in public and private data analytics. As a starter, the project will build a supercomputer capable of analyzing diverse and large weather data sets reaching 20-plus petabytes and reducing the time-to-results from weeks to less than an hour.

On October 24th 2011, IBM and Danish wind turbine manufacturer Vestas announced that a supercomputer code name "Firestorm" will analyze petabytes of data to optimize the placement and maximize energy output of Vestas turbines.

"Firestorm" will crunch through weather reports, moon and tidal phase, geospatial sensor data, satellite images, and deforestation maps to generate the best placement of turbines.
"Firestorm" will also help Vestas to see into the future and prescribe solutions. Software on the system will be used by anaylsts to model and research weather in predicting future performance. Vestas engineers will run other softwre to figure out the best time to do maintenance of the turbines.

Predicting energy output of turbines is vitally important to project developers who put up money for wind farms with an expectation of selling a certain amount of energy to the grid. Although wind power is growing in many places around the world, project developers are seeking better methods, including better wind speed measurement, to better match expected and actual performance.

The IBM software package is called BigInsight and it took four years to develop. BigInsight will run on the open-source Apache Hadoop framework for parallel processing of very large data sets. The software provides a framework for large scale parallel processing and scalable storage for terabyte to petabytes-level data plus the ability to enable "what-if" scenarios with its BigSheets component. BigInsights is part of IBM's Big Data software platform, which includes InfoSphere Streams software that analyzes data coming into an organization and in real time and monitors it for any changes that may signify a new pattern or trend.



A bit more on the "Firestorm" itself: the supercomputer has 1,222 connected, workload optimized System x iDataPlex servers and is capable of 150 trillion calculations per second -- equivalent to 30 million calculations per Danish citizen per second. Firestorm is #53 on the Top500 list of the world’s fastest supercomputers and the third largest commercial system on the list.

One of my colleagues from the IBM Worldwide Deep Computing team, Scott D, got to architect the system and worked with Vestas closely on the project. He's a lucky guy!

Links:

Wednesday, March 14, 2012

"The People's Oscar"

Do you endorse or abhor the picks made by Tinseltown’s elite on Oscar night? Now you can weigh in with your picks and views through social media and the results will count. Which means computers can help settle the film-buff debate, courtesy of a new tool co-developed by the University of Southern California, LA Times and IBM.

The project "People's Oscar" relies on new sophisticated analytics and natural language recognition technologies to gauge positive and negative opinions shared in millions of public tweets.

Focused on the Best Actor, Best Actress and Best Picture categories for the 84th Academy Awards, the goal is to establish a model for measuring the volume and tone of worldwide Twitter sentiment to better understand moviegoers' opinions. The results are intended to illuminate how advances in technology can help identify important consumer trends, and if I can add, public opinion on any given subject or topic, in near real-time.



The "near real-time" can be realized if the data can be processed on a Smarter Computing platform, such as one based on IBM iDataplex platform at USC. With thousands of computing cores and high-performance networking infrastructure, this Big Data problem can be easily tackled.

After all, this is "The Age of Big Data", as The New York Times put it recently.

Links:

Tuesday, March 13, 2012

Catching up with Light?

IBM scientists have developed a prototype optical chipset, called Holey Optochip, that is the first parallel optical transceiver to transfer one trillion bits – one terabit – of information per second. So technically speaking, this is a Tbps transceiver, about 100x faster than today's technology.

For those of us who appreciate the speed in the context of digital download and Big Data, this speed is about the equivalent of downloading 500 high definition movies. The report will occur at the Optical Fiber Communication Conference taking place in Los Angeles, Calif.



For the fun of it, here are some more spec:

The raw speed of one transceiver is equivalent to the bandwidth consumed by 100,000 users at today’s typical 10 Mb/s high-speed internet access.

Or, it would take just around an hour to transfer the entire U.S. Library of Congress web archive through the transceiver.

The transceiver consumes less than 5 watts; the power consumed by a 100W light bulb could power 20 transceivers.

Links: