Monday, April 30, 2012

I Built A Supercomputer for KU


It took a three-year-long journey but we finally arrived at a celebration point as University of Kansas announced its first state-of-the-art supercomputer system powered by IBM high-performance computing (HPC) technologies. 

As the lead solution architect and research liason for this collaborative supercomputing project, I not only designed the full system, but also enabled the research collaboration between KU and IBM Research that underscored the funding through the Shared University Research Grant (SUR).

So I'm happy to count another example of success in terms of helping my customers and partners alike build and expand centralized HPC infrastructure for research. Below is the KU news release.


IBM, KU to empower researchers with world-class supercomputing

 April 30th, 2012 
 
The University of Kansas will partner with Armonk, N.Y.-based IBM Corp. to help advance supercomputing at KU, the school announced today.

The IBM Shared University Research (SUR) award includes five compute blades, a large memory blade, a graphical processing unit blade, two storage servers and 72 terabytes of disk storage to the renovated Bioinformatics Computing Facility. The KU award builds on a donation earlier this year of three IBM BladeCenter chassis to the BCF.

The BCF renovation is being funded through a $4.6 million grant from the National Institutes of Health as part of the American Recovery & Reinvestment Act of 2009.

The BCF, which is set to open this summer, will greatly enhance the computing capabilities of the university, giving researchers a 20-fold increase in computing power to support investigations ranging from biology and disease to national security and climate change.

“At most universities, researchers work department-by-department or individually to get the computing resources they need,” said Perry Alexander, director of the Information and Telecommunication Technology Center, which houses the new BCF. “The BCF unites university resources and provides an outstanding staff to maintain a secure, energy efficient, world-class computing facility. Now, KU researchers can spend less time managing computational resources and more time conducting scholarly work.”

IBM’s Shared University Research Award program strives to connect researchers at universities with IBM Research, IBM Life Sciences, IBM Global Services and IBM's development and product labs.

The KU-IBM partnership will develop new hardware and software approaches to modeling and simulations of complex real-world systems. Researchers will be able to process and analyze huge volumes of structured and unstructured data, share their findings, explore new approaches and store the results of their research. Advanced systems modeling will enable more accurate predictions and large-scale analyses that incorporate data from multiple disciplines into a single framework with the goal of accelerating scientific breakthroughs.
IBM Systems and Technology Group University Alliances Executive Keith Brown sponsored the award to help the University of Kansas expand its High Performance Computing capabilities.

“We are pleased to help provide KU with the computational framework needed to develop and evaluate a hybrid computing cluster that is optimized for a number of simulation paradigms,” said Brown. “Modeling cell processes and structures, predicting the impact of climate change on biodiversity and exploring massive data sets using visual and analytical techniques are examples of how HPC technology can be used to achieve our goals of helping to create a Smarter Planet.”

Gerald Lushington, director of KU’s Molecular Graphics and Modeling Laboratory, uses the BCF in developing computational methods able to extract information from voluminous medical and chemical research. 

“Laboratory instruments for studying problems in molecular biology and medicine have grown incredibly sophisticated very quickly, to the point where they produce such huge volumes of useful data that we need very powerful computers to meaningfully analyze data,” Lushington said. “The renovated BCF in Nichols Hall provides the high performance computing hardware necessary to do this work, and the IBM SUR grant will deliver a valuable infusion of computing power for these calculations.”
Provided by University of Kansas

Update:
  • 2012.04.30 - original post

Wednesday, April 25, 2012

IBM Acquires an Advanced Navigator of Big Data


Big Blue lands on another Big Data treasure and in this case it's an advanced navigator tool for Big Data analytics.

IBM said today (2012.04.25) that it is buying Vivisimo, a privately held company based in Pittsburgh that helps organizations find and analyze large amounts of data.  The acquisition reflects IBM's big data analytics initiatives with advanced federated capabilities allowing organizations to access, navigate, and analyze the full variety, velocity and volume of structured and unstructured data without having to move it.

Vivisimo software excels in capturing and delivering quality information across the broadest range of data sources, no matter what format it is, or where it resides. The software automates the discovery of data and helps employees navigate it with a single view across the enterprise, providing valuable insights that drive better decision-making for solving all operational challenges.

IBM estimates that 2.5 quintillion bytes of data are created every day. It said the research group IDC estimates the market for big data technology and services will grow 40 percent annually to reach $16.9 billion by 2015.

IBM's big data platform is based on open source Apache Hadoop. The platform makes it easier for data-intensive applications to manage and analyze petabytes of big data by providing clients with an integrated approach to analytics, helping them turn information into insights for improved business outcomes.

IBM is also expanding its big data platform to run on other distributions of Hadoop, beginning with Cloudera. Cloudera is a top contributor to the Hadoop development community, and an early provider of Hadoop-based systems to clients across a broad range of industries including financial services, government, telecommunications, media, retail, energy and healthcare. As a result, Cloudera Hadoop clients can now take advantage of IBM's big data platform to perform complex analytics and build a new generation of software applications.


In a blog posted on Vivisimo's website, chief-scientist and co-founder Jerome Pesenti wrote:"I am honored to announce that Vivisimo has signed a definitive agreement to be acquired by IBM. Over the past 12 years, the Vivisimo team has been dedicated to building an amazing product, helping customers extract tremendous value out of their information. We can be proud of what we have accomplished and IBM’s decision to acquire us is a remarkable validation."


Links:

Media:

Blog:

Update:
  • 2012.04.25 - original post 

Sunday, April 22, 2012

From the Frontier: Battery for the Future

The days of $1.50 gasoline are long gone, and high costs coupled with environmental concerns have ignited a search for the battery that will power the cars of the future. IBM is at the forefront of this effort with its development of the lithium air super-battery that can power electric vehicles for 500 miles in a single charge. On Friday (2012.04.20), Big Blue announced that material innovation developers Asahi Kasei and Central Glass had joined its Battery 500 Project team to develop new battery technology for electric vehicles.


Challenge for Today’s Electric Car
President Obama had previously made development of electric vehicles a priority of his administration, and to date has provided a US$7,500 tax credit for buyers of plug-in cars, and has further provided billions of dollars in grants and loans to companies for vehicle and battery development through Energy Department programs.
However, drivers aren't plugging in yet. The mass adoption of electric vehicles has fallen short of optimistic projections, and even highly touted vehicles such as the Chevy Volt have missed sales targets, according to a report from Lux Research. Two impediments have been the cost of the vehicles and the range of their batteries.
"The limited range of electric cars is slowing down their adoption," Michael Holman, research director at Lux Research, told TechNewsWorld. "This new technology has the potential to address those issues for car buyers. If it's possible to increase the range, that might make the decision to buy an electric car easier for consumers." 
A Battery of the Future
The goal "is to create a battery that will power the typical family car about 500 miles between recharges," explains Winfried Wilcke, Senior Manager, Nanoscale Science and Technology with IBM. "Today's batteries...fall short of this goal by quite a factor, with [the best batteries] only lasting approximately 200 to 240 miles."

Bridging this gap requires drastically increasing the battery's energy density by making it lighter. IBM reduces the weight by getting rid of the heavy transition metal oxides like cobalt oxide or manganese oxide and replacing them with a lightweight, high-surface carbon structure.

The lithium air battery represents "the highest energy density of any imaginable system," says Wilcke, "but it's not easy to do. It's a long-term project currently in its early science phases, but in the last six or seven months we have gotten a lot of positive results, which make me cautiously optimistic that this can actually work."
Materials scientists for years have been pursuing lithium air batteries, which use oxygen from the air to react with lithium ions to discharge and charge electric energy. It still remains in the realm of research but Wilcke said that IBM has made progress understanding the basic chemistry and made important decisions on how a working battery would be engineered. 

"New materials development is vitally important to ensuring the viability of lithium-air battery technology," said Tatsuya Mori, director and executive managing officer of Central Glass. "As a longstanding partner of IBM and leader in developing high-performance electrolytes for batteries, we're excited to share each other's chemical and scientific expertise in a field as exciting as electric vehicles."

The key difference with lithium-air batteries is that these have higher energy density than lithium-ion batteries, while the primary "fuel" is the oxygen, which is readily available in the atmosphere. As a result, lithium-air technology promises a battery that offers a longer range and one that could be smaller than existing batteries -- but other issues remain.
"This technology is still similar in ways to conventional lithium-ion batteries," said Holman, "so the number of cycle lives could still be an issue. With each charge, tendrils form -- and these get larger each time until the battery can no longer be recharged."

Wilcke hopes to have a lithium air battery in cars by 2020. A battery that could power a car for 500 miles would certainly be worth the wait.


Big Blue’s Blue-sky Research

 IBM's gamble swims against a 50-year trend. U.S. companies used to perform their own basic research, but they have increasingly turned this over to universities and the government. Today, most companies' research and development divisions focus on applied research -- work that is likely to make money for the company soon.
In late 2009, IBM applied for a Department of Energy grant to defray some of the cost of this risky research. But DOE chose to fund two other lithium-air projects -- not IBM's. All the grants occurred under the Advanced Research Projects Agency-Energy, or ARPA-E. IBM's choice to continue the research puts it in a rare category: a big company willing to take a big risk.

Winfried said it's been tricky to make the battery rechargeable -- even to measure that it's recharging. But after seeing progress over the last six months, he said, "I have got a lot more optimistic that it will work, actually."

IBM wants a "substantial demonstration" or lab demo in three years, Wilcke said. He wouldn't say how much money or how many people it has put to the task. ARPA-E's lithium-air awardees received about $5 million and $1 million, respectively.

Michael Holman, research director at Lux Research, keeps an eye on breakthrough battery technologies like lithium-air. He calls IBM "a bit of a throwback" to a time when companies did "blue-sky" research.

Some companies still do this today, Holman said, not least because the "R&D playgrounds" attract great scientists.

When asked why IBM is pursuing this research, Wilcke observed that the company doesn't currently make its money in batteries: "We have the resources, we can think long-term."

He said world demand for cars is about to double, thanks to India and China. "Now we would have to be blind not to see the mother of all opportunities: environment and the clean world," he said.

In three years, Wilcke said, he should have enough information to either advance the technology with IBM's commercial partners, or to say, "Nah, doesn't work, shut it down. I'm perfectly willing to do the latter if it turns out to be the right thing to do." 

Update:
  • 2012.04.22 - original post

Thursday, April 19, 2012

Cycle Computing Creates Supercomputer on Amazon

Cycle Computing provisioned a 50,000-core utility supercomputer in the Amazon Web Services (AWS) cloud for Schrödinger and Nimbus Discovery to accelerate lead identification via virtual screening. This milestone – the largest of its kind – is Cycle Computing’s fifth massive cluster in less than two years on the heels of a 30,000-core cluster in October 2011, illustrating Cycle’s continued leadership in delivering full-featured and scalable cluster deployments. Cycle Computing revealed the cluster creation during today’s opening keynote at the AWS Summit in New York City.

Schrödinger’s widely used computational docking application, Glide, performs high-throughput virtual screening of compound libraries for identification of drug discovery leads. Computing resource and time constraints traditionally limit the extent with which ligand conformations can be explored, potentially leading to false negative while the same constraints may require a less accurate level of scoring, which can lead to false positives. Tapping into Cycle’s utility supercomputing, Schrödinger ran a virtual screen in collaboration with Nimbus Discovery of 21 million compounds against a protein target.  The run required 12.5 processor years and completed in less than three hours.

“Typically, we have to weigh tradeoffs between time and accuracy in a project,” said Ramy Farid, President, Schrödinger. “With Cycle’s utility supercomputing, we didn’t have to compromise the accuracy in favor of faster throughput, and we were able to run the virtual screen using the appropriate levels of scoring and sampling.”

The global 50,000-core cluster was run with CycleCloud, Cycle’s flagship HPC in the cloud service that runs on AWS. Replicating data across seven AWS regions while automating provisioned resources, CycleCloud run time per job averaged 11 minutes and the total work completed topping 100,000 hours. Schrödinger’s researchers completed over 4,480 days of work, nearing 12.5 years of computations in a few hours, with cost under $4,900 per hour at peak requiring no upfront capital.

“By leveraging AWS, Cycle Computing is able to perform highly sophisticated computations in minutes at a fraction of what it would cost for businesses to purchase the high performance computing infrastructure themselves,” said Terry Wise, Director of Business Development, Amazon Web Services. “Cycle Computing brings an incredible amount of innovation to our partner ecosystem and we’re excited to continue working with them to enable businesses to take advantage of AWS’s highly scalable, elastic and low cost technology infrastructure.”    

Cluster and performance analytics software CycleServer tracked utilization, diagnosed performance and managed scientific workflow. Replicating the success of employing next generation developments, Cycle engineers continued open source strategies, including Condor, Linux, and Opscode’s Chef cloud infrastructure automation system. Cycle’s Chef monitoring and analytics plug-in, called Grill, provided visualization into scaling the infrastructure environment and eliminated the need for additional Chef servers with alert technology supporting data around installations, driving down preparation and operational overhead.
Leveraging CycleCloud software and Cycle’s HPC proficiency delivered these stats:
  • Infrastructure: 6,742 Amazon EC2 instances /58.79 TB RAM
  • Global-scale: Multi-datacenter clusters with simple user interfaces
  • Cluster Size: 51,132 cores, 58.78TB RAM
  • Security: Engineered with HTTPS, SSH & 256-bit AES encryption
  • AWS Regions:7 (us-east, us-west1, us-west2, eu-west, sa-east, ap-northeast,
    ap-southeast )
The end-user experience for using CycleCloud is:
  • User Effort: One-click global cluster at massive scale
  • Start-up Time: Thousands of cores in minutes, full cluster in 2 hours
  • Up-front Capital Investment: $0
  • Total Infrastructure Cost: $4,828.85/hour
“Researchers can now meet their aspirations and bottom line through secure, mega-elastic and fully-supported utility supercomputers,” said Jason Stowe, founder and CEO, Cycle Computing. “By harvesting the raw infrastructure from AWS, we empower Schrödinger’s scientific accuracy while allowing them to push the boundaries of computation research. Creating robust, reliable and importantly repeatable supercomputers means any industry from life sciences, risk management, quantitative finance to product design can reap the benefits as we tip the scales towards the next generation of massive clusters.”
To learn more about the development of the 50,000 core-cluster and Cycle’s projects leading up to this accomplishment, please visit the Cycle Computing blog: Compute Cycles (http://blog.cyclecomputing.com/).

About Schrödinger
Schrödinger is committed to innovation and scientific advancement in computational chemistry. Schrödinger’s complete software solutions deliver advanced simulation technologies that accelerate R&D activities, make possible novel discoveries, and provide infrastructural support for research organizations, streamlining workflows and facilitating enterprise data sharing, management, and visualization among modelers, medicinal chemists, biologists, structural biologists, and other members of a multi-disciplinary team. The predictive power of Schrödinger's software has been demonstrated in a series of successful drug discovery collaborations with pharmaceutical and biotech companies, resulting in numerous patents.
Schrödinger employs approximately 100 full-time Ph.D. scientists, and operates from locations in New York, Oregon, California, Massachusetts, Maryland, Germany, France, the UK, and Japan in order to provide the best possible service and support for its more than 2,000 commercial, government, and academic customers worldwide. For more information, visit www.schrodinger.com.

About Nimbus
Nimbus Discovery is applying advances in computer-based drug discovery to develop new medicines against important drug targets and thereby unlock fundamental biological pathways.  Nimbus has established a first-of-its-kind partnership with Schrödinger, the leader in computational drug discovery, to gain privileged access to cutting-edge technology and exclusive rights to key targets.  Nimbus has already delivered selective, potent, and differentiated compounds within the first year for two disease targets that are pivotal in the progression of an aggressive form of Non-Hodgkin’s lymphoma and obesity, respectively.  Nimbus has built a virtually integrated, globally distributed R&D organization that leverages an experienced internal drug discovery team across an external network of R&D partners.  The resulting organization is scalable, capital efficient and has attracted world-class talent.  Nimbus seeks to partner its programs with larger pharmaceutical companies early in the development process allowing Nimbus to focus on its competitive advantage in novel drug discovery.  Nimbus programs are held in target-specific subsidiaries under an LLC umbrella.  Nimbus Discovery was founded in 2009 by Atlas Venture and Schrödinger, Inc.  In 2010, Nimbus received three Qualifying Therapeutic Discovery Project Tax Credit (QTDP) grants for its programs.  For more information please visit www.nimbusdiscovery.com.

About Cycle Computing
Cycle Computing is the leader in Utility Supercomputing software. As a bootstrapped, profitable software company, Cycle delivers proven, secure and flexible high performance computing (HPC) and data solutions since 2005.  Cycle helps clients maximize existing infrastructure and speed computations on servers, VMs, and on-demand in the cloud. Our products help clients maximize internal infrastructure and increase power as research demands, like the 10000-core cluster for Genentech and the 30000+ core cluster for a Top 5 Pharma that were covered in Wired, TheRegister, BusinessWeek, Bio-IT World, and Forbes. Starting with three initial Fortune 100 clients, Cycle has grown to deploy proven implementations at Fortune 500s, SMBs and government and academic institutions including JP Morgan Chase, Purdue University, Pfizer and Lockheed Martin.

About Amazon Web Services
Launched in 2006, Amazon Web Services (AWS) began exposing key infrastructure services to businesses in the form of web services -- now widely known as cloud computing. The ultimate benefit of cloud computing, and AWS, is the ability to leverage a new business model and turn capital infrastructure expenses into variable costs. Businesses no longer need to plan and procure servers and other IT resources weeks or months in advance. Using AWS, businesses can take advantage of Amazon's expertise and economies of scale to access resources when their business needs them, delivering results faster and at a lower cost.  Today, Amazon Web Services provides a highly reliable, scalable, low-cost infrastructure platform in the cloud that powers hundreds of thousands of enterprise, government and startup customers businesses in 190 countries around the world. AWS offers over 28 different services, including Amazon Elastic Compute Cloud (Amazon EC2), Amazon Simple Storage Service (Amazon S3) and Amazon Relational Database Service (Amazon RDS). AWS services are available to customers from data center locations in the U.S., Brazil, Europe, Japan and Singapore.

Update:
  • 2012.04.19 - original news release posted

Tuesday, April 17, 2012

jPage: Big Data Analytics

Big Data consists of datasets that grow so large (and often times, fast) that they become extrememly challenging to work with using traditional database management tools on limited computing platforms. These challenges include capture, store , search, sharing, analytics, and visualizing data.

The rapid rise of websites, online gaming, social media and network/cloud computing drives typical data volumes in terabytes and petabytes rather than megabytes with non-structured data being the bulk of the increased data volumes. At the same time, the need to understand business operations, customers and prospects has never been more important as all businesses face stiff competition for finite customer revenue. Large volume of data is also surging in research fields such as engineering, metrology, and biotech.

Social media and internet search are leading the way with big data application development and adoption. A good example is Apache Hadoop.  As big data tools and methods mature, traditional business and research organziations are starting to adopt the technology.  

One current feature of big data is the difficulty working with it using standard analytics tool (blog), requiring instead massively parallel software running on hundreds, or even thousands of servers. The size of "big data" varies depending on the capabilities of the organization managing the set.  For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration. (Wikipedia)

This post will track all the development and adoption of solutions for big data. As my interest of Big Data centers around research, I will also add use cases from fields such as life sciences and healthcare.

I. Traditional Enterprise Data Warehouse

1.1. MPP Data Warehouse Appliance

Big Data analytics solutions are typically constructed on massively parallel processing (MPP) platforms with high query performance and platform scalability. Essentially, these platforms are supercomputers with data warehouses that allow rapid data manipulation and hyper fast calculation speeds. Big volumes of data and nearly unlimited processing capability enable solutions that were inconceivable in the past. 

Each Netezza data warehouse appliance features IBM Netezza Analytics, an embedded software platform that fuses data warehousing and predictive analytics to provide petascale performance. IBM Netezza Analytics provides the technology infrastructure to support enterprise deployment of parallel in-database analytics. The programming interfaces and parallelization options make it straightforward to move a majority of analytics inside the appliance, regardless of whether they are being performed using tools such as IBM SPSS or SAS or written in languages such as R, C/C++ or Java.  A key performance advantage of the IBM Netezza data warehouse appliance family comes from its patented Asymmetric Massively Parallel Processing (AMPP™) architecture. AMPP combines open, blade-based servers and commodity disk storage with IBM's patented data filtering using Field Programmable Gate Arrays (FPGAs). This combination delivers blisteringly fast query performance and modular scalability on highly complex mixed workloads, and supports tens of thousands of BI, advanced analytics and data warehouse users. 

SAS In-Database Processing
SAS In-Database processing is a flexible, efficient way to leverage increasing amounts of data by integrating select SAS technology into databases or data warehouses. It utilizes the massively parallel processing (MPP) architecture of the database or data warehouse for scalability and better performance. Moving relevant data management, analytics and reporting tasks to where the data resides is beneficial in terms of speed, reducing unnecessary data movement and promoting better data governance. According to SAS, the solution is jointly developed by SAS and Teradata and consists of SAS accelerator for Teradata (scoring acceleration, analytics acceleration) and Teradata Enterprise Data Warehouse (see more at SAS Analytic Advantage for Teradata).

Teradata provides database software for data warehouses and analytic applications. Its products are meant to consolidate data from different sources and make the data available for analysis.

Aginity combines deep experience in big data and big math to build and deploy customer analytic solutions, creating intimacy at scale. Its solution and methodologies address challenges of disparate data, database size, data quality, automated advanced analytics, interactive reporting. The soluton it provides is Aginity Netezza Workbench.

1.2. Database Appliance

Oracle Exadata is a database appliance with support for both OLTP and OLAP workloads.  Exadata was initially manufactured, delivered and supported by HP. Since the acquisition of Sun Microsystems by Oracle circa January 2010, Exadata hardware shifted to Sun based hardware. Oracle claims that it is "the fastest database server on the planet".

II. Evolving Scale-out Architecture

2.1 Apache Hadoop

Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.   Originally developed and employed by Web companies like Google, Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other industries with significant data. With Hadoop, orangizations can easily explore complex data using custom analyses tailored to their information and questions.

While the open source Hadoop technology offers unlimited scalability and very low cost, it's a raw technology toolset with a command line interface that requires extensive Java programming and IT resources commitment to function as any sort of analytic solution. The following solutions provide Hadoop-based and user-frieldly analytics platform in the form of appliance or software platform.

Available for free download, CDH delivers a streamlined path for putting Apache Hadoop to work solving business problems in production.

IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise.  BigInsights enhances Apache Hadoop technology to withstand the demands of enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is a more developer and user-friendly solution for complex, large scale analytics.
The applince is a pre-integrated full rack configuration with 18 of Oracle's Sun servers that include InfiniBand and Ethernet connectivity to simplify implementation and management. It includes support for Apache Hadoop through Cloudera CDH.  Additional system software including Oracle NoSQL Database, Oracle Linux, Oracle Java Hotspot VM, and an open source distribution of R.

Datameer leverages the scalability, flexibility and cost-effectiveness of Apache Hadoop to deliver an end-user focused BI solution for big data analytics. Datameer overcomes Hadoop's complexity and lack of tools by providing business and IT users with BI functionality across data integration, analytics and data visualization in the world's first BI platform for Hadoop.  

Pentaho Business Analytics also offers native support for the most popular big data sources including Hadoop, NoSQL and analytic databases. Using Pentaho Business Analytics with Hadoop allows easy management, integration, and speed-of-thought analysis and visualization of Hadoop data.

2.2 Apache Cassandra

The Apache Cassandra database an open source distributed database management system. It is designed for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple data centers is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.




Updates:
  • 2012.03.29 - original post
  • 2012.04.06 - adding Datameer
  • 2012.04.12 - adding Pentaho
  • 2012.04.16 - adding SAS In-database Processing 
  • 2012.09.23 - this blog post becomes part of the jPage

j-Tool (file system): Ceph


Ceph is a distributed network storage and file system designed to provide excellent performance, reliability, and scalability.  Ceph is based on a reliable and scalable distributed object store, with a distributed metadata management cluster layered on top to provide a distributed file system with POSIX semantics.  There are a variety of ways to interact with the system:
  • Distributed file system - The Ceph Distributed Filesystem (CephFS) is a scalable network file system that aims for high performance, large data storage, and POSIX compliance.
  • Object storage - RADOS (Reliable Autonomic Distributed Object Storage) is the lower object storage layer of the Ceph storage system. A librados library provides applications direct access to the underlying distributed object store.  Clients talk directly with storage nodes to store named blobs of data and attributes, while the cluster transparently handles replication and recovery internally.
  • Swift and S3-compatible storage - The RADOS Gateway allows seamless data access to objects through direct or signed URLs, provides a RESTful API for the distributed object storage (RADOS) and is compatible with both Swift and S3 REST APIs.
  • Rados block device (RBD) - The RBD driver provides a shared network block device via a Linux kernel block device driver (2.6.37+) or a Qemu/KVM storage driver based on librados.

History

Ceph has grown out of the petabyte-scale storage research at the Storage Systems Research Center at the University of California, Santa Cruz. The project was funded primarily by a grant from the Lawrence Livermove, Sandia, and Los Alamos National Laboratories.


Links

Updates:
  • 2012.04.17 - original post

Monday, April 16, 2012

j-Tool: OpenStack

OpenStack is a global collaboration of developers and cloud computing technologists that seeks to produce a ubiquitous Infrastructure as a Service (IaaS) open source cloud computing platform for public and private clouds.

OpenStack was founded by Rackspace Hosting and NASA jointly in July 2010. Over time, 150+ companies have joined the project to various degrees, including Dell, AMD, Intel, Cisco, HP, RedHat, SUSE and Canonical. IBM and Redhat joined as Plantinum sponsors in April 2012.

Openstack has taken the next crucial step to creating a truly open environment to foster the growth of OpenStack through an independent, open governance structure to develop and mature the code and ecosystem surrounding OpenStack.

Links:

Journey Blogs:

Events:

Analysis:

Updates:
  • 2012.04.15 - original post 

jPage: Guavus

In the HPC Hunger Game of Big Data, how to make sense of large amounts of data? The following is a blog post by WSJ about how Gurvus tackles the challenge.

Words are a likely culprit, said Anukool Lakhina, the founder and chief executive of Guavus, a data analytics start-up serving the telecommunications industry. “The vocabulary that’s being used to describe this space is tools and technologies” and that confuses customers because it doesn’t help them understand how data can be used to solve their problems, he said.

Most of the emerging technologies that are behind the excitement around data, like Hadoop and NoSQL databases, imply that if companies store their data with the right tools, they’ll be able to extract value from it later. But how can companies expect to justify a $20 million investment in storage without knowing what the return will be, Lakhina said.

Though Guavus also provides storage tools–as well as every other piece of technology that a company needs to make sense of large amounts of data, from ingestion through analytics–it puts the analytics first in its sales pitch and that’s resonating so far.

Its customers are three of the four largest mobile operators in the U.S., including Sprint, and four of the top five Internet backbone operators. Most of its deals are for multiple millions of dollars, including one customer that’s generating at least $10 million in revenue for the San Mateo, Calif. company, Lakhina said, though he wouldn’t be more specific about the company’s finances or customers.

Guavus initially helps customers translate their business problems into data problems, and then over the course of about a month, Guavus implements its analytics technology onto a customer’s existing data infrastructure to demonstrate the value of the technology without a big upfront investment. Once the value is clear, customers are willing to expand their use of Guavus, Lakhina said.

It can help customers build capabilities to analyze the potential for customer churn, bundling and pricing and to analyze network traffic and routing.

Guavas was founded in 2006, but the genesis of the company goes back more than 10 years when Lakhina was working at Sprint Labs working on a project to collect sensor data to better understand its networks. Back then, the team would fill up its entire data storage system in an hour or two, remove the hard drives and mail them—the amount of data would have overwhelmed the network back then– to a central repository to be analyzed.

“We had continuous measurement of what was happening on the network, but what was missing was the ability to make timely decisions,” Lakhina said.

It took the company two years to learn the needs of the telecommunications industry, but Lakhina said the learning curve would be shorter as it branches into new sectors.


Though it started in telecommunications, Guavus has pilot tests running at companies in utilities, transportation and manufacturing. “All these folks are seeing the data avalanche, but are often ignored,” he said.


Links:

Saturday, April 14, 2012

Bring Kids to Science - Tiffany & Science Olympiad

My daughter Tiffany Lee likes sciences and is a member of the Lindbergh Highschool Science Olympiad team. Today (2012.04.14), her team spent a full-day competing at the Missouri Science Olympiad State Tournament. The event was held at the campus of Mizzou so they took a school bus ride to and from Columbia, Missouri. I drove to Lindbergh about 9pm to pick up Tiffany. The members of the team seemed exhausted yet happy.

The Lindbergh team qualified for the state after qualifying regionally on 2012.02.04 in Missouri Region 6 Tournament at Lindenwood University.

For the State Tournament, Lindbergh High placed the 8th in team event and Tiffany won a third place in individual competition. Tiffany also placed third in all individual events at the Regional Tournament.


The three events they competed in are: 
  • Disease Detective - This event requires students to apply principles of epidemiology to a published report of a real-life health situation or problem. (Food Borne Illness)
  • Microbe Mission - Teams will answer questions, solve problems and analyze data pertaining to microbes.
  • Fermi - A Fermi Question is a science related question that seeks a fast, rough estimate of a quantity which is difficult or impossible to measure directly. Answers will be estimated within an order of magnitude recorded in powers

Recognized as a model program by the National Governors Association Center for Best Practices in the 2007 report, Innovation America: Building a Science, Technology, Engineering and Math Agenda, Science Olympiad is committed to increasing global competitiveness for the next generation of scientists.

Science Olympiad

Missouri State Science Olympiad is a non-profit organization that operates under the National Science Olympiad which is a national non-profit organization dedicated to improving the quality of K-12 science education, increasing male, female and minority interest in science, creating a technologically literate workforce and providing recognition for outstanding achievement by both students and teachers.  These goals are achieved by participating in Science Olympiad tournaments and non-competitive events, incorporating Science Olympiad into classroom curriculum and attending teacher training institutes.

For over 25 years, Science Olympiad has led a revolution in science education.  In the face of shrinking college enrollment in science majors, falling science test scores and a nationwide shortage of K-12 science teachers, Science Olympiad continues to challenge, inspire and inform the academic and professional careers of students and instructors across America.

Science Olympiad Competition

Each team is allowed to bring 15 students who may participate in a variety of events in their skill set.  Practices vary from monthly meetings to weekly study sessions to daily work as tournaments near.  Science Olympiad competitions are like academic track meets, consisting of a series of 23 team events in each division.  Every year, a portion of the events are rotated to reflect the ever changing nature of genetics, earth science, chemistry, anatomy, physics, geology, astronomy, mechanical engineering and technology.

Emphasis is placed on active, hands-on, group participation.  Through the Olympiad, students, teachers, coaches, principals, business leaders and parents bond together a work toward a shared goal.  Teamwork is a required skill in most scientific careers today, and Science Olympiad encourages group learning by designing events that forge alliances.

The prestige of winning a medal at a Science Olympiad tournament, whether regional, state, or national,  is often a springboard to success.


Links:


Updates:
  • 2012.04.14 - original post

The Science Behind an Answer - Everything about Watson Jeopardy!

IBM's Watson Supercomputer took on Jeopardy! all-time champions Ken Jennings and Brad Rutter in a three-day IBM Challenge round in February 2011. Here I compile most pertinent information to recap the historical event and provide food for thought

The challenge in building a computer system like Watson lies in developing its ability to understand the language of a clue, register the intent of a question, scour millions of lines of human language, weigh the evidence from the analysis, and return a single, precise answer - in a split second on the stage of Jeopardy! Game.

Preparing for the Game
Preparing Watson for the Jeopardy! stage posed a unique challenge to the team: how to represent a system of 90 servers and hundreds of custom algorithms for the viewing public. IBM, in collaboration with a team of partners, created a representation of this computing system for the viewing audience -- from its stage presence to its voice.




The Jeopardy! Game
After competing against the two greatest Jeopardy! champions of all time, the technology behind Watson will now be applied to some of the world's most enticing challenges. Watch a breakdown of the match from Ken Jennings, Brad Rutter and the IBM team members as they look toward the future.



More videos about the three day competition:

Post-Game Panel Review
Join host Stephen Baker, author of "Final Jeopardy, Man vs Machine and the Quest to Know Everything", as he discusses Watson's performance on Jeopardy! and the possible real-world applications of this technology with the following panelists: IBM Watson Principal Investigator Dr. David Ferrucci, IBM Fellow and CTO of IBM's SOA Center for Excellence Kerrie Holley and Columbia University Professor of Clinical Medicine Dr. Herbert Chase. See video from Ted.com.


How Watson Answer the Question
In this video, the four steps of Watson's question answering technology are covered:
  • question analysis
  • hypothesis generation
  • hypothesis and evidence scoring
  • final merging & ranking




Watson after Jeopardy!
Watson was optimized to tackle a specific challenge: competing against the world's best Jeopardy! contestants. Beyond Jeopardy!, the IBM team is working to deploy this technology across industries such as healthcare, finance and customer service.




Blog updates:
  • 2012.04.13 - original post

jPage: IBM Watson

IBM Watson is a new class of industry specific analytical capability that uses deep content analysis, evidence-based reasoning and natural language processing to identify relationships buried in large volumes of data that can be used to improve decision making.


Events:

Links:

j-Tool: Varicent

Varicent Software Incorporated, a global pioneer in Incentive Compensation and Sales Performance Management delivers innovative, industry-leading solutions for finance, sales, human resources and IT departments in high-performing companies across industries.

Varicent solutions streamline administrative processes, maximize efficiencies and drive improved sales performance.
  • Varicent for Enterprise - Design and manage complex pay-for-performance and compensation programs including sales commissions, MBOs and non-cash rewards.
  • Varicent for Midmarket - Create and administer variable pay programs, automate commission calculations, and quickly distribute personalized payout results to your entire sales team.

In addition to ranking in the top 5 two years running, in the PROFIT 100 list of Canada’s Fastest Growing Companies by PROFIT Magazine, Varicent was named the highest overall “hot” vendor in the Ventana Research 2011 SPM Value Index.

j-Links:
j-Blogs:

Blog Updates:
  • 2012.04.14 - original post

Friday, April 13, 2012

IBM Analytics Goes Deep with Varicent Acquisition

IBM announced today (2012.04.13) that it acquired Varicent Software, a company that creates a sales analytics software. Financial terms of the deal were not disclosed.
 
Varicent analyzes sales data from businesses to help organizations to streamline compensation processes for employees, improve sales performance, and more. Varicent’s software automates and analyzes sales data across a number of sectors of an organization including the finance, sales, human resources and IT departments and can uncover trends that could lead to better sales and revenue for a company. Unlike traditional tools which often involves labor-intensive process, Varicent provides a single management system that relies on a sophisticated calculation engine to model and analyze the effectiveness of incentive spend.  

Varicent’s software is used by over 200 banks, insurance companies, retailers, information technology and telecommunications providers. Clients include Starwood Hotels, Covidien, Dex One, Manpower, Hertz, Office Depot, Farmers, SugarCRM, Reliance Standard Life Insurance, Silverpop, Tribune, and AAA Northern California. Varicent offers a software catered to larger enterprises as well as a product that focuses on smaller teams. 

This acquisition ties into IBM’s focus on providing in-depth analytics offerings to businesses.
IBM will combine Varicent with its R&D and prior acquisitions including Algorithmics, Clarity Systems, OpenPages, Cognos and SPSS, to expand IBM capabilities in business analytics and optimization across finance, sales, and customer service operations. These acquisitions are part of IBM's larger focus on analytics, which spans hardware, software, services and research.


Link:

Thursday, April 12, 2012

j-Tool: Pentaho

Pentaho tightly couples data integration with business analytics in a modern platform that brings together IT and business users to easily access, visualize and explore all data that impacts business results.

Pentaho Business Analytics is a complete end-to-end solution that includes business intelligence, data integration and data mining capabilities – providing power for technologists and rapid insight for users. Pentaho Business Analytics enables business users to intuitively access, discover and analyze their data, empowering them to make information-driven decisions that positively impact their organization's performance.   

Forrester Wave gave Pentaho good remarks in its Q1-2012 report: "Pentaho is a Strong Performer with an impressive Hadoop data integration tool. Among data integration vendors that have added Hadoop functionality to their products over the past year, it has the richest functionality and the most extensive integration with open source Apache Hadoop..."

The following are some key features of Pentaho Business Analytics:
  • Access to any data - Pentaho provides a web-based interface for business users to access any data they wish to use in reporting, analysis and dashboards. With a simple wizard-based approach, business users can turn their data into insight and make information-driven decisions in minutes.
  • Reporting - Reporting capabilities span the entire continuum from self-service interactive reporting for casual business users, to high-volume, highly formatted enterprise reporting.
  • Dashboards - Pentaho dashboards delivers key performance indicators in a highly graphical user interface.
  • Data analysis - With an intuitive, interactive web user interface, non-technical users can freely explore and visualize their business data by multiple dimensions such as product, geography and customer.
  • Data integration & quality -  Pentaho Data Integration delivers powerful extract, transform and load (ETL) capabilities using an intuitive and rich graphical design environment that enables users to do exactly what most skilled code developers can accomplish – in a fraction of the time.
  • Data mining and predictive analytics - The powerful, state-of-the-art machine learning algorithms and data processing tools in Pentaho Business Analytics enable users to uncover meaningful patterns and correlations that may otherwise be hidden with standard analysis and reporting. The sophisticated analytics help you understand prior business performance for better planning of future outcomes. Dozens of powerful algorithms including classification, regression, clustering and association.
  • Mobile -  Pentaho gives business users on the go complete data discovery and interactive analysis capabilities with a powerful visual experience on the iPad. Mobile users instantly become more productive by accessing, analyzing and sharing business information from anywhere.
  • Multiple deployment models - Pentaho is designed for flexible deployment anywhere, including:
    • On-premise – running on your own in-house servers
    • Cloud – running on public or private Cloud platforms, including Amazon EC2 and others
    • Integrated or embedded – into other on-premise or SaaS applications, for example CRM, ERP or financial applications

Pentaho Business Analytics also offers native support for the most popular big data sources including Hadoop, NoSQL and analytic databases.

Using Pentaho Business Analytics with Hadoop allows easy management, integration, and speed-of-thought analysis and visualization of Hadoop data and enables:
  • Quick and easy analytics against big data
  • Easier to maintain solutions
  • Integration of big data tasks into the overall IT/ETL/BI solutions
  • ETL engine distributed across the Hadoop cluster
  • Support for multiple Hadoop distributions


 Links:

IBM Joins OpenStack Bandwagon

The open-source OpenStack cloud infrastructure stack has gained a number of additional powerful allies, as IBM and Red Hat have both agreed to support the OpenStack Foundation, organizers behind the soon-to-be-created organization announced yesterday (2012.04.11).  The group expects to make more announcements about the foundation's progress next week at the OpenStack Spring Conference, being held in San Francisco.

Both companies have agreed to join the foundation as platinum members, meaning they will contribute US$500,000 per year for the next three years. The companies will also contribute code changes to the software stack as well.

In addition to these two companies, AT&T, Canonical, Hewlett-Packard, Nebula, Rackspace and SUSE are also listed by Openstack foundation as platinum members. A number of other companies have pledged their support as well at a lower level, including Cisco, ClearPath Networks, Cloudscaling, Dell, DreamHost, NetApp, Piston Cloud Computing and Yahoo. This second tier, called the gold level, requires contributions between $50,000 and $200,000 a year, depending on company revenue.

Last year, Rackspace, which has been overseeing the OpenStack development process, announced that it would spin off the OpenStack project as a stand-alone foundation. Since then, the project's managers and contributors have been working out the details of how the foundation would work.

With the legal help of the committed sponsors, the organizers behind the foundation will write a set of bylaws for the organization, which then will be posted for community review. They expect to ratify the final draft by September.

This expansion of the foundation is an important milestone for OpenStack as it evolves from an effort driven by Rackspace and NASA, to a broad-based coalition. Its goal is to provide an open-source cloud platform alternative to Amazon Web Services. Another reason IBM is such an important addition is because of its work in making the Eclipse Foundation a success — largely by ceding control over the open-source Java IDE to an outside body.

As Mark Collier wrote in his OpenStack announcement blog - "at the start of the process, Jonathan Bryce and I spent the first couple of months learning as much as we could about successful open source foundations, like the ASF, Eclipse, and the Linux Foundation, reading foundation meeting minutes into the wee hours of the morning ..."

Looks like their efforts are starting to pay off.


Sources:

Linkes:

Blog updates:
  • 2012.04.12 - original post

Wednesday, April 11, 2012

IBM Unveils PureSystems (Expert Integrated Systems)

A few months ago, I was pulled aside in a Kansas City office, signed off on company non-disclosure before being briefed on Project Troy, an IBM confidential initiative for building and launching next generation platform (NGP) for converged and cloud computing. 

IBM today (2012.04.11) unveils PureSystems, the ground-breaking new family of expert integrated systems. Representing a $2 billion investment, involving thousands of IBMers in 17 labs in 37 countries, these solutions provide customers the flexibility of a general purpose system, the elasticity of cloud and the simplicity of an appliance tuned to the workload.  Throughout launch events in New York City, London, Sao Paolo, Mumbai, Shanghai and Tokyo, along with a global web broadcast and hundreds of smaller local events, IBM will showcase the first two members of the PureSystems family:

PureFlex System: integrates server, software, networking and storage nodes, and arrives pre-integrated and ready to deploy. Key attributes of PureFlex include:
    • Factory integrated and optimized system infrastructure
    • Management integration across physical and virtual resources
    • Automation and optimization expertise
    • Built for cloud, as a foundation for Infrastructure as a Service offering
PureApplication System: inherits infrastructure elements from the PureFlex System – like virtualization and hardware management – and delivers pre-integrated capabilities at the middleware stack to allow clients to handle a wide variety of workloads. The software stack includes web, database, and Java applications. It is built for cloud, offering simplicity, efficiency, and control – serving as a virtualized application platform that optimizes application workloads and accelerates time to value. Key attributes include:
    • Built-in expertise pre-optimizes web, database, and application workloads, with elastic scalability
    • Repeatable self-service provisioning, on a resilient, secure, scalable infrastructure
    • Simplified platform management with a single consol
    • A foundation for a Platform as a Service offerings



Expert integrated systems have three distinct characteristics:

  • Built-in expertise: Systems must capture and automate best-practices and expertise, reducing manual steps that impact project’s time to value with an open architecture, allowing participating solution providers to optimize their applications workloads. For the first time, IBM is embedding technology and industry expertise through first-of-a-kind software that allows the systems to automatically handle basic, time-consuming tasks such as configuration, upgrades, and application requirements. This enables customers to consolidate and reduce their IT footprint and lower costs, facilitate the sharing of resources, and accelerate the movement to the cloud. 
  • Integration by Design: All hardware and software components must be integrated by design, tuned in the lab, and pre-packaged in the factory into a single ready-to-go system – optimized for the business task.Each of the hardware systems can support a mix of server CPU (x86 and PowerPC) as well as operating systems (Windows,Linux, AIX) allowing the customer to configure the system to meet their specific needs and simplify migration of existing environments.
  • Simplified Experience: IT staff and the lines of business that consume IT experience a simplified systems lifecycle. Collections of hardware, middleware, and application components will no longer need to separately be procured, configured, integrated, tuned, and managed separately. PureSystems are simply ordered, unpacked, plugged in and managed as a single system with a single interface.

PureFlex System and PureApplication System share technology with IBM SmartCloud and provide IaaS and PaaS capabilities out-of-the-box, dramatically reducing the time and effort it takes to create private/hybrid cloud environments. They mark an important step forward in  how customers will think about, and interact with, their IT environments in the future. With the hardware running the IBM SmartCloud, one could describe this offering as a cloud-in-a-box.

Customers will also be able to access applications optimized for PureSystems from more than 100 IBM partners and solution providers through the new online PureSystems Centre. Now, in the same way a consumer can view and download apps to their phone, the clients can select and install new software to their enterprise IT.




PureSystems represents a new and exciting mile marker in IBM's second century of innovation. There is much anticipation around what this next era of innovation -- and integration --  will bring. I'm excited that IBM continues to make ground-breaking investment in the fields of technical and cloud computing, first the acquisition of Platform Computing earlier this year and now the release of PureSystems this month.

Notable quotes:
  • "A new era of computing requires a new kind of infrastructure. IBM PureSystems have been designed and engineered from the start to work together to be flexible, open and easy to manage.  All of this fundamentally improves the economics of IT for our clients."  - Rod Adkins, senior vice president, IBM Systems & Technology Group
  • "When looked at as a TCO question over the life of the applications, the fact that applications get certified for operation in the PureSystems environment, and the potential that has to limit problems that can be encountered when one rolls their own solutions, IBM has a good argument for their design choices." - David Chernicoff, ZDNet (blog)

More from Journey Post:

Blog Updates:
  • 2012.04.11- original post
  • 2012.04.12 - updated with more system features

jPage: IBM PureSystems

 Journey Post:

Key sites:

Media Report:

Video:

Blog:

Social Network:

Updates:
  • 2012.04.11 - original  post
  • 2012.04.16 - added "The Four Hundred" post and link to PureSystems Center