Tuesday, April 17, 2012

jPage: Big Data Analytics

Big Data consists of datasets that grow so large (and often times, fast) that they become extrememly challenging to work with using traditional database management tools on limited computing platforms. These challenges include capture, store , search, sharing, analytics, and visualizing data.

The rapid rise of websites, online gaming, social media and network/cloud computing drives typical data volumes in terabytes and petabytes rather than megabytes with non-structured data being the bulk of the increased data volumes. At the same time, the need to understand business operations, customers and prospects has never been more important as all businesses face stiff competition for finite customer revenue. Large volume of data is also surging in research fields such as engineering, metrology, and biotech.

Social media and internet search are leading the way with big data application development and adoption. A good example is Apache Hadoop.  As big data tools and methods mature, traditional business and research organziations are starting to adopt the technology.  

One current feature of big data is the difficulty working with it using standard analytics tool (blog), requiring instead massively parallel software running on hundreds, or even thousands of servers. The size of "big data" varies depending on the capabilities of the organization managing the set.  For some organizations, facing hundreds of gigabytes of data for the first time may trigger a need to reconsider data management options. For others, it may take tens or hundreds of terabytes before data size becomes a significant consideration. (Wikipedia)

This post will track all the development and adoption of solutions for big data. As my interest of Big Data centers around research, I will also add use cases from fields such as life sciences and healthcare.

I. Traditional Enterprise Data Warehouse

1.1. MPP Data Warehouse Appliance

Big Data analytics solutions are typically constructed on massively parallel processing (MPP) platforms with high query performance and platform scalability. Essentially, these platforms are supercomputers with data warehouses that allow rapid data manipulation and hyper fast calculation speeds. Big volumes of data and nearly unlimited processing capability enable solutions that were inconceivable in the past. 

Each Netezza data warehouse appliance features IBM Netezza Analytics, an embedded software platform that fuses data warehousing and predictive analytics to provide petascale performance. IBM Netezza Analytics provides the technology infrastructure to support enterprise deployment of parallel in-database analytics. The programming interfaces and parallelization options make it straightforward to move a majority of analytics inside the appliance, regardless of whether they are being performed using tools such as IBM SPSS or SAS or written in languages such as R, C/C++ or Java.  A key performance advantage of the IBM Netezza data warehouse appliance family comes from its patented Asymmetric Massively Parallel Processing (AMPP™) architecture. AMPP combines open, blade-based servers and commodity disk storage with IBM's patented data filtering using Field Programmable Gate Arrays (FPGAs). This combination delivers blisteringly fast query performance and modular scalability on highly complex mixed workloads, and supports tens of thousands of BI, advanced analytics and data warehouse users. 

SAS In-Database Processing
SAS In-Database processing is a flexible, efficient way to leverage increasing amounts of data by integrating select SAS technology into databases or data warehouses. It utilizes the massively parallel processing (MPP) architecture of the database or data warehouse for scalability and better performance. Moving relevant data management, analytics and reporting tasks to where the data resides is beneficial in terms of speed, reducing unnecessary data movement and promoting better data governance. According to SAS, the solution is jointly developed by SAS and Teradata and consists of SAS accelerator for Teradata (scoring acceleration, analytics acceleration) and Teradata Enterprise Data Warehouse (see more at SAS Analytic Advantage for Teradata).

Teradata provides database software for data warehouses and analytic applications. Its products are meant to consolidate data from different sources and make the data available for analysis.

Aginity combines deep experience in big data and big math to build and deploy customer analytic solutions, creating intimacy at scale. Its solution and methodologies address challenges of disparate data, database size, data quality, automated advanced analytics, interactive reporting. The soluton it provides is Aginity Netezza Workbench.

1.2. Database Appliance

Oracle Exadata is a database appliance with support for both OLTP and OLAP workloads.  Exadata was initially manufactured, delivered and supported by HP. Since the acquisition of Sun Microsystems by Oracle circa January 2010, Exadata hardware shifted to Sun based hardware. Oracle claims that it is "the fastest database server on the planet".

II. Evolving Scale-out Architecture

2.1 Apache Hadoop

Apache Hadoop is a powerful open source software package designed for sophisticated analysis and transformation of both structured and unstructured complex data. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce.   Originally developed and employed by Web companies like Google, Yahoo and Facebook, Hadoop is now widely used in finance, technology, telecom, media and entertainment, government, research institutions and other industries with significant data. With Hadoop, orangizations can easily explore complex data using custom analyses tailored to their information and questions.

While the open source Hadoop technology offers unlimited scalability and very low cost, it's a raw technology toolset with a command line interface that requires extensive Java programming and IT resources commitment to function as any sort of analytic solution. The following solutions provide Hadoop-based and user-frieldly analytics platform in the form of appliance or software platform.

Available for free download, CDH delivers a streamlined path for putting Apache Hadoop to work solving business problems in production.

IBM InfoSphere BigInsights brings the power of Hadoop to the enterprise.  BigInsights enhances Apache Hadoop technology to withstand the demands of enterprise, adding administrative, workflow, provisioning, and security features, along with best-in-class analytical capabilities from IBM Research. The result is a more developer and user-friendly solution for complex, large scale analytics.
The applince is a pre-integrated full rack configuration with 18 of Oracle's Sun servers that include InfiniBand and Ethernet connectivity to simplify implementation and management. It includes support for Apache Hadoop through Cloudera CDH.  Additional system software including Oracle NoSQL Database, Oracle Linux, Oracle Java Hotspot VM, and an open source distribution of R.

Datameer leverages the scalability, flexibility and cost-effectiveness of Apache Hadoop to deliver an end-user focused BI solution for big data analytics. Datameer overcomes Hadoop's complexity and lack of tools by providing business and IT users with BI functionality across data integration, analytics and data visualization in the world's first BI platform for Hadoop.  

Pentaho Business Analytics also offers native support for the most popular big data sources including Hadoop, NoSQL and analytic databases. Using Pentaho Business Analytics with Hadoop allows easy management, integration, and speed-of-thought analysis and visualization of Hadoop data.

2.2 Apache Cassandra

The Apache Cassandra database an open source distributed database management system. It is designed for scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple data centers is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.




Updates:
  • 2012.03.29 - original post
  • 2012.04.06 - adding Datameer
  • 2012.04.12 - adding Pentaho
  • 2012.04.16 - adding SAS In-database Processing 
  • 2012.09.23 - this blog post becomes part of the jPage

1 comment:


  1. The information you have given here is truly helpful to me. CCNA- It’s a certification program based on routing & switching for starting level network engineers that helps improve your investment in knowledge of networking & increase the value of employer’s network
    Regards,
    ccna training institute in Chennai|ccna training in Velachery|ccna courses in Chennai

    ReplyDelete