Friday, May 25, 2012

A Big Data View on Memorial Day Weekend Travel

If the Memorial Day weekend is a bellwether for summer spending trends in the U.S., the travel and retail industries are in for a sunnier season, according to an IBM social media sentiment analysis. The IBM Social Sentiment Index shows a 46 percent jump in the amount of social media chatter about Memorial Day travel, and a five-fold increase in the “Desire Ratio” – the share of positive versus negative comments about shopping – this Memorial Day compared to last year. http://bit.ly/LrPXJi

IBM Viewpoint

  • IBM’s deep analytics capabilities enable clients to crunch enormous volumes of social media data in real time to better understand and respond to customers. The Social Sentiment Index around Memorial Day demonstrates how analytics can help travel providers and retailers prepare for and capture an upswing in consumer spending.
  • The exponential growth of social networks like Twitter is an avenue for monitoring public opinion. IBM's ability to measure social sentiment can help organizations make more informed decisions by capturing valuable insights into unfiltered consumer attitudes and actions, predict trends or detect fraud.
  • Analysis of social media data can reveal more than just what customers are saying -- it can enable companies to answer more complex questions such as what is motivating people to discuss a certain topic or to take a certain action.

What They're Saying

  • The Wall Street Journal reports that the IBM Social Sentiment Index gives insight into the unofficial kick-off of the summer road-trip and vacation season. http://on.wsj.com/MNszf1
  • TIME explains how the IBM Social Sentiment Index reveals something interesting about Americans’ plans for the Memorial Day holiday weekend. http://ti.me/LNcQxc
  • IDC analyst Ruthbea Yesner Clarke blogs about how local governments can use social media analytics. http://bit.ly/LnHQxU

Update
  • 2012.05.26 - original post

Thursday, May 10, 2012

jTool: Apache Lucene

I had a full-day client briefing today in St. Louis on Big Data analytics for enterprise research.  During the briefing, Lucene was mentioned by architects and developers from both sides over a dozen times as an useful tool for text indexing even at the enterprise level. 

As an open-source project, it says a lot for Lucene when two 100-year companies agreed on its importance. 


So I am now adding the 2nd Apache project (the first being Apache Hadoop) to the jTool catalog. 


Overview of Apache Lucene

Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform.

Lucene offers powerful features through a simple API: 

Scalable, High-Performance Indexing
  • over 95GB/hour on modern hardware
  • small RAM requirements -- only 1MB heap
  • incremental indexing as fast as batch indexing
  • index size roughly 20-30% the size of text indexed

Powerful, Accurate and Efficient Search Algorithms

  • ranked searching -- best results returned first
  • many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
  • fielded searching (e.g., title, author, contents)
  • date-range searching
  • sorting by any field
  • multiple-index searching with merged results
  • allows simultaneous update and searching

Cross-Platform Solution

  • Available as Open Source software under the Apache License which lets you use Lucene in both commercial and Open Source programs
  • 100%-pure Java
  • Implementations in other programming languages available that are index-compatible

Links

TABI - Near Real-time Business Intelligence

Timely Analytics for Business Intelligence (TABI) is a system that delivers near real-time business intelligence for analysts relying on massive amounts of data. It was developed by IBM Research and first field-tested in a First-of-a-kind (FOAK) project with Telco industry.

The project integrates two research assets from IBM Research – Haifa: the Massive Collection System (MCS) from the Software and Services department; and the Parallel Machine Learning (PML) toolbox developed jointly by ML group and the Data Analytics department at the IBM T.J. Watson Research lab.


The potential TABI use cases are prevalent in fields such as telecommunications, banking, and transportation. 
TABI connects to the data stream of the customer, extracting only relevant information and making rapid business intelligence decisions using methods such as prediction, clustering, social connectivity graph analysis, and association rules.

TABI has demonstrated that it can learn from hundreds of millions of records per day, with relatively modest hardware requirements.

An IBM Case Study

A joint research project between IBM and a large mobile communications provider offered strategic insight into the wireless customers’ social network calling patterns, giving the provider valuable information it could use to improve customer loyalty and reduce churn.

Mobile phone customers are notoriously fickle, and churn rates are a major headache for most wireless carriers. IBM’s Timely Analytics for Business Intelligence (TABI) uses sophisticated software analytics and algorithms to predict the likelihood of defection among current subscribers, giving the provider the opportunity to target those customers with special offers to encourage retention.

IBM Research conducted a joint first-of-a-kind project with the client, using software analytics and algorithms developed by IBM Research Haifa. Through TABI, the IBM Research Services team and the client discovered that customers’ calling patterns provide important clues about their brand loyalty. Looking at a customer’s network of friends and determining who calls whom is proving to be a reliable predictor of whether and how seriously that customer will consider jumping to a competitor.

The IBM Research Services team worked with the client to analyze more than 250 million call detail records (CDRs) every day using IBM’s TABI technology. TABI is based on a combination of IBM’s Massive Collection System and Parallel Machine Learning system. Unlike traditional methods, which “warehouse” and analyze data that is quickly outdated, TABI perceives patterns by analyzing massive amounts of CDRs as they are generated. This ongoing analytical process, which can handle billions of CDRs a day, helps TABI deliver results that are up to date. The use of social analytics provides predictions that are up to 50 percent more accurate than previous techniques used to predict churn. Customer privacy is maintained by identifying social network patterns based on CDRs, which are decoupled from any personal customer information.

For each set of data, TABI can apply one of several analytical operations to generate a Business Intelligence model that clients can use to glean actionable knowledge. There are currently five operations, including: prediction, in which forecasts are made according to the Key Performance Indicators of each user or entity in the data; clustering, where users are divided into classes according to similarities between them; connectivity graph analysis, which is used to analyze the data according to links between users; association rule mining, which automatically determines correlations between data items; and quality of service analysis, which provides advanced methods that ensure the network is functioning at a high level of service.In addition to loyalty reinforcement and cross-selling, TABI can be used in marketing campaign management to target likely customers and track progress, and for fraud or abnormal activity detection by rapidly adjusting its prediction model to spot recent examples of behavioral anomalies. It can also be used in network quality tracking, alerting a network operator to a potential problem and helping provide the opportunity to react quickly to emerging problems before they have far-reaching effects.

The Net of TABI

- a cluster forming algorithm
- good at finding the lead of cluster
- near real-time decision-making


Links:
  • TABI (IBM Research)

Wednesday, May 9, 2012

jTool: IBM SmartCloud Provisioning

IBM SmartCloud Provisioning is a breakthrough entry-level solution that allows quick cloud deployment and features automated provisioning, parallel scalability and integrated fault tolerance to increase operational efficiency and respond to user needs. It also provides the foundation to integrate more advanced cloud capabilities.

The following is a recent post by IBMer Mauro Arcese as the first of a series to introduce IBM SmartCloud tools. In this post, IBM SmartCloud Provisioning is reviewed in detail for its rapid deployment capability. 

I’ve been impressed by the speed of provisioning a set of virtual machines in just a few tens of seconds using IBM SmartCloud Provisioning. In most cases, you can get a running virtual machine in less than one minute.
The IBM SmartCloud Provisioning technology has been devised and particularly optimized for managing the following cloud infrastructure scenarios:
  • Infrastructure composed of homogenous resources
  • High level of standardization with a relative small set of master images used to provision many instances from the same image
  • Typical life cycle of the provisioned resources with short average time of life of provisioned virtual instances

Many other workloads can be deployed and easily automated on top of IBM SmartCloud Provisioning. For example, traditional stateful applications can be easily deployed for simple high availability (HA) solutions. Anyway, you get the maximum performances from SmartCloud Provisioning when operating in the context of the above scenarios.

To achieve such high performances, IBM SmartCloud Provisioning has been designed to focus the attention to an optimized virtualization infrastructure based on OS streaming: no need to copy large image files over the network when provisioning.

Image copying is the single biggest bottleneck in virtual machine (VM) provisioning today, in terms of CPU, memory, I/O, and bandwidth usage. In traditional cloud provisioning approaches, all of this overhead is system resource that is just pure overhead (nobody builds a cloud to provision systems ― provisioning is an overhead that is required to have systems on which business workload is deployed, and any overhead is in conflict with the business workload).

The key element of such infrastructure is the so-called ephemeral instances, which are virtual machines having no persistent state. After they are terminated, all the data associated with them is deleted also. They are clones of a master image and these clones will have a primary virtual disk that is ephemeral: when the instance goes, so does its ephemeral storage (mechanisms exist in IBM SmartCloud Provisioning to provide persistence, if needed by some scenarios).

When creating a new instance, because master images are read-only resources and are replicated across the storage cluster, IBM SmartCloud Provisioning uses the copy-on-write (CoW) technology and the iSCSI protocol to stream them, avoiding expensive copying. Each iSCSI session results in a valid block device to be created in the host OS.

Of course each guest OS (corresponding to a given instance) requires a writable block device representing the main disk of the system. All supported hypervisors have a storage virtualization layer that includes the copy-on-write technology. For example, Kernal-based Virtual Machine’s (KVM) qcow2 files can be configured to implement CoW by referencing a backing storage device. VMware has something called redo files, which effectively do the same thing as well. In each case, the hypervisor can natively use the CoW file referencing the iSCSI block device to expose a virtual block device to the virtual machine.

Depending on the hypervisor and guest OS, this device will show up as something like one of the following lines:

/dev/sda 

c:\

The CoW files are stored locally on the hypervisor’s file system.

When the instance is terminated, the IBM SmartCloud Provisioning agent simply discards the CoW file and checks if any other instances are using the same iSCSI device. If the device is no longer in use, the agent also stops the iSCSI device.

Thanks to this infrastructure, the action of provisioning a new virtual machine results in a very fast and reliable process that allows it to create individual systems in tens of seconds, and of peak requests of thousands of systems per hour.


Links:


Update:
  • 2012.05.09 - original post

Saturday, May 5, 2012

j-Warrior: Juan Iglesias of UTB

It was my honor to sponsor Dr. Juan R. Iglesias, associate professor and chair of the Department of Computer and Information services at The University of Texas at Brownsville, for an IBM faculty award of 2010 for his involvement with computer science and research computing. The 2010 IBM Faculty Award was given to 55 recipients at 47 universities in 15 countries. The award promotes work opportunities, teaching and research innovation and collaboration between faculty-researchers.


The award was $40,000 and paid for a HPC visualization system located in the computer science laboratory on UTB campus.  

Dr. Iglesias and his colleague, Dr. Hansheng Lei, had earlier won a Major Research Instrumentation grant of $704,000 from NSF towards building Futuro, UTB's third-generation computing cluster. I was fortunate enough to be able to architect and led the IBM team to deliver the system on-site in the summer of 2010. The cluster went into operation early 2011 with a full-day user orientation and training.  We celebrated Dr Iglesias' award and the start of the system in February of 2011.


Link:

Update:
  • 2012.05.05 - original post

Thursday, May 3, 2012

IBM Takes Big Data Analytics to Crime-fighting


Last fall, IBM acquired i2 technology and said it would integrate it with its own data collection, analysis and warehousing software. IBM today announced the first rollout of the analytics software package that it expects will help law enforcement, government agencies and private businesses wade through the massive amounts of data they collect to help them predict, disrupt and prevent criminal, terrorist and fraudulent activities.

IBM said the package will give customers powerful visualization and analysis capabilities coupled with advanced data access to help organizations to manage and process information in less time, giving them more time to spend on analysis.

The development is an example of when Big Blue has a vision and a plan, the acquisition and the acquired all seem to fall into places like chess pieces on the board.


History of i2 Acquisition

The package, IBM i2 Intelligence Analysis portfolio, is based on the security software it picked up last year when it bought i2. Law enforcement agencies and corporate security departments use i2's software to pinpoint fraudulent or improper activity within their logs of operational data. The company's Analyst's Notebook digital forensic software can display a visual diagram of people, places or other entities, showing how different parties are linked. At the time of the buy (2011.08.31), i2 had more than 4,500 customers across 150 countries. The company said that 12 of the top 20 retail banks use its software. The Boston Police Department and the Criminal Justice System in Orange County, Calif., share criminal data through i2's Coplink platform. In a $9.6 million contract, the U.S. Army procured an enterprise license to use Analyst's Notebook in its troubled Distributed Common Ground Systems -- Army (DCGS-A) intelligence sharing system. Defense contractor Northrop Grumman folded i2's Coplink into a system it is providing to the Navy to track criminal information from multiple sources. (IBM press)

"Helping governments and businesses improve public safety and corporate security is a significant part of IBM's business strategy," said Craig Hayman, general manager, IBM Industry Solutions.  "Through our acquisition of i2, we are strengthening our ability to help cities, countries, international organizations and private enterprises create safer environments for conducting business."

i2 supports:
  • Association, network, link, temporal, geospatial, and statistical analysis to help build a comprehensive analytical picture, revealing relationships, patterns and trends in data that can help save time and increase efficiency.
  • Social Network Analysis (SNA) and quantitative analysis techniques that combine organization theories with mathematical models to help better understand and target the dynamics of groups, networks, and organizations.
  • Collaborative working capabilities that support the greater organization in working together on cases, supporting sharing, teamwork, and inter- and intra-organizational communication, helping investigations to be resolved more quickly.
  • Advanced connectivity and multi-source, simultaneous search capabilities that automate and accelerate the lengthy research process of capturing, collating, and enriching data.
  • Searching unstructured data using powerful search capabilities to cast the net wide and deep to ensure that no data is missed in supporting investigative and operational activities.
  • Real-time exploration of intelligence, delivering an extensible, scalable, and collaborative environment supporting operational analysis and faster, more informed decision-making across an enterprise.

i2 Intelligence Analysis portfolio :

The portfolio includes:
  • IBM i2 Analyst's Notebook which lets government agencies and private businesses to maximize the value of the mass of information that they collect to discover and disseminate actionable intelligence that may help identify, predict, and prevent criminal, terrorist, and fraudulent activities.
  • IBM i2 Analyst's Notebook Connector for Esri integrates the capabilities of IBM i2 Analyst's Notebook with the capabilities of the Esri ArcGIS server to drive timely and informed operational decision-making. IBM i2 Analyst's Notebook Connector for Esri enables access to published mapping services from available Esri Server environments. It supports a range of mapping and geospatial analysis tools that enable analysts to do fundamental geospatial analysis tasks without the need to go to dedicated GIS analyst teams, which also allows these teams to concentrate on the in-depth geospatial analysis tasks. This self-service access to available data-rich Esri Servers (providing access to base maps, dynamic feature layers, and GIS services) directly from IBM i2 Analyst's Notebook allows analysts to move forward with intelligence products quickly, helping to increase the productivity of the intelligence production process, IBM said.
  • IBM i2 Information Exchange for Analysis Search for Analyst's Notebook lets IBM i2 Analyst's Notebook connect to disparate and distributed data sources, perform a single search across these multiple resources, and then bring information back into IBM i2 Analyst's notebook for visualization and analysis.
  • IBM i2 Information Exchange for Analysis Search Services SDK allows for secure access to multiple data sources, irrespective of type or location, across heterogeneous IT environments. It provides a set of developer tools and documentation allowing the creation, extension, and customization of web services components and IBM i2 Information Exchange for Analysis Search for Analyst's Notebook.
  • IBM i2 Analyst's Notebook Premium builds on the capabilities of IBM i2 Analyst's Notebook with a local analysis repository and new capabilities for enhanced data management, information discovery, and charting to address today's increasing volumes of structured and unstructured data. The addition of a local analysis repository to IBM i2 Analyst's Notebook and the new capabilities introduced for enhanced data management, information discovery, charting, and chart management complete the offering for an individual who requires access to greater volumes of data than IBM i2 Analyst's Notebook can contain in charts alone, the company stated. IBM said this version of the notebook will let users import higher volume data from structured data files via the wizard-style Visual Importer, letting users quickly visualize and analyze a wide range of data types including telephone call records, financial transactions, computer IP logs, and mobile forensics data.
  • IBM i2 Intelligence Analysis Platform offers a scalable and collaborative environment supporting operational intelligence sharing and faster, more informed decision-making across your enterprise.
  • IBM i2 Fraud Intelligence Analysis enables insights across large, complex, and disparate data sets to help investigate, disrupt, and prevent fraud. Fraud Intelligence Analysis helps turn massive, unrelated data sets into actionable intelligence and presents the results in an easily digestible way, IBM said. IBM i2 Fraud Intelligence Analysis can be used as a stand-alone fraud identification and analysis solution where currently there is either no, or only a manual, capability currently. It can also be deployed in conjunction with other IBM products for predictive analytics, identity management, business process rules, and case management.
  • IBM i2 iBase is a comprehensive SQL server database application that lets collaborative teams of analysts capture, control, and analyze multi-source data in a secure environment and disseminate the results as actionable intelligence in support of intelligence-led operations.

Social Media
Links

Update:
  • 2012.05.02 - original post