Big Data and Predictive Business Analytics
Ying Liu
2014-01-01
Abstract:EXECUTIVE SUMMARY | Today, amount of data we are able to collect has been exploding. As a result, Data have become a new buzzword in information technology. Storing, managing, and analyzing Data is challenging, and will soon become a major differentiator between high-performing and low-performing organizations. This article discusses issue of Data including four dimensions of Data and opportunities and challenges created by them. It also discusses various Data analytics applications.Every day, we use several different devices to generate large amounts of data; for example, searching online, making purchases through e-commerce web sites, making transactions in supermarket, reading data from sensors, using social media to interact with our friends, and using GPS. All data are accumulated and stored somewhere, which we call Big WHAT IS BIG DATA?According to McKinsey Global Institute, Big Data refer to datasets whose size is beyond ability of typical database software tools to capture, store, manage and analyze. EdTech Report to nation in 2013 states Every day, we create 2.5 quintillion (1020) bytes of data- so much that 90% of data in world today has been created in last two years alone. These data come from everywhere: sensors used to gather climate information, posts to social media sites, digital pictures and videos, purchase transaction records, and cell phone GPS signals to name a few. These data are Big Data. The concept of Data actually is not new. We have been accumulating data since beginning of recorded time. However, as technology advances, data are accumulating at an alarming rate.FOUR DIMENSIONS OF BIG DATAIBM data scientists break Data down into four dimensions: volume, velocity, variety, and veracity (4-Vs). The volume dimension refers to scale of data. From beginning of recorded time until 2003, we created 5 billion gigabytes (exabytes) of data. In 2011, same amount were created nearly every two days. In 2013, same amount of data were created every 10 minutes. Velocity refers to analysis of streaming data. As data are accumulated every second, data quickly become out-of-date. Therefore, it is important to use data as fast as possible. The third dimension, variety, refers to different types of data we collect, e.g., structured data, unstructured data, text data, numerical data, image data, and audio and video data. Veracity refers to uncertainty in data.The data we collect may contain noise, but we do not know which data are accurate and which have noise. This is why many business leaders do not trust information generated from them. What's more, according to IBM, poor data quality costs U.S. economy around $3.1 trillion each year.WHY DO WE CARE ABOUT BIG DATA?The model of generating/consuming data has changed. The old model was that few companies were generating data; all others were consuming data. As technology advanced, a new model has evolved. Many companies are now generating and consuming data. But our ultimate goal is not just generating, storing, and managing Once data are generated and stored, next step is to analyze data to find useful information. Information is then converted into knowledge to make decisions to optimize profit Figure 1 shows an overview of data analysis and decision making process.The article by Clay Dillow, published in Fortune magazine on September 4, 2013, states that Data are now viewed as the new oilto drive economies in century ahead, same way as they did at beginning of last century. So, we are experiencing a Data employment boom. Dillow's claim is supported by job trends from indeed.com. Indeed.com is an employment-related meta-search engine for job listings. According to job trends from indeed.com, number of job postings related to Data increased exponentially after 2011. …