BIG DATA FOR DUMMIES
“Big Data” refers to large, complex volumes of structured and unstructured business data from daily activities. These data can be too large for traditional data processing software to handle. Big data analytics is the process of reviewing and comparing data sets to reveal patterns and trends that may lend themselves to improvements and informed business decisions.
The primary characteristics of Big Data are:
Volume: The quantity of generated and stored data determines the value, potential insight, and its classification as Big Data.
Variety: The nature of the data and how the insight it provides will be used.
Velocity: The speed at which the data is generated and processed to respond to growth and development.
Variability: Data inconsistencies can encumber processing and use.
Veracity: Data quality can vary greatly, which affects analytical accuracy.
Growth and Trends
The term “Big Data” was first used in the mid-1990s to explain increasing data volume. Doug Laney, an analyst with Meta Group, used the term initially to refer to increases in the disparity of data generated by organizations and the speed at which the data were sourced and updated. Gartner, after hiring and acquiring Meta Group in 2005, popularized what are now known as the three variables of big data: – volume, velocity, and variety.
Big data applications were initially the domain of large-capital internet companies such as e-commerce websites, Google, Yahoo, Facebook and other analytics and marketing service providers. Big data analytics has since been adopted by manufacturing firms, energy companies, healthcare organizations and others. Global information storage capacity reached 2.5 exabytes (2.5×1018) of data per day since 2012. A significant question for large enterprises is how to establish ownership and responsibility for big-data initiatives.
Specialized big data analytics systems and software are of great importance to businesses in that they allow them to improve marketing, to increase operational efficiency and to generate new revenue opportunities. Broadly speaking, the use of these techniques provides a means for analyzing data sets in order to make reasonable projections and conclusions to support the best business decisions.
Basic tools and technology
Traditional data processing software may not have the capacity to handle big data sets, particularly when they are unstructured or semi-structured, especially if it is real-time data, such as stock market activity, tracking website visitors, or the performance of mobile applications, that require ongoing, frequent updates.
Many organizations that collect data can process and analyze it with database tools such as:
HBase: This tool is a column-oriented key-value data store and is highly respected because of its ability to run with Hadoop Distributed File System (HDFS).
Spark: This is an open-source parallel processing framework designed for extensive data analysis applications and supports a wide array of data formats and storage systems.
Yarn: This is one of the major features in second generation Hadoop and also uses cluster-management technology.
Kafka: This tool is designed to replace traditional message brokering. It is a distributed publish-subscribe messaging system.
Pig: An open-source technology which offers a sophisticated mechanism for parallel programming of MapReduce jobs executed on Hadoop clusters.
Hive: An open-source similar to Spark. It is a data warehousing system used for querying and analyzing big data sets stored in Hadoop files.
MapReduce: A framework of software that helps to write uncategorized programs by developers across a distributed cluster of processors or stand-alone computers.
Occasionally, NoSQL and Hadoop cluster systems can be used as a platform for data before it is loaded into an analytical database.
Sound data management is an essential first step in the analysis of big data. Users are now adopting the Hadoop data lake, which serves as the primary storage area for new raw data.
Challenges Large sample size and high levels of disparity of Big Data introduce computational and statistical challenges due to:
Storage or warehouse limitations;
Acquiring and retaining the skills needed.
Dedić, N.; Stanier, C. (2017). “Towards Differentiating Business Intelligence, Big Data, Data Analytics and Knowledge Discovery”. 285. Berlin ; Heidelberg: Springer International Publishing. ISSN 1865-1356. OCLC 909580101
Boyd, Dana; K, Crawford (September 21, 2011). “Six Provocations for Big Data”. Social Science Research Network: A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society. “Community cleverness required”. Nature. 455 (7209): 1. 4 September 2008 PMID 18769385. Doi:10 138/45501a.
Jump up^ Reichman, O.J.; Jones, M.B.; Schildhauer, M.P. (2011). “Challenges and Opportunities of Open Data in Ecology”. Science. 331 (6018): 703–5. PMID 21311007. doi:10.1126/science.1197962