DATA INTEGRATION FOR DUMMIES
Data integration involves the compilation of data from several sources into a unified output, thereby providing a comprehensive data set that is easy to understand. The compiled data may be of a technical or financial nature. The data to be integrated can be in the form of text, charts, or tables, and the objective is to make the integrated data easier to understand.
Evolution of Data Integration
Data integration involves systems that facilitate the interoperability of heterogeneous databases, or information silos. In 1991, the first computer-based data integration process was conducted at the University of Minnesota. Semantic integration addresses the issue of how to resolve semantic conflicts between heterogeneous data sources. Combining research results from different sources requires bench-marking of the similarities according to a common criterion such as positive predictive value. In 2011, issues related to data isolation, an artifact associated with data modelling technology that leads to disparate data models, was tackled. New methods for data compilation and manipulation have been developed to eliminate these artifacts and to ensure the production of integrated data models. Data hub and data lake approaches, which combine unstructured or varied data into one location, but that do not always require a master relational schema to structure and define the data in the Hub, have become popular.
Importance of Data Integration
When a user queries a variety of information sources, each with its own scheme and stored in independent databases, the combined data may be difficult to collect, extensive, and may contain many duplications. Adapters are used to transform local query results returned by the respective websites or databases into an easily processed form. The information that emerges from the compiled data can be used by business and government organizations to identify areas where there business development is required.
Data integration also allows for the comparison of data. When data from different business units is compared, conclusions regarding performance can be made. Furthermore, with the increasing need for cloud computing, data integration becomes an invaluable management resource.
Data integration Tools
Data integration tools are the backbone of every successful integration process. These tools allow one to make informed decisions about which data to discard and which to retain in a secured environment. Another feature of data integration tools is the identification of data that is accessible to the public and the encryption of data that is not.
Data integration techniques can be organized according to their level of complexity.
Manual Integration or Common User Interface provides no unified view of the data. Application-Based Integration is possible with a very limited number of applications. Middleware Data Integration transfers the integration logic from particular applications to a new middleware layer, although there is still a need for the some applications. Uniform Data Access or Virtual Integration leaves data in the source systems and defines a set of views to provide and access the unified view. The main benefits of virtual integration are minimal latency of data updates to the consolidated view, and no need for separate storage of the consolidated data. Drawbacks include limited history and version management, applicable only to ‘similar’ data sources, and it generates extra load on the source systems.
Common Data Storage or Physical Data Integration implies the creation of a new system, which stores and manages a copy of the source data independently. Benefits include data version management and the ability to combine data from disparate sources. Physical integration requires a separate system.
Informatica: Includes data governance, data migration, data quality, data synchronization and data warehousing capabilities and it is scalable.
IBM: End-to-end information integration capabilities of InfoSphere Information Server allow for cleansing, monitoring, transforming, and delivering understandable data.
Oracle: Data Integrator is a comprehensive data integration platform that covers all data integration requirements and can deliver bulk or real-time data from heterogeneous systems while minimizing the impact on source systems.
SAP: SAP BusinessObjects Integration software provides direct connectivity to enterprise applications, which allows for consolidation and transformation of vast amounts of data into reports, analyses, visualizations, and dashboards and ensures data security and compliance.
Talend: Talend’s data integration products provide extensible, highly-performing open-source tools to access, transform and integrate data from any business system to respond to operational and analytical data integration needs.
Broad questions in the life sciences frequently require the comparison of disparate data sets for meta-analysis. Both business and government organizations are in constant need of more centralized and integrated data storage facilities. Data integration allows for disparate data to be unified and compared in a meaningful way.
1. Shubhra S. Ray; et al. (2009). Combining Multi-Source Information through Functional Annotation based Weighting: Gene Function Prediction in Yeast (PDF). IEEE Transactions on Biomedical Engineering. 56 (2): 229–236.
2. Michael Mireku Kwakye (2011). A Practical Approach to Merging Multidimensional Data Models. IEEE Transactions on Biomedical Engineering. 56 (2): 229–236.
3. “Rapid Architectural Consolidation Engine – The enterprise solution for disparate data models.” (PDF). 2011.
4. Widom, J. (1995). Research problems in data warehousing. CIKM ’95 Proceedings of the fourth international conference on information and knowledge management. pp. 25–30.