Big Data provides intelligence, enhances our capabilities, and creates new business opportunities.
Many business organisations use Big Data generated from various sources such as transaction systems, media, and streaming data from the Internet of Things.
In this post, I introduce the lifecycle management of the Big Data process at a high-level and with simplified language gleaning from methods I used in my data solutions. The key roles in this process are data architects, technical data specialists, data analysts, and data scientists.
Big Data architects and specialists start solutions by understanding the lifecycle. They engage in all phases of the lifecycle. The roles and responsibilities may differ in different stages. However, they need to be on top of the life cycle management end to end.
Based on my experience, I introduce 12 distinct phases in the overall data lifecycle management, which can also apply to Big Data. I combined some relevant activities in a single phase to make it concise and easily understandable.
These phases may be implemented under different names in various data solution teams. There is no universal systematic approach to the Big Data lifecycle as the field is still evolving. For guiding purposes, I propose the following distinct phases in this area:
- Phase 1: Foundations
- Phase 2: Acquirement
- Phase 3: Preparation
- Phase 4: Input and Access
- Phase 5: Processing
- Phase 6: Output and Interpretation
- Phase 7: Storage
- Phase 8: Integration
- Phase 9: Analytics
- Phase 10: Consumption
- Phase 11: Retention, Backup, and Archival
- Phase 12: Destruction
These phases can be customised based on the need. They are not set in stone.
In the data management process, the foundation phase includes various aspects. The most critical point in the foundation phase is understanding, capturing, analysing, and validating data requirements. Then comes the solution scope, including roles and responsibilities.
During the foundation phase, data architects prepare infrastructure and document technical and non-technical considerations. This document of understanding includes data rules in an organisation.
This phase requires a detailed plan facilitated ideally by a data project manager with substantial input from the Big Data solution architect and domain specialists.
A Big Data solution project definition report (PDR) can include planning, funding, commercials, risks, dependencies, issues, and resourcing. Project Managers author the PDR; however, the solution overview in this artefact is covered by the Big Data architects and specialists.
Data acquirement refers to collecting data.
Data can be obtained from various sources. These sources can be internal and external to the organisation. Data sources can be structured forms such as transfers from a data warehouse, transaction systems, or semi-structured records such as Web or system logs or unstructured such as media files consisting of videos, audios, or pictures.
Even though various specialists conduct data collection with the help of administrators, the Big Data architects can have a substantial role in optimally facilitating this phase.
Data governance, security, privacy, and quality controls start with the data collection phase. Thus, the Big Data architects take technical and architectural leadership of this phase.
In the data preparation phase, the collected raw data is cleaned.
In this phase, data is rigorously checked for any inconsistencies, errors, and duplicates. Any redundant, duplicated, incomplete, and incorrect data are removed. The goal is to have clean and useable datasets.
The Big Data solution architect facilitates this phase. However, data cleaning tasks can be performed by data specialists trained in data preparation and cleaning techniques.
Data Input and Access
Data input refers to sending data to planned target data repositories or systems. Some common systems are CRM (Customer Relationship Management) system, a data lake, and a data warehouse. In this phase, data specialists transform the raw data into a useable format.
Data access refers to accessing data using various methods such as using relational databases, flat files, and NoSQL.
The Big Data solution architects lead the input and access phases. However, usually, a data specialist, with the help of database administrators, performs the input and access related tasks during this phase.
Data Processing starts with processing the raw data. Then, data specialists need to convert data into a readable format giving it form and context. After this activity, data analysts and data scientists can interpret the data by using data analytics tools.
They can use Big Data open-source processing tools such as Hadoop, MapReduce, Impala, Hive, Pig, and Spark SQL. A common real-time data processing tool is HBase, and a near real-time data processing tool is Spark Streaming.
- Data processing also includes data annotation, data integration, data aggregation, and data representation.
- Data annotation is labelling the data. Once data is labelled, it can be ready for machine learning.
- Data integration aims to combine data in different sources and provide the data to consumers with a unified view of them.
- Data representation refers to the way data is processed, transmitted, and stored. These three essential functions depict the representation of data in the lifecycle.
- Data aggregation aims to compile data from databases to combined datasets to be used for data processing.
Data Output and Interpretation
In the data output phase, the data is ready for consumption by the business users. Data specialists can transform data into useable formats such as plain text, graphs, processed images, and video files.
The output phase states that the data is ready for use, thus sends the data to the next stage for storing. In some organisations, this phase is called data ingestion, aiming to import data for immediate use or future use to keep data in a database format.
The data ingestion process can be a real-time or batch process. Some commonly used data ingestion tools are Sqoop, Flume, and Spark streaming.
Once the data output phase is completed, data specialists store data in designated storage units. These units are part of the data platform and infrastructure considering non-functional aspects such as capacity, scalability, security, compliance, performance and availability.
The data platform infrastructure can consist of storage area networks (SAN), network-attached storage (NAS), or direct access storage (DAS) formats. Data and database administrators can manage stored data and allow access to the defined user groups.
Big Data storage includes technology stacks such as database clusters, relational data storage, and extended data storage.
The file formats such as text, binary, or other specialised structures such as Sequence, Avro and Parquet are considered in the data storage design phase.
In traditional models, once the data is stored, the process ends.
However, for Big Data, there may be a need to integrate stored data for various purposes.
Data integration is a complex process. Big Data architects design the use of various data connectors for the integration of Big Data solutions. There may be use cases and requirements for many connectors such as ODBC, JDBC, Kafka, DB2, Amazon S3, Netezza, Teradata, Oracle and many more based on the data sources used in the solution.
Some data models may require the integration of data lakes with a data warehouse or a data mart. There may also be application integration requirements.
For example, some integration activities may comprise integrating data with dashboards, tableau, websites, or data visualisation applications. This activity may overlap with the next phase, which is data analytics.
Integrated data can be valuable and productive for data analytics.
Data analytics is a significant component of Big Data solutions. This phase is critical because of the business value generated by Big Data.
The commonly used tools for data analytics are Scala, Phyton, and R notebooks.
There can be a team responsible for data analytics led by a chief data scientist. The data architect has a limited role in this phase. Data architects must ensure the stages of the lifecycle are completed with rigour.
Once the data analytics phase is completed, the data is turned into information ready for consumption. Consumers can be internal or external users.
Data consumption requires policies, rules, regulations, principles, and guidelines. The consumption can be based on a service provision process. Data governance bodies create rules for the provision of data.
The lead Big Data Solution Architect facilitates creating these policies, rules, principles and guidelines using an architectural framework.
Retention, Backup, & Archival
Critical data require to be backed up for protection. It is also an industry compliance requirement. There are established data backup strategies, techniques, methods, and tools.
The Big Data Solution Architect usually delegates the design of this phase to an infrastructure architect assisted by several data, database, storage, and recovery domain specialists.
Some data for regulatory and business compliance reasons may require to be archived for a defined period. Data retention strategy must be documented and approved by the governing body.
There may be regulatory requirements to destroy a particular type of data after a period. These requirements may change based on the industries and organisations that own data.
Big Data lifecycle management is a recursive process. Each solution can use a specific lifecycle process.
Even though many solutions follow a chronological order for the data life cycle management, some phases may overlap and can be done in parallel.
The life cycle proposed in this article is only a guideline. This proposed lifecycle management can be customised based on the structure of the data solution, unique data platforms, data solution requirements, use cases, industry compliance, and dynamics of the departments in an organisation.