Implementation of change data capture in ETL process for data warehouse using HDFS and apache spark PDF

Title	Implementation of change data capture in ETL process for data warehouse using HDFS and apache spark
Author	Carlos Vilcherres
Course	Inteligencia de Negocio
Institution	Universidad Nacional del Santa
Pages	7
File Size	694.6 KB
File Type	PDF
Total Downloads	78
Total Views	148

Preview

CLICK TO PREVIEW PDF

Summary

Articulo...

Description

IWBIS 2017

c IEEE 978-1-5386-2038-0/17/$31.00 2017

parallel processing. MapReduce programming was used for simple processes such as transfer of data from the database to the HDFS using Apache Spoon. Apache Spoon is a add-on tools for Hadoop to transfer data from data source into Hadoop environment. The main purpose of this research is to reduce time in ETL process by using CDC with distributed approach. CDC technique can reduce the amount of data that will be processed in ETL. CDC also can ﬁlter updated data to be processed. So, the process in ETL will be more efﬁcient. Meanwhile, Apache Spark and HDFS will be used to support CDC technique running in distributed system to increase their performance. This paper is organized as follows. Sections II and III Fig. 2. CDC methods for immediate data extraction (adapted from[6]). discusses ETL and CDC, respectively. Then, our design and implementation for distributed ETL are discussed in Section IV. The results of our experiments are discussed in Section V. to monitor the operational data source that focuses on data The comparison between existing model and proposed model changes [3]. Based on the two deﬁnitions, it can be concluded is discussed in Section VI. that CDC is a method to determine and detect changes in the data that occurred during a transaction in the operational II. ETL P ROCESS IN DATA WAREHOUSE system. The data from the operational system comprise of different In general, CDC can be used to support the ETL system. types and structures, so the data from the operational system The goal is to reduce the amount of data processed in the ETL cannot be directly used in the data warehouse. Thus, the system. The ETL process can run more efﬁciently because it data from the operational system need to be processed prior only processes data that have been changed. This also enables to entry into the data warehouse. Data processing aims to more frequent updates from operational databases to the data integrate and to clean the data, and to change it into the warehouse. predetermined structure. Processing of operational data prior to use in the data warehouse is known as Extract, Transform, A. CDC Methods Load (ETL). Extraction is the initial process in ETL, where In general, applications of CDC can be categorized into data from the source is transferred into the ETL environment for processing. Transformation is the task of changing the data immediate data extraction and deferred data extraction[6]. structures into the predetermined format as well as to improve Immediate data extraction allows extraction in real time data quality. Load is the term used to refer to the transfer of when a change occurs in the data source. Meanwhile, in the transformed data into the data warehouse or repository. Load is deferred data extraction approach, the extraction process is also known as delivering. According to [3], delivering is the performed periodically (in speciﬁed intervals). Therefore, the process of transferring transformed data into a dimensional data extracted are those that have been changed since the last model accessed by the user. According to the structure of the extraction. There are three methods of CDC application for immediate data warehouse using the star scheme, the load process can be classiﬁed into two processes: the load process for fact tables data extraction. Figure 2 depicts the three methods, which uses the transaction log, database triggers, and capture in and for dimension tables. The ETL process as a backend needs to fulﬁll three im- source application. The CDC method using log transaction portant roles, viz., to send the data effectively to the data is a method that utilizes the log recorded in the RDBMS warehouse user, to enhance value to the data during the from the data source in the form of a database. This method cleaning and conforming phase, and to protect and document works by utilizing every event from the Create, Update, Delete data lineage [4]. These three important roles of the ETL in the (CRUD) operation that is recorded by the RDBMS in a ﬁle data warehouse system take most of the whole development log. Systems that utilize this method will look for data that time. [4] stated that 70% of time and energy are spent only to has been changed or added by reading the contents of this ﬁle log. The second method utilizes database triggers. A database transform data from the source to the data warehouse. trigger is a procedural function that RDBMS generally has to III. C HANGE DATA C APTURE (CDC) take action if CRUD has been undertaken on particular data. As previously mentioned, Change Data Capture (CDC) This function can be utilized to propagate updates if there are methods can be used to improve ETL process. CDC has several changes in the data. The third method is capture in source deﬁnitions. One of the deﬁnitions is that CDC is a method to application. This method utilizes this application or system integrate data based on the results of identiﬁcation, capture from data sources that have the capacity to apply CDC. This and delivering only to changes of data in the operational method is quite effective to reduce the load in the ETL process, system [5]. Another deﬁnition of CDC is that it is a technique especially during the extraction phase. However, this method

50

IWBIS 2017

Fig. 3. CDC methods for deferred data extraction (adapted from [6]).

is limited to applications that can be modiﬁed. This method is not applicable to proprietary applications. CDC application for deferred data extraction is generally classiﬁed into two type of methods. Figure 3 shows the two methods. One looks at the data based on timestamp, and the other compares ﬁles. CDC application using the ﬁrst method utilizes the time column that most data from the source have. In several cases, the time column can provide more detailed information, such as when the data were entered and changed. This information can be easily utilized to detect data changes. Unlike the ﬁrst method, the second method uses a more ﬂexible technique because it does not depend on the attributes of the data source. The technique used compares data from previous extractions with the current data to detect changes in each data attribute in order to detect changed data.

c IEEE 978-1-5386-2038-0/17/$31.00 2017

is also limited in the choice of data, as the data have to be in the form of a database managed by RDBMS. The other method utilizes the RDBMS transaction log. This method utilizes the database log as the data source to record each change to the database. The drawback of this method is that the RDBMS has to be monitored in running its function to record a log of every transaction. This is done to prevent loss in transactions [7]. To overcome limitations of these methods, changes can be detected by comparing data which is commonly referred to using the term snapshot differential. Initially, this process accesses data using full extract from the beginning to the end of the condition. This condition requires large amounts of resources from the CDC and inﬂuences the performance and time for CDC processes. In line with an increase in the amount of data, more time and computations required to conduct the CDC process. Other approaches are delta view and parallel processing. The application of delta view is performed on data sources that are in the form of a database. The principle of this approach is by creating a view that comprises of a key of a record that will be inserted in the ETL process. The delta view will be utilized to store the key of the updated record, deleted record and inserted record that will be used as information to the changes in the ETL process [7]. Even though this approach still requires access to the data source, the delta view approach does not change the data structure. Thus, this application is easier to conduct. On the other hand, parallel processing approaches can overcome the problem of the CDC method to process large amounts of data. The principle of this method is to reduce load in the system to classify the process into several resource for simultaneous processing. Thus, the CDC process can be performed in less time.

B. CDC Limitation

C. CDC in Distributed Systems

The use of the CDC method is similar to the extraction process in that it needs to adjust to the characteristics of the data source. This condition causes different applications of CDC techniques using the same method. Several CDC methods that have been previously mentioned cannot be applied to all types of data. Thus, each method has limits that inﬂuence the CDC process. One of the limitations in applying the CDC method is that it is too dependent on the system of the data source. Several CDC methods require conﬁguration from the side of the operational system. Such method includes the use of timestamp, trigger, and transaction logs. CDC application using the timestamp method can only be used if the data has a time record for all the changes that occurred. If not, there needs to be changes in the data structure to add a timestamp column, which is one of the weaknesses of this method [7]. As with timestamp, CDC trigger application requires access to the data source system to create a function that can detect changes in the data. In principle, this method works by utilizing the trigger function that many RDBMS have. Aside from the drawback of having to create a function, this method

A distributed system is a group of subsystem components that are connected in a network that can communicate with each other. In a distributed system, each hardware and software communicate with each other and coordinate their processes with message passing [8]. The main goal in creating a distributed system is to divide the resources to enhance system performance. In implementation, there are two actions that can be done in a distributed system, which are the use of distributed ﬁle system to increase storage capacity and parallel processing to increase throughput. Parallel processing can be used to increase the performance of the CDC process by dividing a large process into smaller processes to be conducted simultaneously on different computers (nodes) in a cluster. Each node can run the process without waiting for the completion of processes in other node. This reduces the time needed to complete a task. In addition to saving time, parallel processing can also save resources used as no large task are performed on a single computer. There are two ways to implement parallel processing in the CDC method, which are the MapReduce programming paradigm with Apache Hadoop and utilizes Spark SQL from

51

IWBIS 2017

c IEEE 978-1-5386-2038-0/17/$31.00 2017

















Fig. 4. MapReduce framework [8].







Apache Spark. MapReduce is a programming paradigm that undertakes parallel computational processes using two main functions: map and reduce. In essence, the principle of MapReduce is the same as parallel computational processes in general. It begins by dividing the data input into several parts, processing each part, and at the end combining the results of the processes in one ﬁnal result of the process [8]. Figure 4 demonstrates the process with MapReduce with map function to take data in the form of a group of key-value as input to be processed and reduce function to receive input from the map results that are processed to obtain an output from that process. The CDC method can be implemented using MapReduce by adopting the divide and conquer principle similar to that conducted in the study by [9]. The data are divided into several parts, each to be processed separately. Then, each data processed will enter the reduce phase which will detect changes in the data. Alternative to MapReduce, parallel processing in the CDC method can be implemented by employing Spark SQL. Spark SQL is a module from Apache Spark that integrates relational processes with functions in API Apache Spark [2]. Spark SQL has the capacity to utilize query such as data processing using database. Spark SQL can be utilized to run the CDC method using commonly used operations such as JOIN, FILTER, and OUTER-JOIN, so that CDC processing using Spark SQL can be more easily implemented compared to using MapReduce. The use of MapReduce and SparkSQL for parallel processing cannot be done without using distributed storage. This is because each process in the node needs to be able to access the data processed, so each data has to be available and accessible at each node. Distributed storage keeps the data by replicating the data into several nodes in a cluster, so that the data can be accessed at any node in a cluster. A commonly used platform for distributed storage is HDFS (Hadoop File System). CDC method using parallel processing can greatly reduce the data processing time needed to detect changes. But, it requires lots of conﬁguration and preparation before it is ready to be implemented. In this study, this process was implemented using a distributed system infrastructure (HDFS and Apache Spark). The process was run using library SparkSQL from Apache Spark and MapReduce programming paradigm from



 











Fig. 5. The previous ETL process (above) and proposed incremental ETL process using distributed system (below).

Apache Hadoop. IV. D ESIGN AND I MPLEMENTATION The proposed ETL process uses incremental extract method and only process changed data. As shown in Figure 5, the current ETL process uses full extraction from databases, then perform transformation and loading on the whole extracted data. The whole ETL process are performed using Pentaho Data Integration (PDI). Meanwhile, the proposed ETL process extract new and changed data using map-reduce framework and Apache Spoon. The transformation and loading process are performed using PDI. This section elaborates our big data cluster environment and the implementation of CDC. A. Server Configuration The distributed system implemented was peer-to-peer. Peerto-peer is an architectural design where each process has the same role, whereby nodes interact without differentiating between client and server or computer where an activity is run [8]. The main reason for using this design is to maximize resource, because with this architecture, all nodes undertakes the process simultaneously. Figure 6 displays hostname and IP addresses of each server. B. Apache Hadoop Configuration Hadoop is a framework that provides facilities for distributed storage and processing of large amounts of data using MapReduce programming paradigm. One of the important characteristics of Hadoop is that data partition and computation are conducted in several hosts, and Hadoop can run the application in parallel [10]. Hadoop consists of two main components, viz., Hadoop Distributed File System (HDFS), to develop a distributed storage, and MapReduce, to support parallel computational process. In this study, the most active

52

IWBIS 2017

c IEEE 978-1-5386-2038-0/17/$31.00 2017

Fig. 7. Spark Architecture [12]. Fig. 6. Conﬁguration of distributed system.

component used was HDFS, because the parallel processes used Apache Spark and Sqoop. HDFS consists of two parts, namenode and datanode. The role of the namenode is to maintain the tree structure of the ﬁle system and metadata book.hadoop. In addition, the namenode also maintains the block location of each ﬁle allocated on the datanodes. Unlike the namenode, the datanode is a blocks storage site that is managed by the namenode. These datanodes store data and report to the namenode on the block where the data is stored. Thus, the data stored in these datanodes can only Fig. 8. Our CDC method with snapshot difference approach be accessed through the namenode. The namenode is generally located in a separate server to the datanode. However, in this study, the cluster approach used was peer-to-peer in order to Cluster manager is needed by Spark to maximize ﬂexibility maximize available resource. Thus, all servers were conﬁgured in cluster management. Spark can work with a third party as datanode and one server as a namenode, as shown in Figure cluster manager such as YARN and Mesos. Spark also has its 6. own cluster manager, the standalone scheduler. In this study, Spark was conﬁgured using standalone cluster. C. Apache Spark Configuration D. Implementation Spark is a platform used for the process conducted in In this study, the method used was by comparing data a cluster. When processing large amounts of data, speed and ﬁnding differences between two data. The CDC method becomes a priority. Thus, Spark was designed for rapid data using snapshot difference was divided into several stages as processing. One of Spark’s features is its capacity for data illustrated in Figure 8. The ﬁrst process was to take the data processing in memory [11]. Aside from that, Spark can be from the source. The data were taken by full load, which took implemented using several programming languages such as the entire data from the beginning to the ﬁnal condition. The Java, Scala, Python, and R. In this study, the language and results of the extraction was entered into the HDFS. The data library used was Python version 3.0. was processed through the program created to run the CDC Spark architecture is generally classiﬁed into three parts: process using the snapshot difference technique. Snapshot Spark core, cluster manager, and supporting components difference was implemented using outer-join function from (MLib, Spark SQL, Spark Streaming, and Graph Processing) SparkSQL. The program was run by looking at the null value as illustrated in Figure 7. Spark core is part of Spark with basic of the outer-join results of the two records, which will be functions, such as task scheduling and memory management. the parameter for the newest data. If in the process damage to In Apache Spark there are two roles in the server that have the data source was detected, the program could automatically to be conﬁgured. There are master and worker as shown in change the data reference to compare and store the old data Figure 6. The master has a function to manage the process and as a snapshot. allocate the resource into the worker server, while the workers V. E VALUATION AND ANALYSIS execute the process. In this experiment, each server has a role The ETL model using the CDC method was tested the same as worker and only one server serving as a master and also as a worker. Because of that, all resources in this experiment way to test the existing ETL method. Testing was done using script with data from the SCELE log database and the Apache will be used and every server are in the same level.

53

c IEEE 978-1-5386-2038-0/17/$31.00 2017

IWBIS 2017



























 





 









 







 

  

























Fig. 11. Comparison of the running time of existing ETL method and CDCbased ETL method.











 





