Into ro Data Analytics - ASSIGNMENT TASK 1 PDF

Title	Into ro Data Analytics - ASSIGNMENT TASK 1
Author	Aaima Kausar
Course	Introduction to Data Analytics
Institution	University of Technology Sydney
Pages	8
File Size	317 KB
File Type	PDF
Total Downloads	35
Total Views	126

Preview

CLICK TO PREVIEW PDF

Summary

assessment report 1 for into to data analytics...

Description

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

Project Proposal: Breast Cancer Research Centre Identifying the genetic sequence responsible for the rapid progression of breast cancer. 1.1 Abstract This documentation is a project proposal prepared for the Breast Cancer Research Centre. It details a data mining method prepared to specifically navigate the genetic information of the centre’s patients and identify a single genetic sequence that is present in sufferers of breast cancer that propagates the progression of the disease. The information required to utilise this method includes a microarray of DNA data and SNP data, which is already in the hands of the Breast Cancer Research Centre. This data, when integrated into the data analytics model, will enable a better understanding of the genetic sequences that are present in the cancerous regions of patients who show progression in the disease, allowing for the identification of a gene sequence that is linked to the rapid progression of breast cancer. This information will allow the Breast Cancer Research Centre to focus on the gene in question and better research manners in which the mutation can be reversed, altered or treated in any way. The Breast Cancer Research Centre has access to data from patients and databases all around the world and, from it, can conclude an accurate identification of the gene sequence that plays a key role in the rapid progression of cancer in patients.

1.2 Aims The aim of the data analytics method is to identify the particular gene sequence pattern that plays a key role in the rapid progression of breast cancer in some patients. The results of this will be of great importance to the Breast Cancer Research Centre and will aid in the research being done to improve the quality of living for breast cancer patients as well as give insight to researchers regarding the location of the genetic sequence. Data mining is the best way to analyse the mass amounts of data the Breast Cancer Research Centre has and is the most effective way to filter noise and unnecessary data points from the dataset. By combining a variety of data mining and data analytics methods, it is possible to directly locate and identify the most recurring gene in a specific demographic of breast cancer patients who present a greater rate of disease progression than their slower progressing counterparts. Using existing data mining techniques, it will be possible to sort the data and flag potential gene mutations, patterns and inherited sequences to identify the gene, or set of genes, that play a major role in the rapid progression of breast cancer. This includes sifting through the data of patients who have an inherited link to breast cancer as well as those who seemingly have no genetic relation to breast cancer in their inheritance history. The aim of identifying one particular gene sequence or set of gene sequences is a daunting task to be done manually, however, using the proposed combination of data mining methods, it is a possible feat.

Aaima Kausar, 12582728 –

1

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

1.3 Objectives In order to accomplish the aim in section 1.2, it is important to first work towards and achieve a number of objectives linked to the identification of the gene. The Breast Cancer Research Centre has access to and must consider a great amount of patient data in order to correctly identify the genes related to the progression of breast cancer. To do this in the most effective and efficient manner, a system must be placed in order to filter the most relevant patient data from those which would become noise in the process of data understanding. This can be done by considering only those patients who show a rapid progression in tumorous breast cancer for sequential pattern analysis and similarity analysis. Another objective to be reached is the clarity of data. The data analytics process must be transparent and legible to the researchers in order for them to correctly identify the gene sequence in question. By considering only patients who display a rapid progression of the same type of breast cancer, researchers will not have to cross-check with conflicting data, allowing for an easier analysis of similarity and comparative data. The data needs to be sorted from those who have an inherited link to breast cancer and those who have no genetic history of the disease, in a separate manner. In order to correctly identify the genome responsible for the rapid progression of the disease, it must be ensured that inherited data does not counter non-inherited data. This is due to the behavioural differences of breast cancers from the two sets and must be considered in isolation.

1.4 Possible Outcomes The desired outcome of the method is to identify the gene related to the rapid progression of breast cancer. By using the results of this method and overlapping it with existing data, researchers will be able to conclude the patterns and behaviours of breast cancer as it progresses in patients, indicating new and more effective methods of treatment to slow down or reverse the effects of breast cancer. By identifying the gene sequence or pattern linked with the rapid progression of breast cancer, the Breast Cancer Research Centre may also be able to identify the location of genes most susceptible to the mutations that not just propagate the disease, but also those that are likely to trigger the onset of breast cancer. It is also possible that the outcome of this process is the identification of a variation of the known BRCA1 or the BRCA2 gene as the propagating factor of some types of breast cancer. As the two gene sequences have already been linked to the onset of breast cancer, a variation of one or both of the genes can possibly be identified as a catalyst for rapid progression. This can greatly aid in the search of the cure for cancer as the identification of key genetic patterns can give insight to treatment methods and possible prevention measures that were previously overlooked or not considered. By finding a genetic link to an increased rate of cancer progression, medical specialists can further develop methods to slow down the progression of breast cancer and therefore increase the lifespan of sufferers.

Aaima Kausar, 12582728 –

2

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

2.1 Background Information (Breast Cancer Research) 2.1.1 Overview of the Problem Focus In the medical research of cancer, a key focus is the analysis of DNA sequences. DNA sequence is the base genetic coding for all living organisms and determines everything in relation to the phenotype and disease susceptibility of these organisms. DNA is comprised of a double-helix structure and is determined by the order in which nucleotides connect to one another. If the order of the nucleotides is compromised, a genetic mutation occurs. It has been found that breast cancer is triggered by a mutation in the genetic sequence of the genes BRCA1 and BRCA2 (Journal of Clinical Oncology, 2008), gene sequences which were found to increase the probability of being diagnosed with breast cancer. However, while the gene sequences related to the trigger of breast cancer, it has not been determined what propagates the rapid progression of the disease in some people. The Breast Cancer Research Centre has access to microarray data, the clinical data from patients who suffer from breast cancer at varying stages as well as the demographic database of patients. These data sources can be extremely useful in identifying the key gene sequence that causes the rapid progression of Breast Cancer in sufferers. By combining the information the Breast Cancer Research Clinic has, researchers can single out the demographic of people who display a recurring pattern of degradation in health as the cancer progresses and locate a single gene which manifests only in these individuals. Metabolic heterogeneity is a phenomenon recorded in other types of cancers. This has been shown to increase the rate of progression of cancer and contribute to tumour growth in human lung cancer and can be attributed to a genetic mutation ( Cell, 2016). The analysis of metabolic heterogeneity and the genetic mutation which occurs in the tumours is a factor to be considered as the gene sequence that propagates rapid cancer progression may be linked to the gene sequence which enables metabolic heterogeneity in some tumorous cancer tissue. The current only known genes to be directly related to breast cancer are BRCA1 and BRCA2. While these genes are proven to trigger the onset of breast cancer, the numerous variations of these genes can hold the reason for the rapid progression of breast cancer in patients who have a specific variation of the gene.

2.1.2 Current Methods in Practice Sequential pattern mining is a data mining method used to discover gene interactions and their contextual information ( Journal of Biomedical Semantics, 2015). Sequential pattern mining is the extraction of information from data to understand the data and uncover interesting and unexpected yet useful patterns which form within a database. This is used in cancer research via databases where the interactions of genes with genes or proteins are stored and compiled to reveal patterns of such interactions that are useful to identify patterns recurring in cancerous samples. While this is an effective method for detecting patterns, it does not work when trying to uncover specific interactions between genes. If a researcher wishes to analyse the patterns and interactions specific to a single gene sequence, the database will not allow for the complex search and will provide only general interaction patterns.

Aaima Kausar, 12582728 –

3

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

The National Human Genome Research Institute has devised a visual resource that compiles and displays functional data on all documented BRCA1 variants. Taking the name BRCA1 Circos, it consists of data derived from functional assays and bioinformatic predictions that aid in the interpretation of data. This is an extremely useful resource in analysing the consequences variants of the gene have. Though it only focuses on the BRCA1 gene, it can be utilised in the assistance of narrowing down a demographic based on the variations and their effects. Furthermore, the concept of this visualisation algorithm is key to interpreting data in a transparent and legible format. An example of the BRCA1 Circos imaging that the National Human Genome Research Institute provides is the side-by-side comparison of the BRCA1 gene where one exon (exon 11) is reduced in size in comparison to the whole gene versus exon 11 in proportion with the gene (Fig.1a and Fig.1b). This previews the possibilities of this visualisation algorithm as it allows researchers to be more thorough with assessing the differences between BRCA1 variations

Fig.1a

Fig.1b

a) Circos image of BRCA1 with Exon 11 in proportion to the whole gene (left) with (b) the image of BRCA1 with Exon 11 reduced in proportion to the whole gene (Right) (National Human Genome Institute, 2017)

Another method currently utilised to analyse DNA sets from cancer samples is microarray data analysis. This method utilises a database of microarray data to discover new types of cancerous tumours and predict their class and compare the classes of cancer (Landes Bioscience, 2013). This data analytics method has yet to be widely accepted for the diagnosis of human cancers, however, it does host an exponential increase in studies. Alongside the various methods available, a few microarray analysis methods have evolved for the analysis of microarray data for cancer research. This is effective in the case of both haematological and solid tumours related with various cancer types. This microarray data can be used to obtain reliable gene profiles. Though this method has been proved effective and helped improve the accuracy of classic diagnostic techniques for tumour-specific markers, it has not been as effective in non-tumorous cancer research. Even though it has its shortcomings, this method is an emerging process with potential to accomplish more complex tasks than the sequential pattern mining method.

Aaima Kausar, 12582728 –

4

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

3.1 Data Analytics Scenario and Methodology based on CRISP-DM 3.1.1 Objective The focus for this proposal is to provide a process to detect and identify the gene sequence that is responsible for the rapid progression of breast cancer in some breast cancer patients. There is a great amount of microarray DNA data, SNP data and clinical records of cancer patients that the Breast Cancer Research Centre has access to. This data can be used with data mining techniques that can narrow down the demographic of patients and identify the specific group of individual patients who have seen a rapid progression in the stage of their cancer. With further analytics methods, the microarray DNA data can be analysed to determine which gene sequence is responsible for the rapid degradation of patients’ health. The key objective of this proposal is to present a process of data mining techniques and technology to detect the rapid progression of breast cancer in certain patients and identify the gene sequence pattern that is responsible for this. The success of this project will be outlined by the discovery and identification of the gene that is present and active in patients with a rapid progression in their cancer.

3.1.2 Assessment of the Situation A number of data mining methods must be combined in order to achieve the objective of this project using the data mining process. The base volume of data needs to be narrowed into the field of patients who face a rapid progression in their disease and also have the inherited influence of gene sequences. The Breast Cancer Research Centre has access to a large volume of data that is all stored in varying forms and must be considered separately, yet equally. In order to gain the most accurate result, the data mining elements of various methods must be combined to gather the most relevant data and find overlaps in the different data forms for a single complete dataset that can then undergo sequential data analysis to find patterns of similarity, contrast and recurrence. In order to use the method of sequential pattern mining as an effective data analysis method, the microarray DNA must first be categorised using microarray data classification. An example of its use is in the prediction of cancer classes in a 1999 study of 38 bone marrow samples from patients all suffering acute leukemia (Landes Bioscience, 2013). The authors conducted a supervised analysis for class prediction and identified the 38 samples into two classes, acute myeloid leukemia and acute lymphoblastic leukemia. The two types of cancers are variations of the cancer and behave in different manners, altering disease progression and required treatment. This method can be applied on the microarray DNA data that the Breast Cancer Research Centre has. By applying this method of class prediction, the microarray data can be clustered into various classes of breast cancer and assessed on the progression of each cancer class, effectively narrowing the microarray DNA data to focus on the rapidly progressing breast cancer samples.

Aaima Kausar, 12582728 –

5

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

3.1.3 Data Mining Goals The main goal of this project is to assess patterns in various datasets and identify the gene sequence or number of gene sequences that is responsible for the rapid progression of breast cancer in certain patients. These patterns will be found using a variety of data mining elements and methods in order to obtain the most accurate and clearly legible compilation of the data to determine the identity of the gene in question.

3.1.4 Plan The data required has been collected by the Breast Cancer Research Institute and their global partners and is currently held in their systems as various forms of datasets. Table 3.1 below summarises the project plan in a series of steps and factors that will lead to the analysis of the data in a manner that will identify the gene sequence responsible for the rapid progression of breast cancer in some patients. Table 3.1 – Project Plan Overview Task Sorting the Microarray DNA dataset Sequential Pattern Mining to detect recurrent patterns of progression Sequential Pattern Mining of Clinical Patient Data

Overlay Data

Analyse Data Patterns

Objective ▪ Cluster Microarray Data ▪ Identify cancer classes ▪ Rank in order of progression using the sequential pattern method ▪ Assess and conclude class most susceptible to rapid progression ▪ Cluster Patient Data by cancer stage ▪ Identify late-stage patient datasets with a later diagnosis date ▪ Determine which clusters of patients have had the fastest progression ▪ Compare datasets obtained from Microarray DNA and Clinical data ▪ Combine overlapping or repeated data to prevent duplication of data

Factors to Consider ➢ There may be more than one class to present a considerable degree of rapid progression

▪

➢ Some data may need to be omitted ➢ Analysis of metabolically heterogenous tissue may be helpful ➢ The gene sequence identified may be a variation of the BRCA1 or BRCA2 gene

▪ ▪ Identify recurring sequences

▪ ▪ ▪

Cluster new data by rate of progression Identify datapoints with the highest rate of progression Identify common genetic patterns of the data Utilise algorithm systems to visualise datapoints Isolate most recurring gene sequences and patterns Analyse and identify genetic sequence

➢ When identifying patients with a later diagnosis date, consider stage of cancer at diagnosis

➢ Some clinical records may not have a corresponding microarray DNA data, expect missing datapoints

Aaima Kausar, 12582728 –

6

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

The plan outlined in Table 3.1 is the project plan proposed in this project proposal. The plan utilises a number of data mining elements and techniques which all aid in the narrowing of the dataset to identify one genetic sequence that contributes and causes the rapid progression of breast cancer in some patients. The plan utilises Sequential Data Mining as well as the assessment of Microarray Data analytics to work with the data and ensure the method being utilised is suited to the dataset being analysed. By overlaying the data before analysing any recurring points, it is ensured that the recurring points are indeed from varying patient sources and are not the result of duplicate data, allowing for a more accurate identification of the gene sequence pattern that is sought in this project.

Aaima Kausar, 12582728 –

7

Assignment 1, 31250 Introduction to Data Analytics SPR_2017

Appendix 1: References • 1. Shulzhenko, N., Perez-Diez, A., Morgun, A., 2000-2013 ‘Microarrays for Cancer Diagnosis and Classification’ Landes Bioscience 2. Chen, S., Parmigiani, G., 2008 ‘Meta-Analysis of BRCA1 and BRCA2 Penetrance ’ Journal of Clinical Oncology, Vol.11, pp1329-1333 3. National Human Genome Research Centre, 2017,

‘BRCA1

Circos’,

4. Lisboa, J.G.P., Vellido, A., Tagliaferri, R., Napolitano, F., Ceccarelli, M., MartinGuerro J.D., Biganzoli, E., 2010 ‘Data Mining in Cancer Research [Application Notes] ’, IEEE Computational Intelligence Magazine, Vol.5, Issue.1 pp14-18 5. Fournier-Viger, P., 2017 ‘An Introduction to Sequential Pattern Mining’, The Data Mining Blog, < http://data-mining.philippe-fournier-viger.com/introductionsequential-pattern-mining/> 6. National Cancer Institute, 2015 ‘BRCA1 and BRCA2: Cancer Risk and Genetic Testin...