Analytics Exam notes PDF

Title	Analytics Exam notes
Course	Business Analytics
Institution	Auckland University of Technology
Pages	25
File Size	857.2 KB
File Type	PDF
Total Downloads	202
Total Views	982

Preview

CLICK TO PREVIEW PDF

Summary

Description

Analytics Class Notes Table of Contents Analytics Class Notes.........................................................................................................................................................................................................1 Glossary.............................................................................................................................................................................................................................2 Abbreviations....................................................................................................................................................................................................................7 Calculations.......................................................................................................................................................................................................................8 Graph explanations...........................................................................................................................................................................................................9 Concepts..........................................................................................................................................................................................................................10 Week 1............................................................................................................................................................................................................................................................12 Week 2............................................................................................................................................................................................................................................................15 Week 3............................................................................................................................................................................................................................................................16 Week 4............................................................................................................................................................................................................................................................18 Week 5............................................................................................................................................................................................................................................................19 Week 6............................................................................................................................................................................................................................................................21 Week 7............................................................................................................................................................................................................................................................23 Week 8............................................................................................................................................................................................................................................................23

Week 9............................................................................................................................................................................................................................................................24 Week 10..........................................................................................................................................................................................................................................................24 Week 12..........................................................................................................................................................................................................................................................24

Glossary Words (E) Extract (T) Transform (L) Load Customer Lifetime Value Cluster analysis Sequence Analysis Association Analysis (market basket analysis) Decision trees

Linear & Logistic regressions

Binary Target

Meaning Extracting the data from source systems (silos). Converting (transforming) the data into standardized formats loading & integrating data into a system (such as a data warehouse) the present value of all future streams of profits that an individual customer generates over the life of his or her business with the firm does not perform well with a large number of variables, as it becomes increasingly difficult to detect differences among groups as the number of variables increases Related to market basket analysis is sequence analysis, which looks at which items go together from one time to another. This can create opportunity for best-next-offer campaigns The detection of association rules is a descriptive method, Popular in data mining Alternative name ‘market basket analysis’ o Good for classifying data into segments based on response/activity o Easy to explain & target groups o Better than linear regression, as it can identify nonlinear relationships o Classify observations based on the values of nominal, binary, or ordinal targets o Predict outcomes for interval targets o Easy to interpret o Interactive Trees o Good for understanding relationships between variables and outcomes o Easy to explain o relative power, o ease of use, o robustness with a variety of data and levels of measurement o ease of interpretability o Linear regression does not work, because whatever the form of the equation, the results are generally unbounded.

Neural networks

Misclassification Rate (MR) ROC Curve lif N-grams

Network Social Network Data expert Domain expert Analytics expert Data Partitioning Waterfall development Agile development Gradient Boosting node

o Instead, you work with the probability p that the event will occur rather than a direct classification. o Picks up nonlinear relationships in the data o Tends to overfit the data o Difficult to explain/interpret o A predictive technique or as a clustering technique o Most important feature of neural networks is that they can ‘learn’ o The process is continuously iterative “percentage correct”, i.e. how accurate was the forecast; used for binary outcomes. Used for comparing predictions from different models- the model with largest area under the curve is the best model. The higher the lif, the stronger the association rule (also the more useful the model is) (1 indicates items are independent, values > 1.0 are usually good) o sequences of adjacent words as terms; bi-grams = adjacent pairs of words o useful when particular phrases are significant, but their component words may not be. But… disadvantage of n-grams is that they greatly increase the size of the feature set – many more word pairs than individual words, and still more word triples! A set of nodes (or vertices or entities) connected by ties (or edges or links) People are the nodes, and social relationships are the ties Understands structure, size & format of data Knows specifics of business problem, relevant knowledge, context, terminology, and strengths & weaknesses of current approaches Understands capabilities & limitations of various methods and their relevance to different problems Partition available data into training and validation sets. The model is fit on the training data set, and model performance is evaluated on the validation data set. sequential process, only one direction (sofware projects approach) iterative process breaks tasks into smaller chunks and repeat sequence multiple times (sofware projects approach) • Sequential combination of many trees • Extremely good predictions

Ensemble node

Two Stage node

Model comparison node ROC chart Score Ranking Score Node

Enterprise Data Warehouse

Cluster node Segment profile node Association node Data lakes

Very effective at variable selection Creates new models by combining the posterior probabilities (for class targets) or the predicted values (for interval targets) from multiple predecessor models • Three methods: Average, Maximum, Voting • Enables you to model a class target and an interval target. The interval target variable is usually the value that is associated with a level of the class target. • Choose between Sequential or Concurrent modeling. • Choose type of model at each stage: Tree, Regression, or Neural. • The score code of the Two Stage node is a composite of the class and value models  Provides a common framework for comparing models  Misclassification Rate = the disagreement between what was predicted and what actually happened The validation data is sorted from high to low. Each point on the ROC chart corresponds to a specific fraction of the sorted data. The model with largest area under the curve is the best model. • The validation data is sorted from highest to lowest (either by prediction rankings or estimates). • Each point on the ROC chart corresponds to a specific fraction of cases. • Enables you to score, that is, predict or classify, an observation from a data set whose role is score. • Each node in the process flow generates scoring code, and this code is compiled and passed to the score node. This enables the scoring data set to be scored outside of Enterprise Miner. • prerequisite for business analytics - it helps the organization obtain value from its data sources by preparing and storing the enterprise data into a repository designed to support decision making • Contains a copy of transaction data structured for querying and reporting Generate clusters and perform segmentation using automatic settings and with user-defined settings. Compare within-segment distributions of selected inputs to overall distributions. This helps you understand segment definition. Conduct market basket and sequence analysis on transactions data. A data source must have one target, one ID, and (if desired) one sequence variable in the data source. integrate disparate data into one repository without having to define the metadata (i.e. how the different data sets are linked to each other); “schema on read, not on write” Instead, the links are defined when the data is queried. The reservoir of dataset where you can run analytics on all data

• •

Data lineage Risk Text mining Algorithms Natural language processing (NLP) Token Tokenizing Corpus Stylometry Anomaly Forensic linguistics Topic modeling Latent semantic analysis

Boolean Search Stop list Sparseness Weighted Term-Document Matrix Parsing

the different processes, business rules, dependencies and other attributes that explain where data come from and how they were used to calculate results • potential event that has a negative outcome • Probability of event x potential loss the process of discovering and extracting meaningful patterns and relationships from text collections process or set of rules to be followed in calculations or other problem-solving operations, especially by a computer • branch of artificial intelligence that helps computers understand, interpret and manipulate human language. • to understand the meaning underlying the text unit of analysis can be a word or a phrase decomposing or grouping text into these units list of documents, large and structured set of texts the statistical analysis of variations in literary style between one writer or genre and another (determining authorship) something that deviates from what is standard, normal, or expected. applies stylometry to crime investigation, and is related to anomaly detection for crime prevention surface themes or issues underlying a corpus and organizes the documents within the corpus according to those themes is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms The largest value of the query occurs for the document (or documents) that most closely matches the query Stop words are natural language words which have very little meaning, such as "and", "the", "a", "an", and similar words. (words that are not inherently interesting and clutter our analysis) sparseness of a term t is measured commonly by an equation called inverse document frequency (IDF) normalizing the term frequencies with respect to document length •

used to build the corpus dictionary

Stemming Case conversion Corpus Bag-of-words word cloud Edge list Directed edge Undirected edge Unweighted edge Weighted edge Egonet unimodal network multimodal network

network-level node-level Visualization Machine learning

• associates terms with parts of speech and controls which parts of speech to recognize equate terms that are different verb tenses of the same verb, or to equate terms that are either singular or plural versions of the same noun Converting the entire text to all upper- or all lower-case ensures upper- and lower-case instances of the same word are treated the same. list of documents, large and structured set of texts representation of text without ordering information A visual representation of text data The importance of each tag/word is shown with font size or colour list of rows, with each row showing a link between nodes in column 1 and 2 clear origin and destination (e.g. Twitter user following another user, hyperlink from one page to another). May be reciprocated or not. Represented as line with arrow head a mutual relationship with no origin/destination e.g. marriage, Facebook friend. Do not exist unless they are reciprocated exist/not exist has a value attached to indicate the strength of the relationship e.g. in a Twitter network, a follower edge could be weighted by the number of re-tweets a network consisting of the focal node/person (“ego”), the people he/she is connected to (“alters”) A unimodal network contains only one type of node/vertex. A multimodal network contains more than one type of node/vertex. • Bimodal networks – e.g. individuals and the Wikipedia articles they’ve written- can be used to create unimodal networks of individuals and articles (metrics describing the network as a whole) • Size, density, inclusiveness, centralisation (metrics describing individual nodes within the network) • In/outdegree, betweenness centrality, Eigenvector centrality A visualization displays data values using charts, geographic maps, and word clouds • also known as artificial intelligence (AI) or cognitive computing • Computers running statistics (e.g. decision trees, neural networks) automatically (i.e. without human involvement) to make predictions

Hadoop/MapReduce Apache Spark

platform to store data in a distributed manner (HDFS), and analyse the scattered data (MapReduce). Not used for real-time data used for analyzing real-time data

Abbreviations Abbreviations CRISP-DM Conceptual model ETL CLV CSV MR ROC OLTP EDW SSADM SVM LSA IDF AWS

Long forms CRoss-Industry Standard Process for Data Mining Extract, Transform, Load Customer Lifetime Value comma-separated values Misclassification Rate (MR) Receiver operating characteristic online transactional processing Enterprise Data Warehouse Structured systems analysis and design method Support Vector Machine Latent semantic analysis Inverse document frequency Amazon Web Services

Calculations Support - The proportion of the times this item pair occurs in the dataset

Misclassification Rate (MR)

ROC curve Sensitivity (true positive rate) = 1 –

Probability odds FP FP+TN Inverse document frequency

Specificity (true negative rate) = 1 –

FN FN+TP

Market Basket Analysis Support (A  B) = transactions containing every item in A and B all transactions Expected Confidence (A  B) = transactions containing every item in B all transactions Lif (A  B) =

Confidence (A B) Expected Confidence (A  B)

The ratio by which by confidence of a rule exceeds the expected confidence

Confidence (A  B) = transactions containing every item in A and B transactions containing the items in A Probability that a rule is correct for a new transaction with the item on the lef

TFIDF = product of Term Frequency (TF) and Inverse Document Frequency (IDF) TFIDF of a term t in a given document d: TFIDF (t, d) = TF (t, d) × IDF(t)

Graph explanations Default — The default selection uses different statistics based on the type of target variable and whether a profit/loss matrix has been defined. If a profit/loss matrix is defined for a categorical target, the average profit or average loss is used. If no profit/loss matrix is defined for a categorical target, the misclassification rate is used. If the target variable is interval, the average squared error is used. Akaike's Information Criterion — chooses the model with the smallest Akaike's Information Criterion value. Average Squared Error — chooses the model with the smallest average squared error value. Mean Squared Error — chooses the model with the smallest mean squared error value. ROC — chooses the model with the greatest area under the ROC curve. Captured Response — chooses the model with the greatest captured response values using the decile range that is specified in the Selection Depth property. Gain — chooses the model with the greatest gain using the decile range that is specified in the Selection Depth property. Gini Coefficient — chooses the model with the highest Gini coefficient value. Kolmogorov-Smirnov Statistic — chooses the model with the highest Kolmogorov - Smirnov statistic value. Lif — chooses the model with the greatest lif using the decile range that is specified in the Selection Depth property. Misclassification Rate — chooses the model with the lowest misclassification rate. Average Profit/Loss — chooses the model with the greatest average profit/loss. Percent Response — chooses the model with the greatest % response. Cumulative Captured Response — chooses the model with the greatest cumulative % captured response. Cumulative Lif — chooses the model with the greatest cumulative lif. Cumulative Percent Response — chooses the model with the greatest cumulative % response.

Concepts 1. What makes up the CRISP-DM Conceptual model? (1) Business understanding  (2) Data understanding  (3) Data Preparation  (4) Modelling  (5) Evaluation  (6) Deployment 2. What’s affected when analytics is introduced in an organisation?  Technology infrastructure o Hardware, networking, storage, personal computing devices  Data management o Transactional database systems, reporting systems, changing data definitions  Business infrastructure o Organizational chart, corporate policies, business processes 3. What does data management activities do business analytics require? data access movement transformation aggregation augmentation a. What type of data do these tasks require? simple flat files files with comma-separated values Microsof Excel files SAS tables Oracle tables b. What are some examples of what the data combines? simple flat files, files with comma-separated values, Microsof Excel files, SAS tables, and Oracle tables 4. What are the signs that your analytics environment is at risk? o The spreadsheets that just won’t die o Everybody asks for help o Nobody asks for help o Grumbles about usability during lunch/breaktime o “Good old days” syndrome o Usage numbers decline over time o BI tools aren’t used in strategy/planning discussions o Executive sponsors lose enthusiasm o Executive sponsors lose their jobs o Resistance to upgrades and expansion 5. What is the ethics of analytics? o Myth: more data = better understanding

o Definition of ...