COMP2521 Notes PDF

Title	COMP2521 Notes
Author	Brandon Nguyen
Course	IT
Institution	University of New South Wales
Pages	34
File Size	2 MB
File Type	PDF
Total Downloads	863
Total Views	983

Preview

CLICK TO PREVIEW PDF

Summary

Description

COMP2521 Notes Sorting ○ Selection Sort - O(n2) ○ Bubble Sort - O(n2) ○ Insertion Sort - O(n2) ○ Quick Sort - O(n2) ○ Merge Sort - O(n log n) Graphs ○ DFS Path Checking - O(V+E) ○ BFS Path Checking - O(V+E) ○ Buggy Cycle Check ○ Connected Components ○ Hamiltonian Path Checking - O((V-1)!) ○ Euler Path Checking - O(V) ○ Minimum Spanning Trees - O(nn-2) ○ Kruskal's Algorithm - O(E log E) ○ Prim's Algorithm - O(V log E) or O(VE) ○ Dijkstra's Algorithm - O(V2) Trees ○ BST Search ○ Splay Tree - O(n) ○ AVL Tree - O(log n) ○ 2-3-4 Tree - O(log n) ○ Red-Black Tree - O(log n)

Analysis of Algorithms  Basic computations performed by an algorithm e.g. evaluating an expression, indexing array, calling/returning from a function

Counting Primitive Functions  Look at pseudocode

Estimating Running Times Changing hardware/software environment only changes T(n) by a constant factor.

Big-Oh Given functions f(n) and g(n), we say that f(n) is O(g(n))

Big-Oh Rules If f(n) is a polynomial of degree d ⇒ f(n) is O(nd)  lower-order terms are ignored  constant factors are ignored  Use the smallest possible class of functions o say "2n is O(n)" instead of "2n is O(n2)"  Use the simplest expression of the class o say "3n + 5 is O(n)" instead of "3n + 5 is O(3n)"

30/63

Asymptotic Analysis Asymptotic analysis of algorithms determines running time in big_oh notation: find worst- case number of primitive operations as a function of input size express this function using big-Oh notation Example: algorithm arrayMax executes at most 5n – 2 primitive operations ⇒ algorithm arrayMax "runs in O(n) time"

Abstract Data Types

Data Types: o a set of values o collection of operation on those values ADT: o implementation of the data types o separates interfaced from implementation o users of ADT only see interface ) i.e. function prototypes, not how those functions actually work o builders of ADT implement it ADTs are usually a collection of items. E.g. structure – list, usage : queue.

Sorting Sorts on Linux The sort command o sorts a file of text, understands fields in line o can sort alphabetically, numerically, reverse, random The qsort() function o qsort(void *a, int n, int size, int (*cmp)()) o sorts any kind of array (n objects, each of size bytes) o requires the user to supply a comparison function (e.g. strcmp()) o sorts list of items using the order given by cmp()

Selection Sort o Go through array, find smallest, put in 1st index o Then second smallest -> 2nd index… and so on

Bubble Sort o o o o

Go through array index from 1..N For each adjacent array value, swap if out of order Keep track of if a swap has occurred. If not, the array is sorted Complexity is O(n^2)

Insertion Sort o Starts from the first index and creates an array using it o Goes to the next index and sorts it into the existing array o This sorted array increases by 1 each time

Shell Sort o Improvement of insertion sort as it sorts based on adjacency

o For 3 sorted. Look at every 3rd index and sort that. (base at 0, then 1, 2) o Complexity O(n ^3/2) or O(n^4/3) usually

Quicksort o Choose an item to be a pivot o Re-arrange the array so that o All items on left are smaller than pivot o All items on right are greater than pivot o Recursively sort each of the partitions

Performance

48/83

Best case: O(nlogn) comparisons o choice of pivot gives two equal-sized partitions o same happens at every recursive level o each "level" requires approx n comparisons o halving at each level ⇒ log2n levels Worst case: O(n2) comparisons o always choose lowest/highest value for pivot (CHOOSE WISELY!!) o partitions are size 1 and n-1 o each "level" requires approx n comparisons o partitioning to 1 and n-1 ⇒ n levels To improve this, try to find intermediate value by median of three Also, inefficient due to recursive calls. Under 5 items inefficient -> can switch to insertion sort.

Mergesort Mergesort: basic idea    

split the array into two equal-sized paritions (recursively) sort each of the partitions merge the two partitions into a new sorted array copy back to original array

Merging: basic idea   

copy elements from the inputs one at a time give preference to the smaller of the two when one exhausted, copy the rest of the other

Performance Best case: O(NlogN) comparisons    

split array into equal-sized partitions same happens at every recursive level each "level" requires ≤ N comparisons halving at each level ⇒ log2N levels

Worst case: O(NlogN) comparisons  

partitions are exactly interleaved need to compare all the way to end of partitions

Disadvantage over quicksort: need extra storage O(N)

Heap Sort

’

Graph Vertex = node Edge = connection

Graph Traversal o Is there a path from A-> C? (src -> dest) o What order do you have to follow to get from A->C?

BFS goes one layer at a time (adjacencies first). Uses queue. Finds shortest path. Stops with first-found path. DFS goes all the way in, back then all the way in another path. Uses stack. Uses visited[] global variable. A spanning tree of a graph  

includes all vertices uses a subset of edges, without cycles

Connected Components

o Components are subgraphs o Are two vertices in the same connected subgraph?

Initialise all vertices’ components to -1. Start by looking at first component. Then look through vertices and their components. If its not set to anything, do a dfsComponents and give the COMPID you’re currently at. dfsComponents will

Hamiltonian Path and Circuit  A simple path the include each vertex exactly once using DFS.  Generate all possible simple paths then keep counter of vertices visited in current path and stop when path found.

Euler Paths and Circuits  A path that include each edge exactly once.  Can check by brute force each possible path resulting in a factorial time complexity.  Theorem: A graph has an Euler circuit if and only if it is connected and all vertices have even degree.  Theorem: A graph has a non-circuitous Euler path if and only if it is connected and exactly two vertices have odd degree.

Directed Graphs Directed Graphs (Digraphs)  Directed Path: sequence of n ≥ 2 vertices v1 → v2 → … → vn where (vi,vi+1) ∈ edges(G) for all vi,vi+1 in sequence if v1 = vn, we have a directed cycle.  Degree of Vertex: deg(v) = number of edges of the form (v, _) ∈ edges(G). 1 o I n de g r e eofv e r t e x :de g ( v )=numbe ro fe d g e soft hef or m( _,v)∈ e dg e s ( G) .  Reachability is w is reachable from v if ∃ directed path v,…,w.  Strong connectivity: every vertex is reachable from every other vertex.  Directed acyclic graph (DAG): graph containing no directed cycles.

Digraph Representation  Can be represented as array of edges (directed), vertex-indexed adjacency matrix (non-symmetric) and vertex-indexed adjacency lists.

Weighted Graphs o Edges have values

Minimum Spanning Trees Minimum = sum of all edge weights the smallest possible  Weighted graph

 Set of edges o Touches all vertices o Minimal Weight  Not all edges may be used  no cycles/loops

Prim’s Algorithm  Start with a starting vertex – H.  Look at all the surrounding vertices and their edges.  Pick the path with the lowest weighted edge, following HI. Connect edge to MST. The vertex is now connected.  Look through all vertex’s connected and their respective edges. Look for the lowest weighted edge and follow that. Add that vertex/edge to the set. REPEAT

Kruskal’s Algorithm 1. Start with empty MST 2. Consider edges in increasing weight order a. Add edge if it does not form a cycle in MST 3. Repeat until v-1 edges are added

Shortest Path Dijkstra’s Algorithm https://www.youtube.com/watch?v=pVfj6mxhdMw

 Let distance of start vertex from start vertex = 0  Initialise all other vertices to -1 (not yet found) Repeat 1. Visit the unvisited vertex with the smallest known distance from the start vertex.

2. For the current vertex, examine its unvisited neighbours 3. For the current vertex, calculate distance of each neighbour from start vertex (use the shortest distance from A to find) 4. If the calculated distance of a vertex is less than known distance, update the shortest distance 5. Update the previous vertex for each of the updated distances Add the current vertex to the list of visited vertices (Until all vertices visited)

Week 07: Trees, BSTs, Balanced BSTs Trees Trees are connected graphs  Contain edges and nodes  Each node has a data value  Links to child nodes









preorder (NLR) … visit root, then left subtree, then right subtree o useful for building tree {4, 2, 1, 3, 8, 6, 9} inorder (LNR) … visit left subtree, then root, then right subtree o natural order {1,2,3,4,6,8,9} postorder (LRN) … visit left subtree, then right subtree, then root o useful for evaluation {} level-order … visit root, then all its children, then all their children o used for printing tree {4,2,8,1,3,6,9}

Binary Search Trees  Main operations: search, insert, remove Level of node = path length from root to node Height/depth = max path length from root to leaf Search = O(logn)

Time complexity is O(height) Everything on left is less than the root, everything on the right is greater than the right

Insertion Insertion at Root Method for inserting at root:  base case: o tree is empty; make new node and make it root  recursive case: o insert new node as root of appropriate subtree o lift new node to root by rotation Time complexity: O(height)

Deletion 4 cases:  Empty tree  Zero subtrees (deleting a leaf)  One Subtree (replace by child)  Two subtrees (replace by successor (p, it’s less than t but greater than g) , join two subtrees)

Tree rotation 4 common imbalances: (‘ grandparent has to pay and do the work (rotate)’)  https://www.youtube.com/watch?v=NczBLeco6XA 1. Right-child, right-subtree: left rotation

2. Left-child, left-subtree, right rotation

3. Right-child’s, left-subtree. Right left rotation

4. Left-child’s right subtree. Left right rotation

Joining Trees The highest number of the left tree (t1) must be less than the lowest number of the right tree (t2) Method:  Find min node in right subtree (t2). 24 in this case  Replace min node by its right subtree. Get the 29 (on the right of the min node 24) and connect it to 30.  Elevate min node to be new root of both trees. Left tree smaller so on left of 24. Right tree bigger so its on the right.

Week 08: Search Tree Algorithms 2 Self-adjusting trees Splay trees  Not as balanced as AVL trees. Therefore faster  Most operations O(logN), some O(N)  Fast access to elements accessed recently  After each operation, a splay occurs. E.g. if found, move to the root, Splaying allows tree to balance itself (bumping x – the node that you last worked on to the top). Usually using zig-zag/zigz-ig procedure Zig-zig: e.g. left-child’s left child Zig zag: e.g. left-child’s right child Insert -> bring the inserted key up.

AVL trees  Difference in height between left and right must be O(1)  Not good for ordering/sorting data Hash table has two things: 1. Has function -> reutrns nonnegtiave integer value called a hash code 2. Array capable of storing data of the type we wish to place into the data structure We run the data through the hash function and use the hash code to find where we want to store the data

 A collision occurs when two pieces of data when run through the hash function return the same hash code.  We want to store both pieces of data in the hash table, so don’t overwrite. LINEAR PROBING:  If we want to put it in the next available spot in array. However the larger the hash map, the more costly this is. Deviates from constant time operations.  It causes clustering. Once there’s a miss, two adjcanet cells will contain data making it more likely in the future that the cluster will grow.

Chaining:  Array holds more than one piece of data  Each element of the array is a pointer to the head of a linked list.

Week 11: Text Processing Algorithms Alphabets and words Strings  A sequence of characters, where a possible set is the alphabet.

Pattern matching  Given two strings T(text) and P(pattern) the pattern matching problem consists of finding a substring of T equal to P.  A brute force pattern matching run in O(n*m).

Boyer-Moore  Based on two heuristics: o Looking-glass heuristic: Compare P with subsequence of T moving backwards. o Character-jump heuristic: When a mismatch occurs at T[i]=c  If P contains c ⇒ shift P so as to align the last occurrence of c in P with T[i].  Otherwise ⇒ shift P so as to align P[0] with T[i+1].  AKA the big jump.  Algorithm pre-processes pattern P and the alphabet to build the lastoccurrence function L. o L maps alphabet to integers such that L© defined as largest index i such that P[i] = c or -1 if no such index exists. o L can be represented by an array indexed by numeric code of the characters and can be computed in O(m + s) time.  Cost analysis is that it run in O(nm + s) time, where m is length of pattern, n is length of text and s is size of alphabet.

Knuth-Morris-Pratt Algorithm  Compares pattern to text left-to-right but shifts pattern more intelligently than brute force algorithm.  When a mismatch occurs, we shift to the largest prefix of P[0..j] that is suffix of P[i..j].  Pre-processes pattern to find matches of its prefixes with itself.  Failure function F(j) is defined as the size of the largest prefix of P[0..j] that is also a suffix of P[1..j]. o If mismatch occurs at Pj then advance to F(j-1).

 Analysis of failure function computation is: o At each iteration of while loop either i increment by one or the shift amount i-j increases by at least one. o Hence no more than 2*m iteration of while loop so failure function can be computed in O(m) time.  Analysis of KMP Algorithm o Similar to failure function, however no more than 2*n iteration of while loop. o Therefore, runs at optimal time O(m + n). Boyer-Moore vs KMP  Boyer-Moore decides how far to jump ahead based on the mismatched character in the text and works best on large alphabets and natural language texts.  KMP uses information embodied in the pattern to determine where the next match could begin and works best on small alphabets.

Tries Pre-processing Strings  Pre-processing pattern speeds up pattern matching queries and after pre-processing P, KMP performs pattern matching in time proportional to text length.  If text is large, immutable and searched for often, then we can preprocess the text instead of the pattern.

Tries  A trie is a compact data structure for representing a set of strings and supports pattern matching queries in time proportional to the pattern size.  Tries are trees organised using parts of keys rather than whole keys.  Each node in a trie contains one part of a key and may have up to 26 children. o May be tagged as finishing node but can also have children.

Trie Operations  Basic operations are search for a key or insert a key.  Require O(n) space and insert and search in time is O(d*m). o n is total size of text, m is size of string parameter of operation and d is size of underlying alphabet.

Word Matching With Tries  For pre-processing the text, insert all searchable words of a text into a trie, where each leaf stores the occurrence(s) of the associated word in the text.

Compressed Tries  Have internal node of degree n greater equal 2 and are obtained from standard tries by compressing redundant chains of nodes.  Possible compact representation of a compressed trie to encode an array S of strings: o Nodes store ranges of indices instead of substrings, where we use triple (i,j,k) to represent substring S[i][j..k]. o Requires O(s) space, where s is number of strings in array S.

Pattern Matching with Suffix Tries  Suffix of trie of text T is compressed trie of all suffixes of T.  Input is compact suffix trie for text T and pattern P.  Analysis is: o For suffix trie for a text of size n can be constructed in O(n) time and uses O(n) space. o Supports pattern matching queries in O(s*m) time where m is length of pattern and s is size of alphabet.

Text compression  Code is mapping of each character to a binary word.  Prefix code is binary code such that no code word is prefix of another code word.  Encoding tree represents a prefix code, where each leaf stores a character and code word given by path from root to leaf. o 0 for left child and 1 for right.  Use short codewords for frequent characters and long code words for rare characters.

Huffman code  Computes frequency f(c) for each character c and encodes highfrequency characters with short code.  No code word is prefix of another code word and uses optimal encoding tree to determine the code words.  Successfully combines pairs of lowest-frequency to build up encoding tree bottom up.  Analysis of algorithm is:

o O(n+d*logd) time where n is length of input text T and s is number of distinct characters in T....