Looking for a Tutor Near You?

Post Learning Requirement »
x

Choose Country Code

x

Direction

x

Ask a Question

x

x
x
x
Hire a Tutor

Data Structures And Algorithms-Part 2

Loading...

Published in: PHP And MySQL
2,397 Views

Algorithm is a step by step procedure, which defines a set of instructions to be executed in certain order to get the desired output. Algorithms are generally created independent of underlying languages, i.e. an algorithm can be implemented in more than one programming language.

Prantik S / Kolkata

11 years of teaching experience

Qualification: MCA (Jaipur National University - [JNU], Jaipur - 2017)

Teaches: Basic Computer, Computer for official job, MS Office, School Level Computer, ICT Training, Computer Science, Information Practice, IT & Computer Subjects, BCA Tuition, IT, Computer, C / C++, C# (C Sharp), Java And J2EE, Python Programming, Visual Basic, BCA Subjects, Hardware Training, Networking, Java Script

Contact this Tutor
  1. Part 2 Continued
  2. Converting to a Binary Tree Left child/right sibling representation essentially stores a binary tree. Use this process to convert any general tree to a binary tree. A forest is a collection of one or more general trees. (a) (b)
  3. Sequential Implementations (1) List node values in the order they would be visited by a preorder traversal. Saves space, but allows only sequential access. Need to retain tree structure for reconstruction. Example: For binary trees, us a symbol to mark null 1 links. AB/D//CEG///FH//I//
  4. Sequential Implementations (2) Example: For full binary trees, mark nodes as leaf or internal. A'B'/DC'E'G/F'HI Example: For general trees, mark the end of each subtree.
  5. Sorting Each record contains a field called the key. — Linear order: comparison. Measures of cost: — Comparisons — Swaps
  6. Insertion Sort 42 20 17 13 28 14 23 20 42 17 13 28 14 23 17 20 42 13 28 14 23 13 17 20 42 28 14 23 13 17 20 28 42 14 23 (1) 13 14 17 20 28 42 23 13 14 17 20 23 28 42 13 14 17 20 23 28 42
  7. Insertion Sort (2) template void inssort (Elem A [l int n) { for (int i = 1; i < n; i++) for (int j=i; (Comp: : It (A A[j-l])) swap (A, Best Case: Worst Case: Average Case:
  8. Bubble Sort 42 20 17 13 28 14 23 i=0 13 42 20 17 14 28 23 13 14 42 20 28 23 13 14 42 20 23 28 13 14 17 42 20 23 28 (1) 13 14 17 20 42 23 28 13 14 17 20 23 42 28 13 14 17 20 23 28 42
  9. Bubble Sort (2) template void bubsort (Elem A [l int n) { for (int i =0; i
  10. Selection Sort 42 20 13 28 14 23 i=0 13 20 17 42 28 14 23 14 17 42 28 20 23 14 42 28 20 23 13 14 17 28* 20 23 42 (1) 13 14 17 20 28* 23? 42 13 14 17 20 23 28- 42 14 17 20 23 28 42 10
  11. Selection Sort (2) template void selsort (Elem A [l int n) { for (int i =0; i i; j ) // Find least for (int j=n-l; if (Comp: : It (A [j], A [lowindex] ) ) // Put it in place low index swap (A, i, low index) ; Best Case: Worst Case: Average Case: 11
  12. (a) Pointer Swapping (b) 12
  13. Summary Insertion Bubble Comparisons: Best Case Average Case Worst Case Swaps Best Case Average Case Worst Case 2) 2) 2) 2) 2) 2) 2) 2) 2) selection 2) 2) 2) 13
  14. Exchange Sorting All of the sorts so far rely on exchanges of adjacent records. What is the average number of exchanges red? — There are ! permutations — Consider permuation and its reverse, — Together, every pair requires ( exchanges. 14
  15. Shellsort 59 36 28 11 11 20 20 14 13 17 13 28 14 23 11 11 14 13 13 14 28 36 23 17 14 20 15 20 23 17 28 23 83 15 15 20 28 36 59 59 36 36 98 98 41 41 41 11 17 23 42 42 70 70 70 70 59 65 65 65 59 65 41 41 98 83 70 42 42 42 65 83 15 83 83 98 98 15
  16. Shellsort // Modified version of Insertion Sort template < class El em, class Comp> int n, int inc r) void inssort2 (Elem A [l for (int i=incr; i < n; i+=incr) for (int j=i; (j>=incr) && (Comp: : It (A [j], A[j-incr] ) ) ; j -incr) ; swap (A, template < class El em, class Comp> —inc r) int n) { // Shell sort void shell sort (Elem A [l , for (int i=n/2; i > 2; i/=2) // For each incr // Sort sublists for (int 3=0; inssort2 n-j , i); inssort2 (A, n, 1) ' 16
  17. Quicksort template < class El em, class Comp> void qsort (Elem , int i, int j) if (j int findpivot (Elem A [l , int i, { return (i + j) / 2; int j) 17
  18. Quicksort Partition template < class El em, class Comp> int partition (Elem int l, int r, Elem& pivot) { do { // Move the bounds in until they meet while (Comp: : It (A [++1], pivot) ) ; while ( (r ! 0) && Comp: : g t (A[--r], pivot) ) ; // Swap out-of-place values swap (A, l, r); // Stop when they cross } while (l < r); swap (A, l, r) ; return l; // Reverse last swap // Return first pos on right The cost for partition is ). 18
  19. Partition Example Pass 1 Swap 1 Pass 2 Swap 2 Pass 3 Swap 3 Reverse Swap 72 72 48 48 48 48 48 48 6 6 6 6 6 6 6 6 57 57 57 57 57 57 57 57 88 88 88 88 42 42 85 42 85 85 85 85 85 85 42 | 85 42 42 42 42 88 88 88 88 83 83 83 83 83 83 83 83 73 73 73 73 73 73 73 73 48 48 72 72 72 72 72 72 60 60 60 60 60 60 60 60 19
  20. Quicksort Example Pivot = 60 Pivot = 42 72 6 57 48 6 57 Pivot = 6 6 42 57 48 Pivot = 57 42 48 57 42 48 6 42 48 88 60 42 60 57 60 42 88 72 72 73 48 85 83 73 83 73 72 85 Pivot = 73 85 88 83 Pivot = 88 85 83 88 83 85 85 88 73 83 Final Sorted Array 20
  21. Cost of Quicksort Best case: Always partition in half. Worst case: Bad partition. Average case: Optimizations for Quicksort: — Better Pivot — Better algorithm for small sublists — Eliminate recursion 21
  22. Mergesort List mergesort (List in list) { if (in list. length()
  23. Mergesort Implementation template < class Elem class Comp> void mergesort fill, tem [l 1 nt left 1 nt righ€) (left+rxght5 / 2; int mid if (left right) return; mergesort A, temp, left mid) mergesort A, temp, mid+f right) for (int i
  24. Optimized Mergesort template < class Elem class Comp> void mergesort fill, tem [l 1 nt left 1 nt ri h?) if ( (right-left)
  25. Mergesort Cost Mergesort cost: Mergsort is also good for sorting linked lists. Mergesort requires twice the space. 25
  26. Heapsort template int n) { // Heapsort void heapsort (Elem , Elem mval; maxheap H (A, n, n) ; // Now sort for (int i =0; i < n; i++) // Put max at end H . removemax (mval ) Use a max-heap, so that elements end up sorted within the array. Cost of heapsort: Cost of finding K largest elements: 26
  27. Heapsort Example (1) 73 88 85 6 85 57 83 Original Numbers 88 60 42 83 Build Heap 72 73 42 57 Remove 88 72 60 42 57 72 6 6 48 85 48 60 48 88 6 88 72 48 85 85 72 6 48 60 73 72 6 48 60 73 60 73 88 85 73 83 42 42 42 57 83 83 83 57 57 27
  28. Heapsort Example Remove 88 85 73 83 72 60 42 57 6 Remove 85 83 73 57 72 60 42 48 6 Remove 83 73 72 57 6 60 42 48 83 48 88 85 88 85 88 (2) 28
  29. Binsort (1) A simple, efficient sort: for (i =0; i < n; i++) Ways to generalize: — Make each bin the head of a list. — Allow more keys than records. 29
  30. Binsort (2) template void binsort (Elem A [l int n) { List B [MaxKeyValue] ; Elem item; for (i =0; i < n; i++) B [A [i]] . append (A [i]); for (i =0; i
  31. Radix Sort (1) Initial List: First pass (on right digit) 27 91 1 1 5 97 72 17 17 23 84 28 Second pass (on left digit) 72 5 27 97 67 72 67 25 Result of first pass: Result of second pass: 1 23 84 23 25 5 25 27 28 84 67 91 28 97 31
  32. Radix Sort (2) template void radix (Elem A [l, Elem BC], int n, int k, int r, int cnt[]) // cnt [i] stores # of records in bin [i] int j; for (int i =0, rtok=l; i < k; i++, rtok*=r) for (3=0; j < r; j++) cnt [j] O; // Count # of records for each bin j < n; j++) cnt[ (A [j] /rtok) 00 r] ++; for (3=0; // cnt [j] will be last slot of bin j. for (3=1; cnt [j] cnt[j-l] + cnt[j]; for (j=n-l• for (3=0; 32
  33. Radix Initial Input: Array A First pass values for Count. rtok = 1. Count array: Index positions for Array B. End of Pass 1: Array A. Second pass values for Count. rtok = 10. Count array: Index positions for Array B. End of Pass 2: Array A. Sort Example o 2 2 1 2 1 2 1 1 1 2 1 2 3 2 4 2 7 3 1 3 4 3 0 3 7 4 1 4 5 4 4 7 5 2 5 7 5 o 5 7 6 6 7 6 1 6 8 7 4 7 11 7 1 7 9 8 1 8 12 8 1 8 10 9 9 12 9 2 9 12 33
  34. Cost: HOW do Radix Sort Cost , and relate? If key range is small, then this can be ). If there are distinct keys, then the length of a key must be at least log — Thus, Radix Sort is log ) in general case 34
  35. Empirical Comparison (1) Smt st•/o QI?k QI?k/O Merge Merge/O Ra•/4 10 mli mli mli mil mli mli 100 .0T2 .om .01ó 9.13 0.33 0.27 061 0.38 2.31 352.1 $435 99 33 33 236 170 10 94 IM 1430 0.0 513.5 2.8 16 6.0 5.0 812.9 6.1 2.2 16 6.1 38 5.0 35
  36. Empirical Comparison (2) ??e? ??e?jo Msae MsaejO He4 10 m?? m?e m?? m?e m?r m?c m12 100 .114 m3 .044 .191 1136 041 034 34 10 K 9.8 3.9 9.1 8.8 IOOK 118 379 03 29 29 15 178 916.9 1012.2 5.2 3.9 1.5 35.4 36
  37. Sorting Lower Bound We would like to know a lower bound for all possible sorting algorithms. Sorting is 0( log ) (average, worst cases) because we know of algorithms with this upper bound. Sorting I/O takes Q( ) time. We will now prove Q( log ) lower bound for sorting. 37
  38. Decision Trees Yes YXZ YXZ YZX ZYX Yes A[2] < All]? yzx (z < X?) YZX ZYX Yes All] < ALO]? Z Y X (Z < Y?) YZX XYZ XYZ YZX XZY ZXY YXZ ZYX All] < ALO]? Yes No XYZ XYZ XZY ZXY No YXZ Yes A[2] < All]? xzy (Z < Y?) XYZ XZY ZXY All] < ALO]? ZXY (Z < X?) XZY 38
  39. Lower Bound Proof There are ! permutations. A sorting algorithm can be viewed as determining which permutation has been input. Each leaf node of the decision tree corresponds to one permutation. A tree with n nodes has Q(log ) levels, so the tree with n! leaves has Q(log !) = log ) levels. Which node in the decision tree corresponds to the worst case? 39
  40. Primary vs. Secondary Storage Primary storage: Main memory (RAM) Secondary Storage: Peripheral devices Disk drives Tape drives 40
  41. Comparisons Medium Early 1996 Mid 1997 Early 2000 RAM Disk Floppy Tape 0.25 0.50 0.03 7.00 0.10 0.36 0.01 1.50 0.01 0.25 0.001 RAM is usually volatile. RAM is about 1/4 million times faster than disk. 41
  42. Golden Rule of File Processing Minimize the number of disk accesses! 1 Arrange information so that you get what you want with few disk accesses. 2. Arrange information to minimize future disk accesses. An organization for data on disk is often called a file structure. Disk-based space/time tradeoff: Compress information to save processing time by reducing disk accesses. 42
  43. Sectors Head 6 3 8 5 Intersector Gaps Sectors Bits of Data 8 7 6 5 1 2 3 4 Rotation (a) Head 1 4 7 2 Rotation (b) A sector is the basic unit of I/O. Interleaving factor: Physical distance between logically adjacent sectors on a track. 43
  44. Terms Locality of Reference: When record is read from disk, next request is likely to come from near the same place in the file. Cluster: Smallest unit of file allocation, usually several sectors. Extent: A group of physically contiguous clusters. Internal fragmentation: Wasted space within sector if record size does not match sector size; wasted space within cluster if file size is not a multiple of cluster size. 44
  45. Seek Time Seek time: Time for I/O head to reach desired track. Largely determined by distance between I/O head and desired track. Track-to-track time: Minimum time to move from one track to an adjacent track. Average Seek time: Average time to reach a track for random access. 45
  46. Other Factors Rotational Delay or Latency: Time for data to rotate under I/O head. One half of a rotation on average. At 7200 rpm, this is 8.3/2 = 4.2ms. Transfer time: Time for data to move under the I/O head. At 7200 rpm: Number of sectors read/Number of sectors per track * 8.3ms. 46
  47. Disk Access Cost Example (1) Read a IMB file divided into 2048 records of 512 bytes (1 sector) each. Assume all records are on 8 contiguous tracks. First track: 9.5 + 11 .1/2 +3 x 11.1 = 48.4 ms Remaining 7 tracks: 2.2 + 11 .1/2 + 3 x 11.1 = 41.1 ms. Total: 48.4 + 7*41.1 = 335.7ms 47
  48. Disk Access Cost Example (2) Read a IMB file divided into 2048 records of 512 bytes (1 sector) each. Assume all file clusters are randomly spread across the disk. 256 clusters. Cluster read time is (3 x 8)/256 of a rotation for about 1 ms. 256(9.5 + 11 .1/2 + (3 x 8)/256) is about 3877 ms. or nearly 4 seconds. 48
  49. How Much to Read? Read time for one track: + 11 +3 x = Read time for one sector: 9.5 + 11 .1/2 + (1/256)11 1 . = 15.1ms. Read time for one byte: 9.5 + 11.1/2 = 15.05 ms. Nearly all disk drives read/write one sector at every I/O access. Also referred to as a page. 49
  50. Buffers The information in a sector is stored in a buffer or cache. If the next I/O access is to the same buffer, then no need to go to disk. There are usually one or more input buffers and one or more output buffers. 50
  51. Buffer Pools A series of buffers used by an application to cache disk data is called a buffer pool. Virtual memory uses a buffer pool to imitate greater RAM memory by actually storing information on disk and "swapping" between disk and RAM. 51
  52. Buffer Pools Physical Memory (on disk) Virtual Memory (in main memory) 52
  53. Organizing Buffer Pools Which buffer should be replaced when new data must be read? First-in, First-out: Use the first one on the queue. Least Frequently Used (LFU): Count buffer accesses, reuse the least used. Least Recently used (LRU): Keep buffers on a linked list. When buffer is accessed, bring it to front. Reuse the one at end. 53
  54. Design Issues Disadvantage of message passing: Messages are copied and passed back and forth. Disadvantages of buffer passing: The user is given access to system memory (the buffer itself) The user must explicitly tell the buffer pool when buffer contents have been modified, so that modified data can be rewritten to disk when the buffer is flushed. The pointer might become stale when the bufferpool replaces the contents of a buffer. 54
  55. Programmer's View of Files Logical view of files: An a array of bytes. A file pointer marks the current position. Three fundamental operations: Read bytes from current position (move file pointer) Write bytes to current position (move file pointer) Set file pointer to specified byte position. 55
  56. C++ File Functions # include < f stream. h > void f stream: : open (char* name, openmode mode) Example: ios: : in I ios: : binary void f stream: : close ( ) • f stream: . read (char* ptr, int numbytes) ' f stream: : write (char* ptr, int numbtyes) ' f stream: : see kg (int pos) ; f stream: : see kg (int pos, f stream: . seekp (int pos) ; f stream: : seekp (int pos, ios: : cur r) i os : : end) ; 56
  57. External Sorting Problem: Sorting data sets too large to fit into main memory. Assume data are stored on disk drive. To sort, portions of the data must be brought into main memory, processed, and returned to disk. An external sort should minimize disk accesses. 57
  58. Model of External Computation Secondary memory is divided into equal-sized blocks (512, 1024, etc...) A basic I/O operation transfers the contents of one disk block to/from main memory. Under certain circumstances, reading blocks of a file in sequential order is more efficient. (When?) Primary goal is to minimize I/O operations. Assume only one disk drive is available. 58
  59. Key Sorting Often, records are large, keys are small. Ex: Payroll entries keyed on ID number Approach 1 : Read in entire records, sort them, then write them out again. Approach 2: Read only the key values, store with each key the location on disk of its associated record. After keys are sorted the records can be read and rewritten in sorted order. 59
  60. Simple External Mergesort (1) Quicksort requires random access to the entire set Of records. Better: Modified Mergesort algorithm. Process elements in ) passes. A group of sorted records is called a 60
  61. Simple External Mergesort (2) Split the file into two files. Read in a block from each file. Take first record from each block, output them in sorted order. Take next record from each block, output them to a second file in sorted order. Repeat until finished, alternating between output files. Read new input blocks as needed. Repeat steps 2-5, except this time input files have runs of two sorted records that are merged together. Each pass through the files provides larger runs. 61
  62. Problems with Simple Mergesort Is each pass through input and output files sequential? What happens if all work is done on a single disk drive? How can we reduce the number of Mergesort passes? In general, external sorting consists of two phases: Break the files into initial runs Merge the runs together into a single run. 62
  63. Breaking a File into Runs General approach: Read as much of the file into memory as possible. Perform an in-memory sort. Output this group of records as a single run. 63
  64. Replacement Selection (1 ) Break available memory into an array for the heap, an input buffer, and an output buffer. Fill the array from disk. Make a min-heap. Send the smallest value (root) to the output buffer. 64
  65. Replacement Selection (2) If the next key in the file is greater than the last value output, then Replace the root with this key else Replace the root with the last key in the array Add the next record in the file to a new heap (actually, stick it at the end of the array). llJbru Bnuek uvn orubfil WII-I E!l€ orubru 65
  66. RS Example 66
  67. Snowplow Analogy (1 ) Imagine a snowplow moving around a circular track on which snow falls at a steady rate. At any instant, there is a certain amount of snow on the track. Some falling snow comes in front of the plow, some behind. During the next revolution of the plow, all of this is removed, plus 1/2 of what falls during that revolution. Thus, the plow removes 2 amount of snow. 67
  68. Snowplow Analogy (2) Falling Snow Future snow Existing snow Snowplow Movement Start time T 68
  69. Problems with Simple Merge Simple mergesort: Place runs into two files. Merge the first two runs to output file, then next two runs, etc. Repeat process until only one run remains. How many passes for r initial runs? Is there benefit from sequential reading? Is working memory well used? Need a way to reduce the number of passes. 69
  70. Multiway Merge (1) With replacement selection, each initial run is several blocks long. Assume each run is placed in separate file. Read the first block from each file into memory and perform an -way merge. When a buffer becomes empty, read a block from the appropriate run file. Each record is read only once from disk during the merge process. 70
  71. Multiway Merge (2) In practice, use only one file and seek to appropriate block. Input Runs Output Buffer 12 18 20 71
  72. Limits to Multiway Merge (1 ) Assume working memory is blocks in size. How many runs can be processed at one The runs are 2 blocks long (on average). How big a file can be merged in one pass? 72
  73. Limits to Multiway Merge (2) Larger files will need more passes -- but the run size grows quickly! This approach trades (log b) (possibly) sequential passes for a single or very few random (block) access passes. 73
  74. General Principles A good external sorting algorithm will seek to do the following: Make the initial runs as long as possible. At all stages, overlap input, processing and output as much as possible. Use as much working memory as possible. Applying more memory usually speeds processing. If possible, use additional disk drives for more overlapping of processing with I/O, and allow for more sequential file processing. 74
  75. Search Given: Distinct keys l, 2, and collection of records of the form where is the information associated with key for 1 . Search Problem: For key value , locate the record ( ) in such that Searching is a systematic method for locating the record(s) with key value 75
  76. Successful vs. Unsuccessful A successful search is one in which a record with key is found. An unsuccessful search is one in which no record with is found (and presumably no such record exists). 76
  77. Approaches to Search 1 . Sequential and list methods (lists, tables, arrays). 2. Direct access by key value (hashing) 3. Tree indexing methods. 77
  78. Searching Ordered Arrays Sequential Search Binary Search Dictionary Search 78
  79. Self-Organizing Lists Self-organizing lists modify the order of records within the list based on the actual pattern of record accesses. Self-organizing lists use a heuristic for deciding how to reorder the list. These heuristics are similar to the rules for managing buffer pools. 79
  80. 1. 2. 3. Heuristics Order by actual historical frequency of access. (Similar to LFU buffer pool replacement strategy.) Move-to-Front: When a record is found, move it to the front of the list. Transpose: When a record is found, swap it with the record ahead of it. 80
  81. Text Compression Example Application: Text Compression. Keep a table of words already seen, organized via Move-to-Front heuristic. If a word not yet seen, send the word. Otherwise, send (current) index in the table. The car on the left hit the car I left. The car on 3 left hit 35 1 5. This is similar in spirit to Ziv-LempeI coding. 81
  82. Searching in Sets For dense sets (small range, high percentage Of elements in set). Can use logical bit operators. Example: To find all primes that are odd numbers, compute: 0011010100010100 & 0101010101010101 82
  83. Hashing (1) Hashing: The process of mapping a key value to a position in a table. A hash function maps key values to positions. It is denoted by h. A hash table is an array that holds the records. It is denoted by HT. HT has slots, indexed form Oto -I. 83
  84. Hashing (2) For any value in the key range and some hash function h, h( ) = , such that key (H T [ = Hashing is appropriate only for sets (no duplicates). Good for both in-memory and disk-based applications. Answers the question "What record, if any, has key value K?' 84
  85. Simple Examples (1) Store the n records with keys in range 0 to — Store the record with key in slot . — Use hash function h( ) = (2) More reasonable example: — Store about 1000 records with keys in range 0 to 16,383. — Impractical to keep a hash table with 16,384 slots. — We must devise a hash function to map the key range to a smaller table. 85
  86. Collisions (1) Given: hash function h with keys I and ß is a slot in the hash table. If h( l) 2), then I and 2 have a collision at ß under h. Search for the record with key 1. Compute the table location h( 2. Starting with slot h( ), locate the record containing key using (if necessary) a collision resolution policy. 86
  87. Collisions (2) Collisions are inevitable in most applications. Example: 23 people are likely to share a birthday. 87
  88. Hash Functions (1) A hash function MUST return a value within the hash table range. To be practical, a hash function SHOULD evenly distribute the records stored among the hash table slots. Ideally, the hash function should distribute records with equal probability to all hash table slots. In practice, success depends on distribution of actual records stored. 88
  89. Hash Functions (2) If we know nothing about the incoming key distribution, evenly distribute the key range over the hash table slots while avoiding obvious opportunities for clustering. If we have knowledge of the incoming distribution, use a distribution-dependent hash function. 89
  90. Examples (1) int h (int x) { return (x 00 This function is entirely dependent on the lower 4 bits of the key. Mid-square method: Square the key value, take the middle bits from the result for a hash table of 2 slots. 90
  91. Examples (2) For strings: Sum the ASCII values of the letters and take results modulo int h (char* x) { int i, sum; for (sum=0, i =0; x [i] sum += (int) x [i]; return (sum 00 M) ' This is only good if the sum is large compared to 91
  92. Examples (3) ELF Hash: From Executable and Linking Format (ELF), UNIX System V Release 4. int ELF hash (char* key) { unsigned long h O; while ( *key) { h ( h < < 4) + *key + + ; unsigned long g h & OxFOOOOOOOL; if (g) h return h M; 92
  93. Open Hashing What to do when collisions occur? Open hashing treats each hash table slot as a bin. 1 2 3 4 5 6 7 8 9 3013 9877 9879 9530 2007 1 057 93
  94. Bucket Hashing Divide the hash table slots into buckets. Example: 8 slots/bucket. Include an overflow bucket. Records hash to the first slot of the bucket, and fill bucket. Go to overflow if necessary. When searching, first check the proper bucket. Then check the overflow. 94
  95. Closed Hashing Closed hashing stores all records directly in the hash table. Each record has a home position h( If another record occupies 's home position, then another slot must be found to store The new slot is found by a collision resolution p.Q!jcy. Search must follow the same policy to find records not in their home slots. 95
  96. Collision Resolution During insertion, the goal of collision resolution is to find a free slot in the table. Probe sequence: The series Of slots visited during insert/search by following a collision resolution policy. Let ß =h( ). Let (ßo ß , 1, . ) be the series of0slots making up the probe sequence. 96
  97. Insertion // Insert e into hash table HT template < class Key, class El em, class KEComp, class EEComp> bool hashdict: : hash Insert (const Elem& e) { // Home position for e int home; // Init int pos home h (get key (e) ) ; for (int i = 1; ! (EEComp: HT [pos] ) ) ; (home + p (get key (e) , M; pos if (EEComp: :eq(e, HT [pos] ) ) return false; // Duplicate HT [pos] return true; // Insert e 97
  98. Search // Search for the record with Key K template < class Key, class El em, class KEComp, class EEComp> bool hashdict: : hashSearch (const Key & K, Elem& e) const { // Home position for K int home; // Initial posit int pos home for (int i 1; ! KEComp: HT [pos]) ! EEComp: HT [pos] ) ; M; // Next (home + p (K, i)) pos { // Found it if (KEComp: HT [pos] ) ) HT [pos]; return true; else return false; // K not in hash table 98
  99. Probe Function Look carefully at the probe function p(). i)) (home + p (get key (e) , M; pos Each time p ( ) is called, it generates a value to be added to the home position to generate the slot to be examined. p ( ) is a function both of the element's key value, and of the number of steps taken along the probe sequence. Not all probe functions use both parameters. 99
  100. Linear Probing Use the following probe function: i; Linear probing simply goes to the next slot in the table. Past bottom, wrap around to the top. To avoid infinite loop, one slot in the table must always be empty. 100
  101. Linear Probing Example Primary Clustering: Records tend to cluster in the table under linear probing since the probabilities for which slot to use next are not the same for all slots. 1 2 3 4 5 6 7 8 9 10 I OOI 9537 3016 9874 2009 9875 (a) 1 2 3 4 5 6 7 8 9 10 1001 9537 3016 2009 9875 1052 (b) 101
  102. Improved Linear Probing Instead of going to the next slot, skip by some constant Warning: Pick and carefully. The probe sequence SHOULD cycle through all slots of the table. Pick to be relatively prime to There is still some clustering Probe sequences for I and are linked together. 102
  103. Pseudo-Random Probing(l) The ideal probe function would select the next slot on the probe sequence at random. An actual probe function cannot operate randomly. (Why?) 103
  104. Pseudo-Random Probing(2) Select a (random) permutation of the numbers from 1 to -1: All insertions and searches use the same permutation. Example: Hash table size of = 101 -5, 3=32. 0=30, Probe sequence for Probe sequence for 104
  105. Quadratic Probing Set the 'th value in the probe sequence as h( Example: -101 0=30, 2) = 29. Probe sequence for Probe sequence for is: 1 2 is: 105
  106. Secondary Clustering Pseudo-random probing eliminates primary clustering. If two keys hash to the same slot, they follow the same probe sequence. This is called secondary clustering. To avoid secondary clustering, need probe sequence to be a function of the original key value, not just the home position. 106
  107. Double Hashing p( Be sure that all probe sequence constants (h2( )) are relatively prime to This will be true if is prime, or if and the constants are odd. Example: Hash table of size 0=30, 2 0=2, 2 Probe sequence for Is: 1 Probe sequence for 2 is: Probe sequence for is: 3 107
  108. Deletion Deleting a record must not hinder later searches. We do not want to make positions in the hash table unusable because of deletion. 108
  109. Tombstones (1) Both of these problems can be resolved by placing a special mark in place of the deleted record, called a tombstone. A tombstone will not stop a search, but that slot can be used for future insertions. 109
  110. Tombstones (2) Unfortunately, tombstones add to the average path length. Solutions: 1. Local reorganizations to try to shorten the average path length. 2. Periodically rehash the table (by order of most frequently accessed record). 110
  111. Indexing Goals: — Store large files — Support multiple search keys — Support efficient insert, delete, and range queries 111
  112. Terms(l ) Entry sequenced file: Order records by time of insertion. — Search with sequential search Index file: Organized, stores pointers to actual records. — Could be organized with a tree or other data structure. 112
  113. Terms(2) Primary Key: A unique identifier for records. May be inconvenient for search. Secondary Key: An alternate search key, often not unique for each record. Often used for search key. 113
  114. Linear Indexing Linear index: Index file organized as a simple sequence Of key/record pointer pairs with key values are in sorted order. Linear indexing is good for searching variable-length records. Linear Index 73 Database Records 52 98 37 42 114
  115. Linear Indexing (2) If the index is too large to fit in main memory, a second-level index might be used. 1 2003 5894 10528 Second Level Index 1 2001 2003 5688 5894 9942 10528 10984 Linear Index: Disk Pages 115
  116. Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: — Insert/delete — Multiple search keys (multiple indices) — Key range search 116
  117. Tree Indexing (2) Difficulties when storing tree index on disk: — Tree must be balanced. — Each path from root to leaf should cover few disk pages. 000 0000 117
  118. 2-3 Tree (1 ) A 2-3 Tree has the following properties: 1. Anode contains one or two keys 2. Every internal node has either two children (if it contains one key) or three children (if it contains two keys). 3. All leaves are at the sarne level in the tree, so the tree is always height balanced. The 2-3 Tree has a search tree property analogous to the BST. 118
  119. 2-3 Tree (2) The advantage of the 2-3 Tree over the BST is that it can be updated at low cost. 18 33 23 30 48 119
  120. 2-3 Tree Insertion (1) 18 33 23 30 18 33 23 30 14 120
  121. 12 2-3 Tree 18 23 Insertion (2) 18 33 23 30 33 30 48 52 121
  122. 2-3 Tree Insertion (3) 23 30 122
  123. B-Trees (1 ) The B-Tree is an extension of the 2-3 Tree. The B-Tree is now the standard file organization for applications requiring insertion, deletion, and key range searches. 123
  124. 1. 2. 3. B-Trees (2) B-Trees are always balanced. B-Trees keep similar-valued records together on a disk page, which takes advantage Of locality Of reference. B-Trees guarantee that every node in the tree will be full at least to a certain minimum percentage. This improves space efficiency while reducing the typical number of disk fetches necessary during a search or update operation. 124
  125. B-Tree Definition A B-Tree of order has these properties: The root is either a leaf or has at least two children. Each node, except for the root and the leaves, has between //21 and children. All leaves are at the level in the tree, so the tree is always height balanced. A B-Tree node is usually selected to match the size of a disk block. A B-Tree node could have hundreds of children. 125
  126. B -Tree Search (1 ) Search in a B-Tree is a generalization of search in a 2-3 Tree. 1. Do binary search on keys in current node. If search key is found, then return record. If current node is a leaf node and key is not found, then report an unsuccessful search. 2. Otherwise, follow the proper branch and repeat the process. 126
  127. B + -Trees The most commonly implemented form of the B- Tree is the B+-Tree. Internal nodes of the B+-Tree do not store record - only key values to guild the search. Leaf nodes store records or pointers to records. A leaf node may store more or less records than an internal node stores keys. 127
  128. B+-Tree Example zoo 101215 18 19 20 21 22 23 30 31 33 45 47 48 50 52 128
  129. B+-Tree 10 12 23 33 48 (a) Insertion 10 12 23 101215 101215 18 20 21 18 33 48 18 20 21 23 31 33 45 47 (c) 23 30 31 (d) 33 48 50 (b) 48 50 52 33 45 47 48 50 52 129
  130. B+-Tree Deletion (1) zoo 101215 18 19 20 21 22 sos4 SS 23 30 31 sc 90 33 45 47 93 48 50 52 130
  131. B+-Tree Deletion a?? 101215 101518 (2) 48 18 23 18 19 20 21 22 19 23 192021 22 23 30 31 a?? 23 30 31 33 45 47 33 45 47 48 50 52 a?? 48 50 52 131
  132. B+-Tree Deletion (3) aoo 101215 101215 18 19 20 21 22 23 30 31 45 47 48 50 52 (a) zoo 33 45 47 18 19 20 21 22 23 30 31 (b) 48 50 52 45 47 48 50 52 132
  133. B-Tree Space Analysis (1 ) B+-Trees nodes are always at least half full. The B*-Tree splits two pages for three, and combines three pages into two. In this way, nodes are always 2/3 full. Asymptotic cost of search, insertion, and deletion of nodes from B-Trees is ). Base of the log is the (average) branching factor Of the tree. 133
  134. B-Tree Space Analysis (2) Example: Consider a B+-Tree of order 100 with leaf nodes containing 100 records. 1 level B + -tree: 2 level B-l--tree: 3 level B + -tree: 4 level B + -tree: Ways to reduce the number of disk fetches: Keep the upper levels in memory. Manage B+-Tree pages with a buffer pool. 134
  135. Graph Applications Modeling connectivity in computer networks Representing maps Modeling flow capacities in networks Finding paths from start to goal (Al) Modeling transitions in algorithms Ordering tasks Modeling relationships (families, organizations) 135
  136. Graphs A graph G = (V, E) consists of a set of vertices V, and a set of edges E, such that each edge in E is a connection between a pair of vertices in V. The number of vertices is written IVI, and the number edges is written IEI. 136
  137. Graphs (2) 137
  138. Paths and Cycles Path: A sequence Of vertices of length -1 with an edge from to for 1 < +1 A path is simple if all vertices on the path are distinct. A cycle is a path of length 3 or more that connects to itself. A cycle is simple if the path is simple, except the first and last vertices are the same. 138
  139. Connected Components An undirected graph is connected if there is at least one path from any vertex to any other. The maximum connected subgraphs of an undirected graph are called connected components. 139
  140. Ott (e) 1e1ues91d9H!u0 POIOO.I!Q
  141. Itt ? ?? (e) ? C Z LO 1e1ues91d9H!u0 l!Pl-lßl.P9lO9
  142. RepresentatiOn Costs Adjacency Matrix: Adjacency List: 142
  143. Graph ADT class Graph { // Graph abstract class public : // # of vertices virtual int n ( ) —O; virtual int e ( ) —O; // Return index of first, next neighbor virtual int first (int) —O; virtual int next (int, int) —O; // Store new edge virtual void set Edge (int, int, int) —O; // Delete edge defined by two vertices virtual void del Edge (int, int) —O; // Weight of edge connecting two vertices virtual int weight (int, int) —O; virtual int getMark (int) —O; virtual void setMark (int, int) —O; 143
  144. Graph Traversals Some applications require visiting every vertex in the graph exactly once. The application may require that vertices be visited in some special order based on graph topology. Examples: — Artificial Intelligence Search — Shortest paths problems
  145. Graph Traversals (2) To insure visiting all vertices: void graph Traverse (const Graph* G) { for (v=0; ( ) ; v++) // Initialize UNVISITED) ; for (v=0; ( ) ; v++) if (G->getMark (v) UNVISITED) doTraverse (G, v) 145
  146. Depth First Search (1) // Depth first search void DFS (Graph* G, int v) { // Take action PreVisit (G, v) ; VISITED) ; for (int w=G->first (v); ; G->next (v, w) ) w if (G->getMark (w) UNVISITED) DES (G, w); // Take action Post Visit (G, v) ; 146
  147. Depth First Search (2) Cost: + 14). 147
  148. Breadth First Search (1) Like DFS, but replace stack with a queue. — Visit vertex's neighbors before continuing deeper in the tree. 148
  149. Breadth First Search (2) void BF S (Graph* G, int start, Queue*Q) { int v, w; // Initialize Q Q->enqueue (start) • VISITED) ; { // Process Q while (Q->length ( ) 0) Q->dequeue (v) ; // Take action PreVisit (G, v) • for (w=G->first (v) ;wn ( ) ;w=G->next (v, w) ) if (G->getMark (w) UNVISITED) { VISITED) ; Q->enqueue (w) ; 149
  150. Breadth First Search (3) 150
  151. Topological Sort (1) Problem: Given a set of jobs, courses, etc., with prerequisite constraints, output the jobs in an order that does not violate any of the prerequisites. 151
  152. Topological Sort (2) void topsort (Graph* G) { // Topological sort int i; i++) // Initialize for (i =0; ; UNVISITED) ; i++) // Do vertices for (i =0; ; if (G->getMark (i) tophelp (G, i) ; UNVISITED) // // helper void tophelp (Graph* G, int v) { // Process v VISITED) ; for (int w=G->first (v); ; G->next (v, w) ) w if (G->getMark (w) UNVISITED) tophelp (G, w) ; // Post Visit for Vertex v printout (v) ; 152
  153. Topological Sort (3) 153
  154. Queue-Based Topsort void topsort (Graph* G, Queue* Q) { 1 nt Count [G->n ( ) ] ; int v, w; for v=0; vn O; /// s edges for v=0; vn for (w=G->first (v); ( ) ; G->next (v, w) ) w // Add to v 2' s count for (v=0; vn(); v++) // Initialize Q // No prereqs if (Count [v] 0) Q->enqueue (v) ' while (Q->length (5 0) Q->dequeue (v) ; // PreVisit for V rintout (v) ; or (w=G->first (v); ( ) ; Count [w] - if (Count [w] G->next (v, w) ) w / / One s pre re q 0) // Now free Q->enqueue (w) ; 154
  155. Shortest Paths Problems Input: A graph with weights or costs associated with each edge. Output: The list of edges forming the shortest path. Sample problems: — Find shortest path between two named — Find shortest path from to all other vertices — Find shortest path between all pairs of vertices Will actually calculate only distances. 155
  156. Shortest Paths Definitions d(A, B) is the shortest distance from vertex A to B. W(A, B) is the weight Of the edge connecting A to B. — If there is no such edge, then w(A, B) = 156
  157. Single-Source Shortest Paths Given start vertex , find the shortest path from to all other vertices. Try 1 : Visit vertices in some order, compute shortest paths for all vertices seen so far, then add shortest path to next vertex Problem: Shortest path to a vertex already processed might go through Solution: Process vertices in order of distance from 157
  158. Dijkstra's Algorithm Example Initial Process A Process C Process B Process D Process E A o 0 0 0 0 B 10 5 5 5 5 C 3 3 3 3 3 D 20 20 10 10 10 E 18 18 18 18 158
  159. Dijkstra's Implementation // Compute shortest path distances from s, // return them in D void Dij kstra (Graph* G, int* D, int s) int i, v, w; i++) { // Do vertices for (i =0; ; min Vertex (G, D) if (D [v] INFINITY) return; VISITED) ; for (w=G->first (v); ( ) ; G->next (v, w) ) w if (D [w] > (D [v] + G->weight (v, w)) ) D [v] + G->weight (v, w) ; 159
  160. Implementing min Vertex Issue: How to determine the next-closest vertex? (I.e., implement minVertex) Approach 1: Scan through the table of current distances. - Cost: + IEI) = Approach 2: Store unprocessed vertices using a min-heap to implement a priority queue ordered by D value. Must update priority queue for each edge. - Cost: + IEI)IogIVI) 160
  161. Approach 1 // Find min cost vertex int min Vertex (Graph* G, int* D) { int i, v; // Set v to an unvisited vertex for (i =0; ; if (G->getMark (i) UNVISITED) i; break; } // Now find smallest D value for (i++; ; if ( (G->getMark (i) UNVISITED) i; return v; 161
  162. Approach 2 void Dij kstra (Graph* G, int* D, int s) // v is current vertex 1 nt i v, w; Dij kElem temp; // Heap array DiJ kElem E [G->e ( ) ] ; 0; temp . temp . distance // In It i all ze hea arra E [01 temp • munheap etMark (v) VISITED) ; V}SITED) ; {NFINITY) return; if D [v] fov w=G->first (v) ; ( ) ; w=G->next (v, w) ) f Dlw] > (D [v] + w)) ) 1 temp . distance D rw]; temp . vertex w; / 7 Insert in heap H. insert (temp) ; 162
  163. All-Pairs Shortest Paths For every vertex e V, calculate d( Could run Dijkstra's Algorithm IVI times. Better is Floyd's Algorithm. Define a -path from to to be any path whose intermediate vertices all have indices less than 5 INF INF INF 163
  164. Floyd's Algorithm // Floyd's all-pairs shortest paths algorithm void Floyd (Graph* G) { // Store distances int D [G->n()]; i++) // Initialize for (int i =0; ; for (int 3=0; G->weight (i, j); // Compute all k paths for (int k=0; ( ) ; k++) for (int i =0; ; for (int 3=0; if (DC i] [j] > (DC i] [k] + [j l)) 164
  165. Minimal Cost Spanning Trees Minimal Cost Spanning Tree (MST) Problem: Input: An undirected, connected graph G. Output: The subgraph of G that 1) has minimum total cost as measured by summing the values of all the edges in the subset, and 2) keeps the vertices connected. 165
  166. MST Example 166
  167. Prim's MST Algorithm void Prim (Graph* G, int* D, int s) // Who's closest int V int i, w; i++) { // Do vertices for (i =0; ; int v min Vertex (G, D) VISITED) ; if (v ! s) AddEdgetoMST (V v) ; if (D [v] INFINITY) return; for (w=G->first (v); ( ) ; G->next (v, w) ) w if (D [w] > G->weight (v, w) ) G->weight (v, w) ; // Update dist v; // Update who it came from 167
  168. Alternate Implementation As with Dijkstra's algorithm, the key issue is determining which vertex is next closest. As with Dijkstra's algorithm, the alternative is to use a priority queue. Running times for the two implementations are identical to the corresponding Dijkstra's algorithm implementations. 168
  169. Kruskal's MST Algorithm (1) Initially, each vertex is in its own MST. Merge two MST's that have the shortest edge between them. — Use a priority queue to order the unprocessed edges. Grab next one at each step. How to tell if an edge connects two vertices already in the same MST? - Use the UNION/FIND algorithm with parent- pointer representation. 169
  170. Kruskal's MST Algorithm (2) Step 1 Process edge (C. D) Step 2 Process edge (E, F) Step 3 Process edge (C, F) 170
  171. Kruskal's MST Algorithm (3) Cost is dominated by the time to remove edges from the heap. — Can stop processing edges once all vertices are in the same MST Total cost: + log 14). 171