ANNA UNIVERSITY: 12/10/11

Saturday, 10 December 2011

Scheduling of ScientificWorkflows on Data Grids - I

4 Model
4.1 A Workflow Model
A workflowW is represented by a directed acyclic graph G = (V,E) , where V = D ∪ C ∪ J , and E represents the edge that maintains execution precedence constraints. Having a directed edge from jx to jy means that jy cannot start to execute until jx is completed. Having a directed edge from dx to jx means that job jx cannot start to execute until file is transferred or made available to the execution host executing the job. The components are described as follows:
1. A set of jobs J = {j1, j2, . . . , jn}
2. A set of files F = {f1, f2, . . . , fn}
3. A set of compute-hosts C = {c1, c2, . . . , cn}
4. A set of data-hosts D = {d1, d2, . . . , dn}
A job jx requires a set of files Fx = {f1, f2, . . . , fn} to be staged in for execution. In the set of files, we denote ft k as temporary file and ff k as fixed file to distinguish files produced as a result of execution of a job and those files already hosted by data-hosts, respectively. File ff k is hosted by multiple data-hosts.

4.2 Resource Model
A compute resource is a high performance computing platform such as a cluster which has a 'head' node that manages a batch job submission system. Each computehost has its own storage constrained data-host. There exists data-hosts that are only for storage purposes and do not provide computation. The communication cost between the compute-host and its own data-host is local and thus minimal in comparison to the communication cost between different hosts [22].

Scheduling of ScientificWorkflows on Data Grids

(Suraj Pandey and Rajkumar Buyya
Grid Computing and Distributed Systems (GRIDS) Laboratory
Department of Computer Science and Software Engineering
The University of Melbourne, Australia
{spandey,raj}@csse.unimelb.edu.au )

Abstract
Selection of resources for execution of scientific workflows in data grids becomes challenging with the exponential growth of files as a result of the distribution of scientific experiments around the world. With more runs of these experiments, huge number of data-files produced can be made available from numerous resources. There is lack of work in optimal selection of data-hosts and compute resources in the presence of replicated files for scientific workflows. Foreseeing this, the thesis work aims at proposing novel workflow scheduling algorithms on data grids with large number of replicated files that incorporates practical constraints in heterogeneous environments such as Grids. In this paper, we define the workflow scheduling problem statement in the context of data grids, supported by motivating applications; list research issues arising from practical constraints; propose two algorithms for experimenting with the problem; report simulation results obtained as a result of preliminary studies. The results are promising enough to motivate us to research on the problem stated.

1 Introduction
Scientific experiments like the Compact Muon Solenoid (CMS) experiment for the Large Hadron Collider (LHC) at CERN1, the Laser Interferometer Gravitational-Wave Observatory’s (LIGO) science2 runs, the projects at Grid Physics Network3 produce data in the scale of petabytes. These experiments are usually represented using directed acyclic graphs (DAG), called workflows, where jobs are linked according to their flow dependencies. The workflow is called compute-intensive when the computational needs of individual jobs are high. Similarly, the workflow is called data-intensive when the data requirements are high. Optimizing the make-span and cost of execution is the of execution are minimized. Cao et al. [3] have demonstrated a Data Monitoring Toolkit (DMT) with LIGO Data Grid (LDG).We are motivated to design more sophisticated algorithms for scheduling applications like advanced LIGO.

Figure 1: Example workflow structures.

Magnetic Resonance Imaging (MRI) uses radio waves and a strong magnetic field to provide detailed images of internal organs and tissues. Functional MRI (fMRI) is a procedure that uses MRI to measure the signal changes in an active part of the brain. A typical study of fMRI data requires multiple-stage processes that begin with the preprocess of raw data and concludes with a statistical analysis.

Radix sort

Let a be an array of n numbers. Our aim is to sort this array in ascending order using radix sort. The steps to be followed are

i) Initialize ten queues namely q[0], q[1], ………, q[9]. That is set rear [i] = front [i] = NULL for all I = 0, 1, 2, ……,9.

ii) Scan the array from left to right and find the least significant digit for all array elements.

iii) If it is 0, push the number in q[0]; if it is 1, push the number in q[1],….., and if it is 9, push the number in q[9].

iv) After pushing all the elements in the respective queues, pop up the data from 0^th queue to 9^th queue and restore in the array a.

v) Again scan the array from left to right and find the second least significant digit for all array elements.

vi) Repeats steps (iii) and (iv)

The repetition of the above process depends upon the width of the maximum element in the array. That is if the maximum width of the array element is 3, then scan the array three times and find the 1^st least significant digit and 3^rd least significant digit.

Example

Merge sort

Let a be an array of n numbers. Our aim is to sort this array in ascending order using merge sort. The steps to be followed are

i) Divide the array into n sub arrays with size 1.

ii) Merge adjacent pair of sub arrays [a[1]] and [a[2]], [a[3]] and [a[4]], …….,

[a[n-1]] and [a[n]]. Now we will get a n/2 sorted sub arrays of size 2 or less.

iii) Merge adjacent pairs of sub arrays [a[1]] and [a[2]], [a[3]] and [a[4]], ……., [a[n-3]], [a[n-2]] and [a[n-1

iv) Repeat this process until ther]], [a[n]]. Now we will get a n/4 sorted sub arrays of size 4 or less.e is only one sub array of size n.

Now this array will be in sorted order

Example

Merge

Merge

Merging is a process of combining two or more sorted arrays into a single sorted array.

Explanation

Suppose A[m] and B[n] are two sorted arrays. Merging is a process of combining these two arrays into a single sorted array C[m+n]. Steps to be followed are

i) Scan both arrays left to right

ii) Compare A[1] with B[1]. If A[1] is less than B[1] put A[1] as C[1] and compare A[2] with B[1] and so on, else put B[1] as C[1] and compare A[1] with B[2] and so on.

iii) Repeat step ii until any one array becomes empty.

iv) If A array becomes empty, store all the remaining B array elements in the same order in the C array. Besides store all the remaining A array elements in the same order in the B array.

Procedure

Selection sort

Selection sort

Let A be an array of n elements A[1], A[2], …., A[n]. Our aim is to sort the numbers in the array in ascending order. The steps are as follows

1) Find the smallest element position in the list A[1], A[2], …., A[n] say k₁ and interchange the values A[1] and A[k₁]

2) Find the smallest element position in the list A[2], A[3], ….., A[n] say k₂ and interchange the values A[2] and A[k₂].

3) Find the smallest element position in the list A[3], A[4], ….., A[n] say k₃ and then interchange the values A[3] and A[k₃]

4) Find the smallest element position in the list A[n-1] and A[n] say k_n-1and then interchange the values A[n] and A[k_n-1].

Example - Sort the array A using selection sort

A = (43, 72, 10, 23, 80, 1, 75)

Insertion sort

Insertion sort

Let a be an array of numbers a[1], a[2] ….. a[n]. Our aim is to sort them in ascending order using insertion sort.

Principle

Scan the array a from a[1] to a[n] and find the smallest elements a[r], r = 1, 2 ……, r and insert into its proper position in the previously sorted sub array a[1], a[2] …. a[r]. This can be explained by the following steps.

i) For r = 1 the previously sorted such array is empty therefore a[1] by itself is sorted.

ii) For r =2 , a[2] is inserted into the proper position in the previously sorted sub array a[1] i.e.; a[2] is inserted after or before a[1].

iii) For r = 3, a[3] is inserted into the proper position in the previously sorted sub array a[1], a[2]. i.e.; a[3] is inserted either before a[1] or after a[2] or in between a[1] and a[2].

The above steps are repeated until the last element a[n] is inserted into its proper position in the sorted sub array. Now the given array a is in sorted order.

Example

Bubble sort

Bubble sort

Let a be an array of n numbers. Our aim is to sort the numbers in the array in ascending order. The principle is to scan the array sequentially n-1 times and in each scan do the following steps.

Scan – 1

i) Take the first element and compare it with the second element. If it is small don’t interchange; or else interchange the two elements.

ii) Then compare the second element with the third element. If it is small don’t interchange; or else interchange the two elements.

iii) This process is continued till (n-1)^th element is compared with the n^th element.

iv) At the end of the first scan, the highest element comes to the n^th position.

Scan – 2

SORTING

1 Sorting

General Concepts

A file is an ordered collection of items.Each item in the file is called a record. The records are named as r[1], r[2],…..r[n], where n is the size of the file.

A key is associated with each record r[i] and through this key only we can identify the records in the file.

Pages

Saturday, 10 December 2011