ANNA UNIVERSITY: Scheduling of ScientificWorkflows on Data Grids

Saturday, 10 December 2011

Scheduling of ScientificWorkflows on Data Grids

(Suraj Pandey and Rajkumar Buyya
Grid Computing and Distributed Systems (GRIDS) Laboratory
Department of Computer Science and Software Engineering
The University of Melbourne, Australia
{spandey,raj}@csse.unimelb.edu.au )

Abstract
Selection of resources for execution of scientific workflows in data grids becomes challenging with the exponential growth of files as a result of the distribution of scientific experiments around the world. With more runs of these experiments, huge number of data-files produced can be made available from numerous resources. There is lack of work in optimal selection of data-hosts and compute resources in the presence of replicated files for scientific workflows. Foreseeing this, the thesis work aims at proposing novel workflow scheduling algorithms on data grids with large number of replicated files that incorporates practical constraints in heterogeneous environments such as Grids. In this paper, we define the workflow scheduling problem statement in the context of data grids, supported by motivating applications; list research issues arising from practical constraints; propose two algorithms for experimenting with the problem; report simulation results obtained as a result of preliminary studies. The results are promising enough to motivate us to research on the problem stated.

1 Introduction
Scientific experiments like the Compact Muon Solenoid (CMS) experiment for the Large Hadron Collider (LHC) at CERN1, the Laser Interferometer Gravitational-Wave Observatory’s (LIGO) science2 runs, the projects at Grid Physics Network3 produce data in the scale of petabytes. These experiments are usually represented using directed acyclic graphs (DAG), called workflows, where jobs are linked according to their flow dependencies. The workflow is called compute-intensive when the computational needs of individual jobs are high. Similarly, the workflow is called data-intensive when the data requirements are high. Optimizing the make-span and cost of execution is the of execution are minimized. Cao et al. [3] have demonstrated a Data Monitoring Toolkit (DMT) with LIGO Data Grid (LDG).We are motivated to design more sophisticated algorithms for scheduling applications like advanced LIGO.

Figure 1: Example workflow structures.

Magnetic Resonance Imaging (MRI) uses radio waves and a strong magnetic field to provide detailed images of internal organs and tissues. Functional MRI (fMRI) is a procedure that uses MRI to measure the signal changes in an active part of the brain. A typical study of fMRI data requires multiple-stage processes that begin with the preprocess of raw data and concludes with a statistical analysis.

Hundreds or even thousands of images are involved in such processes. Population-based atlas [18] creation is one of the major fMRI research activities. Figure 1b shows the workflow structure employing the

Automated Image Registration (AIR) and FSL suite for creating population-based brain atlases from high resolution anatomical data [24]. A proper scheduling algorithm is required to select the image files from appropriate data-hosts and execute the jobs in selected compute-hosts. For executing these application workflows, the data required can be retrieved from several data-hosts as there exist replicas of data files.

Among the large number of datahosts, the set of data-hosts that host the files required by a job should be
found such that the repeated transfer of these huge files from one site to another is minimized. The problem of finding the set of data-hosts that provides all the files in optimal time is NP-complete [21]. Data has to be staged as input before any job can be executed. At the end of execution or during the execution, output data is produced that may also be of similar sizes to the input data. These intermediate data should be stored for subsequent jobs requiring them. These new hosts can be treated as a new source of data depending on the policy of retaining or deleting the temporarily data. The total number of data-hosts thus increases as the intermediate output files are stored and the selection problem becomes complex.

The computation requirements of these jobs are also very high. After the set of candidate data-hosts is found, the jobs need to be mapped to the appropriate compute-host for execution. The mapping of these jobs depends on the objective function. Scheduling of the jobs in the workflow primarily focuses on some of the objective functions or combination of them: for example, minimizing the total make-span, minimizing the overall cost of execution and data transfer, deadline and budget. The mapping of the workflow jobs to minimize one of the objective functions is a complex subproblem of the general job scheduling problem as stated in Section 5. The problem becomes complex with the addition of replicated data sets with jobs requiring more than a single file.

3 Related Work
In order to find out the location of replicated files, Replica Location Services (RLS) [5] Local Replica Catalog (LRC) provides information about the data available at the resource. RLS is a distributed replica management system consisting of local catalogs that contain information about logical to physical filename mappings and distributed indexes that summarize the local catalog content. Information about the state of the resources can be obtained via the Monitoring and Discovery Service (MDS) [9]. MDS provides information about the number and type of available resources, processor characteristics and the amount of available memory [23].

The work done by Shankar et al. [17] is most closely related to that of ours. While producing a schedule for the jobs in a workflow, they have proposed a planner that uses the file location table to determine the locations of cached or replicated files. They take the volume of data into account while scheduling and have used cache mechanism for future usage similar to our notion of re-use. However, they do not consider the best location to get the data from. Moreover, the scheduling is performed for cluster management systems and not suitable for heterogeneous environments like Grid where instantaneous data replication, global search is not at all feasible.

Considerable work has been done by Deelman et al. [6, 7] on planning, mapping and data-reuse in workflow scheduling. Their Concrete Workflow Generator (CWG) queries the Transformation Catalog to determine if the components are available in the execution environment and to identify their locations. However, CWG selects a random location to execute from among the returned locations which does not necessarily give the best result. Also, if the input files are replicated at several locations, it selects the source location at random.

Zinn et al. [25] defined the mapping problem as Task Handling Problem (THP). They make strong assumptions by considering intrees and minimal series parallel graphs, unity communication cost and execution cost, which is not the case in heterogeneous environments. It is not always possible to transform workflows to intrees or minimal series parallel graphs.

Task to host assignment heuristic given by the Casanova et al. in the XSufferage [4] algorithm targets maximum file re-use by assigning the job to a cluster which already has the required file, provided the file is large compared to the available bandwidth on the cluster network’s link. However, they do not consider dependent jobs and replicated files.

Topcuouglu et al. have designed a list scheduling algorithm called HEFT. This list scheduling algorithm assigns ranks to the jobs according to both the communication and computation costs and preserves the job execution precedence. We have used a modified version of this algorithm for experimenting with our case. We chose not to test Dynamic Critical Path (DCP) algorithm [11] due to its higher complexity.

Ranganathan et al. [16, 15] have used dynamic replication strategies to improve data access. Significant performance improvement is achieved when scheduling is performed according to data availability while also using a dynamic replication strategy. Locality of access should be leveraged by selecting the appropriate replica. However, replication cannot be done instantaneously given the huge data size and bandwidth constraints.

In Ramakrishnan et al. [14] work, workflow input data is staged dynamically, new data products are generated during execution. They focus on determining which data are no longer needed and when, by adding nodes to the workflow to cleanup data along the way. This minimizes the disk space usage as intermediate files generated are also of similar order to that of the input files.

Venugopal and Buyya [21] have proposed a heuristic, based on the Set Covering Problem (SCP), for selecting compute and data resources for data-intensive bag-of-task applications in an environment where the data is replicated (SCP-MH). However, it does not consider storage constraints on the nodes and the dependencies between jobs, and multiple-workflows. The work presented in this proposal incorporates these concerns.

ANNA UNIVERSITY

Pages

Saturday, 10 December 2011

Scheduling of ScientificWorkflows on Data Grids

No comments:

Post a Comment