Contemporary e-Science tasks often require data-intensive applications to be incorporated as a major part of the composite solution [1]. A huge amount of various data are available for scientific analysis, information retrieval and knowledge discovery. The growth of volume and variety of available data together with processing performance requirements causes the BigData area [2] to emerge. The BigData area collects a set of practices, methods, technologies which are aimed to better management of the large data arrays. Nevertheless the BigData still has a lot of issues to be solved. One of the questions currently opened within the area of BigData is
high-level user-friendly task definition. Within the e-Science area the common approach for description of the complex domain-specific tasks is workflow (WF). Still the WF concepts usually exploit the data-to-code (D2C) approach (data is transferred to the executable services) when the BigData solutions are usually based on the code-to-data (C2D) concept (executable modules are transferred to the data sites). The differences in the approach make it difficult to integrate seamlessly the BigData solutions into the WF management systems (WFMS): the existing solutions usually have a strong conceptual border between these two approaches.
Within the presented work we try to develop a solution which enables the end-user to define high-level tasks which can combine seamlessly both D2C and C2D approaches by the use of knowledge-based support for simulation-based scientific exploration [3]. As a core part of the solution we developed dynamic domain-specific language (DSL) which can adapt to the particular problem domain and can be integrated into the cloud computing environment to support the BigData tasks definition.
Considering the typical e-Science tasks and BigData requirements appeared within such tasks, the following features can be defined for the developed solution:
• High-level semantic description of available data which allows integrating various data sources and providing the user with capability to work with domain-specific semantics.
• The semantic description should allow linking the data with the structure of system being analyzed and thus to support the system-level science investigation.
• Usually the e-Science tasks are solved within Simulation-Driven Approach (SDA). The data processing should be integrated with the simulation procedures a) by being incorporated into the high-level composite application; b) by performing simulation procedures in C2D way within the data storage.
• Domain-specific languages (DSL) provide a powerful toolbox for high-level interaction with the user, but usually they are developed for single predefined problem.
• Considering the e-Science area a huge variety of domains and tasks can be identified. Thus the developed DSL should be dynamically adoptable to the particular problem domain and task.
The proposed solution is based on the idea of knowledge-based expressive toolbox [4] where the hierarchy of languages (with textual or graphical notation) is defined by the set of knowledge from different problem domains. The basic hierarchy of expressive tools include five levels: L1 – services description; L2 – services composition (WF); L3 – simulation objects description; L4 – objects composition (description of investigated system); L5 – high-level task definition.
To extend this architecture the parallel hierarchy was developed which describe the C2D part of composite application built during high-level task interpretation. To implement this hierarchy the dynamic DSL is developed which serve to describe the high-level BigData tasks. The C2D part of the composite application is interconnected with the D2C part in several ways within the mentioned hierarchical levels: a) shared system’s structure is used to generate general-purpose WF and to support data analysis; b) dynamic dataset which is updated during simulation and data analysis is sharable enabling to interconnect the parts automatically; c) distributed data storage can be used to transfer small datasets to the computational services in D2C way; d) the cloud management system can control the deploy of portable software to the data storage nodes to support local runs for the simulation purposes.
The developing dynamic DSL provide a flexible expressive tool for high-level definition of data analysis tasks. The knowledge-based technologies used to process the DSL structure enables to interpret the terms and structures defined by the semantics of particular problem domain. The DSL is integrated with existing cloud computing environment CLAVIRE (http://clavire.ru/) and extends the set of tools for development of composite applications.
Acknowledgements. The research work was partly financially supported by Government of Russian Federation, Grant 074-U01. Data management facilities were developed within the project “Big data management for computationally intensive applications” (project #14613).
References
[1] T. Hey, S. Tansley, K. Tolle The fourth paradigm: data-intensive scientific discovery, 2009, p. 252.
[2] M.D. Assuncao [et al.] Big Data Computing and Clouds: Challenges, Solutions, and Future Directions // arXiv preprint, arXiv:1312.4722, 2013.
[3] P.A. Smirnov, S.V. Kovalchuk, A.V. Boukhanovsky Knowledge-Based Support for Complex Systems Exploration in Distributed Problem Solving Environments // Communications in Computer and Information Science. Knowledge Engineering and the Semantic Web, Vol. 394, 2013, pp. 147-161.
[4] S.V. Kovalchuk [et al.] Knowledge-based Expressive Technologies within Cloud Computing Environments // Advances in Intelligent Systems and Computing. Practical Applications of Intelligent Systems, Vol. 279, 2014, pp. 1-12.