Many scientific disciplines have become data-driven. The authors of the book “Big Data: A Revolution That Will Transform How We Live, Work, and Think” points out: ” The volume of information had grown so large that the quantity being examined no longer fit into the memory that computers use for processing, so engineers needed to revamp the tools they used for analyzing it all”. Scientific projects generate such enormous data sets that automated analysis is required. For example, the Large Hadron Collider (LHC), operating at the international CERN Laboratory in Geneva, Switzerland. The unprecedented petasacle data volumes generated at the LHC (raw data rate from LHC detector: 1PB/s, this translates to Petabytes of data recorded world-wide (Grid)) – the largest scientific instrument in the world – present new challenges to computer and data sciences. LHC Computing uses workload management systems (WMS) for data processing and analysis (AliEn, DIRAC, PanDA, etc). The main goal of WMS is to make distributed resources accessible to all users. WMS at LHC operates at the scale where almost every problem we encounter is a research problem, since nothing like this has been done before. WMS provide location transparency of processing and data access for High Energy Physics (HEP), other data-intensive sciences and a wider exascale community, processes a diverse range of workloads. For example, PanDA (acronym for Production and Distributed Analysis) WMS have up to 30M jobs completed every month at more than 100 sites, running at 130 000 cores worldwide. Consuming at peak ~0.2 petaflops. 1.2 Exabytes of data processed by PanDA WMS in 2013.
It has now become clear that current WMS can’t fit the infrastructure of GRID and LHC Computing scale of needs. Recently, WMS evolves in the direction of integration and unification of new types of computing resources. There are objective factors that could prevent the system in increasing the number of users and the formation of a worldwide meta-network that provides access to extremely large computing resourceswith initial low cost.
First, the current version WMS has no knowledge about data placement and there is no intelligent data transfer mechanism integrated with the workload management system. This makes it difficult to use the system to solve universal problems with differentdata sources.
Secondly, the LHC will increase the rate of petascale data volumes by five times in 2015 and by ten times in 2020. As LHC data volumes grow, the underlying software stack used for the file management encountered certain limits. File transfer errors became the main source of inefficiencies in Big Data processing on the Grid. Development and integration of new Big Data management Technologies with re-engineered WMS is necessary to overcome these challenges.
Third, the WMS is focused only on the HEP. For example, it is used in the ATLAS (A Thoroidal LHC ApparatuS) experiment for managing the workflow for all data processing on the WLCG (Worldwide LHC Computing Grid). Using it outside of the LHC Computing Grid requires improved structure providing a higher level of abstraction and modularity, standardized interfaces, and more convenient administration tools.
Forth, it is the limited support of the resources, providing high-performance data processing. WMS provides no standard tools for high-performance parallel processing on HPC (High-Performance Computing). At the same time, HPC could provide a substantial increase in overall performance.
The new project carried out at the National Research Centre "Kurchatov Institute" will reuse as much as possible, existing components of the PanDA and provide new level of services for workload and data management. Chosen PanDA WMS has demonstrated at a very large scale the value of automated dynamic brokering of diverse workloads across distributed computing resources, based on well proven, highly scalable, robust web technologies. It’s scalable to the experiment requirements.
The new system, called MegaPanDA, will make PanDA WMS available beyond LHC Computing and HEP and extend it beyond GRID. It will add support for large-scale data handling, HPC support, as well as Cloud and web-based computing services.
The technology base provided by the MegaPanDA will enhance the usage of a variety of available high-performance computing resources. We are now starting to integrate heterogeneous computing resources: Grid (Tier-1 in Kurchatov Institute), processing centre based on supercomputer in Kurchatov Institute (with a peak performance of 120 teraflops), and clouds (elastic cloud computing (EC2) Amazon, Google and other cloud resources), to explore how MegaPanDA might be used for managing computing jobs in such extended infrastructure with the possibility of dynamic acquiring supplementary CPU resources when needed.
In collaboration with CERN and LHC Universities and Institutes we will develop an intelligent data transfer mechanism integrated with the WMS to provide the integration between Data Flow Management and Job Control systems, using the Big Data Technologies (BDT), which will allow managing large data volumes and analyzing them more efficiently.
A central component of the PanDA architecture is its database, which at any given time reflects the state of payloads. Now it is hosted on an Oracle RDBMS. We plan to improve the scalability and performance of the MegaPanDA monitor database using “noSQL” solution for finalized, reference portion of the data where consistency is not essential but sheer performance and scalability are.
MegaPanDA will be integrated as a computational kernel in PSE (Problem Solving Environment), which require analysis of large amounts of data. In nearest future, MegaPanDA will be used for data processing in Biology, Astroparticle physics and geophysics.
Enhanced with latest Big Data processing capabilities, the MegaPanDA has an opportunity to become a flagship product for a world-class team of researchers that will be put together at the National Research Centre "Kurchatov Institute". Designed for wider use among megascience projects and built upon the state of art data management part, the Big Data Technologies project has a unique opportunity to achieve breakthrough world-class research results relevant for the fundamental science and industry.