WP 2: Caching Services and Cloud Storage Integration

Work Package 2: Caching Services and Cloud Storage Integration

Caching is crucial for efficient usage of HPC systems. XCache is part of the XRootd storage eco-system developed in the US and is a well-established high-performance caching service used in WLCG. The current operation model is tailored for the specific US environment and does not match the requirements of sites in Germany regarding setup, deployment, and monitoring. The goal is to provide flexible XCache deployment showcases and an XCache monitoring service. While XCache supports less features compared to dCache, it is still a necessary tool as it is usable even in very restrictive environments which one typically finds on Tier-1 HPC centers.

The second objective of this task area is to provide an interface layer for existing grid storage systems to allow access with the S3 cloud protocol. Using the S3 protocol is very popular in data science, in particular in non-HEP communities.

Finally, a Sync&Share service will be developed that allows for easy exchange and portability of results for publications.

This work package addresses several goals:
- Establish a support infrastructure for XCache services in Germany. The current default deployment procedures provided by groups in the US are specific to the US environment and do not match the requirements and constraints of sites in DE. We plan to provide well-tested setup examples using light-weight containers. Another important component is a dedicated monitoring service that collects and aggregates the information of the XCache instances, which is crucial to operate and optimize theXCache service running at the included sites.
- Providing an S3 layer on top of existing storage infrastructures is crucial. The use of cloud-native storage systems, along with dedicated community-based systems, is an important step to equip research communities with the necessary capabilities to bridge into federated storage and other federated infrastructures. S3 storage bindings are widely used in various applications, including big data analytics (for hosting CSV, Parquet, or HDF5 files), CI/CD workflows (for hosting artifacts and Docker containers), and ML/AI pipelines (for storing models and results). The use of S3 storage is also significant for related projects, such as the ongoing PUNCH4NFDI project. REANA workflows using S3 interfaces have already been explored and will be an important feature for the Analysis Facilities project, planned in ErUM Data.
- Sync & Share Storage. In scientific analyses, one has to deal not only with large primary datasets but also with many small data files such as output of processing steps, source files, Jupyter notebooks, ML training data, etc., and scientists need to share these data with colleagues and/or use them on different platforms. The goal is to provide such a Sync & Share storage service based on Nextcloud and adapt it to the requirements of our FAIRUM partner project regarding functionality and AAI services.
This work package is structured into different task areas.
- Task area 1: Orchestration, monitoring and feature evaluation of the XCache proxy server
  - XCache’s official deployment strategies, primarily using Kubernetes and the SLATE tool (tested by the Wuppertal group during FIDIUM), are not feasible for Germany’s NHR centers. Instead, new strategies will be needed with a focus on integrating XCache into NHR infrastructure. This will likely require services to migrate to lightweight containers, such as Apptainer, which is commonly supported at HPC centers, or to simple, systemd-based services, which can be installed on any virtual or physical machine.
    
    In addition to the deployment, monitoring is essential for data caching systems, especially within global data management frameworks. A lightweight monitoring system will be developed to track performance, optimize cache hits using RUCIO data, and ensure availability. Collaboration with Work Package 1 for monitoring integration is planned.
    
    And finally, we plan to evaluate extended functionalities of the XCache service, namely to operate it as a caching service for the webdav/http protocol and as S3 caching service. These features had been recently added or are under development, and they are potentially promising services to be used in the task areas below (’S3 frontend’ and ’Sync&Share’).
- Task area 2: S3 Frontend for POSIX-compliant Storage
  - This task focuses on implementing a lightweight, easily deployable S3-layer on existing POSIX-compliant storage to enable scientists to use cloud-native tools while collaborating with backend storage users. The S3 protocol, developed by Amazon, is a standard for cloud-storage access, supported by various open-source tools. It allows access from all operating systems and major programming languages used in scientific computing. The setup aims for seamless deployment by local admins managing systems like Lustre, GPFS, or dCache, without altering existing storage systems, enhancing integration between HPC workflows and remote data access.
    
    This goal will be achieved by using or adapting an open-source S3 protocol implementation capable of mapping POSIX files to S3 objects with transparent modifications. The Versity S3 Gateway is a strong candidate, but other options will also be evaluated by LMU.
    
    The MinIO S3 object store at AIP will serve as a compatibility reference. While MinIO lacks POSIX interoperability, it is highly S3-compliant. AIP will benchmark with the StarHorse project, which uses S3 storage and Parquet files, to assess solution suitability. Deployment and packaging of the best solution will follow.
    
    We will analyze S3 requirements for read/write functionality on supported backend storage systems. While read-only access avoids mapping S3 credentials to user accounts, write access requires translating user accounts into S3 access keys for interoperability. Additionally, roles and access rights must align with S3 equivalents.
    
    Our goals include optimizing S3 data access, ensuring security and integrity, and addressing user mapping across systems. Efficient handling of S3’s parallel uploads and adapting to varying POSIX compliance levels, particularly dCache’s file modification restrictions, are priorities. Development will leverage existing solutions by LMU and AIP.
- Task area 3 :Sync&Share service
  - This task area aims to implement a Sync&Share service for scientific data analyses that in particular matches the requirements of our partner project FAIRUM.
    
    The goal is to offer dedicated storage that can be used to access data shared with or by scientists and their collaborators, e.g. small data sets, analysis results, plots. Shared data should be available from Jupyter notebooks and through a web-browser based graphical interface. For integration with Jupyter, the directory should be automatically mounted and unmounted, and the status of the synchronization must be visible to the user. The sharing should be easily configurable through the same interface. Ideally, direct POSIX access should be possible, if not, POSIX access should be facilitated through Linux FUSE mounts.
    
    The service will be based on the commercial product Nextcloud, already in use at some of the participating institutes. Data will be stored in a central Nextcloud instance and can be accessed using federated AAI employing the methods developed by the FAIRUM consortium.
    
    The existing Sync&Share service at DESY is based on Nextcloud and dCache. Transparent direct access through dCache via NFS and WebDAV (bypassing, see github.com/versity/versitygwNextcloud) will be developed. This will substantially improve throughput and latency. For users and facilities outside of DESY, direct access through the existing dCache WebDAV interface is possible. Both functionality will be developed and rolled out at the DESY Analysis Facility. Direct access to data shared through Nextcloud is currently not possible and will be developed by DESY.
    
    For the Sync&Share service one needs to deal with the many different aspects required to ensure the propagation of users and their varying roles and access rights (= POSIX accounts) across the different file systems and domains and to assess security related questions. Similar issues arise in task area 2 for read-write access with the S3 protocol; therefore, the exploration will be based on results.

Neueste Beiträge