This document is a proposal for an engagement to verify that a cloud deployment of the LSST Data Management (DM) system (including both Alert Production and Data Release Production) at Interim Data Facility (IDF) scale is feasible, measure its performance, determine its final discounted cost, and investigate more-native cloud options for handling system components that may be costly to develop or maintain.
1 Goals¶
A Proof of Concept (PoC) engagement has been identified to address the following use-cases/goals:
- Leveraging the existing Rubin private 10Gbit/sec link between Chile and Miami, set up an interconnect between the Rubin and Google networks in Miami, then validate that we can upload image files of 7-12GB from Chile to a Google Cloud Storage (GCS) bucket hosted in the us-central region or to distributor hosts in Google Compute Engine (GCE)/Google Kubernetes Engine (GKE) on the cadence necessary for Alert Production (less than 7 seconds transmission time, repeated every 17 seconds).
- Setup the Alert Production and Data Release Production pipelines in Google Cloud Platform (GCP), and validate they can use GCS as file storage and work properly, similar to what was documented in [1].
- Run desired batch jobs on preemptible instances and perform cost analysis for the Interim Data Facility (pre-Operations production) scale and for the 10-year Operations scale.
2 Target Architecture¶
Figure 1 shows the high-level architecture for the entire deployment. For this PoC, we will be primarily testing data transmission from Chilean to Cloud storage and then running Alerts and DRP in GCP. The Rubin Science Platform and Qserv database were already tested in the 2018 PoC [2].
2.1 Alert Production¶
Alert Production will be executed on GKE nodes with GCE nodes as a backup. For the PoC, the target APDB will be PostgreSQL. Alert Distribution is not a key component for the PoC as it has already been tested in the cloud.
2.1.1 Transmission¶
Transmission of images from La Serena to Google Cloud is a key item to be tested. The goal is to test at full speed over a private 10 Gbit/sec interconnect between the Rubin international network and the Google Cloud network. This connection will be provisioned in Miami, where the two networks are present in the same building. The destination of the images will be Google Cloud Storage (for archiving, and, if sufficiently performant, for Prompt Processing) and the Prompt Processing execution nodes.
2.1.2 Alert Pipeline¶
The ap_pipe
pipeline will be used along with DECam data used for ap_verify
.
2.1.3 End-to-End Demonstration¶
If possible, an end-to-end demonstration using images from ComCam at the Base and a limited processCcd
pipeline will be executed.
2.2 Data Release Production¶
Data Release Production will execute on GCE nodes under HTCondor or under a managed GCP workflow service. Data will be taken from GCS and the Butler databases will be on PostgreSQL.
2.2.1 Datasets¶
We will run at HSC RC2 scale to start. If possible, we will scale up to PDR2 or a similarly sized dataset from DC2 to demonstrate that there are no bottlenecks at full pre-Ops production scale.
2.2.2 Pipelines¶
The full set of pipelines used for biannual production will be used.
2.2.3 Middleware¶
Gen3 middleware will be used.
3 Proposed Phases¶
3.1 Phase 1a: PoC Project Onboarding¶
3.1.1 Owners¶
K.T. + Vinod/Dom/Flora
3.1.2 Milestones¶
Be able to complete project onboarding and test upload via public internet.
3.1.3 Tasks¶
- [Both] PoC Project onboarding: (O)Org, Billing, Credit, Project, IAM, VPC, GCS, etc.
- [Both] Plan interconnect details
- [LSST] Prepare testing data
- [Both] Create a GCS client using SDK.
- [LSST] Test upload to GCS: - gsutil - custom client - transfer services
- [LSST] Transfer small datasets to GCS
3.2 Phase 1b: Setup DRP¶
3.2.1 Owners¶
Hsin-Fang + Karan/Ross/Flora
3.2.2 Milestones¶
Be able to run DRP with GCS as file storage
3.2.3 Tasks¶
- [Both] Port DRP and Gen3 Data Butler to GCP, including any needed adaptations to boto, PostgreSQL
- [LSST] Execute DRP at small scale
3.3 Phase 1c: Setup Alerts¶
3.3.1 Owners¶
K.T + Dom/Karan/Ross/Flora
3.3.2 Milestones¶
Be able to run Alerts with GCS as file storage and/or with a custom distributor service
3.3.3 Tasks¶
- [Both] Re-architect Alerts to use GCS and deploy it in GCP
- [LSST] Develop custom distributor service for Alert input and deploy it in GCP
- [LSST] Execute Alert Production on pre-positioned test data
3.4 Phase 2a: Execute DRP at scale and perform cost analysis¶
3.4.1 Owners¶
Hsin-Fang + Dom/Karan/Ross/Flora
3.4.2 Milestones¶
Be able to execute desired batch jobs on preemptive instances and perform cost analysis.
3.4.3 Tasks¶
- [Both] Configure GCE cluster for the desired batch job.
- [Google] Quota/limit adjustment for the PoC project per the testing target.
- [Both] Perform any needed adaptations to HTCondor, for obtaining preemptible nodes, etc.
- [Both] Perform cost analysis
3.5 Phase 2b: Network Validation¶
3.5.1 Owners¶
K.T./Jeronimo + Vinod/Flora
3.5.2 Milestones¶
Be able to upload 7-12GB data within <7s from Chilean to GCS bucket hosted in us-central repeatedly (every 17 sec)
3.5.3 Tasks¶
- [Both] Interconnect setup
- [LSST] Prepare testing data and hosts
- [LSST] Test upload to GCS via interconnect - gsutil - custom client - transfer services
3.6 Phase 3: End-To-End Alerts¶
3.6.1 Owners¶
K.T.+ Vinod/Karan/Ross/Flora
3.6.2 Milestones¶
Be able to run Alerts end-to-end with data from Chile
3.6.3 Tasks¶
- [Both] Integrate data transfer mechanism with Alerts
- [LSST] Execute Alert Production on “live” test data
- [LSST] Stretch goal: execute prompt calibration processing pipeline on live ComCam calibration data
4 Success Criteria¶
Validate the architecture will work on GCP and meet its required service level.
Validate the deployment could scale from small (PoC/IDF) to full (10yr goal); not planning to test 10yr goal, but want to gather enough data points to do reasonable analysis.
Validate that GCP is cost efficient.
Validate that GCP is easy to work with.
5 Reports and Conclusion¶
A report will be produced as a DMTN. A presentation will be given to Rubin Observatory Operations and Construction management and technical personnel summarizing the results and conclusions.
References
[1] | [DMTN-137]. Hsin-Fang Chiang, Dino Bektesevic and the AWS-PoC team. AWS Proof of Concept Project Report. 2020. LSST Data Management Technical Note. URL: https://dmtn-137.lsst.io |
[2] | [DMTN-125]. Kian-Tat Lim. Google Cloud Engagement Results. 2019. LSST Data Management Technical Note. URL: https://dmtn-125.lsst.io |