Research: Minerva / RAP Platform Architecture
| Field | Value |
|---|---|
| Type | Research |
| Status | Active |
| Author | xRED Dev Team |
| Created | 2026-04-07 |
| Source | code.roche.com/iix-science-and-research/architecture — 21 ADRs |
| Related | Secure Application Blueprint, Janus |
1. What is the Platform?
The RISE AWS Platform (also called RAP, Minerva, and most recently Nebula) is Roche’s managed EKS-based developer platform for research informatics. These names refer to overlapping but distinct scopes within the same ecosystem:
- RAP (Research Architecture Platform) — the broader SnR platform initiative
- Minerva — specifically the pRED EKS platform
- Nebula — the latest branding (referenced in ADR 0020)
The platform documentation is served via MkDocs at platform.apps.science.roche.com and
the architecture repo contains 21 ADRs covering technology choices, patterns, and
operational decisions.
2. Environment Strategy
5 static Kubernetes clusters:
| Cluster | Purpose | Developer Access |
|---|---|---|
| DEV | Development | Standard |
| SBX | Sandbox | Full kubectl freedom |
| TST | Testing | Standard |
| UAT | User acceptance testing | Restricted |
| PRD | Production | Read-only dashboards only |
Ephemeral environments (triggered by GitLab MR labels) are being introduced for isolated E2E testing — namespace-based isolation within existing clusters, integrated with ArgoCD, Vault, Helm, and Datadog (ADR 0019).
3. Platform Service Catalog
API Gateway: Gravitee.io (ADR 0001)
Gravitee scored 93.44 in evaluation against AWS API Gateway (91.88) and MuleSoft (85.69).
Why Gravitee won:
- Deployed on the EKS cluster itself — no latency from cross-account hops
- No cross-account configuration needed
- Native alerting (Slack/email)
- Supports Keycloak authentication
- Both RAP and Minerva (pRED) use Gravitee — standardisation advantage
Capabilities: Developer portal, OpenAPI console, API lifecycle management (versioning, deprecation), subscription management, rate limiting, CORS, multi-endpoint failover, logging, metrics, analytics.
Limitation: All configuration is click-ops — no infrastructure-as-code support.
Authentication: Janus (ADR 0018)
Two architectural approaches documented for multi-tenant Janus integration:
Istio-based:
NLB → Istio Ingress Gateway → Envoy Proxy → OAuth2-Proxy (per-tenant) → AppNginx-based:
NLB → Nginx Ingress Controller → OAuth2-Proxy (per-tenant) → App- Each tenant gets a separate Cognito user pool
- OAuth2-Proxy handles OIDC authentication, JWT injection, cookie management
- Per-tenant OAuth2-Proxy instances provide isolation and independent scaling
- DNS pattern:
tenantA.apps.science.roche.com - Redirects unauthenticated users to Amazon Cognito Hosted UI
Auth Standard (Governance ADR 0001)
AuthService (Keycloak) with OAuth 2.0 is the governance-level standard for auth across SnR, with gCustoms for roles management (custom RBAC app integrated with Keycloak). Cognito noted as a potential future alternative. PingFederate and SailPoint were avoided due to complex GIS onboarding.
Observability: Datadog (ADRs 0003, 0011, 0012)
- Datadog for monitoring, APM, tracing — universally praised by platform users
- OpenTelemetry Operator on EKS with W3C Trace Context propagation standard
- Java services auto-instrumented via annotation:
instrumentation.opentelemetry.io/inject-java: observability/otel-instrumentation - Dashboards cover: ArgoCD sync/health, K8s pod metrics, app logs, error traces, frontend Core Web Vitals, backend latency/traffic
- Datadog Synthetic Tests for no-code E2E testing of web apps
Runtime Security: Sysdig (ADR 0012)
- Sysdig selected as CNAPP (Cloud Native Application Protection Platform)
- Covers: runtime vulnerability scanning, infrastructure visibility, CIEM (least privilege), compliance, forensics
- Deployed on DEV, UAT, PRD only (not TST or RPLATFORM)
- Currently in POC mode with limited licenses
- Limitation: security data not correlated with observability data
Code Quality: SonarQube (ADRs 0004, 0006)
- SonarQube at
sonarqube.roche.comfor static code analysis - Integrated into the EKS Base Pipeline (platform CI/CD)
- Quality gate currently advisory only — does not fail the pipeline (gradual rollout)
Deployment: ArgoCD Rollouts (ADR 0016)
- ArgoCD Rollouts for Canary and Blue-Green deployment strategies
- Integrated with existing ArgoCD GitOps setup
- All deployments managed via Helm charts following GitOps practices
Feature Flags: GitLab Feature Flags (ADR 0017)
- GitLab Feature Flags scored 90.19% in evaluation
- Supports: percentage rollouts, user targeting, environment-specific toggling, RBAC, SSO, audit trails
- No direct cost for RAP platform
- Limitation: no OpenFeature standard support yet
Off-Hours Scaling (ADR 0020)
- KEDA (Cron Scaler) for workload downscaling during off-hours
- Karpenter for automatic node management and consolidation
- Opt-in per application
- Covers 6am–6pm Mon–Fri across US/Europe/India timezones
- Goal: lower monthly AWS bills as tenant count grows
Developer Portal: Backstage (ADRs 0007, 0010, 0014)
- Backstage (Spotify) as the RAP Developer Portal and Software Catalog
- Services registered via
catalog-info.yamlfollowing the Backstage System Model - Multi-tenancy via namespace-driven UI switching
- Entities: Components (microservices), Systems (logical apps), APIs, Groups (teams)
- Single Backstage instance recommended for all of Roche (per Spotify engineer feedback)
Infrastructure-as-Code (ADR 0015)
Platform transitioned from AWS CDK to Terraform Cloud, driven by successful Terraform adoption across Minerva, MLOps, and RAP teams.
4. Networking
Ingress
Two options: Istio Ingress Gateway or Nginx Ingress Controller, both front-ended by AWS Network Load Balancer (NLB) for transparent pass-through.
DNS pattern: <app-name>.<env>.apps.science.roche.com
Security Layers
| Layer | Technology | Scope |
|---|---|---|
| Edge | Cloudflare WAF | DDoS, bot protection |
| Network | RCP Inspection VPC | Traffic flow controls |
| Governance | Service Control Policies | Account-level guardrails |
| Runtime | Sysdig | Container forensics, CIEM |
| Code | SonarQube | Static analysis |
5. Disaster Recovery (ADRs 0005, 0008, 0009)
Strategy: Multi-Region + Pilot Light
| Target | Value |
|---|---|
| RPO | < 4 hours |
| RTO | < 2 hours |
Traffic management: Route 53 Application Recovery Controller with routing controls (ON/OFF switches) and safety rules. 5 redundant regional endpoints. Chosen over AWS Global Accelerator (no cost, better Nginx integration).
Data resilience:
- EFS: AWS EFS Replication for continuous near-real-time cross-region sync
- EBS: Snapshot-based replication across regions
- Velero (open-source, K8s-native) for incremental backups of persistent volumes and K8s objects (Deployments, Services, ConfigMaps)
6. Known Platform Challenges (User Feedback, Nov 2023)
| Issue | Detail |
|---|---|
| Lack of self-service | Onboarding a new app “takes days instead of an hour” — dominant complaint |
| Environment instability | Non-PRD environments cause frustration |
| Support model | Issues spread across Slack, Jira, ServiceNow, gChat with no aggregation |
| ArgoCD/GitOps friction | Perceived as hindrance for quick MVCs/PoCs |
| AWS login complexity | Multiple accounts across environments |
| Communication overload | Announcements scattered across channels |
| Datadog | Most praised feature — universally valued |
| PDB status | Pod Disruption Budget status unclear to teams |
Key quote: “It is the RAP platform that should know something wrong is happening, but instead the users are informing DevOps about downtimes.”
7. Relevance to xRED ELN
xRED ELN runs on Minerva and inherits its service catalog. Key platform capabilities used by xRED:
| Platform Service | xRED Usage |
|---|---|
| Gravitee | Internal API gateway for all integration traffic |
| ArgoCD | GitOps deployment from infrastructure repo |
| Vault | All secrets via External Secrets Operator |
| Datadog | Monitoring and observability |
| NGINX Ingress | K8s routing and TLS termination |
| RDS PostgreSQL | Persistent storage |
| ElastiCache Redis | Session store |
Not currently used by xRED but available: Backstage, SonarQube, GitLab Feature Flags, Sysdig, KEDA off-hours scaling, Confluent Kafka, Argo Rollouts (Blue/Green/Canary).
The Gravitee click-ops limitation is notable — xRED’s API definitions must be configured manually in the Gravitee console, not via infrastructure-as-code.