Data Lake Storage Architecture for Medicare Advantage Payers

Most infrastructure conversations start with cost-per-terabyte math or cloud benchmark scores. In a Medicare Advantage (MA) environment, neither is the right starting point. The real question is whether the storage layer can reconstruct exactly what the plan knew, on exactly what date, across eligibility, claims, and risk adjustment pipelines simultaneously. If the honest answer is "kind of," there is already a revenue problem.
The stakes are real. In 2025, 34.1 million of about 62.8 million eligible Medicare beneficiaries were enrolled in MA, representing 54% of the eligible Medicare population. In this article, we'll go over the core data lake storage decisions for MA payers like file format selection, zone architecture, partitioning strategy, storage tiering, multi-source ingestion, and PHI governance.
How data lake storage differs from a traditional EDW in a payer context
A traditional enterprise data warehouse enforces schema at write time.. A well-designed data lake applies schema at read time. Raw CMS files land intact, and transformation logic lives downstream where it belongs.
Why payer data types require schema-flexible storage from day one
Consider what a regional health plan ingests in a single month:
- 834 enrollment transactions
- 835 remittance files
- 837 claims
- MOR reconciliation files
- MMR payment summaries
- ADT feeds from HIEs
- FHIR R4 payer-to-payer payloads
- Pharmacy benefit manager extracts
- In-home assessment vendor flat files
Each arrives on a different schedule, in a different format, with different downstream sensitivity. No schema-on-write system handles that variety without brittle ETL. Schema flexibility is a prerequisite, not a preference.
The revenue connection: how storage decisions affect RAF accuracy, STARS performance, and MLR
Risk Adjustment Factor scores drive per-member-per-month revenue for every MA enrollee. If a diagnosis is not documented, submitted, and accepted by CMS, the plan loses revenue for a condition that still exists in the patient. STARS performance is equally sensitive: post-discharge follow-up rates and gap closure metrics depend on timely data moving through the storage layer. Medical Loss Ratio, capped near 85% under the ACA, adds pressure for real-time claims monitoring rather than quarterly snapshots.
The Three Storage Layer Failures That Cost Payers Revenue
Regional health plans usually discover storage architecture problems at the worst possible moment. Two examples are during a CMS audit or when a HITRUST assessment reveals controls that should have existed years earlier.
- Schema rigidity and the HCC transition risk
When CMS transitions HCC models, payers managing rigid schema structures face a difficult choice: defer the update and let risk adjustment logic drift across versions, or force a migration that breaks dependent pipelines. The practical failure is usually quiet. Teams fork pipelines, logic diverges, and by the time someone notices, RAF submissions are inconsistent with the underlying data.
- No time-travel capability and CMS audit exposure
CMS auditors do not just want current data. They want eligibility state, diagnosis support, and claim adjudication records as they existed on a specific submission date. Without table-level versioning, that reconstruction requires re-running transforms and locating whatever intermediate snapshots happened to survive. For fiscal year 2021, based on calendar year 2019 payments, CMS calculated over $15 billion in Part C overpayments, representing nearly 7% of total Part C payments. RADV audits are the primary corrective mechanism for overpayments where diagnosis documentation cannot support submitted RAF scores. If the storage layer cannot produce a clean audit package, the plan is financially exposed, not just operationally exposed.
- PHI governance gaps that trigger HITRUST rework and re-platforming
PHI governance applied inconsistently across storage buckets cannot produce consistent evidence during a HITRUST assessment. Plans that reach certification without proper access logging, encryption at rest, and data residency controls typically discover they need to rebuild infrastructure before they can certify. HITRUST publishes that a readiness assessment report begins at $3,625, but that figure does not capture the engineering cost of re-platforming a storage layer that was never designed with governance built in from the start.
File Format Selection for Payer Workloads
The real format decision for a payer is not CSV versus Parquet. It is whether the implementation uses an open table format that provides versioned, auditable tables on top of object storage.
Delta Lake vs Apache Iceberg vs Apache Hudi for Medicare Advantage data
All three formats support ACID transactions, schema evolution, and time-travel queries. The differences show up in the details that matter for payer workloads. Delta Lake stores versioned Parquet files plus a transaction log to track commits and provide ACID guarantees. Delta Lake's documentation explains this architecture clearly and is worth reading before any format evaluation. Apache Iceberg is built around snapshots, where each write or update creates a new snapshot while preserving older data and metadata. Iceberg's evolution documentation covers hidden partitioning and partition evolution, meaning partition layouts can change without rewriting historical data.
Apache Hudi defines three query types that map directly to payer needs:
- Snapshot queries for latest state
- Time-travel queries for historical state
- Incremental queries for changes between two points
Apache Hudi's technical specification is the clearest explanation of why incremental queries are useful for "what changed since the last MOR run" without triggering full rescans.
Why Delta Lake's transaction log matters for MOR reconciliation and CMS audit defense
Delta Lake's transaction log records every write operation with full metadata:
- What changed
- When
- What the table state was at each commit
For MOR reconciliation, querying eligibility and payment data as it existed on any prior date becomes a single query rather than a reconstruction project. For CMS audit defense, the evidence package maps directly to submission state rather than to whatever intermediate outputs happened to survive.
Zone Architecture Designed for Payer Data Realities
The most effective data lake architectures for MA use a three-zone model: raw, conformed, and curated. Each zone has a distinct purpose, a distinct governance posture, and a distinct update cadence.
Raw zone ingestion for 834/835/837, MOR/MMR, ADT feeds, FHIR R4 payloads, and vendor flat files
The raw zone is the immutable source of truth. Every file lands here intact, in its original format, keyed by ingestion timestamp and upstream file identity to preserve chain-of-custody. EDI transactions, CMS MOR and MMR files, HL7 ADT feeds, FHIR R4 payloads, pharmacy benefit manager extracts, and in-home assessment vendor files all belong here without transformation. When a downstream transformation error surfaces, teams replay from raw rather than re-requesting files from CMS or trading partners.
Conformed zone patterns for eligibility normalization and claims staging
The conformed zone is where raw data becomes queryable and comparable across sources. Eligibility records from multiple payers normalize to a common member schema. Claims stage with standard diagnosis and procedure code structures. NPI-to-TIN matching, which connects individual providers to their organizational identifiers, resolves here. Errors in conformed zone normalization cascade directly into claims processing, RAF submissions, and STARS quality measures, so this zone deserves disproportionate engineering attention.
Curated zone outputs for RAF reporting, STARS dashboards, and MLR monitoring
The curated zone is where operational answers live: RAF scoring tables, STARS measure tracking, MLR dashboards, and CMS submission staging outputs. This is the zone analysts, actuaries, and care management teams query directly. It should be optimized for read performance and updated on cadences that match operational needs, typically daily incremental refreshes with full reconciliation at monthly CMS cycles.
Partitioning Strategy for CMS Submission Cadences, Not Calendar Dates
How partitioning is structured determines how fast queries run and how directly the storage layer supports actual payer workflows. Generic guidance defaults to date-based partitioning. In a MA environment, that default creates measurable, avoidable cost.
Why date-based partitioning creates query performance problems for MA workloads
Claims in MA carry a 30 to 60 day lag between service delivery and adjudication. A claim for services in January may not finalize until March. Calendar-date partitioning forces queries that reconstruct a member's complete claims history to scan across multiple partition ranges.
The cost is direct. Amazon Athena SQL queries are priced at $5 per TB scanned. Athena's own pricing documentation shows that columnar formats and aligned partitions can reduce a $15 query scanning 3 TB to $1.25 scanning 0.25 TB. Wrong partitioning is not just a performance problem. It is a recurring operating cost.
Aligning storage partitions to claims lag windows, eligibility monthly cycles, and CMS sweep calendars
Better partitioning aligns to the cadences the payer business actually uses. Eligibility tables partition by the monthly effective coverage period. Claims tables work better partitioned by the CMS sweep calendar date governing when diagnoses are accepted for risk adjustment, not by service date embedded in the claim. MOR and MMR files should partition by CMS submission batch identifier so point-in-time reconstruction maps cleanly to audit requests without cross-partition scanning.
Optimizing member snapshot queries for risk adjustment reconciliation and MAO004 validation
MAO004 files are CMS's confirmation receipts for accepted ICD-10 diagnosis codes submitted for risk adjustment. Validating MAO004 acceptance against submitted records requires joining across eligibility snapshots, HCC submission records, and CMS response files. When partitioning aligns these three data types to the same CMS processing cycle, those joins are fast and predictable. Misaligned partitioning turns a routine reconciliation query into a multi-hour scan with unpredictable cost.
Storage Tiering for Risk Adjustment Lookback Requirements
Under MA contract provisions, MA organizations must maintain records for 10 years, with audit and inspection rights extending through 10 years from the end of the final contract period or completion of audit, whichever is later. That single provision is why hot-only storage designs are economically unsustainable. Payers are committing to multi-year, audit-grade retention whether planned for or not.
Multi-year claims and eligibility retention without hot-tier cost exposure
A practical structure puts the current plan year and the immediately prior year in hot or warm storage for active queries. Two to five prior years move to cold storage with retrieval SLAs aligned to audit workflows. Anything older moves to archive-tier storage with a documented retention policy tied to CMS contract obligations and applicable state regulations.
Retrieval performance benchmarks for audit defense and retrospective reconciliation use cases
Retrieval speed is a first-class design variable, not an afterthought. Standard retrievals from S3 Glacier Flexible Retrieval typically finish in 3 to 5 hours, S3 Glacier Deep Archive in 12 hours, and bulk Deep Archive retrievals in up to 48 hours. On Azure, rehydrating from Archive to Hot or Cool typically takes up to 15 hours. These timelines translate into a concrete compliance question: whether a complete "as-of" eligibility, claims, and diagnosis evidence package can be produced within the window the legal team expects during an active audit.
Multi-Source Ingestion Architecture for the Payer Data Lake
A MA plan ingests from a wider variety of sources than most data infrastructure teams anticipate at the start of architecture planning, and each source has different format constraints, delivery schedules, and downstream sensitivity.
Ingesting EDI transactions without schema-on-write constraints
EDI file parsing is complex. The 834, 835, and 837 each have multiple implementation guides, and trading partner variations make strict schema validation fragile at ingestion. The right architecture lands these files in the raw zone as-is, performs validation in a separate processing step, and stores both the original file and the parsed output so re-parsing is always possible without going back to the source.
HL7 ADT feeds from HIEs and FHIR R4 under CMS-0057-F
HIEs deliver ADT feeds that provide real-time hospital census visibility. Without them, post-discharge follow-up measures for STARS go unmet. The CMS-0057-F final rule sets API compliance dates of January 1, 2027 for MA organizations and requires Patient Access API metrics reporting starting January 1, 2026. The Federal Register final rule text documents these deadlines in full. Both HL7 v2 ADT messages and FHIR R4 bundles need raw-zone landing with format-appropriate parsing, not forced into a shared schema at ingestion time.
Pharmacy benefit manager files and in-home assessment vendor extracts in the raw zone
Pharmacy benefit manager data feeds medication adherence measures directly into STARS ratings and provides strong signals for chronic condition documentation. In-home assessment vendor results support suspect diagnosis confirmation and gap closure workflows. Both arrive as flat files or proprietary extracts with formats that change without notice. Treating them as raw-zone residents, transformed downstream rather than at ingestion, keeps pipelines stable across vendor format changes and preserves the original file for re-processing if transformation logic needs correction.
PHI Governance and Compliance at the Storage Layer
Compliance is not a layer added after the architecture is built. It is a design constraint that shapes every storage decision from the beginning, and the storage layer is where PHI governance either starts correctly or has to be rebuilt expensively later.
HIPAA Security Rule controls that storage architecture decisions must satisfy
Under the HIPAA Security Rule technical safeguards, covered entities and business associates must implement audit controls to record and examine activity in systems containing ePHI, with encryption described as an addressable specification. 45 CFR 164.312 is the authoritative text. In practice, encryption at rest must be enforced by configuration rather than team convention, access logging must be immutable and queryable, and least-privilege access must be demonstrable per dataset across all three zones.
HITRUST CSF requirements for PHI at rest, access logging, and data residency
HITRUST CSF certification requires specific storage-layer controls: documented data residency, tamper-evident audit trails, and key management for encryption at rest. Plans that attempt HITRUST certification without these embedded in their storage architecture consistently encounter infrastructure gaps during the assessment that cannot be resolved through policy changes alone. They require engineering changes, often significant ones.
Access control and audit trail architecture for multi-tenant payer environments
In multi-payer environments where data from multiple CMS contracts co-exists in the same data lake, row-level and column-level access controls are non-negotiable. An analyst working with one payer's MA population should not be able to query another payer's eligibility or claims data, even if both data sets share a conformed zone table. Attribute-based access control implemented at the storage layer enforces this separation without requiring separate physical data stores per payer contract.
How to Evaluate Data Lake Storage Options for a Regional Medicare Advantage Plan
The most common evaluation mistake is benchmarking storage options against generic cloud performance metrics rather than the CMS submission workflows the plan actually runs. Vendor benchmarks answer questions the business is not asking.
Decision criteria tied to CMS file contracts and submission workflows, not vendor benchmarks
Evaluation criteria should answer operational questions directly. Can the storage system reconstruct member-month eligibility, claim adjudication state, and diagnosis support as of specific submission windows? Does the storage layer support the repeatable exports and access logging that the CMS-0057-F interoperability rule requires, with API compliance dates of January 1, 2027 for MA organizations and Patient Access API metrics reporting beginning January 1, 2026? Generic benchmarks reveal nothing about any of these.
Delta Lakehouse vs cloud-native object store vs managed warehouse: trade-offs for mid-size health plans
A Delta Lakehouse running on Databricks or a compatible runtime gives mid-size MA plans strong time-travel capability, schema evolution support, and incremental processing without a fully managed warehouse. Pure cloud-native object storage without an open table format layer lacks the transactional consistency and audit trail depth that payer workloads require. Managed warehouses like Snowflake offer strong query performance and access control but become expensive at the data volumes that multi-year HCC retention demands, and they reintroduce schema rigidity through native table structures.
What sound payer storage architecture looks like before building on top of it
Before building RAF scoring models, STARS dashboards, or MLR pipelines on top of the data lake, the foundation should meet a clear set of conditions. Raw CMS files land intact and immutable. Schema evolution is handled at the table format level, not the ETL code level. Time-travel queries against eligibility, claims, and HCC tables return results in seconds. PHI access is logged with a tamper-evident trail. Storage tiers align to actual access patterns for HCC historicals and prior-period CMS files. Partitioning follows CMS submission cadences rather than calendar dates. If these conditions are not met, every system built on top inherits the risk.
Final Thoughts
Data lake storage is easy to underestimate until something breaks, and in a MA environment, what breaks is revenue. Schema rigidity breaks when CMS updates HCC models. Missing time-travel capability breaks when an auditor asks for point-in-time reconstruction. PHI governance gaps break when a HITRUST assessment surfaces controls that should have been in place from day one.
The plans that navigate this well treat storage architecture as a revenue infrastructure decision. They align file formats, zone structures, partitioning strategies, and governance controls to the actual operational cadences and regulatory obligations of a MA plan. That alignment is what separates a data lake that supports RAF revenue integrity from one that quietly drains it.
Frequently Asked Questions
How can Invene help build a compliant data lake for our MA plan?
Invene is a healthcare technology firm that specializes in building compliant, production-grade data infrastructure for payers, providers, and life sciences organizations. For MA plans, Invene brings deep healthcare domain expertise alongside hands-on engineering capability—spanning file format selection, zone architecture design, partitioning strategy, and PHI governance. Rather than applying a generic data platform template, Invene aligns storage architecture decisions to each plan's specific operational needs: CMS submission cadences, RADV audit requirements, HCC model transitions, and multi-source ingestion from EDI, HIE, and vendor feeds. Plans working with Invene benefit from a team that understands both the technical tradeoffs and the downstream revenue implications of storage decisions. Learn more at invene.com.
What is the difference between data lake storage and a data warehouse for a Medicare Advantage plan?
A traditional data warehouse enforces schema at write time, meaning structure must be defined before ingestion. A data lake applies schema at read time, so raw CMS files, EDI transactions, and vendor extracts land intact and transform downstream. For MA plans managing frequent CMS file format changes, that flexibility is operationally critical.
Why does time-travel capability matter for CMS RADV audit defense?
RADV audits require plans to substantiate RAF scores against records that match specific submission dates. Time-travel capability in formats like Delta Lake or Iceberg turns that reconstruction into a query rather than a manual project. With CMS having calculated over $15 billion in Part C overpayments tied to unsupported diagnosis coding, the audit exposure for plans that cannot produce this reconstruction is significant.
What HIPAA and HITRUST controls must be built into the storage layer from the start?
At minimum: AES-256 encryption at rest enforced by configuration, row-level and column-level access controls tied to user roles and payer contracts, tamper-evident audit logging for all PHI access, and data residency documentation satisfying both HIPAA Security Rule requirements and HITRUST CSF criteria. These cannot be retrofitted effectively after the architecture is in production.
How should a regional Medicare Advantage plan approach storage tiering given the 10-year CMS retention obligation?
Put the current plan year and immediately prior year in hot or warm storage for active queries. Move two to five prior years to cold storage with retrieval SLAs aligned to audit workflows. Archive anything older with a documented retention policy tied to CMS contract obligations. The critical design variable is retrieval time, placing high-likelihood retrieval datasets in deep archive to save storage costs creates compliance risk that the savings rarely justify.
James founded Invene with a 20-year plan to build the world's leading partner for healthcare innovation. A Forbes Next 1000 honoree, James specializes in helping mid-market and enterprise healthcare companies build AI-driven solutions with measurable PnL impact. Under his leadership, Invene has worked with 20 of the Fortune 100, achieved 22 FDA clearances, and launched over 400 products for their clients. James is known for driving results at the intersection of technology, healthcare, and business.
Ready to Tackle Your Hardest Data and Product Challenges?
We can accelerate your goals and drive measurable results.