๐๏ธ AWS-Native Serverless Data Architecture for Political Intelligence
๐ฏ From Committed Static Artifacts to a Serverless Knowledge-Graph Platform (2026-2037)
๐ Document Owner: CEO | ๐ Version: 4.1 | ๐
Last
Updated: 2026-05-31 (UTC) | ๐ Release: v1.0.1
๐ Review Cycle: Quarterly | โฐ Next Review: 2026-08-31
๐ท๏ธ Classification: Public (Open Source European Parliament Monitoring Platform)
| Document | Focus | Description | Documentation Link |
|---|---|---|---|
| Architecture | ๐๏ธ Architecture | C4 model showing current system structure | View Source |
| Future Architecture | ๐๏ธ Architecture | C4 model showing future system structure | View Source |
| Mindmaps | ๐ง Concept | Current system component relationships | View Source |
| Future Mindmaps | ๐ง Concept | Future capability evolution | View Source |
| SWOT Analysis | ๐ผ Business | Current strategic assessment | View Source |
| Future SWOT Analysis | ๐ผ Business | Future strategic opportunities | View Source |
| Data Model | ๐ Data | Current data structures and relationships | View Source |
| Future Data Model | ๐ Data | AWS-native serverless data architecture | This Document |
| Flowcharts | ๐ Process | Current data processing workflows | View Source |
| Future Flowcharts | ๐ Process | Enhanced AI-driven workflows | View Source |
| State Diagrams | ๐ Behavior | Current system state transitions | View Source |
| Future State Diagrams | ๐ Behavior | Enhanced adaptive state transitions | View Source |
| Security Architecture | ๐ก๏ธ Security | Current security implementation | View Source |
| Future Security Architecture | ๐ก๏ธ Security | Security enhancement roadmap | View Source |
| Threat Model | ๐ฏ Security | STRIDE threat analysis | View Source |
| Future Threat Model | ๐ฏ Security | Future threat landscape & controls | View Source |
| Classification | ๐ท๏ธ Governance | CIA classification & BCP | View Source |
| CRA Assessment | ๐ก๏ธ Compliance | Cyber Resilience Act | View Source |
| Workflows | โ๏ธ DevOps | CI/CD documentation | View Source |
| Future Workflows | ๐ DevOps | Planned CI/CD enhancements | View Source |
| Business Continuity Plan | ๐ Resilience | Recovery planning | View Source |
| Financial Security Plan | ๐ฐ Financial | Cost & security analysis | View Source |
| End-of-Life Strategy | ๐ฆ Lifecycle | Technology EOL planning | View Source |
| Unit Test Plan | ๐งช Testing | Unit testing strategy | View Source |
| E2E Test Plan | ๐ Testing | End-to-end testing | View Source |
| Performance Testing | โก Performance | Performance benchmarks | View Source |
| Security Policy | ๐ Security | Vulnerability reporting & security policy | View Source |
This future data model is designed to implement all controls from Hack23 AB's ISMS framework as the EU Parliament Monitor platform evolves from a committed static-file corpus to an AWS-native serverless data platform. Every data store named below is governed by least-privilege IAM, encrypted with AWS KMS customer-managed keys, and constrained to PUBLIC open European Parliament data and the platform's own derived analysis artifacts โ no private-life or non-public personal data is ingested.
| Policy Domain | Policy | Planned Implementation |
|---|---|---|
| ๐ Core Security | Information Security Policy | Overall governance for the serverless data platform |
| ๐ค AI Governance | AI Policy | Bedrock Guardrails, human-accountable RAG, no autonomous deploy |
| ๐ ๏ธ Development | Secure Development Policy | Schema-as-code, IaC review gates, data-contract tests |
| ๐ Network | Network Security Policy | VPC isolation, PrivateLink, WAF, Shield for data APIs |
| ๐ Cryptography | Cryptography Policy | KMS CMK encryption at rest, TLS 1.3 in transit, integrity hashes |
| ๐ Access Control | Access Control Policy | Cognito identity, IAM least-privilege, fine-grained table access |
| ๐ท๏ธ Data Classification | Data Classification Policy | Per-store PUBLIC classification & tagging |
| ๐ Vulnerability | Vulnerability Management | Inspector, automated dependency & infra scanning |
| ๐จ Incident Response | Incident Response Plan | GuardDuty + Security Hub automated detection & response |
| ๐พ Backup & Recovery | Backup Recovery Policy | PITR, S3 versioning, cross-region replication |
| ๐ Business Continuity | Business Continuity Plan | Multi-AZ serverless, static-edge fallback |
| ๐ค Third-Party | Third Party Management | EP MCP / EP Open Data / IMF / World Bank source assurance |
| ๐ท๏ธ Classification | Classification Framework | Business impact analysis for platform |
| Framework | Version | Relevant Controls |
|---|---|---|
| ISO 27001 | 2022 | A.5.1, A.8.10, A.8.11, A.8.12, A.8.25, A.8.26, A.8.27, A.8.28 |
| NIST CSF | 2.0 | GV.OC, GV.RM, ID.AM, PR.AA, PR.DS, DE.CM |
| CIS Controls | v8.1 | Control 1-6, 11, 13, 14, 16 |
| GDPR | 2016/679 | Art. 5 (minimisation), Art. 6 (lawful basis: public-interest transparency), Art. 89 |
This document defines the evolution of EU Parliament Monitor's data model from its current committed static-file corpus โ versioned analysis artifacts plus pre-rendered, per-language HTML on Amazon S3 + Amazon CloudFront โ toward an AWS-native serverless data platform capable of real-time European Parliament event ingestion, a queryable political knowledge graph, semantic / RAG search over the full analysis corpus, and an API ecosystem for journalists and researchers.
It supersedes the obsolete v3.0 polyglot blueprint (PostgreSQL + MongoDB + Redis + Elasticsearch + Neo4j on self-managed infrastructure). That generic-cloud framing is retired. Every data tier is now expressed in managed, serverless AWS primitives so the platform retains zero-ops economics while gaining query power.
| Horizon | Name | Data Posture | Primary Stores |
|---|---|---|---|
| ๐ข v2.0 | Enhanced Static Intelligence (2026 H2 โ 2027) | Stay file-based. Committed analysis artifacts (manifest.json runs) + per-language HTML, now augmented with richer pre-computed party / political-landscape dashboard datasets baked at build time. |
Git-versioned artifacts, S3 build outputs, CloudFront edge cache |
| ๐ต v3.0+ | AWS-Native Serverless Platform (2028+) | Dynamic layer behind the static edge. Hot key-value, relational voting history, full-text + vector search, political knowledge graph, S3 data lake + BI, and a managed RAG layer. | DynamoDB ยท Aurora Serverless v2 ยท OpenSearch Serverless ยท Neptune Serverless ยท S3/Glue/Athena/QuickSight ยท Bedrock Knowledge Bases |
| โช 10-yr | AI Lookahead (2026 โ 2037) | Model-agnostic semantic fabric; multi-parliament ontology; quantum-safe crypto migration. | Bedrock (model-agnostic) + Neptune ontology + linked-data exports |
| Aspect | Current (v1.0.x) | v2.0 (Static-Enhanced) | v3.0+ (AWS Serverless) | Benefit |
|---|---|---|---|---|
| Storage | Committed markdown + HTML on S3 | + pre-computed dashboard JSON baked at build | DynamoDB + Aurora Serverless v2 + S3 lake | Query flexibility, scale |
| Structure | Provenance manifest.json + typed src/types |
+ party/landscape datasets | Relational + key-value + graph + vector | Rich relationships |
| Search | Static client filter | Pre-indexed facets | OpenSearch Serverless (BM25 + kNN vector) | Semantic + RAG search |
| Relationships | Implicit in prose | Coalition/actor graph JSON | Neptune Serverless property graph | Native graph queries |
| Update cadence | Daily gh-aw batch | Daily batch + richer datasets | EventBridge + Kinesis near-real-time | Sub-minute freshness |
| Query API | None (static) | None (static) | API Gateway + AppSync GraphQL | Programmatic access |
| Historical data | Git history | Git history | Aurora temporal + Neptune time-versioned | Trend & "as-of" analysis |
| AI/RAG | Build-time LLM authoring (gh-aw) | Same | Bedrock Knowledge Bases over corpus | NL query, grounded answers |
| Data sources | EP MCP + World Bank + IMF | Same | + multi-parliament adapters | Comparative coverage |
The strategic invariant across all horizons: the static HTML edge remains the public, cacheable, low-cost front door. Dynamic v3.0+ features are layered behind CloudFront, never replacing it. See FUTURE_ARCHITECTURE.md for the matching C4 view and DATA_MODEL.md for the current schema.
gantt
title Data Model Evolution Roadmap (2026-2030)
dateFormat YYYY-MM
section v2.0 Static-Enhanced
Party Landscape Datasets (build-time) :v2a, 2026-07, 3M
Coalition Graph JSON (pre-computed) :v2b, 2026-09, 2M
OSINT Tradecraft Schema Hardening :v2c, 2026-10, 3M
Seat-Projection Dataset Bake :v2d, 2027-01, 2M
section v3.0 Foundation
DynamoDB Single-Table Design :v3a, 2027-06, 3M
Aurora Serverless v2 Voting Schema :v3b, 2027-08, 3M
S3 Data Lake + Glue Catalog :v3c, 2027-09, 2M
section v3.0 Intelligence
OpenSearch Serverless (vector + BM25) :v3d, 2028-01, 3M
Neptune Serverless Knowledge Graph :v3e, 2028-03, 4M
Bedrock Knowledge Bases (RAG) :v3f, 2028-06, 3M
section v3.0 Real-Time
EventBridge + Kinesis Ingestion :v3g, 2028-09, 3M
Athena + QuickSight Analytics :v3h, 2028-11, 2M
Multi-Parliament Adapters :v3i, 2029-03, 6M
v2.0 introduces no servers and no databases. It deepens the existing file-based corpus and adds pre-computed dashboard datasets baked at build time, so the public surface stays pure static HTML on S3 + CloudFront while gaining party-level and political-landscape analytics.
analysis/daily/<YYYY-MM-DD>/<slug>/ with a
manifest.json provenance record (schema 1.4.0+) that the deterministic
aggregator (src/aggregator/**) reads to render HTML.news/*.html, 14 languages) generated by
src/aggregator/article-generator.ts โ never authored by an LLM directly.src/types/*.ts (strict ESM).european-parliament-mcp-server@1.3.20,
60+ tools), worldbank-mcp (optional), and IMF REST.A new build step emits static, versioned dataset files (JSON, hydrated client-side by Chart.js 4 + D3 7) focused on parties and political groups:
data/landscape/<term>/
โโโ political-groups.json โ seat share, cohesion %, leadership
โโโ group-cohesion-timeseries.json
โโโ coalition-mathematics.json โ winning-coalition combinatorics per dossier
โโโ coalition-network.json โ nodes/edges for cross-party alliance graph
โโโ mep-scorecards.json โ per-MEP activity/loyalty/influence indices
โโโ voting-heatmap.json โ group x policy-area alignment matrix
โโโ seat-projection-2029.json โ electoral-cycle forecast bands
โโโ manifest.json โ dataset provenance + source EP MCP versions
These files are produced deterministically from EP MCP tool output during CI, hashed
for integrity (SHA-256), and committed alongside the analysis run. Because they are
plain static assets, CloudFront caches them at the edge with no compute cost. The
v2.0 graph datasets (coalition-network.json) use the same conceptual schema as
the v3.0 Neptune property graph, so the later migration is a loader change, not a
remodelling exercise.
erDiagram
ANALYSIS_RUN ||--|| RUN_MANIFEST : "described by"
ANALYSIS_RUN ||--o{ ANALYSIS_ARTIFACT : "emits"
ANALYSIS_RUN ||--o{ ARTICLE_HTML : "renders"
ANALYSIS_RUN ||--o{ LANDSCAPE_DATASET : "bakes"
LANDSCAPE_DATASET ||--o{ POLITICAL_GROUP_FACT : "contains"
LANDSCAPE_DATASET ||--o{ COALITION_EDGE : "contains"
LANDSCAPE_DATASET ||--o{ MEP_SCORECARD : "contains"
RUN_MANIFEST {
string articleType "ArticleCategory enum"
string runId "gh-aw run id"
string generatedAt "ISO 8601 UTC"
string sourceCommit "git SHA"
string epMcpVersion "1.3.20"
string ghAwVersion "v0.71.6"
string schemaVersion "1.4.0+"
string dataMode "full | reduced"
}
ANALYSIS_ARTIFACT {
string relativePath "path under run dir"
string category "classification | threat | risk | ..."
int lineCount "vs reference-quality floor"
}
ARTICLE_HTML {
string language "ISO 639-1"
string path "news/<slug>_<lang>.html"
}
LANDSCAPE_DATASET {
string name "political-groups | coalition-network | ..."
string term "EP10"
string sha256 "integrity hash"
}
POLITICAL_GROUP_FACT {
string groupCode "EPP | SD | Renew | ..."
int seatCount
float cohesionPct
string policyArea
}
COALITION_EDGE {
string groupA
string groupB
float coVoteRate "0..1"
string dossierScope
}
MEP_SCORECARD {
string mepId "EP identifier"
float participationRate
float loyaltyScore
float influenceIndex
}
GDPR note (all horizons):
MEP_SCORECARDand every MEP-linked record store only public parliamentary-role attributes (votes cast in plenary, committee membership, tabled questions). No private-life, contact-beyond-public-office, or protected-characteristic data is held. Lawful basis is public-interest transparency (GDPR Art. 6(1)(e)); processing is documented per Art. 30.
From 2028 the file-based corpus becomes the immutable source of record that hydrates a set of purpose-fit, fully-managed AWS serverless stores. Each store is selected for a specific access pattern; none is self-managed; all scale to zero or near-zero when idle.
erDiagram
MEP ||--o{ VOTE_CAST : "casts"
MEP }o--|| POLITICAL_GROUP : "member of"
MEP }o--|| NATIONAL_PARTY : "represents"
MEP }o--|| COUNTRY : "elected in"
MEP ||--o{ COMMITTEE_MEMBERSHIP : "holds"
MEP ||--o{ QUESTION : "tables"
COMMITTEE ||--o{ COMMITTEE_MEMBERSHIP : "staffed by"
COMMITTEE ||--o{ DOSSIER : "responsible for"
DOSSIER ||--o{ VOTE : "decided by"
VOTE ||--o{ VOTE_CAST : "aggregates"
PLENARY_SESSION ||--o{ VOTE : "includes"
POLITICAL_GROUP ||--o{ COALITION_MEMBERSHIP : "joins"
COALITION ||--o{ COALITION_MEMBERSHIP : "comprises"
ANALYSIS_ARTIFACT ||--o{ EMBEDDING : "vectorised as"
ANALYSIS_ARTIFACT }o--|| ANALYSIS_RUN : "produced by"
EP_DOCUMENT ||--o{ EMBEDDING : "vectorised as"
KNOWLEDGE_BASE ||--o{ EMBEDDING : "indexes"
MEP {
string mep_id PK "EP identifier"
string full_name "public"
string country FK
string group_code FK
string national_party FK
date term_start
date term_end
}
POLITICAL_GROUP {
string group_code PK "EPP | SD | Renew | ..."
string name
string ideology_band
int seat_count
}
COUNTRY {
string iso_code PK "ISO 3166-1 alpha-2"
string name
int ep_seats
}
COMMITTEE {
string code PK "LIBE | ECON | ENVI | ..."
string name
string policy_area
}
DOSSIER {
string procedure_ref PK "2024/0001(COD)"
string title
string committee_code FK
string stage "committee | plenary | trilogue | adopted"
}
VOTE {
string vote_id PK "EP vote identifier"
string procedure_ref FK
string session_id FK
datetime vote_time
int for_count
int against_count
int abstain_count
string result "passed | rejected"
}
VOTE_CAST {
string vote_id FK
string mep_id FK
string position "for | against | abstain | absent"
}
PLENARY_SESSION {
string session_id PK
date session_date
string location "Strasbourg | Brussels"
}
COMMITTEE_MEMBERSHIP {
string mep_id FK
string committee_code FK
string role "chair | vice | member | substitute"
}
QUESTION {
string question_id PK
string mep_id FK
string type "written | oral"
date tabled_date
}
COALITION {
string coalition_id PK
string dossier_scope
float winning_margin
}
COALITION_MEMBERSHIP {
string coalition_id FK
string group_code FK
}
ANALYSIS_RUN {
string run_id PK
string article_type
datetime generated_at
string source_commit
}
ANALYSIS_ARTIFACT {
string artifact_id PK
string run_id FK
string relative_path
string category
}
EP_DOCUMENT {
string document_id PK
string document_type
date publication_date
}
EMBEDDING {
string embedding_id PK
string source_id FK
string source_type "artifact | ep_document"
string model "amazon.titan-embed | cohere"
int dims "1024 | 1536"
}
KNOWLEDGE_BASE {
string kb_id PK
string name "ep-corpus-kb"
string vector_store "OpenSearch Serverless collection"
}
The logical entities above are physically distributed across five AWS serverless stores plus a managed RAG layer, each mapped to its natural access pattern in the sections that follow.
graph TD
SOR["S3 Source of Record<br/>committed artifacts + EP feeds"]:::s3
SOR --> DDB["Amazon DynamoDB<br/>hot key-value / single-table"]:::aws
SOR --> AUR["Amazon Aurora Serverless v2<br/>relational voting history"]:::aws
SOR --> OSS["Amazon OpenSearch Serverless<br/>BM25 + vector kNN"]:::aws
SOR --> NEP["Amazon Neptune Serverless<br/>political knowledge graph"]:::aws
SOR --> LAKE["S3 Data Lake + Glue + Athena<br/>QuickSight BI"]:::aws
OSS --> KB["Amazon Bedrock<br/>Knowledge Bases (RAG)"]:::ai
NEP --> KB
AUR --> KB
classDef s3 fill:#e8f5e9,stroke:#2e7d32,color:#000
classDef aws fill:#fff3e0,stroke:#e65100,color:#000
classDef ai fill:#ede7f6,stroke:#4527a0,color:#000
Role: ultra-low-latency, scale-to-near-zero store for sessions, the analysis run index, and real-time EP event state. Replaces the obsolete MongoDB document store and the Redis cache (the latter via DynamoDB DAX for microsecond reads, or ElastiCache Serverless where a true cache-aside is needed).
A single table epm-core uses a generic partition/sort key with overloaded item
types, GSIs for inverted access, and a TTL attribute for ephemeral real-time state.
| Access pattern | PK | SK | Notes |
|---|---|---|---|
| Run index by date | RUN#<date> |
SLUG#<slug> |
List runs for a day |
| Run provenance | RUN#<runId> |
META |
Mirrors manifest.json |
| Artifact catalogue | RUN#<runId> |
ART#<path> |
Per-artifact metadata |
| Live vote tally | VOTE#<voteId> |
STATE |
TTL-expiring real-time tally |
| Session (Cognito) | SESSION#<userId> |
<sessionId> |
API consumer session |
| Saved query | USER#<userId> |
QUERY#<id> |
Researcher saved searches |
{
"PK": "RUN#2028-06-01-breaking-run07",
"SK": "META",
"itemType": "AnalysisRun",
"articleType": "breaking",
"generatedAt": "2028-06-01T05:12:44Z",
"sourceCommit": "a1b2c3d",
"epMcpVersion": "1.3.20",
"schemaVersion": "1.4.0",
"dataMode": "full",
"gsi1pk": "TYPE#breaking",
"gsi1sk": "DATE#2028-06-01",
"ttl": null
}
sha256 of their source artifact; PITR enabled.Old โ new: MongoDB document store โ DynamoDB single-table; Redis cache โ DynamoDB DAX / ElastiCache Serverless.
Role: the system of relational truth for MEPs, votes, full per-MEP voting
history, committees, dossiers, and temporal "as-of" queries. Replaces the obsolete
self-managed PostgreSQL/TimescaleDB tier with an auto-scaling, scale-to-low
(0.5 ACU) serverless Postgres that supports the pgvector extension for in-row
embeddings where co-location with relational filters is valuable.
-- Members of the European Parliament (public role attributes only)
CREATE TABLE mep (
mep_id TEXT PRIMARY KEY, -- EP identifier
full_name TEXT NOT NULL, -- public
country_iso CHAR(2) NOT NULL REFERENCES country(iso_code),
group_code TEXT REFERENCES political_group(group_code),
national_party TEXT,
term_start DATE NOT NULL,
term_end DATE,
CONSTRAINT public_role_only CHECK (true) -- GDPR: no private-life columns
);
CREATE TABLE political_group (
group_code TEXT PRIMARY KEY, -- EPP, SD, Renew, ...
name TEXT NOT NULL,
ideology_band TEXT,
seat_count INT
);
CREATE TABLE committee (
code TEXT PRIMARY KEY, -- LIBE, ECON, ENVI, ...
name TEXT NOT NULL,
policy_area TEXT
);
CREATE TABLE dossier (
procedure_ref TEXT PRIMARY KEY, -- 2024/0001(COD)
title TEXT NOT NULL,
committee_code TEXT REFERENCES committee(code),
stage TEXT -- committee|plenary|trilogue|adopted
);
CREATE TABLE plenary_vote (
vote_id TEXT PRIMARY KEY,
procedure_ref TEXT REFERENCES dossier(procedure_ref),
session_id TEXT NOT NULL,
vote_time TIMESTAMPTZ NOT NULL,
for_count INT, against_count INT, abstain_count INT,
result TEXT -- passed | rejected
);
-- Per-MEP roll-call positions (the high-volume, append-only history)
CREATE TABLE vote_cast (
vote_id TEXT REFERENCES plenary_vote(vote_id),
mep_id TEXT REFERENCES mep(mep_id),
position TEXT NOT NULL, -- for|against|abstain|absent
PRIMARY KEY (vote_id, mep_id)
);
CREATE INDEX idx_vote_cast_mep ON vote_cast (mep_id);
-- Temporal committee membership for "as-of" composition queries
CREATE TABLE committee_membership (
mep_id TEXT REFERENCES mep(mep_id),
code TEXT REFERENCES committee(code),
role TEXT, -- chair|vice|member|substitute
valid_from DATE NOT NULL,
valid_to DATE,
PRIMARY KEY (mep_id, code, valid_from)
);
-- Group cohesion: share of a group voting with its modal position per dossier
SELECT pg.group_code,
d.procedure_ref,
AVG(modal.share) AS cohesion
FROM political_group pg
JOIN mep m ON m.group_code = pg.group_code
JOIN vote_cast vc ON vc.mep_id = m.mep_id
JOIN plenary_vote pv ON pv.vote_id = vc.vote_id
JOIN dossier d ON d.procedure_ref = pv.procedure_ref
JOIN LATERAL (
SELECT MAX(cnt)::float / NULLIF(SUM(cnt),0) AS share
FROM (SELECT position, COUNT(*) cnt
FROM vote_cast x JOIN mep mm ON mm.mep_id = x.mep_id
WHERE x.vote_id = pv.vote_id AND mm.group_code = pg.group_code
GROUP BY position) t
) modal ON true
GROUP BY pg.group_code, d.procedure_ref;
valid_from/valid_to ranges + system_time-style snapshots enable
"what did committee LIBE look like on 2027-09-01?" analysis.Old โ new: self-managed PostgreSQL / TimescaleDB โ Aurora Serverless v2.
Role: unified lexical (BM25) and semantic (kNN vector) search across the analysis corpus, EP documents, and dashboards. Replaces the obsolete Elasticsearch tier. A single collection serves both keyword search and the vector store backing Bedrock Knowledge Bases.
{
"settings": { "index.knn": true },
"mappings": {
"properties": {
"source_id": { "type": "keyword" },
"source_type": { "type": "keyword" },
"article_type":{ "type": "keyword" },
"language": { "type": "keyword" },
"title": { "type": "text", "analyzer": "standard" },
"body": { "type": "text" },
"published_at":{ "type": "date" },
"mep_ids": { "type": "keyword" },
"group_codes": { "type": "keyword" },
"committee": { "type": "keyword" },
"embedding": {
"type": "knn_vector",
"dimension": 1024,
"method": { "name": "hnsw", "space_type": "cosinesimil", "engine": "faiss" }
}
}
}
}
{
"size": 10,
"query": {
"hybrid": {
"queries": [
{ "match": { "body": "carbon border adjustment cohesion" } },
{ "knn": { "embedding": { "vector": [/* 1024-d query embedding */],
"k": 10 } } }
]
}
},
"post_filter": { "term": { "language": "en" } }
}
Old โ new: Elasticsearch โ OpenSearch Serverless (lexical + vector).
Role: the political knowledge graph linking MEP โ political group โ committee โ dossier โ vote โ country, plus derived coalition and actor networks. Replaces the obsolete Neo4j tier with a managed, auto-scaling property-graph + RDF engine (openCypher / Gremlin / SPARQL). This is the spine of v3.0 OSINT capability: natural network queries ("which Renew MEPs broke with their group on ENVI dossiers and how does that cluster by country?") that are awkward in relational SQL.
graph LR
MEP(("MEP")):::node
GRP(("Political Group")):::node
CTE(("Committee")):::node
DOS(("Dossier")):::node
VOTE(("Vote")):::node
CNTRY(("Country")):::node
COAL(("Coalition")):::node
PARTY(("National Party")):::node
MEP -->|MEMBER_OF| GRP
MEP -->|ELECTED_IN| CNTRY
MEP -->|REPRESENTS| PARTY
MEP -->|SITS_ON| CTE
MEP -->|CAST| VOTE
CTE -->|RESPONSIBLE_FOR| DOS
DOS -->|DECIDED_BY| VOTE
GRP -->|JOINS| COAL
COAL -->|ON_DOSSIER| DOS
PARTY -->|AFFILIATED_WITH| GRP
classDef node fill:#e3f2fd,stroke:#1565c0,color:#000
// Vertices carry only PUBLIC role attributes
(:MEP {mep_id, full_name, term_start, term_end})
(:PoliticalGroup {group_code, name, ideology_band, seat_count})
(:Country {iso_code, name, ep_seats})
(:Committee {code, name, policy_area})
(:Dossier {procedure_ref, title, stage})
(:Vote {vote_id, vote_time, result})
(:Coalition {coalition_id, winning_margin})
(:NationalParty {party_id, name})
// Edges
(:MEP)-[:MEMBER_OF {since}]->(:PoliticalGroup)
(:MEP)-[:ELECTED_IN]->(:Country)
(:MEP)-[:REPRESENTS]->(:NationalParty)
(:MEP)-[:SITS_ON {role, valid_from, valid_to}]->(:Committee)
(:MEP)-[:CAST {position}]->(:Vote)
(:Committee)-[:RESPONSIBLE_FOR]->(:Dossier)
(:Dossier)-[:DECIDED_BY]->(:Vote)
(:PoliticalGroup)-[:JOINS {co_vote_rate}]->(:Coalition)
(:Coalition)-[:ON_DOSSIER]->(:Dossier)
// Cross-party defection clustering on environment dossiers
MATCH (m:MEP)-[:MEMBER_OF]->(g:PoliticalGroup {group_code:'Renew'}),
(m)-[c:CAST]->(v:Vote)<-[:DECIDED_BY]-(d:Dossier),
(cte:Committee {code:'ENVI'})-[:RESPONSIBLE_FOR]->(d),
(m)-[:ELECTED_IN]->(country:Country)
WHERE c.position <> g.modal_position
RETURN country.iso_code, count(DISTINCT m) AS defectors
ORDER BY defectors DESC;
vote_id / procedure_ref,
preserving evidence chains required by the OSINT tradecraft methodology.valid_from/valid_to so coalition graphs can be
reconstructed "as of" any plenary week.coalition-network.json dataset is the
same logical graph, exported for client-side D3 rendering.Old โ new: Neo4j knowledge graph โ Amazon Neptune Serverless.
Role: cost-efficient analytics and BI over the full historical corpus. The committed artifacts and EP feeds land in an S3 data lake (partitioned Parquet), are catalogued by AWS Glue, queried ad-hoc with Amazon Athena (serverless SQL), and visualised in Amazon QuickSight dashboards for internal analysts and partner journalists.
s3://epm-datalake/
โโโ raw/ ep_mcp/ , imf/ , worldbank/ (immutable landing, JSON)
โโโ curated/ votes/ meps/ dossiers/ coalitions/ (Parquet, partitioned by year/term)
โโโ analytics/ scorecards/ cohesion/ projections/ (aggregated marts)
graph LR
RAW["S3 raw zone<br/>JSON landing"]:::s3
GLUE["AWS Glue<br/>crawlers + ETL jobs"]:::aws
CUR["S3 curated zone<br/>partitioned Parquet"]:::s3
ATH["Amazon Athena<br/>serverless SQL"]:::aws
QS["Amazon QuickSight<br/>BI dashboards"]:::aws
RAW --> GLUE --> CUR --> ATH --> QS
classDef s3 fill:#e8f5e9,stroke:#2e7d32,color:#000
classDef aws fill:#fff3e0,stroke:#e65100,color:#000
Old โ new: ad-hoc warehouse / Datadog dashboards โ S3 + Glue + Athena + QuickSight.
Role: the managed Retrieval-Augmented-Generation layer over the EP corpus and the platform's own analysis artifacts, enabling grounded natural-language query ("Summarise how the EPP voted on migration dossiers this term, with citations"). Replaces any bespoke OpenAI/LangChain gateway with a model-agnostic Bedrock layer.
graph TD
SRC["Sources<br/>S3 artifacts + EP documents"]:::s3
EMB["Bedrock Embeddings<br/>Titan / Cohere"]:::ai
VEC["OpenSearch Serverless<br/>vector collection"]:::aws
KB["Bedrock Knowledge Base<br/>ep-corpus-kb"]:::ai
AG["Bedrock Agents<br/>tool use / OSINT workflows"]:::ai
GR["Bedrock Guardrails<br/>neutrality + PII/GDPR + hallucination"]:::ai
APP["AppSync / API Gateway<br/>NL query endpoint"]:::aws
SRC --> EMB --> VEC --> KB --> AG --> GR --> APP
classDef s3 fill:#e8f5e9,stroke:#2e7d32,color:#000
classDef aws fill:#fff3e0,stroke:#e65100,color:#000
classDef ai fill:#ede7f6,stroke:#4527a0,color:#000
Old โ new: OpenAI / LangChain gateway โ Amazon Bedrock + Knowledge Bases + Agents + Guardrails.
v3.0 ingestion is event-driven and serverless end-to-end. EP MCP / EP Open Data changes are detected, streamed, transformed, and projected into each purpose-fit store with eventual consistency. The S3 source of record stays authoritative; every derived store can be rebuilt from it.
sequenceDiagram
participant EP as EP MCP / EP Open Data
participant ING as Lambda Ingestor
participant KIN as Amazon Kinesis
participant EB as Amazon EventBridge
participant SFN as Step Functions
participant S3 as S3 Source of Record
participant DDB as DynamoDB
participant AUR as Aurora Serverless v2
participant OSS as OpenSearch Serverless
participant NEP as Neptune Serverless
EP->>ING: poll feeds / webhook (votes, sessions, docs)
ING->>S3: write immutable raw payload (hashed)
ING->>KIN: emit change record
KIN->>EB: route by detail-type
EB->>SFN: start projection workflow
SFN->>DDB: upsert run index / live state
SFN->>AUR: upsert relational rows (idempotent)
SFN->>OSS: index doc + embedding
SFN->>NEP: upsert vertices/edges
SFN-->>EB: emit projection-complete event
| Concern | AWS Service | Behaviour |
|---|---|---|
| Scheduled / triggered polling | EventBridge Scheduler + Lambda | Pulls EP MCP sliding/fixed-window feeds |
| Change streaming | Amazon Kinesis Data Streams | Ordered, replayable change records |
| Routing / fan-out | Amazon EventBridge | Content-based routing by detail-type |
| Orchestration | AWS Step Functions | Idempotent multi-store projection sagas |
| Async buffering | Amazon SQS / SNS | Backpressure + retry + DLQ |
| Source of record | Amazon S3 | Immutable, versioned, hashed payloads |
Old โ new: Kafka event bus โ EventBridge + Kinesis + SQS/SNS; Socket.io / Apollo โ API Gateway WebSocket / AppSync for live push.
Migration is additive and reversible: the file-based corpus keeps running and serving traffic while serverless stores are populated from it. No "big bang."
flowchart TD
A["Phase 0: v1.0.x baseline<br/>committed artifacts on S3+CloudFront"] --> B
B["Phase 1: Backfill S3 data lake<br/>load historical artifacts + EP feeds (Parquet)"] --> C
C["Phase 2: Stand up read stores<br/>Aurora + DynamoDB + OpenSearch from lake"] --> D
D["Phase 3: Build knowledge graph<br/>Neptune loader from curated zone"] --> E
E["Phase 4: Enable RAG<br/>Bedrock KB over OpenSearch vectors"] --> F
F["Phase 5: Real-time ingestion<br/>EventBridge+Kinesis live projection"] --> G
G["Phase 6: Expose APIs<br/>API Gateway + AppSync + Cognito"] --> H
H["Phase 7: Multi-parliament adapters<br/>pluggable source mappers"]
| Principle | Implementation |
|---|---|
| Source of record unchanged | Git + S3 artifacts remain authoritative; stores are projections |
| Idempotent loaders | Re-runnable Glue / Lambda jobs keyed on EP ids + hashes |
| Reversibility | Any store can be dropped and rebuilt from S3 SOR |
| Zero downtime | Static edge keeps serving; dynamic features feature-flagged |
| Validation gates | Row-count + checksum parity checks before cut-over |
| Cost discipline | Scale-to-zero serverless; backfill on Spot/serverless batch |
// Project a committed analysis run into DynamoDB + OpenSearch (idempotent)
import { readManifest } from "../src/aggregator/analysis-aggregator.js";
export async function projectRun(runDir: string): Promise<void> {
const manifest = await readManifest(runDir); // src/types/analysis.ts
const key = `RUN#${manifest.runId}`;
await ddb.put({ // idempotent upsert
TableName: "epm-core",
Item: { PK: key, SK: "META", ...manifest,
gsi1pk: `TYPE#${manifest.articleType}`,
gsi1sk: `DATE#${manifest.generatedAt.slice(0, 10)}` },
});
for (const path of manifest.files.classification ?? []) {
const body = await readArtifact(runDir, path);
const embedding = await bedrockEmbed(body); // Titan / Cohere
await opensearch.index({
index: "epm-corpus",
id: `${manifest.runId}:${path}`, // deterministic id
document: { source_id: path, source_type: "artifact",
article_type: manifest.articleType, body, embedding },
});
}
}
| Dimension | Mechanism | AWS Service |
|---|---|---|
| Freshness | Feed lag vs EP publication timestamps | CloudWatch metric + alarm |
| Completeness | Row/edge parity vs S3 SOR | AWS Glue Data Quality |
| Schema conformance | Contract tests on ingest | Glue Data Quality rulesets |
| Integrity | SHA-256 of raw payloads vs stored | Lambda check + CloudTrail |
| Referential integrity | Orphan vote_cast / dangling edges | Athena scheduled query |
| Drift | Nightly reconciliation diff | Glue job โ CloudWatch dashboard |
| Cost / capacity | ACU / OCU / NCU utilisation | CloudWatch + Budgets alerts |
-- Athena data-quality probe: votes with no recorded per-MEP positions
SELECT pv.vote_id, pv.vote_time
FROM curated.plenary_vote pv
LEFT JOIN curated.vote_cast vc ON vc.vote_id = pv.vote_id
WHERE vc.vote_id IS NULL
ORDER BY pv.vote_time DESC;
position IN ('for','against','abstain','absent')), and freshness SLAs at ingest time.reference-quality-thresholds.json
philosophy โ every projection records its source artifact and line/row counts.Security is defence-in-depth and PUBLIC-data-only by design.
| Control | Implementation |
|---|---|
| Encryption at rest | AWS KMS customer-managed keys for DynamoDB, Aurora, OpenSearch, Neptune, S3 |
| Encryption in transit | TLS 1.3 everywhere; VPC endpoints / PrivateLink for store access |
| Identity | Amazon Cognito user pools (journalists / researchers / API consumers), federated IdP |
| Authorization | IAM least-privilege roles per Lambda; fine-grained DynamoDB / row-level Aurora policies |
| Network | Private subnets, security groups, AWS WAF + Shield on public APIs |
| Secrets | AWS Secrets Manager for any source credentials; no secrets in code |
| Audit | AWS CloudTrail + Security Hub + GuardDuty anomaly detection |
| PII / GDPR | Bedrock Guardrails block non-public-role personal data in any RAG output |
The stores above hold the base political data โ MEPs, groups, votes, dossiers. This section adds the intelligence-grade data structures required by the OSINT capability roadmap in FUTURE_MINDMAP.md and realised architecturally in FUTURE_ARCHITECTURE.md. Each new entity family is the data backbone of a missing capability a senior intelligence operative would expect โ and every one stays inside the PUBLIC open-data, public-parliamentary-role boundary with provenance attached.
GDPR / neutrality invariant. New entities describe public roles, public declarations, public documents, and public discourse only. No private-life attribute, no protected characteristic, and no psychographic field is ever modelled. Integrity findings are stored as questions with evidence, never as adjudicated accusations.
| Capability | New Entities | Store | Horizon |
|---|---|---|---|
| Collection management / PIR | IntelligenceRequirement, CollectionTask, CoverageGap |
DynamoDB | ๐ข v2.0 โ ๐ต v3.0 |
| Indications and Warning | Indicator, Watchlist, WarningEvent, Tripwire |
DynamoDB + Kinesis | ๐ต v3.0 |
| Integrity / conflict-of-interest | FinancialInterest, OutsideActivity, RegisterEntry, Meeting, InterestOrganization |
Aurora + Neptune | ๐ต v3.1 |
| Verbatim speech intelligence | Speech, Utterance, StanceSignal |
Aurora + OpenSearch | ๐ต v3.1 |
| Counter-FIMI | Narrative, MediaItem, NarrativeCampaign, FIMIIncident |
OpenSearch + Neptune | โช v3.2 |
| Forecasting + ACH | Forecast, Hypothesis, ConfidenceAssessment, RedTeamReview |
Aurora | ๐ต v3.0 |
| Provenance + authenticity | EvidenceChain, SourceGrade, ContentCredential |
S3 + Neptune | ๐ข โ โช v3.2 |
These vertices and edges extend the v3.0 Neptune political knowledge graph so multi-hop influence and integrity tracing becomes a single query.
// New PUBLIC-only intelligence vertices
(:InterestOrganization {org_id, name, register_id, country, category})
(:RegisterEntry {entry_id, declared_interest, lobby_budget_band, updated})
(:Meeting {meeting_id, date, subject, dossier_ref})
(:FinancialInterest {decl_id, type, declared_on, source})
(:Narrative {narrative_id, theme, first_seen, languages})
(:FIMIIncident {incident_id, disarm_ttp, confidence_band, status})
(:Indicator {indicator_id, name, baseline, threshold})
(:WarningEvent {warning_id, level, raised_at, confirmed_by})
// New edges โ every edge references a PUBLIC EP / register source
(:MEP)-[:DECLARED {declared_on}]->(:FinancialInterest)
(:MEP)-[:MET_WITH {date}]->(:InterestOrganization)
(:InterestOrganization)-[:LISTED_IN]->(:RegisterEntry)
(:InterestOrganization)-[:LOBBIED_ON]->(:Dossier)
(:Meeting)-[:CONCERNS]->(:Dossier)
(:Narrative)-[:REFERENCES]->(:Dossier)
(:FIMIIncident)-[:AMPLIFIES]->(:Narrative)
(:Indicator)-[:WATCHES]->(:PoliticalGroup)
(:WarningEvent)-[:RAISED_BY]->(:Indicator)
// Integrity question (NOT an accusation): rapporteurs whose dossier overlaps a
// declared outside interest AND a registered lobby meeting on the same dossier.
MATCH (m:MEP)-[:SITS_ON {role:'Rapporteur'}]->(:Committee)-[:RESPONSIBLE_FOR]->(d:Dossier),
(m)-[:DECLARED]->(fi:FinancialInterest),
(m)-[:MET_WITH]->(o:InterestOrganization)-[:LOBBIED_ON]->(d)
RETURN m.full_name AS public_role, d.procedure_ref, fi.type, o.name
ORDER BY d.procedure_ref;
// Output is a SOURCED prompt for journalistic review, WEP-banded, human-reviewed.
The I&W store turns the abstract "warning problem" into queryable indicators with calibrated tripwires and an auditable promotion history.
{
"PK": "WATCHLIST#coalition-collapse",
"SK": "INDICATOR#cohesion-decline-EPP",
"indicator": "EPP roll-call cohesion 30-day rolling mean",
"baseline": 0.92,
"current": 0.81,
"tripwire": 0.85,
"deviation_sigma": 2.4,
"confidence_band": "likely (WEP 55-70%)",
"state": "WARNING",
"evidence_vote_ids": ["RCV-2029-0412", "RCV-2029-0418"],
"raised_at": "2029-03-14T09:00:00Z",
"human_confirmed_by": "analyst-on-duty",
"anchoring_methodology": "coalition-dynamics-analysis.md"
}
Predictive products are stored with their competing hypotheses, confidence band, and red-team review attached โ never a bare point estimate โ so every forecast is auditable against the outcome and the analytic track record is measurable.
CREATE TABLE forecast (
forecast_id uuid PRIMARY KEY,
subject_ref text NOT NULL, -- dossier / coalition / election
question text NOT NULL, -- the estimative question
estimate numeric, -- probability 0..1
wep_band text NOT NULL, -- Kent / Words of Estimative Probability
confidence text NOT NULL, -- ICD 203 low/moderate/high
competing_hyps jsonb NOT NULL, -- >= 2 hypotheses required
red_team_review jsonb, -- devil's-advocate dissent record
evidence_chain jsonb NOT NULL, -- cited PUBLIC sources
resolved_outcome numeric, -- filled after the event for calibration
human_signoff text NOT NULL
);
Calibration loop.
resolved_outcomecloses the feedback arm of the intelligence cycle: forecast accuracy is scored over time (Brier-style), feeding the analytic track record and re-tasking the collection plan โ the data-model realisation of "be early and honest about it".
As AI models evolve โ major upgrades roughly annually, with competitors evaluated at each release โ the data model evolves toward a model-agnostic semantic fabric on the same AWS serverless substrate. Bedrock's model abstraction means new foundation models are adopted by configuration, not rearchitecture.
| Year | AI Model | DevSecOps Capability Evolution |
|---|---|---|
| 2026 | Opus 4.6โ4.9 | ๐ข AI-assisted code review, automated test generation, agentic CI/CD workflows |
| 2027 | Opus 5.x | ๐ต Predictive vulnerability detection, intelligent dependency management |
| 2028 | Opus 6.x | ๐ฃ Multi-modal security analysis (code + architecture + runtime), automated threat modeling |
| 2029 | Opus 7.x | ๐ Autonomous security pipeline orchestration, self-healing build systems |
| 2030 | Opus 8.x | ๐ด Near-expert automated security review, AI-driven architecture validation |
| 2031โ2033 | Opus 9โ10.x / Pre-AGI | โช Autonomous secure development lifecycle management |
| 2034โ2037 | AGI / Post-AGI | โญ Transformative software engineering with built-in security assurance |
Assumptions: major AI model upgrades annually; competitors (OpenAI, Google, Meta, EU sovereign AI) evaluated at each release; architecture accommodates potential paradigm shifts (quantum AI, neuromorphic computing). Full cross-perspective analysis lives in the Hack23 Information Security Strategy ยง AI Model Evolution Strategy; governance per AI Policy (AI = proposal generator, human accountability, no autonomous deploy).
| Era | Years | Data Model Paradigm | Key Additions |
|---|---|---|---|
| Near-Term | 2027-2029 | AWS serverless multi-store + Neptune graph | Hybrid vector/lexical search, Bedrock RAG, real-time projection |
| Mid-Term | 2029-2032 | Unified semantic fabric | OWL/RDF parliament ontology over Neptune, automated schema evolution with human gates |
| Long-Term | 2032-2035 | Autonomous data curation | Self-healing pipelines, predictive indexing, causal inference layer |
| Visionary | 2035-2037 | AGI-ready, quantum-safe | Dynamic schema generation, universal multi-parliament intelligence, PQC migration |
Document Status: โ
APPROVED FOR PLANNING
Version: 4.0 | Last Updated: 2026-05-31 (UTC) | Release: v1.0.1
Next Review: 2026-08-31 (Quarterly)
Classification: Public (Open Source European Parliament Monitoring Platform)