Verification, Provenance, and Reproducibility in Shared AI-Augmented Data Pipelines
Co-located with IEEE Big Data 2026 · December 14–17, 2026
Modern big data pipelines do not operate in isolation — they increasingly span teams, institutions, and organizational boundaries. Datasets, intermediate artifacts, trained models, and downstream decisions flow across organizational boundaries, between teams, and through multi-institution collaborations, as seen in federated healthcare analytics, financial risk platforms, and large-scale scientific consortia.
Unlike traditional deterministic query engines, learned models are inherently stochastic. When an AI-augmented pipeline flags a fraudulent transaction, recommends a treatment plan, or triggers an infrastructure response, the inability to audit the chain of data, model versions, and intermediate decisions is not an academic limitation — it is an operational risk.
This workshop addresses the foundational question: in a world of shared, AI-augmented data pipelines, how do we verify that results are correct, and how do we ensure they can be reproduced?
Data Provenance and Lineage in Shared Big Data Pipelines
Reproducibility of AI/ML Model Decisions at Scale
Verification of Cross-Organizational Data and Model Artifacts
Model Drift and Pipeline Staleness Detection
Auditing and Accountability Across Multi-Party and Cross-Organizational Workflows
Federated Auditing and Privacy-Preserving Verification
Explainability and Auditability of Black-Box Pipeline Components
Versioning Strategies for Data, Models, and Code in Multi-Party Workflows
Trust Propagation and Accountability Across Pipeline Stages
Reproducibility Challenges in Foundation-Model, RAG, and Agentic Data Workflows
Benchmarking and Evaluation Under Real-World Deployment Conditions
Case Studies in Deployed Systems: Failures, Reproducibility Incidents, and Lessons Learned