arxiv:2602.10870

FedPS: Federated data Preprocessing via aggregated Statistics

Published on Feb 11

· Submitted by

Xuefeng Xu on Feb 12

Upvote

Authors:

Xuefeng Xu ,

Abstract

FedPS is a federated data preprocessing framework that uses aggregated statistics and data-sketching techniques to enable efficient and privacy-preserving data preparation for collaborative machine learning across distributed datasets.

AI-generated summary

Federated Learning (FL) enables multiple parties to collaboratively train machine learning models without sharing raw data. However, before training, data must be preprocessed to address missing values, inconsistent formats, and heterogeneous feature scales. This preprocessing stage is critical for model performance but is largely overlooked in FL research. In practical FL systems, privacy constraints prohibit centralizing raw data, while communication efficiency introduces further challenges for distributed preprocessing. We introduce FedPS, a unified framework for federated data preprocessing based on aggregated statistics. FedPS leverages data-sketching techniques to efficiently summarize local datasets while preserving essential statistical information. Building on these summaries, we design federated algorithms for feature scaling, encoding, discretization, and missing-value imputation, and extend preprocessing-related models such as k-Means, k-Nearest Neighbors, and Bayesian Linear Regression to both horizontal and vertical FL settings. FedPS provides flexible, communication-efficient, and consistent preprocessing pipelines for practical FL deployments.