Published on 01 January 2024

Generalized data thinning using sufficient statistics

View Dataset
Dharamshi, Ameer;Neufeld, Anna;Motwani, Keshav;Gao, Lucy L.;Witten, Daniela;Bien, Jacob

Description

Our goal is to develop a general strategy to decompose a random variable X into multiple independent random variables, without sacrificing any information about unknown parameters. A recent paper showed that for some well-known natural exponential families, X can be thinned into independent random variables X(1),…,X(K), such that X=∑k=1KX(k). These independent random variables can then be used for various model validation and inference tasks, including in contexts where traditional sample splitting fails. In this paper, we generalize their procedure by relaxing this summation requirement and simply asking that some known function of the independent random variables exactly reconstruct X. This generalization of the procedure serves two purposes. First, it greatly expands the families of distributions for which thinning can be performed. Second, it unifies sample splitting and data thinning, which on the surface seem to be very different, as applications of the same principle. This shared principle is sufficiency. We use this insight to perform generalized thinning operations for a diverse set of families.

Citations (1)

Mentions (0)

Metrics

Dataset Index

2.4

FAIR Score

85%

Citations

1

Mentions

0

Metrics Over Time

Publication Details

DOI

Publisher

Taylor & Francis

Assigned Domain

Subfield

Statistics and Probability

Field

Mathematics

Domain

Physical Sciences

Confidence Score

44%

Source

Scholar Data Model

Keywords

BiophysicsBiochemistryPhysical Sciences not elsewhere classifiedMicrobiologyFOS: Biological sciencesCell BiologyMolecular BiologyBiotechnologyEvolutionary BiologyChemical Sciences not elsewhere classifiedEcologyImmunologyFOS: Clinical medicineMarine BiologyCancerInfectious DiseasesFOS: Health sciences

Normalization Factors

FT

13.46

CTw

1.00

MTw

1.00