ToolBox
Tier 3 Fuser Engine

Hybrid Dataset Fuser & Extrapolator

Combine raw academic baseline relationships with synthetic scaling to expand datasets up to 100x while maintaining Pearson covariances—100% offline, powered by Copula algorithms.

Academic Baselines

Scaling Coefficient

Dataset Expansion25x Size
2x multiplier100x expansion
Original baseline:5 records
Extrapolated output:125 records

Interactive Pearson Correlation Heatmap

Age
Cholesterol
Max_HR
Systolic_BP
Risk_Score
Age
1.00
+0.45
-0.60
+0.35
+0.55
Cholesterol
+0.45
1.00
-0.20
+0.40
+0.65
Max_HR
-0.60
-0.20
1.00
-0.15
-0.50
Systolic_BP
+0.35
+0.40
-0.15
1.00
+0.48
Risk_Score
+0.55
+0.65
-0.50
+0.48
1.00

Hover over matrix cells to inspect Pearson $r$ strengths. Blue cells represent positive covariance, while red indicates inverse covariance.

Scientific Guide: Hybrid Copula Extrapolation & Trend Preservation

The Concept of Hybrid Intelligence

While purely synthetic datasets are highly secure, they often lack the authentic nuance of real-world patterns required to train high-fidelity ML or deep learning architectures. Conversely, relying solely on real, un-augmented academic datasets causes data bottlenecks and scarcity. The ultimate gold-standard represents **Hybrid Dataset Expansion** (Tier 3).

By fusing a small, highly vetted, compliant academic baseline with a synthetic generator, we extract genuine multivariate relationship coefficients first. Then, using client-side copula algorithms, we scale the dataset by 10x-100x, generating realistic records that mirror original patterns without replicating sensitive individual profiles.

Pearson Correlations & Mathematical Fit

To confirm that our hybrid dataset expansion does not drift away from genuine patterns, we evaluate multivariate relationships using the **Pearson Correlation Coefficient ($r$)**:

r = Covariance(X, Y) / (StdDev(X) * StdDev(Y))

Our client-side copula simulator leverages these extracted coefficients to maintain high structural resemblance. We validate the final fit of the joint probability distribution using the **Wasserstein Distance** (also known as the Earth Mover's Distance) and the **Kullback-Leibler (KL) Divergence** metrics. Selecting fit distances below 0.05 guarantees a 99.5% academic research utility, rendering the synthetic outputs model-ready for downstream AI/ML tasks.