Synthesizing tabular data with Gaussian copulas

How do you generate fake data that closely resembles your original data? This post discusses one approach to do it.

Introduction

Recent advances in machine learning have revolved around unstructured data which is simply data that can’t be represented in a tabular format like audio, images & graphs. However the data that is often used in businesses tends to be tabular data which is stored as spreadsheets or within databases.

This data sometimes needs to shared internally within an organization or externally with other organizations in a way that preserves privacy. One approach to preserving privacy is to use synthetic data which contains fictitious information instead of personally identifying information. Despite this, the fictitious information should have as much utility as the original data. It should have as many statistical characteristics as the original data if a similar inference on the original data is to be reached using it.

This approach to privacy is quite popular and there are vendors like Synthetic Data Vault that offer synthetic data generation as a service. One of the methods they have for generating synthetic data is GaussianCopulaSynthesizer() which is based on Gaussian Copula. In this post I’ll briefly talk about Gaussian Copula: what it is and how it is used to generate synthetic tabular data.

Gaussian copulas

Tabular datasets are made up of different types of columns and the data in each of these column comes from some unknown distribution. Often times the columns are related in that for any row there is a relationship amongst the values of said row. Hence the distribution for the columns are not independent; to sample a given row we have to draw from this unknown joint distribution of the columns. This is where copulas come in handy.

A copula is simply a multivariate cumulative distribution function (CDF) with standard Uniform marginals. This means that we can decompose the CDF of a joint distribution into independent univariate CDFs and a copula that describes the dependency amongst the variables of the distribution (Ruppert & Matteson, 2011; Wiecki, 2018).

Suppose we have a random vector, $\textbf{X} = (X_{1}, …, X_{d})$, and marginal CDFs, $(F_{X_{1}}, .., F_{X_{d}})$, a multivariate CDF, $F_{\textbf{X}}$, can be expressed as a copula, $C_{\textbf{X}}$, as follows

FX(x1,...,xd)=Pr(X1x1,...,Xdxd)=Pr(FX1(X1)FX1(x1),...,FXd(Xd)FXd(xd))=Pr(U1u1,...,Udud)=CX(u1,...,ud)(1) \begin{aligned} F_{\textbf{X}}(x_{1}, ..., x_{d}) &= \text{Pr}(X_{1} \leq x_{1}, ..., X_{d} \leq x_{d}) \\ \tag{1} &= \text{Pr}(F_{X_{1}}(X_{1}) \leq F_{X_{1}}(x_{1}), ..., F_{X_{d}}(X_{d}) \leq F_{X_{d}}(x_{d})) \\ &= \text{Pr}(U_{1} \leq u_{1}, ..., U_{d} \leq u_{d}) = C_{\textbf{X}}(u_{1}, ..., u_{d}) \\ \end{aligned}

The CDF of each random variable, $x_{i}$, is used to transform said variable into a standard Uniform variable, $u_{i}$, and the dependency between the variables is then inferred. The handiness of copulas comes from the fact that we don’t need to know the joint distribution of our original data, however once we assume its dependency structure we can use the copula to sample random vectors. It is therefore important to choose a copula with a dependency structure that best fits the data so that the sampled values resemble those of the original distribution.

There are different family of copulas and each assumes something about the dependency structure of the data. The Gaussian copula assumes there are no tail dependencies within the joint distribution. Tail dependence is the measure of association between the extreme values of a pair of random variables. Upper tail dependence is when there is an association between extremely large values for each random variable and lower tail dependence is when there is an association between the extremely small values. If we were to observe extreme values for one random variable when we pick extreme values for another then the joint distribution exhibits tail dependence and a Gaussian copula would not be suitable. The exception to this is if the variables are perfectly positively correlated; a Gaussian copula might be suitable as it exhibits tail dependence only in this case.

Inference

To use a Gaussian copula, we need to infer the correlation between the columns of the data. We don’t use the Pearsons correlation in this case because it is a linear correlation and that depends on the marginal distribution of the data. When we transform a column using its CDF, its distribution changes and so does its linear correlation with other columns. A copula should be invariant to strictly monotonic transformations such as CDF transformations so the correlation should still persist regardless of such transformation.

Rank correlation don’t depend on the marginal distribution and as such it is a more suitable form of correlation to use. When we transform a column using its CDF and its distribution changes, its rank correlation with other columns remains the same. Hence the kind of correlations used for copulas are Spearman-Rho and Kendall-Tau rank correlations.

Data Synthesis

Once we have the correlation matrix for the data, we can use it to synthesize fake data as follows:

  1. Generate a random gaussian vector
  2. Correlate the variables using the correlation matrix
  3. Transform the correlated gaussian vector into a uniform vector
  4. Inverse transform the unit vector into that of the joint distribution

By using a combo of the CDF & quantile transforms, we can convert the random vector for the gaussian distribution into a random vector for our joint distribution. Throughout this transformation, we also retain the dependency structure or copula that we’ve inferred from our data.

The aforementioned steps are described by the following algorithm:

Compute the correlation matrix, P\textbf{P}, for a given data
Compute the Cholesky decomposition, A\textbf{A}, of P\textbf{P} such that P=ATA\textbf{P} = \textbf{A}^{T}\textbf{A}
For nn samples:

Sample a random vector, ZN(0,Id)\textbf{Z} \sim \mathcal{N}(\textbf{0}, \textbf{I}_{d})
Compute the correlated vector, X=ATZ\textbf{X} = \textbf{A}^{T}\textbf{Z}
Compute the copula, U=(Φ(X1),...,Φ(Xd))\textbf{U} = (\Phi(X_{1}), ..., \Phi(X_{d})), using the gaussian CDF, Φ\Phi
Compute the joint distribution vector, Y=(F11(Φ(X1)),...,Fd1(Xd))\textbf{Y} = (F_{1}^{-1}(\Phi(X_{1})), ..., F_{d}^{-1}(X_{d})), using the quantile functions, F11,...,Fd1F_{1}^{-1}, ..., F_{d}^{-1} for the columns

Categorical Representation

Copulas only work with continuous distributions but most tabular data, contain both categorical and continuous data. So how do we handle the categorical data?

The approach suggested by (Patki et al., 2016) is to use a continuous distribution to represent each category, $k$, in a column. We first establish an ordering for the categories in the column. Starting from 0, we create a set of cumulative sums for the categorical probabilities such that the final sum is 1. The ordered sums are used to identify sub-intervals for each category, $(a_{k}, b_{k})$, within $[0, 1]$. Each sub-interval is then represented by a truncated gaussian with the following parameters

μk=bkak2,σk=bkak6 \mu_{k} = \frac{b_{k} - a_{k}}{2}, \quad \sigma_{k} = \frac{b_{k} - a_{k}}{6}

We can go back and forth between continuous and categorical representation using this approach. To get the continuous representation of a category, we sample from the corresponding truncated gaussian of said category. To get the categorical representation of a continuous value, we simply find which sub-interval along $[0, 1]$ it falls into and return the corresponding category.

Conclusion

In this post I’ve briefly outlined what a Gaussian copula is, what assumptions it makes about the data and how to generate fake data using it. There are other approaches to generate fake data. One of such approach is to use deep learning methods like Conditional Tabular GAN (CTGAN). It’ll therefore be interesting to compare data generated by Copulas against data generated by other methods. How does the data compare with the original, how useful it is and how much privacy it preserves.

  1. Ruppert, D., & Matteson, D. S. (2011). Statistics and data analysis for financial engineering (Vol. 13). Springer.
  2. Wiecki, T. (2018). An Intuitive, Visual Guide To Copulas. https://twiecki.io/blog/2018/05/03/copulas/
  3. Patki, N., Wedge, R., & Veeramachaneni, K. (2016). The synthetic data vault. 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 399–410.