Star Republic: Guide for Biologists

Data analysis --- missing data fill-in

1. cDNA or oligo arrays frequently have data points that are poor quality. These noisy data points are usually flagged and removed from the dataset.

2. Many statistics algorithms are not designed to handle missing data.

3. To run these algorithms, missing data must be filled in with a calculated guess that should have a minimal impact on the analytic results.

4. Three options are frequently used:

A. Fill in missing data with 0 for ratio data in log scale (log (R/G) ) and 1 for ratio data in unlog scale (R/G). This is equivalent in assuming that the gene's expression doesn't change relative to the control. Since expression of the great majority of genes don't change, this is a reasonable assumption.

B. Column (array) mean --- Fill in missing data with the average expression level of all the genes on the same array.

C. Row (gene) mean --- Fill in missing data with the average expression level of all the samples for the same spot.

5. When similarity is measured by Perason correlation. Fill in with column mean or row mean has an interesting effect:

As long as sample-sample correlation is concerned, fill in with column (array) mean is equivalent to ignoring the missing data.

As long as gene-gene correlation is concerned, fill in with row (gene) mean is equivalent to ignoring the missing data.

6. The above data fill-in options are provided in MicroHelper.

7. For more sophisticated missing data fill-in options, consult a statistician.