1. cDNA or oligo arrays frequently have data points that
are poor quality. These noisy data points are usually flagged
and removed from the dataset.
2. Many statistics algorithms are not designed to handle
missing data.
3. To run these algorithms, missing data must be filled in with
a calculated guess that should have a minimal impact on the analytic results.
4. Three options are frequently used:
A. Fill in missing data with 0 for ratio data in log scale (log (R/G) ) and 1 for
ratio data in unlog scale (R/G). This is equivalent in assuming that
the gene's expression doesn't change relative to the control. Since expression of the great
majority of genes don't change, this is a reasonable assumption.
B. Column (array) mean --- Fill in missing data with the average expression level of all the genes on
the same array.
C. Row (gene) mean --- Fill in missing data with the average expression level of
all the samples for the same spot.
5. When similarity is measured by Perason correlation. Fill in with
column mean or row mean has an interesting effect:
As long as sample-sample correlation is concerned, fill in with
column (array) mean is equivalent to ignoring the missing data.
As long as gene-gene correlation is concerned, fill in with
row (gene) mean is equivalent to ignoring the missing data.
6. The above data fill-in options are provided in MicroHelper.
7. For more sophisticated missing data fill-in options, consult a statistician.
|