FAQ
-
How do you measure coexpression?
CressExpress performs linear regression using expression values harvested from publicly-available microarray data. When you enter a list of query probe set ids (or genes), the tool performs a linear regression comparing your query's expression values to expression values for all probe sets on a particular array platform. The query probe sets correspond to the y axis in the regression plot. Each regression yields the following values:
- r-squared value, the square of Pearson's correlation coefficient. It ranges from 0 to 1, and numbers closer to 1 indicate a better, tighter regression.
- a p value, which indicates the probability of obtaining an r-squared from a regression under the assumption that there is no linear relationship between variables x and y, e.g., your query probe set and another probe set on the array
- DOFs - degrees of freedom, the number of points included in the regression, minus one
- Slope - the slope of the regression line. Negative slopes indicate a negative co-expression relationship, and position slopes indicate position co-expression
For some examples of position and negative co-expression, see Figure 1 in Wei, et al., Plant Physiology, 2006 Oct;142(2):762-74 [PMID: 16920875 / DOI: 10.1104/pp.106.080358]
[ Go To Top ]
-
Which queries should I use? Probe set ids or gene ids?
Probe sets are just a collection of probes on the Affymetrix array that measure expression of a target gene. Because the labeled, sample RNAs
are anti-sense with respect to the target gene, the probes themselves precisely
match the sequence of the target gene. Because most genomes contain many
duplicated sequence segments, many probe sets are non-specific and can
interrogate more than one gene. And of course, some genes are not represented on
some array platforms; it depends on when the gene was discovered, when the array
was designed, and other factors. For these reasons, we strongly recommend using
probe sets ids, rather than gene ids, as inputs to the co-expression analysis
tool. However, if you prefer to enter gene ids, the tool will attempt to match
gene ids you entered with probe set ids and expression values from the array
platform you choose. Regardless, the tool output will report a pairing between target gene ids and probe set ids; we get these from an array annotation file TAIR provides on its ftp site.
[ Go To Top ]
-
Which data set release should I use?
We have found that the RMA processed, quantile-normalized data (Releases 2.0 and 3.0 below) seem to provide the best co-expression results, as judged by analyzing pairs of genes we think are co-expressed, such as certain genes in the same metabolic pathway, for instance. However, we also provide MAS5-processed data to facilitate comparisons with other co-expression tools that use MAS5 and also to allow users to see how the different microarray processing algorithms affect results.
Details on individual data releases appear below:
| Release |
Data sources |
Number arrays (slides) |
Array processing/normalization |
Published example analyses |
| 2.0 |
Affywatch I,II |
486 ATH1, 80 AG |
RMA |
Wei,
et al
Cui,
et al |
| 3.0 |
Affywatch I,II,III |
1779 ATH1, 80 AG |
RMA |
None as yet |
| 3.1 |
Affywatch I,II,III |
1779 ATH1 |
GCRMA |
None as yet |
| 3.2 |
Affywatch I,II,III |
1779 ATH1 |
MAS5/log2
transformation, divide by average |
None as yet |
[ Go To Top ]
-
Which array type should I chose?
CressExpress database includes data harvested from the Affymetrix AG (Arabidopsis Genome) and ATH1 (see: Redman, et al.[PMID: 15086809 / DOI: 10.1111/j.1365-313X.2004.02061.x]) arrays. The database includes many more ATH1 than AG arrays; so for most purposes, ATH1 would likely be a better choice.
[ Go To Top ]
-
What tissue types should I select?
The tissue types listed in Step 3 are taken directly from the descriptions of samples and experiments as provided by the NASC AffyWatch service. The tissue types are essentially just free text descriptions of individual samples that were provided by the experimenters who donated their data to the NASC array collection. If you expect that your query genes are likely to be expressed in a subset of tissues, it would probably be a good idea to select those. However, we would recommend you try selecting ALL (to include all sample types) for at least one run of the tool, so that you can get an idea of how patterns of correlated expression vary across different cellular settings.
[ Go To Top ]
-
What experiments should I include in an analysis?
Step 4 presents a table of experiments available for inclusion in the analysis. Only experiments corresponding to the array type and tissue types selected on previous screens should appear. Also, some experiments may include only a few arrays corresponding to the previously-selected tissue types. If you select these experiments, then the only the arrays that fit the tissue type criteria selected in Step 3 will be included in the analysis.
The table displays Experiment IDs (from NASC) along with some text describing each experiment. The text is from descriptions provided by the original experimenters. To find out more about each experiment, follow the links to a page at NASC.
[ Go To Top ]
-
What is PLC (pathway-level co-expression) analysis?
PLC identifies genes and probe sets that are co-expressed with all or
some of the query genes/probe sets you entered in Step 2. PLC and how it
works is described in detail in
Wei et al. [PMID: 16920875 / DOI: 10.1104/pp.106.080358]
PLC uses the output from the co-expression linear regression analysis to
identify probe sets and target genes that are co-expressed with two or
more of your query genes. PLC then reports these co-expression partners
in order of the number of query genes with which they are co-expressed
and, then, to break ties, in order of average r-squared values.
In essence, PLC analysis identifies the part of the overall co-expression
network that directly neighbors two or more of your query genes or probe
sets. The co-expression tool doesn't actually compute the entire network; it
knows enough about the network to find the shared neighbors for all your
query genes.
When you download the results, you'll get several files: an HTML file for
browsing results; several files you can open in
Cytoscape to visualize the partial co-expression network
that contains just your query genes and their neighbors; and a comma-separated
file with PLC results that you load into Excel or other programs.
[ Go To Top ]
-
Why do I need to enter my email in Step 6?
CressExpress analyses will typically take several minutes (possibly more) to finish. When your analysis run completes, the tool will send you an email with a URL telling you where you can obtain your results files.
If you like, you can enter multiple email addresses. Just enter them as a comma-separated list (e.g., mary@stevens.org,foo@bar.net).
[ Go To Top ]
-
What is the K-S test and how do I use it to filter "bad" chips?
You can choose to filter out "bad" chips using results from quality-control
analysis we've performed. The QC procedure we use is based on analysis of studentized deleted residuals; details on how this works appear in Persson
et. al [PMID: 15932943 / DOI: 10.1073/pnas.0503392102]
2005 and Travedi et al. [PMID: 15813968 / DOI: 10.1186/1471-2105-6-86] In a nutshell, to exclude lower-quality arrays, decrease
the K-S parameter, which should be a value between 1 and 0.
How this works
For expression value for each probe set across all arrays in a group,
we calculate its deleted residual by subtracting the mean expression
value for that probe set across all arrays in the same group. When
then calculate the studentized deleted residual by dividing the
deleted residual from the first step by the probe set's standard deviation,
again computed across all the arrays in the same group. We then examine
all the deleted residuals from a single array. If the distribution
of deleted residuals deviates significantly from a t-distribution with
N-2 degrees of freedom, where N is the number of arrays in a group, then
we consider it an outlier and recommend excluding it from further analyses.
To identify these outlier chips, we use a Kolmogorov-Smirnov (K-S) goodness-of-fit
test, which produces a test statistic (D) indicating how well the observed
distribution matches the theoretical t-distribution. The D statistic
ranges from 0 to 1, and larger numbers indicate greater deviation and
lower quality. We recommend 0.15 as a good default cutoff, which typically
removes 10-20% of the arrays.
For an introduction to the K-S test and how it works, click
here.
[ Go To Top ]
|