FAQ


  • How do you measure coexpression?

    CressExpress performs linear regression using expression values harvested from publicly-available microarray data. When you enter a list of query probe set ids (or genes), the tool performs a linear regression comparing your query's expression values to expression values for all probe sets on a particular array platform. The query probe sets correspond to the y axis in the regression plot. Each regression yields the following values:

    • r-squared value, the square of Pearson's correlation coefficient. It ranges from 0 to 1, and numbers closer to 1 indicate a better, tighter regression.
    • a p value, which indicates the probability of obtaining an r-squared from a regression under the assumption that there is no linear relationship between variables x and y, e.g., your query probe set and another probe set on the array
    • DOFs - degrees of freedom, the number of points included in the regression, minus one
    • Slope - the slope of the regression line. Negative slopes indicate a negative co-expression relationship, and position slopes indicate position co-expression

    For some examples of position and negative co-expression, see Figure 1 in Wei, et al., Plant Physiology, 2006 Oct;142(2):762-74 [PMID: 16920875 / DOI: 10.1104/pp.106.080358]

    [ Go To Top ]


  • Which queries should I use? Probe set ids or gene ids?

    Probe sets are just a collection of probes on the Affymetrix array that measure expression of a target gene. Because the labeled, sample RNAs are anti-sense with respect to the target gene, the probes themselves precisely match the sequence of the target gene. Because most genomes contain many duplicated sequence segments, many probe sets are non-specific and can interrogate more than one gene. And of course, some genes are not represented on some array platforms; it depends on when the gene was discovered, when the array was designed, and other factors. For these reasons, we strongly recommend using probe sets ids, rather than gene ids, as inputs to the co-expression analysis tool. However, if you prefer to enter gene ids, the tool will attempt to match gene ids you entered with probe set ids and expression values from the array platform you choose. Regardless, the tool output will report a pairing between target gene ids and probe set ids; we get these from an array annotation file TAIR provides on its ftp site.

    [ Go To Top ]


  • Which data set release should I use?

    We have found that the RMA processed, quantile-normalized data (Releases 2.0 and 3.0 below) seem to provide the best co-expression results, as judged by analyzing pairs of genes we think are co-expressed, such as certain genes in the same metabolic pathway, for instance. However, we also provide MAS5-processed data to facilitate comparisons with other co-expression tools that use MAS5 and also to allow users to see how the different microarray processing algorithms affect results.

    Details on individual data releases appear below:

    Release Data sources Number arrays
    (slides)
    Array
    processing/normalization
    Published example
    analyses
    2.0 Affywatch I,II 486 ATH1, 80 AG RMA Wei, et al
    Cui, et al
    3.0 Affywatch I,II,III 1779 ATH1, 80 AG RMA None as yet
    3.1 Affywatch I,II,III 1779 ATH1 GCRMA None as yet
    3.2 Affywatch I,II,III 1779 ATH1 MAS5/log2
    transformation,
    divide by average
    None as yet

    [ Go To Top ]


  • Which array type should I chose?

    CressExpress database includes data harvested from the Affymetrix AG (Arabidopsis Genome) and ATH1 (see: Redman, et al.[PMID: 15086809 / DOI: 10.1111/j.1365-313X.2004.02061.x]) arrays. The database includes many more ATH1 than AG arrays; so for most purposes, ATH1 would likely be a better choice.

    [ Go To Top ]


  • What tissue types should I select?

    The tissue types listed in Step 3 are taken directly from the descriptions of samples and experiments as provided by the NASC AffyWatch service. The tissue types are essentially just free text descriptions of individual samples that were provided by the experimenters who donated their data to the NASC array collection. If you expect that your query genes are likely to be expressed in a subset of tissues, it would probably be a good idea to select those. However, we would recommend you try selecting ALL (to include all sample types) for at least one run of the tool, so that you can get an idea of how patterns of correlated expression vary across different cellular settings.

    [ Go To Top ]


  • What experiments should I include in an analysis?

    Step 4 presents a table of experiments available for inclusion in the analysis. Only experiments corresponding to the array type and tissue types selected on previous screens should appear. Also, some experiments may include only a few arrays corresponding to the previously-selected tissue types. If you select these experiments, then the only the arrays that fit the tissue type criteria selected in Step 3 will be included in the analysis.

    The table displays Experiment IDs (from NASC) along with some text describing each experiment. The text is from descriptions provided by the original experimenters. To find out more about each experiment, follow the links to a page at NASC.

    [ Go To Top ]


  • What is PLC (pathway-level co-expression) analysis?

    PLC identifies genes and probe sets that are co-expressed with all or some of the query genes/probe sets you entered in Step 2. PLC and how it works is described in detail in Wei et al. [PMID: 16920875 / DOI: 10.1104/pp.106.080358] PLC uses the output from the co-expression linear regression analysis to identify probe sets and target genes that are co-expressed with two or more of your query genes. PLC then reports these co-expression partners in order of the number of query genes with which they are co-expressed and, then, to break ties, in order of average r-squared values.

    In essence, PLC analysis identifies the part of the overall co-expression network that directly neighbors two or more of your query genes or probe sets. The co-expression tool doesn't actually compute the entire network; it knows enough about the network to find the shared neighbors for all your query genes.

    When you download the results, you'll get several files: an HTML file for browsing results; several files you can open in Cytoscape to visualize the partial co-expression network that contains just your query genes and their neighbors; and a comma-separated file with PLC results that you load into Excel or other programs.

    [ Go To Top ]


  • Why do I need to enter my email in Step 6?

    CressExpress analyses will typically take several minutes (possibly more) to finish. When your analysis run completes, the tool will send you an email with a URL telling you where you can obtain your results files.

    If you like, you can enter multiple email addresses. Just enter them as a comma-separated list (e.g., mary@stevens.org,foo@bar.net).

    [ Go To Top ]


  • What is the K-S test and how do I use it to filter "bad" chips?

    You can choose to filter out "bad" chips using results from quality-control analysis we've performed. The QC procedure we use is based on analysis of studentized deleted residuals; details on how this works appear in Persson et. al [PMID: 15932943 / DOI: 10.1073/pnas.0503392102] 2005 and Travedi et al. [PMID: 15813968 / DOI: 10.1186/1471-2105-6-86] In a nutshell, to exclude lower-quality arrays, decrease the K-S parameter, which should be a value between 1 and 0.

    How this works
    For expression value for each probe set across all arrays in a group, we calculate its deleted residual by subtracting the mean expression value for that probe set across all arrays in the same group. When then calculate the studentized deleted residual by dividing the deleted residual from the first step by the probe set's standard deviation, again computed across all the arrays in the same group. We then examine all the deleted residuals from a single array. If the distribution of deleted residuals deviates significantly from a t-distribution with N-2 degrees of freedom, where N is the number of arrays in a group, then we consider it an outlier and recommend excluding it from further analyses. To identify these outlier chips, we use a Kolmogorov-Smirnov (K-S) goodness-of-fit test, which produces a test statistic (D) indicating how well the observed distribution matches the theoretical t-distribution. The D statistic ranges from 0 to 1, and larger numbers indicate greater deviation and lower quality. We recommend 0.15 as a good default cutoff, which typically removes 10-20% of the arrays.

    For an introduction to the K-S test and how it works, click here.

    [ Go To Top ]