James C. Liao's
Metabolic Engineering Website
lcDNA Instruction Manual - lcDNA_v0.03 - 17 June 2003
NOTES: All files must use the .txt extension. All files must be tab-delimited unless otherwise specified. Do not use headers in your data files. NO whitespace in gene names. Only one file can be selected at a time. When importing data from Imagene files only the median intensity values are imported.
Starting lcDNA
- The program is located in lcDNA/bin. Do not move it.
- GNU/Linux: Go to the directory you installed it in. At the command prompt type: ./lcDNA
- Microsoft Windows: Click on lcDNA.exe. You can create a shortcut and put it on your desktop or in your start menu, but you must not move the program from its directory.
When the program starts you should see the Load Data tab (Figure 1).
Figure 1: Load Data Tab. This is the tab where users are able to load data for further analysis.
Loading your data for analysis
Files that you need:
- Intensity Data File (Table 1): You will need to create an intensity file. The intensity file must have the following columns: channel 1, channel 2, Gene ID, Gene Number, Gene Name (do not include the column headers). *If you use Biodiscovery's (www.biodiscovery.com) Imagene and know your arrayer coordinate system see section (Loading Imagene Files ).
Table 1. Intensity Data File
- Gene ID File: **Optional-Liao Lab use only.. This file must be created by the "Create Gene ID" function.
- ***If you use Gene ID Files that have different content but the same name you must either restart the program, or select "cDNA Analysis" to remove any residue left by the previous file. If your Gene ID Files have different names then you will not have a problem with residual info.
There are three methods for loading your files:
- Loading Data without a Gene ID File
- Loading Data with a Gene ID File
- Method for loading data from Imagene files
- Loading Data without a Gene ID File
- Loading Raw Intensity Files (Table 1)
- In the "File" Menu, select the "Load w/o Gene ID" option.
- Load your intensity data files. Use the navigator box to locate your files. Left-click to select, right-click to pick up, then drag it to the box under "Loaded Intensity Files".
- Loading Normalized Intensity Files
- The file is the same as the Raw Intensity File (Table 1) with one additional column. You must put the column of normalized log ratios in between the Gene ID # and Gene Number Columns.
- In the "File Menu" select the "Load Normalized Data" option.
- Load your intensity data files. Use the navigator box to locate your files. Left-click to select, right-click to pick up, then drag it to the box under "Loaded Intensity Files".
- Loading Data with a Gene ID File
- Loading your Gene ID file. Use the navigator box in the center of the tab to navigate to your Gene ID file. Left-click to select the Gene ID file, right-click to pick it up, then drag it to the label "Drag Gene ID File Here". If you did everything correctly then the label should change to "Gene ID File: Your Gene ID File".
- Load your intensity data files. Use the navigator box to locate your files. Left-click to select, right-click to pick up, then drag it to the box under "Loaded Intensity Files".
Method for loading data from Imagene files
- Loading your Gene ID File.
- If you have already used the "Create Gene ID function" to create your gene ID file then you can drag this file to the label "Drag Gene ID File Here:." Otherwise, see the Create Gene ID Tutorial.
- If you wish to import from intensity files, but did not use the Virtek arrayer please contact the program developer for support.
We may be able to help you fabricate a Gene ID file that will work properly with the importing option. ****Do not use a homemade Gene ID file here. It is likely that your data will not be analyzed properly***.
- Loading your Imagene files. From the "Options" menu select "Imagene Mode." Navigate to your files. Left-click to select the file, right click to pick it up, then drag the intensity file to the box under "Channel 1" (or "Channel 2"). Next, drag the corresponding intensity file to the box under "Channel 2" (or "Channel 1"). This function perfoms the following operations:
- The background median is subtracted from the signal median; this difference is taken to be the signal intensity.
- All values that are flagged or NaN are removed.
- ID_numbers are assigned to the intensity values; if the ID_corresponds to a "blank" or "water" spot in the Gene ID file then the intensity values are ignored.
- The Channel 1 rootname and corresponding Gene ID file rootname are added to the "Loaded Intensity Files" and "Gene ID Files" boxes, respectively. A file and its corresponding Gene ID file name can be removed by left-clicking and then right-clicking on the file.
Analysis Options
Click on the Analysis Options menu. Since there is a perforation you can tear the menu off (Figure 3). Unless you are loading files that have been processed to some degree, all the options above the one that you wish to run should be selected. **lcDNA does not have a memory; it always starts calculations with the files in the state that they were loaded. E.g. If you want to normalize data that has been filtered you cannot select and run the quality filtering option, deselect the quality filtering option, and then select and run the normalization option. If you did this then you would be normalizing unfiltered data.
Figure 3: Analysis Options Menu. Once you have loaded/imported your file you may want to select how the data is to be analyzed.
Eliminate Extreme Values (Figure 4).
- This function allows the user to specify the minimum (Lower Bound) and maximum (Upper Bound) intensity values that are acceptable for data analysis. The Lower Bound must be greater than 0. The Upper Bound must be less than 65535 (this is due to the 16 bit TIFF file generated by the scanner).
- The output file should have 5 columns: 1) Channel 1 Intensity, 2) Channel 2 Intensity, 3) Assigned Gene ID Number, 4) Gene Name, 5) Gene Number.
Quality Filtering
- In order to use this function your microarray must have two or more spots for each gene.
- This function removes genes that have large deviations between spots on a slide. It compares the coefficient of variance (cv) for each gene with a set number of genes (Size of Window) and rejects the gene if the cv exceeds a set percentile (threshold) for the genes.
- Size of window: the number of genes with similar mean intensity values that you want to compare each spot with; this number should not be too large unless all your spots have approximately the same intensity values.
- threshold: the maximum permissable percentile for a genes cv.
- The output file should have 5 columns: 1) Channel 1 Intensity, 2) Channel 2 Intensity, 3) Assigned Gene ID Number, 4) Gene Name, 5) Gene Number
Normalization
- This function performs two types of normalization.
- Global normalization if the files are calibration files
- Non-linear normalization, using the rank-invariant method, and the lowess fitting function for the comparative files.
Parameters:
- File Parameters:
- Calibration: Select for calibration hybridizations. Performs nonlinear normalization using the LOWESS method.
- Comparative: Select for normalizing comparative data. Can either be normalized using the rank invariant method or total intensity.
- Rank Invariant (RI) Normalization Parameters
- Iteration:T indicates that RI normalization with iteration will be performed. Requires a large number of genes on the microarray (greater than 3000). F indicates no iteration will be performed.
- Ext. Threshold (An integer):The maximum difference in rank, for channel1 vs channel2, that a gene may have to be considered invariant.
- % Threshold:
- For Iteration = T (a decimal 0.05 for 5%): After the first iteration, the maximum allowable difference in rank is determined by multiplying the number of invariant genes from the previous iteration by this value.
- For Iteration = F (an integer): The number of genes to exclude from the upper or lower bounds (e.g. If we have 100 genes and set this value to 5 then we will only consider genes ranked 5 through 95.).
- The output file should have 6 columns: 1) Channel 1 Intensity, 2) Channel 2 Intensity, 3) Assigned Gene ID Number, 6) Normalized Log (Channel 2 / Channel 1), 5) Gene Name, 6) Gene Number.
Assess Expression
- For a detailed explanation see: Tseng et al. and supplementary data in Nucleic Acids Research.
Experimental Requirements
- Two or more biologically independent samples.
- Two or more independent calibration hybridizations. Not necessary but highly beneficial. The presence of calibration hybridizations will, in most cases, increase the yield of genes that are categorized as differentially expressed (Hyduke et al.)
- One technical replicate for each independent sample, or calibration. Because there is slide-dependent variation in current-day microarray technology a technical replicate is recommended for each independent sample/calibration. lcDNA can function without technical replicates.
- Clicking on the button will open a window that will allow the user to select the data type for each loaded/imported data set.
General Parameters:
- Number of Genes: Enter the number of genes to be analyzed in your data set.
- Technical Replicates: Select if the data set contains technical replication
Slide-Specific Parameters:
- Calibration: Select if the slide is from a calibration hybridization.
- Comparative: Select if the slide is from an experimental (comparative) hybridization.
- DataSet: Each series of microarray slides that are used to probe a single question (e.g. one time point in a time course experiment) and the corresponding calibration slides belong to a single data set. (Must be an integer greater than 0).
- Experiment Number:Each biologically independent slide in a data set is considered a separate experiment and will have a different experiment number. Technical replicates have the same experiment number but a different slide number. (Must be a positive integer).
- Slide Number: Each technical replicate within an experiment must have a unique slide number. (Must be a positive integer; must be sequential).
- The output file is named with the root from the first file in the MCMC data set and has 9 columns:
- Assigned Gene ID Number
- Ecb: indicates the number of calibration experiments in which the gene was detected on at least on of the slides; if 0 then the gene was never detected. If calibration slides were not employed then it indicates the gene is present in a comparative experiment slide.
- Ecmp (similar to Ecb but for comparative experiments).
- 97.5q
- 2.5q
97.5q and 2.5q represent are used to denote the upper and lower bounds, respectively, of the range that the average expression is 95% likely to be found in (they can be thought of as error bars; remember the distribution is not normal).
- Average Theta G: the average log ratio (channel 2/ channel 1) for the corresponding gene.
- Score: if it is above 0.975 then your are at least 95% confident that the expression ratio is negative; if it is below 0.025 then you are at least 95% confident that the expression ratio is positive.
- Gene number.
- Gene name.
Recommended Reading:
- Tseng, G.C., Oh M.-K., Rohlin L., Liao, J.C. and Wong, W.H. (2001) Issues In cDNA Microarray Analysis: Quality Filtering, Channel Normalization, Models of Variations and Assessment of Gene Effects. Nucleic Acid Res, 29, 2549-2557.
- Hyduke DR, Rohlin L, Kao KC, Liao JC; (2003) A Software Package for cDNA Microarray Data Normalization and Assessing Confidence Intervals. Submitted
Nomenclature
- Comparative experiment - Each dye represents a different condition (e.g. glucose vs acetate)
- Calibration experiment - Each dye is from the same condition and batch of RNA.
- Technical Replication - Hybridization of a pool of labeled sample to multiple slides. Aids in assessing the impact of slide to slide variance
Webmaster