Expression QTL Derivation

For both mice and rats, mean expression levels within strains were used as phenotypic values in a QTL analysis implemented in QTL Reaper, which is written in C and compiled as a Python module. QTL Reaper can be downloaded from http://qtlreaper.sourceforge.net/. A weighted marker regression analysis was used within QTL Reaper to calculate LRS scores for each marker. LRS scores were transformed to LOD scores for convenience by dividing by (2 x ln(10)). The regression is weighted to account for the different number of arrays within strains used to calculate strain means. The weight is based on the repeatability of the transcript intensity and the number of arrays used to calculate the strain mean (Carlborg et al. 2005). The empirical p-value with respect to the maximum LOD score was calculated for each transcript by permutation (Churchill and Doerge, 1994). The empirical p-value adjusts for the multiple comparisons due to the multiple markers per transcript, but not for the multiple comparisons due to the many transcripts. To adjust for the multiple comparisons due to many transcripts, false discovery rates were calculated according to Benjamini and Hochberg (1995).

Mouse, whole brain, Affymetrix Mouse 430 version 2 array

Whole-brain gene expression data was obtained for a panel of 30 BXD RI strains plus the two parental strains on the MOE430v2 array from Affymetrix. Probes were eliminated prior to normalization if their sequence did not match any part of the NCBI m37 build of the mouse genome, if their sequence matched multiple locations in the mouse genome, or if the location in the genome that the probe did match contain a single nucleotide polymorphism between C57BL/6J and DBA/2J according to the whole genome sequence data obtained from the Sanger Institute (Keane et al. 2011). Entire probe sets were eliminated if less than four of the original probes remained after filtering. Probe set intensities were normalized and summarized using RMA. If a probe set did not have at least one present call throughout all samples, the probe set was dropped from the data set. Of the 41,581 probe sets retained after masking, 30,031 probe sets remained after filtering by present/absent calls. Data were thoroughly examined for batch effects related to processing. The microarrays were run over a year and a half period, resulting in 15 batches. Both batches and strains can contribute to non-random data distribution and a new method for removing batch effects, while retaining strain effects, was used (personal communication, Evan Johnson, Boston University) on the set of 30,031 probe sets detected above background. This method combines a simple rank test and a Bayesian hierarchical framework similar to the empirical Bayes method, Combating Batch Effects When Combining Batches of Gene Expression Microarray Data (ComBat) (Johnson et al., 2007). This version of the data is available in the Download Resources section. The two parental strains were not included in eQTL calculations.

An original set of 115,183 SNPs markers and their genotypes for 89 BXD RI strains and the two parental strains was downloaded from the Wellcome-CTC Mouse Strain SNP Genotype Set (http://gscan.well.ox.ac.uk/gsBleadingEdge/mouse.snp.selector.cgi). The location of these markers is based on Mouse Build 37/mm9. However, the set of markers used for the eQTL analyses was reduced from the original set to eliminate SNPs with missing genotype information for the 30 RI strains, SNPs that did not differ between the RI strains, and SNPs with genotype calls that did not follow the known breeding scheme of the panel. This reduced the SNP set to 7,226 SNPs. This set of SNPs was reduced to unique strain distribution patterns with respect to the 30 RI strains used in our analysis. This final set contained 986 informative strain distribution patterns. Both the normalized expression data and the markers used for the eQTL analysis are available for download from the PhenoGen website.

Mouse, whole brain, Affymetrix Mouse Exon Array

Whole-brain gene expression data was obtained for a panel of 59 LXS RI strains on the Affymextrix Mouse Exon Array 1.0 ST. Individual probes were eliminated prior to normalization if their sequence did not match any part of the NCBI m37 build of the mouse genome, if their sequence matched multiple locations in the mouse genome, or if the location in the genome that the probe did match contained a SNP between any of the 19 strains in the public Inbred Mice dataset where genotype data is available at the Imputed Genotype Resource from the Jackson Laboratory; http://cgd.jax.org/datasets/popgen/imputed.shtml (same mask that is implemented on PhenoGen). Entire probe sets were eliminated if less than three of the original probes remained after filtering.

Arrays were examined for quality and arrays that did not meet quality standards were eliminated. Probe intensities were normalized and summarized into core transcript clusters using RMA. The dataset, including arrays from ILS, ISS, C57BL/6J, and DBA/2J, was adjusted for batch effects using the empirical Bayes method outlined in Johnson et al (2007). Two C57BL/6J arrays were analyzed in each batch, and most batches also had two DBA/2J arrays. Strain means were calculated after adjusting for the effect of breeding location. For most strains, three animals were bred at the Jackson Laboratory in Bar Harbor, Maine and three animals were bred at the University of Colorado School of Medicine in Aurora, Colorado. The marker set of eQTL calculations used on the LXS RI panel came from Gary Churchill at the Jackson Laboratory. SNP genotypes were assessed using the Affymetrix Mouse Diversity Genotyping Array. Of the 314,865 SNPs retrieved, 34,475 SNPs indicated different homozygous genotypes between parental strains (ILS and ISS), had valid dbSNP identifiers, and had no missing or heterozygous genotype calls. The set of markers used for the eQTL analyses was reduced from the original set to eliminate markers that had identical strain distributions (with respect to the 59 strains used in our analysis). This final set contained 1,475 informative markers. Both the normalized expression data and the markers used for the eQTL analysis are available for download from the PhenoGen website.

Rat, whole brain, CodeLink Whole Genome Rat Array

Whole-brain gene expression data was obtained for a panel of 25 HXB/BXH RI strains plus the two parental strains on the CodeLink Rat Whole Genome Array. In preparation for normalization, probes were removed from the datasets if they were one of the negative or positive controls placed on the array by the manufacturer. Next, individual values were eliminated based on the quality flags assigned by the CodeLink Expression Analysis Software. Values were eliminated if they were flagged as M (spot was identified to be defective through image inspection at manufacturing), C (spot has a high level of background contamination), I (spot has an irregular shape), or S (spot has a high number of saturated pixels). Values were retained if they were flagged G (spot is good) or L (spot is below local background noise). In addition, to be able to take the log base 2 transformation of the background adjusted intensity values, all background adjusted intensity values below zero were replaced with the value 0.00001. The data were then normalized using a cyclic LOESS procedure executed in R, which accounted for the missing intensity values. Genotype information for the rats was downloaded from the STAR Consortium's website (http://www.snp-star.eu/). SNP locations are based on RGSC version 3.2. The downloaded SNP data was cleaned by eliminated SNPs that did not differ between the parental strains, SNPs that are not genotyped in either parental strains, and SNPs that were heterozygous for either of the parental strains. Unknowns were recoded if the surrounding SNPs had the same genotype. Double recombinants were also recoded. SNPs were eliminated if more than two strains were missing genotype information. After the dataset had been cleaned, 1,460 unique strain distribution patterns were identified and used in the eQTL analysis. Both the normalized expression data and the markers used for the eQTL analysis are available for download from the PhenoGen website.

Rat, whole brain/left ventricle/liver/brown adipose tissue, Affymetrix Rat Exon Array

Whole-brain, heart, liver, and brown adipose tissue gene expression data was obtained for a panel of 21 HXB/BXH RI strains (only 19 RI strains for the brown adipose tissue) and six related inbred strains on the Affymextrix Rat Exon Array 1.0 ST. Individual probes were eliminated prior to normalization if their sequence did not match any part of the RGSC version 3.2 of the rat genome, if their sequence matched multiple locations in the mouse genome, or if the location in the genome that the probe did match contain a SNP between the Brown Norway (BN/SsNHsdMcwi) inbred strains (reference strain) and the spontaneously hypertensive rat (SHR/OlaIpcv) strain that was recently sequenced (Atanur et al. 2010) using next generation sequencing or a SNP detected in DNA sequencing of the BN-Lx/CubPrin and SHR/OlaIpcvPrin strains (same mask that is implemented on PhenoGen). DNA sequence data for the BN/SsNHsdMcwi and SHR/OlaIpcv was downloaded directly from the Ensembl ftp site at: ftp://ftp.ebi.ac.uk/pub/databases/ensembl/snp/rat/shr/. For the 4,022,111 original probes, 604,601 were removed (472,072 did not map uniquely to the genome; 132,529 probes contained a SNP). Entire probe sets were eliminated if less than three of the original probes remained after filtering. Arrays were examined for quality, and arrays that did not meet quality standards were eliminated. Probe intensities were normalized and summarized into core transcript clusters using RMA. The dataset, including arrays from the six relevant inbred strains, was adjusted for batch effects using the empirical Bayes method outlined in Johnson et al (2007). Only the 21 recombinant inbred strains (19 for the brown adipose tissue data set) were included in the eQTL analysis. The marker set used of eQTL calculations on the HXB/BXH RI panel was originally downloaded from the Ensembl link to the STAR Consortium data (http://www.ensembl.org/Rattus_norvegicus/Info/Content?file=star/index.html). SNP locations are based on RGSC version 3.2. The downloaded SNP data was cleaned by eliminated SNPs that did not differ between the parental strains, SNPs that are not genotyped in either parental strains, and SNPs that were heterozygous for either of the parental strains. Unknowns were recoded if the surrounding SNPs had the same genotype. Double recombinants were also recoded. SNPs were eliminated if more than two strains were missing genotype information. After the dataset had been cleaned, 761 unique strain distribution patterns were identified and used in the eQTL analysis. Both the normalized expression data and the markers used for the eQTL analysis are available for download from the PhenoGen website

References

Johnson WE, Li C, and Rabinovic A (2007). Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1): 118-127.
Atanur SS, Birol I, Guryev V, Hirst M, Hummel O, Morrissey C, Behmoaras J, Fernandez-Suarez XM, Johnson MD, McLaren WM, Patone G, Petretto E, Plessy C, Rockland KS, Rockland C, Saar K, Zhao Y, Carninci P, Flicek P, Kurtz T, Cuppen E, Pravenec M, Hubner N, Jones SJ, Birney E, Aitman TJ (2010). The genome sequence of the spontaneously hypertensive rat: Analysis and functional significance. Genome Research 20(6):791-803.
Churchill GA and Doerge RW (1994). Empirical threshold values for quantitative trait mapping. Genetics 138:963-971.
Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B 57:289-300.