Gene co-expression networks (GCNs) are graphic representations that depict the coordinated transcription of genes in response to certain stimuli. GCNs provide functional annotations of genes whose function is unknown and are further used in studies of translational functional genomics among species. In this work, a methodology for the reconstruction and comparison of GCNs is presented. This approach was applied using gene expression data that were obtained from immunity experiments in Arabidopsis thaliana, rice, soybean, tomato and cassava. After the evaluation of diverse similarity metrics for the GCN reconstruction, we recommended the mutual information coefficient measurement and a clustering coefficient-based method for similarity threshold selection. To compare GCNs, we proposed a multivariate approach based on the Principal Component Analysis (PCA). Branches of plant immunity that were exemplified by each experiment were analyzed in conjunction with the PCA results, suggesting both the robustness and the dynamic nature of the cellular responses. The dynamic of molecular plant responses produced networks with different characteristics that are differentiable using our methodology. The comparison of GCNs from plant pathosystems, showed that in response to similar pathogens plants could activate conserved signaling pathways. The results confirmed that the closeness of GCNs projected on the principal component space is an indicative of similarity among GCNs. This also can be used to understand global patterns of events triggered during plant immune responses.
Keywords: Gene co-expression networks, Similarity measures, Similarity threshold, Principal Component Analysis, Networks comparison, Plant immunity
Molecular biological high-throughput techniques have provided a great amount of diverse and informative gene expression data, currently available in genomic databases. These data, if properly analyzed, allow for a better understanding of the biological processes in different organisms. The construction of functional gene networks that are based on gene expression data are termed gene co-expression networks (GCNs), which reflect information based on the relationships between genes (and/or the proteins they encode) that indicate a coordinated participation in a common biological process or pathway (Atias, Chor & Chamovitz, 2009; Hwang et al., 2011). GCNs predict functional annotations for genes whose function is unknown (Ficklin & Feltus, 2011). Some studies have also confirmed through experimental validation that the predictions are accurate (Seo et al., 2011).
Several methodologies have been used for the construction of GCNs in plants in order to understand important biological processes (López-Kleine, Leal & López, 2013), trying to represent as much information as possible using gene expression data from heterogeneous experiments (Atias, Chor & Chamovitz, 2009). Most of these methodologies share four main steps that are solved in different manners: (1) gene expression data selection and the construction of expression matrices, (2) the selection of a similarity measurement and the construction of gene similarity matrices (Butte & Kohane, 2000; Mahanta et al., 2012), (3) similarity threshold selection (Elo et al., 2007; Luo et al., 2007) and (4) the comparison of GCNs that were obtained from different samples or species, as has been proposed as the final step by several works (Elo et al., 2007; Skinner et al., 2011).
The confidence in the obtained GCNs depends on the reliability and objectiveness of the approach used at each of these steps. Additionally, when heterogeneous gene expression samples are used in conjunction, special care is required to maintain a high signal/noise ratio. Selecting a similarity metric that captures the relationship between gene expression profiles is the first critical decision in the methodology (Zhang & Horvath, 2005). The Pearson Correlation Coefficient (PCC) is the most used similarity metric due to its simple implementation and appropriateness for this task (Edwards et al., 2010; Ouyang et al., 2012). Nevertheless, as expression profiles can be correlated non-linearly, many genes with an interesting coordinated co-expression are not retained for inclusion in the final GCN using PCC (Bandyopadhyay & Bhattacharyya, 2011). Furthermore, the PCC is affected by outlying observations that originate pairs of genes that are co-expressed incorrectly (Mutwil, 2010). Studies have confirmed that the PCC is high even if genes are neither overexpressed nor underexpressed across conditions (Bandyopadhyay & Bhattacharyya, 2011) and that it also fails in the detection of proximity between expression profiles (Mahanta et al., 2012). Several metrics have been introduced to detect any dependence between expression profiles while enhancing the robustness if noisy data are available (Numata, Ebenhöh & Knapp, 2008; Bandyopadhyay & Bhattacharyya, 2011). Metrics that are based on information theory, such as the Non-linear Correlation coefficient based on Mutual Information (NCMI), perform well with expression data, due to the lack of distribution assumptions and the fact that these metrics are not affected by data transformations (Numata, Ebenhöh & Knapp, 2008). Recently, the Normalized Mean Residue Similarity (NMRS) showed good performance in detecting shifted patterns of expression profiles (Mahanta et al., 2012). An evaluation of these metrics compared to the PCC is essential to establish their strengths or weaknesses in capturing functional lineal and non-lineal relationships between genes.
Once an appropriate similarity measure has been applied, the second step is selecting the similarity threshold. Selecting a similarity threshold is a decision that frequently relies on subjective criteria or previous biological knowledge (Ala et al., 2008). Elaborated approaches for selecting the threshold objectively have been proposed (Nayak et al., 2009). Methods based on the clustering coefficient of graphs (Elo et al., 2007), spectral graph theory (Perkins & Langston, 2009) and random matrix theory (Luo et al., 2007) attempt to differentiate true co-expressed genes from random noise. In these methods, the structure of GCNs is revealed in a systematic way without subjective intervention (Luo et al., 2007). However, their complexity and dependence on assumptions makes them restrictive. Among these methods, clustering coefficient-based methods are robust and intuitive (Elo et al., 2007).
Regarding the comparison of networks as a final step in most of the studies constructing GCNs, some strategies aim to study conserved topological or biological information between GCNs (Mutwil et al., 2011). The comparison of networks using graph variables and multivariate approaches has also been developed (Costa et al., 2005; Elo et al., 2007). Only topological or spectral variables are used to characterize networks, therefore, genomic information is not reflected in graph properties, and biological conclusions are not revealed. An efficient strategy to characterize and compare GCNs based on a multivariate analysis, allowing researchers to include and also obtain valuable genomic data from networks and to infer global similarities, is still not available.
In the present work, we constructed GCNs based on gene expression data that were obtained from plant immunity experiments. The plants represent an important source of nutrients for most organisms. To gain access to these nutrients, pathogens have to survive the plant responses. Plant immunity has been classified into two branches according the molecules involved in the recognition (Jones & Dangl, 2006). The first branch depends on the recognition of microorganism-associated molecular patterns (MAMPs) by pattern recognition receptors (PRRs). This immunity is named MAMP-triggered immunity (MTI also known as PTI) (Zipfel, 2009). The second branch of plant immunity depends on the recognition of pathogen effector proteins, which are translocated and recognized in the plant cytoplasm by resistance (R) proteins. This branch has been called effector-triggered immunity (ETI) (Jones & Dangl, 2006). The PTI and/or ETI induce a systemic acquired resistance (SAR) that confers a broad-spectrum and long-term resistance (Durrant & Dong, 2004). The recognition of MAMPs or effectors triggers a diverse array of responses, including ion fluxes, the production of reactive oxygen species (ROS) and the activation of MAP kinase signaling pathways, leading to the activation of transcription factors that in its turn modulate the host gene expression (Dodds & Rathjen, 2010). The changes (induction and repression) in gene expression during different plant immune responses have been studied in several plant pathosystems (Glazebrook, 2005; Birkenbihl & Somssich, 2011), but Arabidopsis thaliana-Pseudomonas syringae remain the primary models for the study of plant–pathogen interactions (Nishimura & Dangl, 2010).
In the present work, we performed the four steps of GCN construction, carefully evaluating the statistical robustness and objectivity during each step. The careful selection of the best method and some improvements during the threshold selection step allowed us to obtain a general picture of gene expression reprogramming during plant pathogen immunity through the GCN construction. Pathogen resistance microarray datasets from Arabidopsis, rice (Oryza sativa), soybean (Glycine max), tomato (Solanum lycopersicum) and cassava (Manihot esculenta) were used. We evaluated the performance of the Absolute value of the Pearson Correlation Coefficient (APCC) against two metrics, NCMI and NMRS. For the similarity threshold selection, a modification of the clustering coefficient-based method is proposed to select the similarity thresholds. For the comparison step, the GCNs were characterized and a Principal Component Analysis was performed. The GCNs were clustered based on the principal component (PC) space using the K-means clustering algorithm. We found that the distance between the GCNs in the PC space can be used to analyze their structural and functional similarities within and between species. The comparative analyses allowed for the identification of common elements, indicating cross-talk between the different signaling responses to pathogens in the studied plant species.
Materials & Methods
Expression matrices construction
Pathogen resistance microarray data was used in this work. GEO DataSet repositories were queried for the expression data from microarray experiments (http://www.ncbi.nlm.nih.gov/geo/). A total of 40 non-processed datasets for Arabidopsis thaliana, 8 for rice, 5 for soybean and 3 for tomato were collected. Three cassava microarray datasets were obtained from previous studies (López et al., 2005).
The datasets were independently pre-processed through noise reduction, normalization and log2 transformation. The Robust Multiarray Average (RMA) method (Bolstad et al., 2003) was applied to Affymetrix data using the R affy library (R Development Core Team, 2011), while the two-color microarray data were pre-processed using the marray and Agi4x44PreProcess libraries.
The probe IDs were converted into gene IDs using a conversion table for each platform. Single probes that matched more than one gene were removed. For those multiple probes that matched a single gene, the maximum expression was assigned to the gene.
A filter of the samples and genes was applied to the datasets to reduce missing data. First, a common gene list was obtained, and those samples representing less than 50% of the common genes were removed. Afterwards, those genes that were represented in less than 75% of the total samples were removed.
At this point, two groups of expression matrices were constructed from pre-processed datasets. The first group of expression matrices was obtained by merging all of the expression data from one species (see Fig. 1A). The GCNs that were constructed with these expression matrices were called GCNs based on multiple experiments (M-GCNs). The second group of expression matrices was constructed for each microarray experiment independently (see Fig. 1B). For each experiment, genes showing differential expression were identified and retained using the Significance Analysis of Microarrays (SAM) (Tusher, Tibshirani & Chu, 2001).
The overall steps for the construction and comparison of the GCNs.
The GCNs that were constructed using this approach were called GCNs from single experiments (S-GCNs).
Similarity measurement selection
A square similarity matrix (Snxn) was calculated for every single Enxp. The elements of Snxn or similarities (si,j) between pairs of genes i and j were calculated using a similarity measure. We evaluated three similarity measures: the Absolute value of the Pearson Correlation Coefficient (APCC) (Zhang & Horvath, 2005), the Non-linear Correlation coefficient based on Mutual Information (NCMI) (Dionisio, Menezes & Mendes, 2004; Numata, Ebenhöh & Knapp, 2008) and the Normalized Mean Residue Similarity (NMRS) (Mahanta et al., 2012) (see Article S1, section 1). These measures were used to calculate the dependence between Xi and Xj, where Xi is a continuous random variable denoting the expression level of the ith gene across samples (Meyer, Lafitte & Bontempi, 2008). These similarity measures take values in the same interval [0, 1], where 0 indicates non-dependence between Xi and Xj, and 1 indicates total dependence or maximum similarity. A detailed description of each similarity measure is given in the Article S1, section 1.
The Snxn were contrasted in dispersion plots. The similarity measurement that better detected not only the linear dependences between Xi and Xj but also the non-linear and scaled patterns was chosen.
Similarity threshold selection
Once the Snxn was calculated using the chosen similarity measure, a similarity threshold τ* was selected. The τ* allowed us to determine the GCN edges according to the adjacency function given by Eq. (1) (Zhang & Horvath, 2005). Each GCN was represented by an adjacency matrix Anxn whose elements ai,j take the value of 1 when the genes/nodes i and j are connected by an edge. We restricted the GCNs to have undirected edges and no self-loops; therefore, Anxn is symmetric with diagonal elements equal to 0. The GCNs were drawn using Cytoscape (Shannon et al., 2003).
In this work, we followed an intuitive method based on the network’s topological properties for τ* selection (Elo et al., 2007). The observed clustering coefficient in the GCN C(τv) was compared with the expected clustering coefficient Cr(τv) for a randomized GCN with the same degree distribution of the original GCN (Newman, 2003). Both clustering coefficients are contrasted as the similarity threshold increased (Eqs. (2) and (3)).
In Eq. (2), the observed clustering coefficient C(τv) is the average of the clustering coefficients of all the nodes in the GCN, so it could be also called “average clustering coefficient”; ki denotes the number of neighbors of gene i or node degree; Di denotes the number of edges between the neighbors of gene i. K is the number of genes with ki > 1.
In Eq. (3): N denotes the number of connected nodes in the GCN, , and .
According to Elo et al. (2007), the similarity threshold selection is determined by finding the minimum threshold τv for which the difference between the clustering coefficients is maximum. Although this strategy is useful for a wide broad of networks, it is not suitable for those networks where (C(τv) − Cr(τv)) < 0. Here, we use the absolute difference between clustering coefficients (Eq. (4)). Thus, τ* is the first local maximum of the curve |C(τv) − Cr(τv)|.
In Eq. (4), τ* is the selected similarity threshold; τv+1 = τv + 0.01 with τv∈[0.01, 0.99].
The validity of this modification was evaluated with simulated networks. The simulation procedure and results are described in Article S1, section 2; Fig. S3.
GCN comparison by Principal Component Analysis (PCA)
The GCNs were characterized by eight graph variables (Fig. 1C). These informative measurements were selected following different requirements. Initially, we selected a subset of four variables that explain topological properties of reconstructed networks. For example, to study the structure of networks and their tendency to form sets of tightly connected edges, the clustering coefficient was used. Besides, the density of edges allowed us to measure whether the network is tight or cohesive (Horvath & Dong, 2008). To average the importance of nodes in terms of its centrality a measure of centralization was used. This measure assumes that the greater the number of paths in which a node participates, the higher the importance for the network (Costa et al., 2005). Equally, networks could show high variance in their nodes connectivity, especially in scale-free topologies. We assessed the heterogeneity measure to reveal whether the networks have heterogeneous connectivity (Horvath & Dong, 2008).
Subsequently, we planned to study the structure of networks adding external information. For this purpose, a subset of four variables was proposed as follows. Given that nodes in coexpression networks also represent coded proteins with different biological functions, it’s interesting to consider that nodes are not homogenous. To measure how much the nodes link to others with similar or dissimilar characteristics, a pair of assortativity coefficients was introduced. These coefficients merge current topological information with external Gene ontology (GO) annotations and PFAM annotations.
In the same way, we used graph theory to study the relationship between gene significance and connectivity. We assessed the correlation between node degree and presence of typical domains found in the immunity proteins. The correlation takes a reference dataset of genes encoding proteins involved in defense. We evaluated whether highly connected hub nodes are central to the network structure but also biologically significant in immune responses.
As this work focused on plant pathogen interactions, the tolerance to attacks as represented by the action of the effectors as suppressors of plant immunity was considered important. It was recently demonstrated that effector proteins from pathogens are directed to hubs of plant immunity networks (Mukhtar et al., 2011). Here, we analyzed the resistance to these perturbations by means of the average path length (Albert & Barabasi, 2002). A detailed description of these eight variables is annexed in the Article S1, section 3.
The M-GCNs and S-GCNs were compared in separated collections after characterization. Initially, the characterization matrices Tgxt of g networks by t variables were formed for M-GCNs and S-GCNs. Subsequently, a PCA for every single Tgxt was conducted (Jolliffe, 2002). Those principal components (PCs) retaining more variance were selected.
The M-GCNs and S-GCNs were analyzed using the PCs planes. In addition, two procedures were considered for S-GCNs comparison:
We classified every S-GCN by the treatment studied in the experiment. In this work, the experiments included stresses caused not only by pathogens but also by chemical substances that are related to pathogen activity and the plant immune system. Those stresses sharing similar pathogens or chemical substances were grouped (Table 1). Subsequently, those S-GCNs belonging to the same stress group were depicted on the PCs planes.
Pathogen resistance microarray data collected.
The K-means algorithm was used to find clusters of S-GCNs on the PCs planes. We selected the optimum number of clusters based on the Bayesian Information Criterion (BIC). This selection was achieved using the R adegenet library (R Development Core Team, 2011). The clusters were analyzed with the stress groups as previously defined.
The R code for the construction and comparison of GCNs is given in the Script S1.
With the aim of generating a general picture of the immunity networks, microarray data from different plants in response to pathogens were used to construct GCNs. The general methodology that was followed to construct and compare the GCNs involved four steps: (1) the construction of expression matrices, (2) the selection of a similarity measurement and the construction of gene similarity matrices, (3) the similarity threshold selection and (4) the comparison of GCNs (Fig. 1).
Expression matrices construction
A total of 59 raw microarray datasets from pathogen-infected plants were obtained from publicly available data that were pre-processed and filtered (see Methods and Table 1). Arabidopsis and rice were represented by more experiments than were the other species; 40 and 8 experiments, respectively. In Arabidopsis, studies with the pathogens Botrytis cinerea and Pseudomonas syringae pv. tomato were the most abundant. For rice, experiments involving Magnaporthe oryzae and Xanthomonas oryzae pv. oryzae were the most common. Soybean, tomato and cassava are less studied plants and, therefore, the number of experiments using these species was scarce. A total of 5, 3 and 3 experiments, respectively, involving these species were used.
Two groups of expression matrices were constructed from pre-processed datasets. The expression matrices used to construct the M-GCNs are summarized in Table 2. The expression matrices used to construct the S-GCNs are summarized in Table S1. As expected, the number of samples and genes in the expression matrices was higher for plants with more experiments (Arabidopsis and rice).
Main results for M-GCN construction: expression matrices dimensions, similarity thresholds and network sizes.
Similarity measurement selection and construction of similarity matrices
Three similarity measurements were evaluated to assess the similarity matrix between genes. We compared the dispersion plots of the similarities that were calculated using the APCC (), NCMI () and NMRS (); formally vs. and vs. (Fig. S1).
For low in which no linear similarity is detected, the high values of and evidence a nonlinear correlation (Fig. S1). In other words, for low Pearson coefficients in which no linear similarity is detected, the NCMI and NMRS were able to detect nonlinear correlation. The genes with linearly correlated expression profiles are placed in the upper right corner, and the genes with nonlinearly correlated expression profiles can be found in the upper left corner. Based on these comparisons, we concluded that NMRS and NCMI are both useful measures in detecting linear and non-linear correlations. Nevertheless, non-linear correlations were better revealed by NCMI. This result is especially important when a similarity threshold τ* is chosen based on the gene pairwise similarity matrix, because some gene pairs with a non-linear correlation would be included in the final gene network. Moreover, for any τ* > 0.5, the number of edges from the non-linearly correlated profiles will be greater if is used (Fig. S1). Given that our goal was to construct GCNs including linear and non-linear relationships between genes, we decided that NCMI was the best metric among the three approaches that were evaluated.
Similarity threshold selection and GCN construction
The similarity matrices were used to test the methodology for the threshold selection. In the M-GCN construction, Fig. 2 shows the difference between the expected clustering coefficient of the random network Cr(τv) (Elo et al., 2007) and the real clustering coefficient that was based on the constructed network C(τv) (see Methods). The curves show a first phase of continuous growth where the non-significant edges are gradually removed (Fig. 2). The maximum difference is reached when well-defined clusters are formed due to the removal of non-relevant edges. The clustering coefficient of the random network should remain lower than that of the real network, as assumed by Elo et al. (2007); however, the curve of Arabidopsis did not show the expected behavior.
The application of methodology for the similarity threshold selection in the M-GCN.
The Arabidopsis curve (Fig. 2) showed that the methodology proposed by Elo et al. (2007) is not suitable for networks where (C(τv) − Cr(τv)) < 0, indicating that Cr(τv) > C(τv). In this work, a minor adaptation of the method was proposed (see Eq. (4) in Methods). Indeed, several alternative ways to utilize the clustering coefficient in the threshold selection can be studied (Elo et al., 2007) and the global optimization problem expressed in Eq. (4) is not unique. Through simulation we determined that the absolute value of the differences between C(τv) and Cr(τv) is suitable for the threshold selection. Accordingly, the maximum absolute value between clustering coefficients is still a reference point to identify the transition between the underlying biological system and those random relationships embedded in the similarity matrix. The adaptation relies also in the basis that the maximum the absolute value, the maximum the difference between real and randomized systems. We successfully applied this adaptation for the entire threshold selections performed in our work.
The similarity threshold that was obtained for the Arabidopsis M-GCN was the lowest (0.89), and its network was the largest among the five plants (Table 2; Fig. 3). The thresholds for the S-GCNs had a wide range of values (0.27–0.93) for all of the species (Table S1). The largest S-GCNs (ids: 8, 44, 6, 13, 40) had more than 1,500 nodes and belonged to experiments that used Arabidopsis and rice.
From Table 2 and Fig. 3 we inferred that the species with more expression data or experiments have larger M-GCNs. Indeed, an association between the number of nodes and the number of samples in the expression matrix was found: PCC = 0.98 (p-value = 0.002). Consequently, the size of the M-GCNs is due to the inclusion of very diverse experiments. When a greater number of different types of experiments are included in the expression matrix, the number of nodes/genes required to represent the underlying immunity system is higher. This requirement is because more information about several functions is presented as different experiments are used.
For the S-GCNs, however, we did not found a clear relationship between the quantity of expression data and the network size. The correlation between the number of nodes and the number of samples from each S-GCN is very low: PCC = −0.24 (p-value = 0.004). In other words, although the size of the S-GCNs is highly variable, this variation is neither correlated with the number of experimental data points nor dependent on the organism.
Comparison of GCNs by Principal Component Analysis (PCA)
For these analyses, we focused on the two groups of GCNs, 59 S-GCNs (summarized in Table S1) and 5 M-GCNs (summarized in Table 2). We aimed to compare the obtained networks between species and experiments. The networks were characterized by eight graph variables: (1) the clustering coefficient, (2) the centralization, (3) the coefficient of variation of the node degree (also known as heterogeneity), (4) the network density, (5)–(6) assortativity coefficients, (7) the tolerance to attacks and (8) the correlation between the node degree and the presence of immunity domains (see Methods).
The characterization matrices for the S-GCNs and M-GCNs were constructed with these variables (Tables S1 and S2). These variables were then summarized using the PCA. The S-GCNs and M-GCNs were projected in the principal component (PC) space (Fig. 4; Fig. S2).
The differentiation of the S-GCNs using the PCA.
Analysis of PCs used to project S-GCNs
The first three PCs were selected and used to represent the data structure in 2D plots (Fig. 4). PC1, PC2 and PC3 explain 33%, 20% and 14% of the total variance, respectively. Accordingly, 67% of the total information is represented in these plots. The PC1 (33%) explains primarily the information that is contained in the variables of heterogeneity and density, the clustering coefficient and the assortativity coefficient (PFAM), predominantly topological information (see Fig. 4C; Table S3 shows each variable’s contribution to the principal components). The PC2 (20%) explains the assortativity coefficient and the centralization, primarily non-topological information. The PC3 (14%) explains the tolerance to attacks and the dependence between node degree and immunity domains (see Fig. 4D). These last variables were not explained by PC1 or PC2; consequently, PC3 is associated mainly with the robustness of the immunity processes.
The dependence of the graph variables with the network size was also studied to verify that characterization of networks was not affected by their size. The PCC between the number of nodes and the graph variables clearly shows that all of the variables exhibited a very small correlation with the size of the network (Table S3); this assures that the PCA was not affected or biased by differences in the S-GCN sizes.
Differentiation of S-GCNs between species
The PCA plots allowed us to differentiate S-GCNs among species. The Arabidopsis S-GCNs are spread over the planes PC1–PC2 and PC1–PC3 (Figs. 4A and 4B). Due to this dispersion, we deduced that Arabidopsis S-GCNs have very different graph variables depending on the experiment analyzed.
In contrast, S-GCNs from other plants were more similar based on the eight variables and, therefore, clustered into specific zones (Figs. 4A and 4B). For example, there was a clear difference between cassava and tomato S-GCNs on PC1. Tomato S-GCNs are denser and more clustered than cassava S-GCNs. Cassava S-GCNs have high heterogeneity. Furthermore, the cassava and soybean S-GCNs were significantly more tolerant to attacks than those of the other species.
Another example of differentiation among species was found in rice. There is a defined group of 5 rice S-GCNs near to the center of the PC1–PC2 plane (Fig. 4A). Their assortativity coefficients are slightly higher than other S-GCNs, indicating that co-expressed genes in rice networks shared more functional annotations than did genes from other plants. These examples demonstrate that variables used for the characterization were useful in differentiating S-GCNs among species. In Article S1, section 4, we explain the position of S-GCNs by the contribution of each variable to the PCs.
Differentiation of S-GCNs between stress groups
The PCA plots allowed us to find similar S-GCNs based on stress groups. A total of five stress groups were defined: Bacteria, fungi, induced resistance, oomycetes and PTI (see Table 1). These stress groups are highlighted using different symbols in Fig. 4.
Networks that were constructed under conditions from the same stress group were found close to each other. For instance, we found that networks 27, 41, 43 and 48 are close to each other and no separation is observed in both planes (Fig. 4). These networks are associated with studies of bacteria in Arabidopsis (id 27; Pseudomonas syringae pv. tomato) and rice (ids 41, 43, 48; Xanthomonas oryzae pv. oryzae and Xanthomonas oryzae pv. oryzicola). In this way, they showed similar graph variables but also could represent comparable immunity process against bacteria in these two species.
Some S-GCNs sharing similar stress groups were also identified in quadrant I of the PC1–PC2 plane (Fig. 4A). For example, networks 34 and 47, which are related to fungi experiments in Arabidopsis (ids 34; Botrytis cinerea) and rice (id 47; Magnaporthe oryzae). In the PC1–PC2 plane, they are forming a closer pair; therefore, their topological variables (clustering coefficient, density, heterogeneity and centralization) are analogous. Because of their position in PC2, we can conclude that they are disassortative and their linked genes do not share many functional annotations. Both networks are also close in the PC1–PC3 plane. Therefore, we can infer that the immunity processes that are represented in these networks (derived from plant–pathogen interactions of rice-Magnaporthe oryzae and Arabidopsis-Botrytis cinerea) could share some similarities.
Despite the previous examples, some networks from the same group of stresses were also found separated. An example of opposing S-GCNs is the pair of networks 9–39. They are related to fungal (Botrytis cinerea, Plasmodiophora brassicae) experiments in Arabidopsis. Both networks are in total opposition in the three PCs. While network 9 is robust and assortative, network 39 is less tolerant to attacks and shows high heterogeneity. A similar result was observed for Arabidopsis networks 10, 35 and 45 from Ralstonia solanacearum. Consequently, even when two networks are associated with the same stress or group of stresses, their graph variables could differ.
Clustering of S-GCNs using the K-means algorithm
The K-means algorithm was used with the aim of finding clusters of S-GCNs (see Methods). We selected an optimum of 10 clusters (Fig. S4). Mainly, induced resistance experiments were gathered together in cluster 7, and PTI stresses were in cluster 8 (Fig. 5). Bacteria and fungi were present in almost all of the clusters.
The results of the S-GCN clustering using the K-means algorithm.
Possible associations between clusters and stress groups were revealed (Fig. 5A; Table S1). For example, cluster 9 grouped some Arabidopsis, rice and tomato networks. In this cluster, networks 36 (Botrytis cinerea) and 54 (Colletotrichum coccodes) share the same stress group (Fungi). Networks 27 and 32 (Pseudomonas syringae pv. tomato) were both from bacteria stresses. Also, networks 7, 18 and 45 were related to induced resistance experiments. This result shows that, for specific networks, a small distance in the PC space could have a biological meaning in correspondence with the experiment.
Besides, experiments related to PTI and salicylic acid (SA) were grouped together (Table S1). For instance, in cluster 5, we found that network 26 from PTI was grouped with network 21 from SA. In cluster 7, network 29 from PTI was grouped with networks 19, 20, 23, 24 and 25 from SA. These findings implied that some stress groups, such as PTI and the induced resistance by SA, are potentially related to similar co-expression behaviors.
As expected, some clusters are enriched with S-GCNs from specific species (Fig. 5B). For instance, cluster 3 is useful to compare experiments from soybean and cassava. While clusters 4, 5, 6 and 8 are exclusively conformed by Arabidopsis networks. Accordingly, the clustering of S-GCNs with the K-means algorithm allowed a straightforward identification of theoretically similar networks based on topological and biological characteristics.
Comparison of M-GCNs
In relation to the M-GCNs comparison, two PCs were analyzed (Fig. S2). We verified that networks with low clustering coefficients had high heterogeneity. Both assortativity coefficients showed information that was different from that of the topological variables, such as the density and clustering coefficient.
From the PCA plot, we conclude that Arabidopsis M-GCN constitute a network with high heterogeneity, but is also more tolerable to attacks. Cassava M-GCN is a disassortative and non-centralized network, and rice, tomato and soybean M-GCNs constitute highly clustered and dense networks.
With the aim of obtaining a general representation of the events that are triggered during plant immune responses and to compare these responses in different plants against diverse pathogens or pathogen response stimuli, GCNs were constructed from the available microarray data from Arabidopsis, rice, soybean, tomato and cassava. A careful selection of the methodology at each step was undertaken to fulfill two main criteria: enhanced objectiveness and enhanced information extraction from the gene expression data.
The careful analyses of the linear and non-linear relationships between gene expression profiles allowed us to select NCMI as the best metric approach. Then, the similarity thresholds were defined by the clustering coefficient method. The GCNs were obtained for the different plants in response to different stimuli. Networks were characterized by graph variables and a PCA was applied. Each network showed a specific pattern and topology, indicating that the networks are species-specific, dynamic entities, and even for the same species in response to the same pathogen, the networks can be quite different (Fig. 4). The comparative GCN analyses between species allowed for the identification of some common elements, indicating a cross-talk between the different signaling responses to pathogens (Fig. 5).
We investigated different factors that should be considered when GCNs are used to propose biological hypotheses. For some plant species, both the number of experiments and the completeness of the genome annotations were inadequate. In some cases, expression data were missing for several genes. These factors reduced the data representativeness, especially for tomato, cassava and soybean, for which expression data were not available for all of the genes of the genome. We observed that the genes in the expression matrices from these plants were incomplete, considering the number of genes that were reported in their genomes (Table 2). The microarray data for Arabidopsis and rice were of better quality, and the expression matrices contained information for almost all of the known genes. These differences in data availability were reflected in the final GCNs in the sense that the information represented in the networks from the plants with less data was also sparse.
Regardless of differences in the quantity and quality of the data, the experiments covered a broad spectrum of conditions. We considered experiments using plants inoculated with bacteria, fungi and oomycetes, including ETI and PTI responses and induced resistance experiments. This choice of experiments allowed for the gathering of a broad representation of immunity processes. Fifty-nine experiments offered a good balance between the representation of plant immunity processes and a sufficient number of samples for statistical analyses (Steuer et al., 2002).
Our methodology aims to have a simple application, low-level computational resources and accurate results to be easily implemented. This methodology for the construction of GCNs falls in a group of methodologies that are usually termed Relevance Networks based on their pairwise measures of similarity (Butte & Kohane, 2000). Evidently, more elaborate strategies involving further mathematical and statistical complexities at each step can be studied (López-Kleine, Leal & López, 2013); however, our interest was neither to study the molecular mechanisms in detail nor causal regulatory relationships among gene products. In this sense, at each step of the methodology, we objectively chose the best method from several available options. We recommend the following methods:
NCMI as the similarity measurement: although the NCMI estimation was more complex than that of the APCC or NMRS, its advantages included the detection of non-linearly correlated pairs of genes and flexibility in detecting any type of dependence between expression profiles.
The threshold definition based on the modified clustering coefficient method: among the methods proposed to objectively select a threshold, we used a method based on the topological features of graphs (Elo et al., 2007) that is easy to implement and is based on a simpler mathematical background (Luo et al., 2007). The method was slightly adapted to consider networks with high heterogeneity, as was the case for the Arabidopsis M-GCN.
The characterization and comparison of GCNs using a PCA: the network comparison based on the topological variables such as density, heterogeneity or centrality allowed for the discovery of only similar patterns of morphology between GCNs. We added new non-topological variables to characterize the GCNs, including tolerance to pathogen attacks, assortativity coefficients related to functional annotations and dependence between node degree and immunity domains. These variables produced a better differentiation of GCNs in the PCs space and revealed biological conclusions about the co-expression systems studied.
The characterization of GCNs depends on the use of variables able to extract the most relevant features. There is an unlimited set of variables that could be selected to characterize networks (Costa et al., 2005). Thus, the inclusion or exclusion of variables relies on the knowledge of the problem. Here, we aimed to compare global patterns of immune responses reflected in coexpression networks. We included a set of variables that mutually exposed the differences among the studied phenomena and extract as much information as possible. However, we found that variables like the density and clustering coefficient were highly correlated, implying redundancy (Figs. 4C and 4D). Similarly, both assortativity coefficients contained equivalent information. We could expect that results will not be drastically altered after removing some of these variables. The clustering coefficient and the assortativity coefficient from GO could summarize adequately the variability observed in their counterparts. Alternatively, removing non-correlated variables could obscure the variability observed and results will change. For example, excluding the tolerance to attacks will reduce the differences between soybean S-GCNs and those of the other species (Fig. 4B). Likewise, adding new variables could reveal relationships not presented in our plots. As expressed by Costa et al. (2005), before altering the characterization matrix, it is of importance to have a good knowledge not only of the most useful variables, but also of their properties and interpretation.
The confidence in the constructed S-GCNs allowed for us to analyze the networks that were obtained for extracting biological knowledge and especially for comparing behaviors between and within species. As stated before, most of the experiments that were analyzed in this study were from Arabidopsis. A broad spectrum of gene expression data for this model plant is available (Schenk et al., 2000; Tao et al., 2003; Zipfel, 2009). The zigzag model that was developed to explain the evolution of plant immunity was constructed based on the knowledge of the pathosystem Arabidopsis-Pseudomonas (Jones & Dangl, 2006; Nishimura & Dangl, 2010). In this sense, the S-GCNs that were constructed during the SAR response or that were induced by SA were based on Arabidopsis data; these and other experiments have contributed significantly to a major understanding of this phenomenon (Schenk et al., 2000), including the identification and action mode of NPR1 and the WRKY transcription factors (Wang, Amornsiripanitch & Dong, 2006; Dempsey & Klessig, 2012).
We compared S-GCNs that were obtained from a deeply studied plant such as Arabidopsis with S-GCNs that were obtained from an almost unstudied plant with scarce transcriptomic data such as cassava. The S-GCNs comparison between these two plants showed that there are few common elements and that their topologies are different. However, the K-means allowed us to obtain a cluster that grouped Arabidopsis and cassava networks (cluster 10). This result is important because, for some genes with unknown functions in cassava, a role in immunity processes could be assigned based on these networks. Several studies have reported the utility of this strategy in assigning a putative function to unknown genes (Ficklin & Feltus, 2011; Hwang et al., 2011). Further experiments employing mutant versions of these genes and using silencing approaches will help to determinate the function of these genes in plant immunity.
We observed that the S-GCNs that were generated from Arabidopsis-Pseudomonas syringae pv. tomato DC3000 (PstDC3000) were very distant, even when they came from the same pathosystem (ids. 13, 27, 32, 37). However, even though these experiments belonged to the same plant–pathogen interaction (Arabidopsis-Pseudomonas), some of them used pathogens (ids. 13, 27 and 32) or plants (ids. 13 and 37) exhibiting mutations in particular genes. Furthermore, the samples were taken at different time-points in all of the experiments (see link to summary of experiments in Table 1). Taking together, these results suggest that minor changes, such as the mutation of individual genes in the plant or the pathogen, produce networks with different topologies. In addition, networks seem very dynamic given the important changes they suffer considering different time-points during the immune responses. This aspect indicates that the construction of a network represents only a reduced aspect of the whole gene co-expression in the cell at a given moment, and no generalization can be made for the entire life cycle of a plant cell.
The PTI and ETI responses shared similar responses (ion fluxes, production of ROS and activation of Map kinases); we expected to observe more similarities for the PTI and ETI networks. However, we observed that several PTI networks (ids. 5, 6, 8) were not similar to ETI GCNs, due to the highly dynamic nature of these cellular responses. Similar results were obtained experimentally, where the expression of only a few genes showed an overlap between the PTI and ETI (Navarro et al., 2004).
On the other hand, we observed that different networks that were constructed from experiments involving the PTI were very similar to each other, even when they correspond to induction for different MAMPs. For example, networks 5 and 29 are closer in PC2 and exemplify the induction of different MAMPs: flg22 and chitin. Previous studies have reported a very similar response to flagellin and Elongation Factor Tu (Zipfel, 2009). A similar situation was observed with networks that were constructed from induced resistance and that were grouped together (cluster 7). This result suggests that the PTI and induced responses are robust and are not strongly influenced by other environmental conditions. These types of robust responses were previously reported for incompatible interactions (Tao et al., 2003).
It is also interesting to note that the GCNs that were obtained from the PTI and induced responses were also similar (clusters 7 and 5), supporting previous experimental studies (Tsuda et al., 2008). The ETI has been considered a stronger but very specific response for a particular race of pathogens (Jones & Dangl, 2006). The distal-induced resistance that is activated once the ETI has started or the response induced by hormones such as SA also produces a weak but efficient response against a broad spectrum of pathogens. The PTI is weak as well but can confer resistance to a larger group of non-adapted pathogens. It would be interesting to study more in detail whether there is a relationship between a robust, weak response and the spectrum of resistance.
The rice networks in response to two different bacteria (X. oryzae pv. oryzae and X. oryzae pv. oryzicola) showed a high degree of similarity (ids. 41, 43, 48, Fig. 4). This result is interesting given that the two bacteria employ different strategies of infection. The first bacteria colonize the vascular system, and the others reside on the apoplast. Consequently, both bacteria produce different symptoms (Hajri et al., 2012). The similar network topologies that were observed in our study suggest that, although the colonization is different, the molecular plant responses and genes involved are related in both cases.
Another example comprises the network 46. This network was obtained from rice plants that were inoculated with X. oryzae pv. oryzae, but also shows some degree of similarity with a network from Magnaporthe oryzae (id. 41). Some of the pathways can be shared in response to different pathogens at particular times during the infection or response. Consequently, the networks can exhibit this type of similarity.
In response to similar pathogens, plants can activate conserved signaling pathways. For example, we observed that two unrelated plants such as Arabidopsis and rice (dicotyledonous and monocotyledonous) react in similar ways in response to bacteria (ids. 27, 41, 43, 48, Fig. 4). This response does not indicate that the genes are the same, but rather that some degree of conservation of their function exists. Therefore, it is possible that some plant responses to a particular group of pathogens can be more “stable” and conserved. Considering all of these observations, it is important to consider aspects such as the type of interaction (compatible, incompatible, non-host) evolutionary relationship and mode of colonization between pathogens, as well as the time-points after pathogen inoculation when identifying common or shared elements between the networks.
The networks that were constructed for a species by merging several experiments are different from each other. They have also different characteristics from the networks that were constructed from only one microarray experiment. Differences between S-GCNs and M-GCNs are especially striking for Arabidopsis, which questions the validity of the global network merging all of the experiments. Our results indicate that a global immunity process gene co-expression network is very difficult to construct and could hardly resume global information on this complex process. Moreover, the high level of diversity found between S-GCNs indicates that, depending on the pathogen and type of immunity process that is triggered, the obtained network will be different. Therefore, we conclude that global networks such as those that were previously constructed by Atias, Chor & Chamovitz (2009), Pop et al. (2010) and Mutwil et al. (2011) could mask important gene relationships that are characteristic of a particular process. Also, these global networks could enhance relationships that are specific to only one biological process. Those gene relationships that arise only under special environmental and biological circumstances are better represented by process-oriented networks such as those that were previously constructed by Nakashima, Ito & Yamaguchi-Shinozaki (2009) and Lee et al. (2011).
As a major finding, the closeness of GCNs on the principal component space is indicative of similar plant immune responses and conserved signaling pathways. The comparison of GCNs suggests cross-talk between the different responses to pathogens within plant species. It is possible that some plant responses to a particular group of pathogens are not only conserved but also more robust. Theses similarities between S-GCNs are a valuable source of predictions that can be considered in future works.
The representation of coordinated transcription through GCNs is necessary to gain comprehensible knowledge from the underlying transcriptomes. We showed that global immunity process should not be explored using the M-GCN approach. The comparative S-GCNs analyses allowed to conclude that dynamic of molecular plant responses produce networks with different characteristics. As a consequence, M-GCNs cannot properly summarize the experimental data.
Co-expression network construction and topological analysis
Constructing co-expression networks
Most of the existing co-expression analyses construct value-based networks. We believe that the value-based methods are significantly limited by their use of a homogeneous threshold for all the genes in the network. In reality, genes in different functional pathways may be regulated by different mechanisms, and therefore may exhibit different patterns of co-expression. In particular, genes in one functional pathway may be strongly mutually co-expressed, while genes in another functional pathway may be only weakly co-expressed. As a result, if we choose a stringent global threshold, many genes in the weakly co-expressed pathway may be disconnected. On the other hand, if we attempt to connect the weakly co-expressed genes into the network, the threshold may become so low that the genes in the strongly co-expressed pathway may have many links to genes in other pathways, making further analysis difficult. For example, as shown in Figure 1, to construct a co-expression network for the 3000 yeast genes that we will see in the next subsection, if we allow only 10% of the genes to have no connections, most genes will have more than 300 connections, while if we reduce the median degree to 10, more than one third of the genes will have no connection at all.
Median degree and number of singleton nodes in a value-based yeast co-expression network. Horizontal axis: the Pearson correlation coefficient threshold for the value-based network construction. Left vertical axis: median number of co-expression links...
To deal with this problem, we propose a simple rank-based method to construct co-expression networks. We first calculate the Pearson correlation coefficient (or some other similarity measure) between every pair of genes. For each gene gi, we rank all other genes by their similarity to gi. We then connect every gene to the d genes that are most similar to it. Compared to the value-based method, the rank-based method essentially uses different local similarity threshold for different genes. It is important to mention that even with a fixed d, the number of connections for different genes is not constant. This is because of the asymmetric nature of the ranking. In other words, the rank of gene i with respect to gene j is not necessarily equal to the rank of gene j with respect to gene i. Therefore, although gene i has only d genes on its top-d list, other genes that are not on i's list may list i as one of their top-d genes. The mean degree is between d and 2d, the minimum degree is d, and the maximum degree can be as large as n - 1, with n being the number of genes in the network.
The rank-based method may appear to be limited by a similar drawback of the value-based method - the former uses a global rank threshold and the latter uses a global value threshold for all genes. However, as we discussed above, in the rank-based network, different genes can have different number of connections, even though all genes have the same rank threshold, because of the asymmetric nature of ranking. More importantly, our objective is not to identify all co-expressed genes for each gene, but to construct a sparse network such that the modular structure of the system can be successfully identified. To achieve this, a good co-expression network needs to have the following two properties: (i) there are very few false-positive connections, and (ii) nodes within modules are well connected into a single component, while connections between modules are sparse. A value-based network can hardly provide the two properties simultaneously, for reasons given above. In contrast, the key idea in our rank-based method is that by using a uniform small value of d, we ensure that (i) the network only contains highly reliable edges, and (ii) each module of the network is (almost) fully connected into a single component, both theoretically and empirically (see below). As in most clustering algorithms, we assume that gene expressions in different modules are generated by different distributions, while gene expressions in the same module are generated by a common (unknown) distribution. Therefore, the rank-based sub-network of genes from the same module is a nearest neighbor graph constructed on a set of random geometric points. Theoretically, it is known that a nearest neighbor graph on random geometric points has a high probability to be connected even with a very small number of neighbors (d) . To empirically test this as well as to find the range of d for typical microarray data, we randomly generated a data set with 1000 genes and various dimensions (conditions) using Gaussian distribution. We then constructed a rank-based co-expression network using different values of d and measured the number of disconnected components in the resulting network. Remarkably, we find that for data of dimension > 10, a nearest neighbor graph with 1000 nodes is almost always fully connected with d = 2 neighbors. Even for data of smaller dimensions, the graph can be connected with at most 4 neighbors (Figure 2). The results do not vary significantly when the number of genes or the type of distribution is changed. In the next subsection, we also show that the yeast gene co-expression network is connected with d = 2. In practice we find a value of d between 3 and 5 is sufficient for most cases. This simple network construction method can also be combined with other strategies that were developed for value-based networks. For example, the raw similarity values can be refined by considering local neighborhood or shortest path information before rank transformation ; when selecting edges according to ranks, a threshold based on raw similarity values may be imposed simultaneously to ensure confidence in the edges being created. Ideally, methods can also be developed to automatically select the optimal d, as in . The rank-based method can also be applied to construct networks of other entities, as long as a similarity measure can be defined. One example is to construct a network of samples from microarray data, where the nodes are samples and the similarity between two samples can be measured by the Pearson correlation coefficient between their gene expression profiles. Later we will show an application of a sample co-expression network where each sample is a cell type.
Connectivity of rank-based co-expression networks on random data. Each data set contains 1000 random geometric points in a certain number of dimensions, generated using the standard Gaussian distribution. Y-axis shows the number of disconnected components...
Topology of yeast co-expression networks
Previous studies have analyzed the topologies of various networks, including biological and social networks, and suggested three common topological properties: scale-free, small-world, and hierarchically modular [26,27,32-34]. Although debate exists [35,36], it is generally believed that these properties may be related to the robustness and stability of the underlying systems [26,27,32-34]. For example, a small-world network has a small diameter and a large clustering coefficient (see Methods), which is believed to be related to an efficient and controlled flow of information [26,34]. In a scale-free network, the probability for a node to have k edges follows a power-law distribution, i.e. P(k) = c × k-γ. The implication of the scale-free property is that a few nodes in the network are highly connected, acting as hubs, while most nodes have low degrees. Scale-free networks are believed to be robust to random failures, but vulnerable to deliberate attacks [27,32]. In comparison, in a random network (specifically, an Erdos-Renyi random network ), connections are spread almost uniformly across all nodes [26,34]. Furthermore, although a random network may also have a small diameter, it usually has a near zero clustering coefficient [26,34]. Several studies have analyzed the value-based gene co-expression networks, and reported some interesting but controversial results [11-13,25]. Here we analyze the topologies of both the rank-based and the value-based networks, and compare with previous results.
We obtained a set of yeast gene expression data measured in 173 different time points under various stress conditions , and selected 3000 genes that showed the highest variations. We constructed four gene co-expression networks using the rank-based method with d = 2, 3, 4 and 5, respectively. For each rank-based network, we constructed two random networks as follows. First, we randomly permuted the expression data of each gene independently, and constructed a rank-based network using the permuted data. Second, we randomly rewired the connections in a true rank-based network, but preserved the degree for every node . For comparison, we also constructed four value-based networks, using the Pearson correlation coefficient as a similarity measure. The thresholds were chosen such that the average degrees are 10, 30, 50, and 100, respectively, in the resulting networks. Similar to the rank-based networks, we obtained two random networks for each true value-based network, one constructed from randomly permuted data and the other by randomly rewiring the true network.
Table 1 lists some statistics of these networks. In the rank-based networks, almost all genes are linked to the largest component with d as small as 2. Furthermore, compared to both the randomly rewired networks and the networks constructed from randomly permuted data, the true rank-based co-expression networks have slightly larger average path lengths and diameters, but much larger clustering coefficients, indicating that the rank-based co-expression networks have the small-world property. In contrast, the true value-based co-expression networks contain many singletons. For example, with a Pearson correlation coefficient threshold of 0.69, about 900 genes are singletons, even though the average node degree is much higher than in the rank-based networks. Furthermore, although the value-based networks have high clustering coefficients, their randomly rewired counterparts have almost similarly high clustering coefficients. This observation suggests that the high clustering coefficient of the value-based networks is partially because their non-singleton nodes are almost completely connected, in which case the structure cannot be destroyed by any random rewiring.
Statistics of yeast co-expression networks
Figure 3(a) and 3(b) shows the degree distributions of these networks. As indicated by a linear relationship in the log-log plot, the rank-based networks constructed from the real data exhibit a power-law degree distribution for all the d values considered. This suggests that an overall scale-free topology is a fairly robust feature of the co-expression networks. In contrast, the networks constructed from randomly permuted gene expression data contain significantly fewer high-degree nodes, and exhibit exponential degree distributions. The value-based networks appear to follow power-law degree distributions as well; however, they have a much larger number of high degree nodes than the rank-based networks.
Topological properties of co-expression networks. (a) Degree distribution of rank-based co-expression networks. (b) Degree distribution of value-based co-expression networks. (c) Relationship between clustering coefficient and degree in rank-based and...
To quantify the difference between the degree distributions of the value-based and rank-based networks, we fitted a power-law function for the degree distribution of each network to determine its γ parameter. The values of γ in the rank-based networks are consistently between two and three. This is typical in many biological networks such as PPI networks and metabolic networks, as well as in real-world social and technology networks [26,34]. In comparison, the γ values in the value-based networks are below one (Figure 3b). Theoretically, it is known that a scale-free network with γ < 2 has no finite mean degree when its size grows to infinity, and is dominated by nodes with large degrees . Therefore, small values of γ for co-expression networks were reported in several previous studies as a significant difference between the co-expression networks and other biological networks [15,17]. Our results suggest that this difference may simply be an artifact of the network construction method. Consider that genes in some modules are strongly co-expressed with one another, while genes in some other modules are weakly co-expressed. Using the value-based method, when the similarity cutoff is gradually decreased, the genes within the strongly co-expressed modules will be first connected, up to a point that they are almost completely connected, before any gene in the weakly co-expressed modules can be connected to their within-module partners. As a result, the co-expression network will have many genes with large degrees, resulting in a small slope in the log-log plot. In contrast, with the rank-based method, genes in both strongly and weakly co-expressed modules can be connected, as essentially a different similarity threshold is used for each gene. Therefore, rank-based networks can usually capture the topology of both strongly and weakly co-expressed modules, while value-based networks are often dominated by the strongly co-expressed modules.
Moreover, previous studies have reported that gene co-expression networks lack the hierarchically modular property [11,12]. This property is characterized by a reciprocal relationship between a node's degree and its clustering coefficient . Again, we have found that this claim only applies to the value-based networks. As shown in Figure 3(c), there is a clear reciprocal relationship between the node degree and node clustering coefficient in the rank-based networks, when compared to the value-based networks. This suggests that gene co-expression network can also have hierarchical structures.
Together, these experiments show that the rank-based co-expression networks have all the common topological properties of many other biological networks, while the value-based networks seem to differ significantly. Although these do not necessarily prove that rank-based networks are biologically more meaningful than value-based networks, the former seems to be able to capture the underlying topological structures better.
Module discovery and analysis in gene co-expression networks
Gene co-expression networks with thousands of nodes are difficult to visualize and comprehend. A useful strategy for analyzing such a network is to partition it into subnetworks, where the nodes within each subnetwork are relatively densely connected to one another but have fewer connections to the other subnetworks. In gene co-expression networks, such subnetworks can be considered as candidates of functional modules, as genes within each subnetwork are mutually co-expressed, while co-expression between genes in different subnetworks are sparse. Many graph partitioning algorithms have been developed in computer science . Similar to clustering, one major difficulty in graph partitioning is to determine the number of partitions. Some methods do not require this to be explicitly determined in advance, but require other parameters, which are also difficult to obtain. For example, MCL, one of the best graph partitioning algorithms, requires an inflation parameter, and setting the parameter to different values may result in very different results .
To address this difficulty, we introduce an algorithm that we have developed recently for identifying "communities" in arbitrary networks . The main motivation for the algorithm is that each "community", or a subnetwork, must contain more intra-community edges than would be expected by chance if the connections were random. With this motivation, we developed an algorithm to optimize an objective function called modularity, which is precisely defined as the percentage of intra-community edges minus the random expectation (see Methods). The algorithm, named Qcut, has been shown to be effective in finding statistically significant and practically interesting graph partitions in many synthetic networks, social networks, and biological networks, without any user-tunable parameters, and has outperformed the existing algorithms based on similar motivations .
We evaluate the performance of Qcut on gene co-expression networks in several ways. We first use synthetic microarray data where the true modular structure is known, so that we can directly measure the accuracy. We then use two real microarray data sets to evaluate the overall biological significance of the identified gene modules, with two different metrics. The first metric is a commonly used approach based on the enrichment of specific Gene Ontology terms in the modules, which may be biased by the number of modules and module sizes. The second is a new metric that we introduced based on the idea of reference networks, which can be obtained from a variety of sources, such as gene annotations or protein-protein interaction networks (See methods).
Evaluation using synthetic microarray data
To objectively evaluate the accuracy of the modules detected by Qcut, we tested it on a large collection of synthetic gene expression data. The data sets, available at http://www.biostat.pitt.edu/bioinfo/publication.htm, were used to evaluate many clustering algorithms in a previous study . Each data set contains simulated expression data of approximately 600 genes under 50 conditions. Each gene was pre-assigned to one of fifteen clusters, and the genes in the same cluster had their expression profiles generated from a common log normal distribution. Gaussian noises were then added to the data set to simulate experimental noises. A higher level of Gaussian noise generally makes the data more difficult to cluster. Since the correct clusters are known, we used a well-known metric called the adjusted Rand Index to measure the accuracy of Qcut (see Methods) .
We first compared the accuracy of Qcut on co-expression networks constructed by three methods: value-based, rank-based, and CLR . We used Euclidean distance as the basis to measure the dissimilarity between two genes. For the value-based method, we normalized the distance to be between 0 and 1, and constructed a series of co-expression networks for each data set using different threshold values. As shown in Figure 4(a), the threshold that results in the best clustering accuracy varies for different data set. For more noisy data set a larger threshold value is needed, which suggests that choosing a right threshold is critical for the value-based method. The CLR method, in contrast, by converting the raw distances to z-scores, effectively removed such dependency and the best clustering accuracy is achieved at the same z-score equal to 2, corresponding to a p-value 0.05, for all data sets (Figure 4b). Interestingly, for the rank-based method, the clustering accuracy is almost invariant for rank cutoffs between 2 and 8 (Figure 4c). Figure 4(d) shows the best accuracy that can be achieved on the three types of networks. As can be seen, the rank-based networks clearly have the highest accuracy for intermediate levels of noises (SD = 0.4 or 0.8). For data with lower noises, all three methods resulted in perfect accuracy, and for data set with the highest level of noise (SD = 1.2), all three methods converges to about the same accuracy. Next we compared the clustering accuracy of Qcut on rank-based networks with several widely-used clustering algorithms including k-means clustering, hierarchical clustering , and tight clustering , applied directly to the gene expression data without deriving co-expression networks. In this test, Qcut was applied to rank-based co-expression networks constructed using d values equal to 4. In addition, we also tested one of the best graph partitioning algorithms called the Markov Clustering algorithm (MCL) , which is applied to rank-based networks as well. Since the results of MCL depend heavily on the choice of an inflation parameter (I), we applied MCL to the rank-based networks constructed with d fixed at 4, but varied I from 1.3 to 1.7, with an increment of 0.1, and took the best clustering accuracy resulted from these parameters. We used the MATLAB (the MathWorks Inc.) implementation of the k-means and hierarchical clustering algorithms. k-means clustering was run with k equal to 15, and was repeated 50 times for each data set to obtain the best results. The hierarchical clustering was performed using average linkage, and the final cluster tree was cut at an appropriate depth to generate 15 clusters. The accuracies of tight clustering were directly obtained from the original study that was done on the same data set . Our evaluation results show that, even without any parameter tuning, Qcut outperformed the competing algorithms in identifying the true modular structures embedded within the synthetic microarray data. As shown in Figure 5, the clustering accuracy of Qcut is clearly better than that of the hierarchical clustering and tight clustering. The accuracy of Qcut is similar to k-means, except for the data sets of the highest level of noise. The synthetic data set with the highest level of noise may represent an extreme case in practice, as many of the clusters in this data set are not distinguishable visually (Figure S1 in Additional File 1). However, k-means achieved this accuracy with the number of clusters given explicitly, while Qcut did not have this information at all. In these synthetic data sets, the number of clusters is the single most important parameter and k-means is expected to work well when that is known. We tried to combine k-means with several popular methods to automatically determine the number of clusters, including the gap statistic , Silhouette , and the Dunn's Index . Our results suggest that if the values of k are automatically determined, k-means performs much poorer than our method, especially for data sets with SD ≥ 0.4 (Figure 5). The results of MCL are two-fold. On one hand, when an appropriate inflation parameter is chosen (I = 1.5 in this experiment), MCL has an accuracy similar to that of Qcut, except for the data set with the highest noises, indicating a superior performance of graph-based algorithms in general. On the other hand, the accuracy of MCL depends on the choice of the inflation parameter, and may be much lower than that of clustering algorithms if a suboptimal inflation parameter is used (data not shown).
Effects of network construction methods on the clustering accuracy of Qcut. (a) Clustering accuracy on value-based networks, as a function of the distance cutoffs. (b) Clustering accuracy on CLR co-expression networks, as a function of the Z-score cutoffs....
Comparison of different clustering methods using synthetic microarray data. Qcut and MCL are applied to rank-based networks constructed with d = 4. Each data point in the plot is an average over 100 synthetic microarray data sets.
Functional modules in a yeast gene co-expression network
To evaluate the performance of our algorithm on real biological data, we applied Qcut to cluster the four rank-based yeast co-expression networks constructed in the previous subsection. The best numbers of clusters suggested by Qcut for the four networks are 24, 20, 12 and 12, respectively. As shown on synthetic data, when the number of clusters is known, the clustering algorithms that explicitly use this information, such as k-means, perform better than those without this information, such as the hierarchical clustering. Therefore, here we compared the performance of Qcut to k-means and two other popular clustering algorithms, namely self-organizing maps (SOM)  and spectral clustering , both of which require the number of clusters to be given. Applying the value-based method to the yeast data set, as we showed in Figure 1, the network either contains too many singletons, or are too densely connected to be partitioned by Qcut (or any other graph algorithms). Therefore, we did not evaluate its accuracy on the real data set. Interestingly CLR had similar behavior as the value-based method on the yeast data set, indicating that it is partially value-based.
We used the SOM implementation in the microarray software suite TM4 from TIGR, and implemented our own version of spectral clustering. K-means, SOM and spectral clustering were applied directly to the expression data, using Pearson correlation coefficient as the similarity metric. We obtained 24, 20, 12 and 9 clusters using each of the three competing algorithms. To test if the competing algorithms may give a better result with a different setting of the number of clusters, we also applied the spectral clustering with the number of clusters k equal to 5, 6, ..., 25. SOM was executed on 4 × 6, 4 × 5, 3 × 4, and 3 × 3 grids to produce the desired number of clusters. Because Qcut identified 12 clusters on both the d = 4 and d = 5 networks, we compared the 12 clusters identified from the d = 5 network by Qcut with the 9 clusters identified by the competing algorithms to avoid redundant comparison. Another reason for this comparison is that Qcut often produces a few small clusters, while the clusters of the competing algorithms are relatively uniform in size. Therefore, the "effective" number of clusters is smaller for Qcut than for other algorithms, so we used this comparison to compensate for the differences in cluster sizes.
To validate the biological significance of the clusters, we first counted the number of Gene Ontology (GO) terms enriched in the clusters and the percentage of clusters that had at least one enriched GO term, at various significance levels. As shown in Figure 6, the clusters identified by Qcut contain more enriched GO terms than the competing algorithms for most of the p-value cutoffs (Figure 6(a)-(d)). Furthermore, the percentage of clusters containing at least one enriched GO term is also higher for Qcut than for the other algorithms (Figure 6(e)-(h)).
Enrichment of GO terms in yeast co-expression networks. Vertical axes in (a)-(d): number of GO terms enriched in the clusters. Vertical axes in (e)-(h): percentage of clusters that are enriched with at least one GO term. Horizontal axes: p-value cutoff...
Second, as the above measurement may be biased by the number of clusters or the cluster size distributions, we compared these algorithms using a novel metric based on three reference networks that capture different functional interactions between genes: a co-annotation network based on GO terms of biological processes, a co-regulation network based on ChIP-chip data, and a PPI network (see Methods). While the PPI network has unweighted edges, the GO-based and ChIP-based reference networks have weighted edges. We discretized the GO-based and ChIP-based reference networks using a series of weight cutoffs. We then scored the gene modules identified by each algorithm using these reference networks (see Methods). As shown in Figure 7 and Figure S2 in Additional File 1, the gene modules identified by Qcut always had the highest score, using all three types of reference networks. The spectral clustering algorithm performed better than the other clustering algorithms. As Qcut and the spectral clustering algorithm share some similar spirit in capturing the topology embedded in a data matrix, this result seems to suggest that topological features are important for achieving good clustering results on gene expression data. We also randomly shuffed the gene modules identified by Qcut while preserving the sizes of the modules, and scored the random clusters using the reference networks. The scores for the random modules are very close to zero in all cases (Figure 7 and Figure S2 in Additional File 1), indicating that the results obtained by
Yeast gene co-expression network module scores based on reference networks. Reference networks are derived from GO annotations (a-d) and ChIP-chip data (e-h). Horizontal axes: edge weight cutoff for the reference networks.
Qcut and the clustering algorithms are not due to chance
The modules in the d = 2 co-expression network have the worst scores according to all three reference networks (Figure 7 and Figure S2 in Additional File 1), indicating that the d = 2 network might be too sparse to capture all functional relationships. The d = 4 co-expression network has the highest scores according to all three reference networks, while the networks with d = 3 or 5 give slightly worse results. As stated above, to test if the competing algorithms may give a better result with a different setting of the number of clusters, we applied the spectral clustering, which has a better accuracy than the other two algorithms, with the number of clusters k equal to 5, 6, ..., 25. We scored the gene modules using the GO-based reference network at a weight cutoff of 0.8. As shown in Figure S3 in Additional File 1, the best module score achieved by spectral clustering is 0.323 (at k = 13), which is significantly lower than the module score by Qcut (Q = 0.384). This shows that our evaluation results have not been biased by the number of clusters.
In addition, we combined k-means with gap-statistic , which estimated the optimal number of clusters to be 6. We also tested SOTA , a hierarchical version of SOM that can automatically determine the number of clusters. SOTA returned 11 clusters, a number similar to our optimal number. However, only 6 of the 11 clusters had significantly enriched functions, and the most significant function had a p-value 10-49, as compared to 10-106 in our method (Table 2 and Table S1 in Additional File 1).
Functional modules in a yeast co-expression network
Table 2 summarizes the modules identified by Qcut from the d = 4 network. For each module, we show the most significantly enriched GO biological process terms and the transcription factors whose targets are significantly enriched in the module (see Methods). As shown, most modules contain highly coherent functional terms, and are co-regulated by common transcription factors. For example, the majority of genes in module 12 are involved in protein biosynthesis (p = 10-85), and can be bound by FHL1 (p = 10-105) and RAP1 (p = 10-48), both of which are known to be involved in rRNA processing and regulating ribosomal proteins . Module 9 is significantly enriched with genes that are involved in generation of precursor metabolites and energy (p = 10-33), and can be bound by HAP4 (p = 10-16), a TF regulating carbohydrate metabolism . Module 2 contains almost two-thirds of the ribosome biogenesis genes (p = 10-106), but no TFs were found to bind to this set of genes specifically. Module 11 is enriched with genes that can be bound by eight different TFs. Interestingly, all eight TFs are known cell-cycle regulators . Several small modules correspond to specific functional groups. For example, 17 of the 22 genes in module 10 are involved in Ty element transposition (p = 10-29). Half of the 18 genes in module 3 are related to chromatin assembly or disassembly (p = 10-13); six of them are regulated by transcription factors HIR1/2/3, which are known to be involved in the transcription of histone genes . Modules 5 and 7 contain both a large fraction of genes with unknown functions and groups of genes with significantly enriched common functions or common TFs. It is possible that these uncharacterized genes also have the functions that are enriched in the module.
We also found a very interesting small module that may deserve further investigation. Among the 25 genes in module 4, four genes have a common function in telomere maintenance (p = 10-7). While the majority (16) of the remaining genes encode hypothetic proteins and have unknown functions (p = 10-6), a careful inspection showed that all four characterized genes and five of the 16 uncharacterized ones are located near telomeric regions . Moreover, a significant number of genes in this module are regulated by four common transcription factors: GAT3 (p = 10-21), YAP5 (p = 10-17), PDR1 (p = 10-11), and MSN4 (p = 10-11) (Figure 8 and Table 2). Our results suggest that these uncharacterized genes as well as the four transcription factors are very likely to be involved in the function or maintenance of telomere, which has not been reported in the literature.
A network of co-expressed and co-regulated genes with functions in telomerase maintenance. Each directed edge pointing from a TF to a gene represents a protein-DNA interaction. All other edges represent co-expression relationships.
Functional modules in an Arabidopsis gene co-expression network
To test our method on high organisms, we applied it to a set of Arabidopsis gene expression data downloaded from the AtGenExpress database http://www.uni-tuebingen.de/plantphys/AFGN/atgenex.htm. We studied the cold stress response, for which the dataset contains the expression of 22 k Arabidopsis genes in the root and shoot tissues in six time points following cold stress treatment and under the normal condition. We selected 2545 genes that are up-or down-regulated by at least five-folds in at least one of the six time points in either tissue. We then constructed a co-expression network by connecting each gene to its top three correlated genes (i.e. d = 3), resulting in 5838 co-expression links in total.
Our clustering algorithm partitioned the network into 19 modules, with a Q value of 0.81, indicating a strong modular structure. Similar to the previous experiment, we first examined the GO terms enriched in the clusters at various significance levels, and compared them with the results of the k-means algorithm with k = 19. As shown in Figure 9(a) and 9(b), the modules identified by our method contain significantly more enriched GO terms than that identified by k-means, and GO terms are enriched in more modules in our method than in k-means. Furthermore, our method achieved significantly higher module scores in the GO-based reference networks than the k-means algorithm (Fig 9(c)). Table S2 in Additional File 1 lists the most enriched functional categories for each module. Several modules are enriched with functions that are known to be related to cold stress responses, e.g. modules 7 (photosynthesis, p = 10-16), 11 (circadian rhythm, p = 10-5), 14 (response to heat, p = 10-15), 15 (antiporter activity, p = 10-6) and 18 (lipid binding, p = 10-8). Due to the scarcity of functional annotations in Arabidopsis, the enrichment p-values as well as the module scores of the gene modules are less significant in Arabidopsis than in yeast, which is to be expected.
Enrichment of GO terms in the Arabidopsis co-expression network. (a) Number of enriched GO terms; (b) Percentage of clusters with at least one enriched GO term; (c) Gene module scores measured by the GO-based reference networks. Horizontal axes in (a)...
Cancer subtype classification from sample co-expression networks
In this subsection, we show that our co-expression network analysis method can also be applied to identify sample modules. Conceptually, there is no difference in constructing and analyzing gene or sample co-expression networks: in the latter we treat each sample as a network node, and connect two samples when their expression profiles are similar according to some similarity measure. In practice, however, sample co-expression networks are usually smaller than gene co-expression networks, but the edges in sample co-expression networks may be computed from very high-dimensional data, since each sample is described by thousands of genes. It is therefore interesting to see whether the same network construction and module detection methods can be applied here without much tuning of the parameter, i.e. the value of d. As an application of our approach, we applied our method towards an automatic classification of tumor cells.
An accurate classification of tumor cells is crucial for effective therapy . Traditionally, tumors have been classified by their morphologic appearance, which is, unfortunately, often very subjective. Furthermore, tumors with similar histological features may respond very differently to chemotherapy . To address this problem, a promising alternative or complementary strategy is to classify tumors based on their genetic profiles, i.e. the activity of hundreds or thousands of genes that are involved in the disease. Most of the existing tumor classification approaches are based on supervised learning, such as support vector machines or decision trees, which aimed at identifying genetic features to distinguish two or more known tumor (sub-)types [49,50].
Here, we ask whether it is possible to automatically classify tumor samples in the absence of training data that provide known tumor (sub-)types or even the number of distinct (sub-)types. This unsupervised learning approach has a few advantages over supervised learning methods. First, the existing tumor classifications are based on histological features, which may be unreliable themselves. Second, using unsupervised learning, we may be able to discover novel tumor sub-types that have not been characterized by histological features previously. On the other hand, it is crucial to confirm whether the automatically discovered tumor (sub-)types are biologically meaningful and indeed provide useful information for understanding the disease mechanism or for improving its treatment.
In this study, we chose to focus on lymphoma, a family of tumors involving cells of the immune system. We obtained a data set containing the expression data of 4026 genes for 96 samples belonging to nine cell types, including three different types of tumors, namely, diffuse large B cell lymphoma (DLBCB), chronic lymphocytic leukemia (CLL), and follicular lymphoma (FL), as well as normal B and T cells at different stages of cell differentiation . We constructed a rank-based network of the samples using Pearson correlation coefficient as the similarity measure and the value of d at 5. Edges with a Pearson correlation coefficient < 0.2 were removed to reduce false connections.
Applying Qcut to the network, we identified eight modules. Remarkably, without any prior knowledge about the number of cell types in the samples, our algorithm automatically separated different cell types into different modules, with a few exceptions (Figure 10 and Table S3 in Additional File 1). Furthermore, the results are almost invariant when we varied d between 3 and 7, indicating a very stable modular structure among the samples. As shown in Figure 10, blood T cells and activated blood B cells are perfectly classified into their own modules. CLL and resting B cells are grouped into a single module, which is not surprising since the CLL has a very low proliferation rate, similar to resting B cells . The two germinal centre B (GCB) cell samples are grouped with the FL cells and are far away from the activated B cells, which confirms the hypotheses that GCB represents a distinct stage of B cell differentiation, and that the FL arises from this stage of B-cell differentiation . The transformed cell line module also contains three DLBCL cells. A closer inspection showed that two of the DLBCB samples in this module (OCI-Ly1 and OCI-Ly3) are laboratory-cultivated cell lines rather than samples from real patients (Figure 10), which may be the reason that they are grouped with the transferred cell lines. The Lymph node/tonsil cells are grouped with DLBCL as in previous studies .
A co-expression network of cancer cells. Each cluster is shown with a different color. Each cell type is represented by a unique combination of the shape and text inside a node. Square nodes with D inside represent DLBCL cells. DLBCL outliers that were...
Interestingly, the majority of the DLBCB samples are grouped into three modules (DLBCL-1, 2, and 3). It is well known that not all DLBCL tumors are equal: 40% of patients respond well to chemotherapy and have prolonged survival, while the others have a much shorter survival time after treatment . Previous studies have suggested that DLBCL tumors can be categorized into two subtypes: GCB-like DLBCL and activated B-like DLBCL . The GCB-like DLBCL shares gene signatures that distinguished germinal centre B cells from B cells in other stages, while the activated B-like DLBCL shared many gene signatures with normal lymph node and tonsil. On average, GCB-like DLBCL patients have a much higher survival rate than activated B-like DLBCL patients after comparable chemotherapy . In our sample co-expression network, the GCB cells connect to DLBCL-1 cells while the tonsil and lymph node cells fall in the DLBCL-2 modules, suggesting that the two modules may correspond to the two well-known DLBCL subtypes. Indeed, the median survival durations for the patients in the DLBCL-1 and DLBCL-2 modules are 71 and 22 months, respectively. Furthermore, 11 out of 15 (73%) patients in the DLBCL-1 module lived more than five years after treatment, while only 2 out of 13 (15%) patients in the DLBCL-2 module survived that long. Even more interestingly, however, the DLBCL-3 module, which is apart from GCB and Tonsil/Lymph nodes, has the lowest survival rate overall. The median survival durations for this module is 12 months, and only and 1 out of 10 (10%) patients in this module survived more than five years after treatment. The lack of GCB-like or activated B-like signatures and the lower survival rate seems to suggest DLBCL-3 to be a separate subtype.