Skip to main content Skip to navigation
Bioinformatics and Computational Biology

Protein Clusters

Approximately 8.76M proteins were clustered using the multi-core pClust program on an HPC platform (the pClust program with a graphical user interface can be used for up to 500,000 proteins; the program is available from the BCB software page*).  These were the proteins for all the bacteria and their plasmids in the phylum Proteobacteria for the 2,307 complete proteomes existing in the NCBI database in February 2016.  The total number of clusters obtained was 707,311 of which 224,442 are shared by more than one organism, 10,049 are singleton clusters with proteins occurring in a single organism, and 472,820 are “true” singleton clusters, i.e., 472,820 proteins didn’t align with any other protein.  The cluster file is a text file consisting of the number of proteins in each cluster and the cluster number followed by the first line of the FASTA file for each protein sequence.  The zipped file for all 707,311 clusters is about 158MB in size and unzipped is about 765MB (http://www.eecs.wsu.edu/~shira/clusters.zip).  pClust gives very accurate results as it uses optimal alignment (either Smith Waterman or Needleman Wunsch for local or global/semi-global).
 
L1 distance matrices were created using the 2,307 species and all the protein sequences, and these were visualized used visone software.  An example of a network is shown on the Networks page for all 2,307 bacteria using the 707,311 clusters and normalized distances.  It should be noted that the gamma-proteobacteria roughly outnumber the others by a factor of approximately 3.  The delta/epsilon-proteobacteria are hidden by the others.  All work was performed by Svetlana Lockwood as part of her dissertation research.
 
This project was funded by support from the National Science Foundation under the Advances in Biological Informatics program, Award 1262664.
 
* Close to 1 million protein sequences were clustered using the GUI program, but it took 15 hours and required a machine with 16GB of RAM.  This compares to run times of 5 minutes for about 127,000 proteins and 36 minutes for 270,000.