Proteobacteria Clusters

Approximately 8.76M proteins were clustered using the pClust program soon to be available at (for up to 500,000 proteins the GUI-run pClust program can be used and is available from the BCB software page*).  These were the proteins for all the bacteria and their plasmids in the phylum Proteobacteria for the 2,326 complete proteomes existing in the NCBI database in February 2016.  The total number of clusters obtained was 707,311 of which 224,442 are shared by more than one organism, 10,049 are singleton clusters with proteins occurring in a single organism, and 472,820 are "true" singleton clusters, i.e., 472,820 proteins didn't align with any other protein.  The cluster file is a text file consisting of the number of proteins in each cluster and the cluster number followed by the first line of the FASTA file for each protein sequence.  The zipped file for all 707,311 clusters is about 158MB in size and unzipped is about 765MB (  pClust gives very accurate results as it uses optimal alignment (either Smith Waterman or Needleman Wunsch for local or global/semi-global).
L1 distance matrices were created using the 2,326 species and all the protein sequences, and these were visualized used visone software.  An example of a network is shown on the Proteobacteria Network page for all 707,311 clusters using normalized distances.  It should be noted that the gamma Proteobacteria roughly outnumber the others by a factor of approximately 3.  The delta/epsilon Proteobacteria are hidden by the others.  All work was performed by Svetlana Lockwood as part of her dissertation research.
This project was funded by support from the National Science Foundation under the Advances in Biological Informatics program, Award 1262664.
* Close to 1 million protein sequences were clustered using the GUI program, but it took 15 hours and required a machine with 16GB of RAM.  This compares to run times of 5 minutes for about 127,000 proteins and 36 minutes for 270,000.