Biochemistry Online: An Approach Based on Chemical Logic

CHAPTER 2 - PROTEIN STRUCTURE

G: PREDICTING PROTEIN PROPERTIES FROM SEQUENCES

BIOCHEMISTRY - DR. JAKUBOWSKI

Last Update: 3/9/16

Learning Goals/Objectives for Chapter 2G: After class and this reading, students will be able to:

find web based proteomics protein to analyze protein sequences and structures
describe the basis for methods used to predict the secondary structure and hydrophobic structures of proteins
analyze secondary structure and hydropathy plots from web-based proteomics programs.
describe differences between integral and peripheral membranes proteins, and how each could be purified.
explain how hydropathy and secondary structure plots can be used to predict membrane spanning sequences of proteins
describe in general the theoretical and empirically based methods to predict protein tertiary structure from a primary sequence
describe possible early intermediates in protein folding as determined by theoretical methods

G1. Introduction to Bioinformatics, Computational Biology and Proteomics

With the solving of the human genome, intensive effort has been devoted to analysis of the human genome to determine the number and transcriptional regulation of the encoded genes. Much has been learned from comparative genomics, as genomes from mice, rats, chimpanzees, and a variety of prokaryotes are compared in an effort to help understand the nature of genes and their transcriptional regulation. The vast amount of genomic data that has to be "mined" has required the development of computational and computer programs to enable the analysis. Two relatively new fields have subsequently arisen: bioinformatics and computational biology. (In a personal note, the words computational biology seem somewhat restrictive since the field of computational chemistry, which has a longer history, has significant overlap with "computational biology". I prefer computational biochemistry). These fields have significant overlap (as do physical chemistry/chemical physics and biochemistry/molecular biology/chemical biology), so I defer to others to define them.

The NIH Biomedical Information Science and Technology Initiative Consortium: "This consortium has agreed on the following definitions of bioinformatics and computational biology, recognizing that no definition could completely eliminate overlap with other activities or preclude variations in interpretation by different individuals and organizations.

Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.

Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modeling and computational simulation techniques to the study of biological, behavioral, and social systems."

This web book has been developed as a first semester biochemistry text and choices have been made to limit the scope of the material to exclude content covered in detail in a molecular biology/genetics class. Hence, this text will not discuss in significant detail the genome and transcriptome, and mechanisms of replication, transcription, or translation. However, with its emphasis on protein structure and function, proteomics, the characterization of structure and function of all proteins within a cell, is a logical candidate for inclusion.

In the last several years, computational biology/chemistry and web-based programs have become available for the systematic analysis of individual proteins, and for the comparative analysis of many proteins, based on either their DNA or amino acid sequence. Clearly the ultimate goal in the description of a protein would be to determine, from the amino acid or nucleotide sequence, the three dimensional structure of a protein and its biological function, including all its binding partners.

Here is a list of proteome web resources and tutorials

Voluminous databases of biomolecule sequence and structural data, as well as analysis software packages, are available at a variety of web sites, including:

BioGrid: General Repository for Interaction (protein, NA) Datasets
GenBank: DNA sequence database (over 100 billion bases as of 9/05), from the NCBI
BLAST finds regions of similarity between biological sequences
UniProtKB/Swiss-Prot: manually annotated and reviewed section of the UniProt Knowledgebase (UniProtKB)
ProSite: database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family (if any) a new sequence belongs. From the Swiss Institute of Bioinformatics
Swiss-2D Gel Database: from the Swiss Institute of Bioinformatics
RSCB Protein Data Bank: Protein and nucleic acid 3D structures from x-ray crystallography and NMR spectroscopy (about 33,000 as of 9/15/05)
SWISS-MODEL Repository: 3D comparative protein structure models (675,000) generated by the fully automated homology-modeling pipeline SWISS-MODEL. (again from Swiss Institute of Bioinformatics)
ExPASy (Expert Protein Analysis System) server of the Swiss Institute of Bioinformatics

The NCBI has an extensive array of available tools (free), including:

literature databases: including word searches in many books
All resources: including nucleotide, protein, structure, genome, chemical
Entrez: the life science search engine
Blast Quick Start: easy way to start a BLAST search
complete human proteome from UniProtKB/Swiss-Prot

A summary of three important sites:

• NCBI-Protein: The Protein database is a collection of sequences from several sources, including translations from annotated coding regions in GenBank, RefSeq and TPA, as well as records from SwissProt, PIR, PRF, and PDB. Protein sequences are the fundamental determinants of biological structure and function
• Uniprot: The UniProt Knowledgebase (UniProtKB) is the central hub for the collection of functional information on proteins, with accurate, consistent and rich annotation. In addition to capturing the core data mandatory for each UniProtKB entry (mainly, the amino acid sequence, protein name or description, taxonomic data and citation information), as much annotation information as possible is added.
• Gene Card: GeneCards is a searchable, integrative database that provides comprehensive, user-friendly information on all annotated and predicted human genes. It automatically integrates gene-centric data from ~125 web sources, including genomic, transcriptomic, proteomic, genetic, clinical and functional information

The table below (directly taken from Wikipedia) shows some of the incredible information available the proteome and genome of each human chromosome.

Table: Human proteome and genome from Wikipedia
(Data source: Ensembl genome browser release 68, July 2012)

Chromsome	Length (mm)	BP	Variations	Confirmed Proteins	Putative Proteins	Pseudogenes	miRNA	rRNA	snRNA	snoRNA	misc ncRNA	Links
1	85	249,250,621	4,401,091	2,012	31	1,130	134	66	221	145	106	EBI
2	83	243,199,373	4,607,702	1,203	50	948	115	40	161	117	93	EBI
3	67	198,022,430	3,894,345	1,040	25	719	99	29	138	87	77	EBI
4	65	191,154,276	3,673,892	718	39	698	92	24	120	56	71	EBI
5	62	180,915,260	3,436,667	849	24	676	83	25	106	61	68	EBI
6	58	171,115,067	3,360,890	1,002	39	731	81	26	111	73	67	EBI
7	54	159,138,663	3,045,992	866	34	803	90	24	90	76	70	EBI
8	50	146,364,022	2,890,692	659	39	568	80	28	86	52	42	EBI
9	48	141,213,431	2,581,827	785	15	714	69	19	66	51	55	EBI
10	46	135,534,747	2,609,802	745	18	500	64	32	87	56	56	EBI
11	46	135,006,516	2,607,254	1,258	48	775	63	24	74	76	53	EBI
12	45	133,851,895	2,482,194	1,003	47	582	72	27	106	62	69	EBI
13	39	115,169,878	1,814,242	318	8	323	42	16	45	34	36	EBI
14	36	107,349,540	1,712,799	601	50	472	92	10	65	97	46	EBI
15	35	102,531,392	1,577,346	562	43	473	78	13	63	136	39	EBI
16	31	90,354,753	1,747,136	805	65	429	52	32	53	58	34	EBI
17	28	81,195,210	1,491,841	1,158	44	300	61	15	80	71	46	EBI
18	27	78,077,248	1,448,602	268	20	59	32	13	51	36	25	EBI
19	20	59,128,983	1,171,356	1,399	26	181	110	13	29	31	15	EBI
20	21	63,025,520	1,206,753	533	13	213	57	15	46	37	34	EBI
21	16	48,129,895	787,784	225	8	150	16	5	21	19	8	EBI
22	17	51,304,566	745,778	431	21	308	31	5	23	23	23	EBI
X	53	155,270,560	2,174,952	815	23	780	128	22	85	64	52	EBI
Y	20	59,373,566	286,812	45	8	327	15	7	17	3	2	EBI
mtDNA	0.0054	16,569	929	13	0	0	0	2	0	0	22	EBI

This chapter will describe programs that allow predictions of secondary and tertiary structures of proteins. Specific exercises using web-based bioinformatics programs can be found at the end.

back Navigation

Return to Chapter 2G: Predicting Protein Properties from Sequences

Return to Biochemistry Online Table of Contents

Archived version of full Chapter 2G: Predicting Protein Properties from Sequences

Biochemistry Online by Henry Jakubowski is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.