PDB FILES

(Glossary of terms)

ATOM

Overview

The ATOM records present the atomic coordinates for standard residues. They also present the occupancy and temperature factor for each atom. Heterogen coordinates use the HETATM record type. The element symbol is always present on each ATOM record; segment identifier and charge are optional.

Record Format

COLUMNS        DATA TYPE       FIELD         DEFINITION
---------------------------------------------------------------------------------
 1 -  6        Record name     "ATOM  "

 7 - 11        Integer         serial        Atom serial number.

13 - 16        Atom            name          Atom name.

17             Character       altLoc        Alternate location indicator.

18 - 20        Residue name    resName       Residue name.

22             Character       chainID       Chain identifier.

23 - 26        Integer         resSeq        Residue sequence number.

27             AChar           iCode         Code for insertion of residues.

31 - 38        Real(8.3)       x             Orthogonal coordinates for X in
                                             Angstroms.

39 - 46        Real(8.3)       y             Orthogonal coordinates for Y in
                                             Angstroms.

47 - 54        Real(8.3)       z             Orthogonal coordinates for Z in
                                             Angstroms.

55 - 60        Real(6.2)       occupancy     Occupancy.

61 - 66        Real(6.2)       tempFactor    Temperature factor.

73 - 76        LString(4)      segID         Segment identifier, left-justified.

77 - 78        LString(2)      element       Element symbol, right-justified.

79 - 80        LString(2)      charge        Charge on the atom.

Details

* ATOM records for proteins are listed from amino to carboxyl terminus.

* Nucleic acid residues are listed from the 5' to the 3' terminus.

* No ordering is specified for polysaccharides.

* The list of ATOM records in a chain is terminated by a TER record.

* If more than one model is present in the entry, each model is delimited by MODEL and ENDMDL records.

* For more information on atom naming conventions, see Appendix 3, and for residue names, see Appendix 4 and the HET section of this document

* If an atom is provided in more than one position, then a non-blank alternate location indicator must be used as the alternate location indicator for each of the positions. Within a residue all atoms that are associated with each other in a given conformation are assigned the same alternate position indicator.

* For atoms that are in alternate sites indicated by the alternate site indicator, sorting of atoms in the ATOM/HETATM list uses the following general rules:

- In the simple case that involves a few atoms or a few residues with alternate sites, the coordinates occur one after the other in the entry.

- In the case of a whole macromolecular chain, or significant portion of a chain, having alternate sites, the atoms for each alternate position are listed together. The two conformers are delineated by MODEL/ENDMDL records. In this case each MODEL must represent the entire molecular assemblage, including any heterogen group which is not necessarily disordered. Such is the case when DNA molecules are placed in UP and DOWN positions.

- In the case of a large heterogen groups which are disordered, the atoms for each conformer are listed together. The two lists are not separated by MODEL/ENDMDL as is done for macromolecular chains.

* Addition of atoms to side chains of standard residues are handled as follows:

The additional atoms (modifying group) are represented as a HET group which is assigned its own residue name. The chainID, sequence number, and insertion code assigned to the HET group is that of the standard residue to which it is attached.

* Chemical modifications of standard residue side chains by addition of new atoms are handled as follows:

- The new atoms are represented as a HET group. This group is assigned the chain name, sequence number, and insertion code of the standard residue that it modifies.

- The atoms comprising these het groups are listed as HETATM and are inserted in the ATOM list immediately after the TER record of the chain. These groups are listed in the same order as the standard residue to which they are bonded (i.e., from the N- to C-terminus for polypeptides and from the 5' to 3' end for nucleic acids).

- Modified standard residues and the modifying het group may be assigned the same SEGID to further describe the relationship between the groups. PDB will use this mechanism only if SEGID's were not assigned to these atoms for other purposes.

- Modified standard residues must have a corresponding MODRES record.

* The insertion code is commonly used in sequence numbering and is described here. In most cases, the amino acids that comprise a protein are numbered sequentially starting with 1. However, there are a number of situations that may give rise to different numbering schemes:

- Homologous proteins can exist in a number of different species. Depositors may use a residue numbering scheme in order to preserve the homology. The reference protein may be numbered sequentially starting with 1, then the homologous protein from another species aligned to it. If residues are not present in the homologous sequence, residue numbers may be skipped so that alignment can be preserved. If additional residues are present relative to the reference protein, they may have a letter, called an insertion code, appended to the sequence number. Negative numbers and zeros are permitted if they are needed to align the N-terminus.

     REFERENCE PROTEIN NUMBERING        HOMOLOGOUS PROTEIN NUMBERING
     ---------------------------------------------------------------------
                 59                                  59
                 60                                  60
                 61
                 62                                  62

     REFERENCE PROTEIN NUMBERING         HOMOLOGOUS PROTEIN NUMBERING
     ---------------------------------------------------------------------
                 85                                  85
                 86                                  86
                                                     86A
                                                     86B
                 87                                  87

- The numbering of a proenzyme may be used for the enzyme following cleavage.

- The molecule studied might be a portion of the whole protein. The residue numbering scheme could show the relationship to the intact protein.

- The protein might be a mutant with residues inserted and deleted. As above, the residue numbering of the native protein could be preserved by appropriate use of gaps in the numbering and/or insertion codes.

- The nucleic acid community generally numbers structures sequentially. For double-stranded nucleic acids, entries usually use two different chain identifiers. For example, an octameric duplex would be numbered 1 - 8 for chain A, and 9 - 16 for chain B.

* If the depositor provides the data, then the isotropic B value is given for the temperature factor.

* If there is no isotropic B value from the depositor, but there is an ANISOU record with anisotropic temperature factors, then the B equivalent is stored in the tempFactor field, as calculated by:

B(eq) = 8pi**2{1/3[U(1,1) + U(2,2) + U(3,3)]}

- This will obviate the need to check if ANISOU records are present before interpreting the contents of the temperature factor field.

- In some previously released PDB entries with anisotropic temperature factors provided as ANISOU records, the temperature factor field of the corresponding ATOM or HETATM record contained the equivalent U-isotropic [U(eq)] which is calculated by:

U(eq) = 1/3[U(1,1) + U(2,2) + U(3,3)] x 10**-4

* If there are neither isotropic B values from the depositor, nor anisotropic temperature factors in ANISOU, then the default value of 0.0 is used for the temperature factor.

* In some entries, the occupancy and temperature factor fields are used for other quantities. In these cases, an explanation is provided in the remarks.

* Columns 73 - 76 identify specific segments of the molecule. The segment id is a string of up to four (4) alphanumeric characters, left-justified, and may include a space, e.g., CH86, A 1, NASE. The segment itself may consist of a complete chain or a portion of a chain. The importance of this new field can be appreciated if one considers an antibody structure having two molecules in the asymmetric unit. Since each chain must have a unique chain identifier, the two heavy chains and two light chains cannot currently be labeled to indicate their nature. Segment id's of CH, VH1, VH2, VH3, CL, and VL would clearly identify regions of the chains and the relationship between them. Users of X-PLOR will be familiar with SEGID as used in the refinement application of X-PLOR.

* Columns 77 - 78 contain the atom's element symbol (as given in the periodic table), right-justified. This is especially needed because in some cases it has not been possible to follow the convention that columns 13 - 14 of the atom name contain the element symbol. The most common cases are:

- In large het groups it sometimes is not possible to follow the convention of having the first two characters be the chemical symbol and still use atom names that are meaningful to users. A example is nicotinamide adenine dinucleotide, atom names begin with an A or N, depending on which portion of the molecule they appear in, e.g., AC6 or NC6, AN1 or NN1.

- Hydrogen naming sometimes conflicts with IUPAC conventions. For example, a hydrogen named HG11 in columns 13 - 16 is differentiated from a mercury atom by the element symbol in columns 77 - 78. Columns 13 - 16 present a unique name for each atom.

* Columns 79 - 80 indicate any charge on the atom, e.g., 2+, 1-. In most cases these are blank.

Verification/Validation/Value Authority Control

PDB checks ATOM/HETATM records for PDB format, sequence information, and packing. The PDB reserves the right to return deposited coordinates to the author for transformation into PDB format.

PDB intends to verify the coordinates against the experimental structure factor data in the when available. Details on this will be forthcoming.

Relationships to Other Record Types

The ATOM records are compared to the corresponding sequence database. Residue discrepancies appear in the SEQADV record. Missing atoms are annotated in the remarks. HETATM records are formatted in the same way as ATOM records. The sequence implied by ATOM records must be identical to that given in SEQRES, with the exception that residues that have no coordinates, e.g., due to disorder, must appear in SEQRES. Remark 550 is used to describe the meaning assigned to any segment identifiers used.

Example

                    22:Chain

recnamserno atnm res cres#    |--x---||--y---||--z---|occupatempfa      seq#ElQQ
         1         2         3         4         5         6         7         8
12345678901234567890123456789012345678901234567890123456789012345678901234567890
ATOM    145  N   VAL A  25      32.433  16.336  57.540  1.00 11.92      A1   N
ATOM    146  CA  VAL A  25      31.132  16.439  58.160  1.00 11.85      A1   C
ATOM    147  C   VAL A  25      30.447  15.105  58.363  1.00 12.34      A1   C
ATOM    148  O   VAL A  25      29.520  15.059  59.174  1.00 15.65      A1   O
ATOM    149  CB AVAL A  25      30.385  17.437  57.230  0.28 13.88      A1   C
ATOM    150  CB BVAL A  25      30.166  17.399  57.373  0.72 15.41      A1   C
ATOM    151  CG1AVAL A  25      28.870  17.401  57.336  0.28 12.64      A1   C
ATOM    152  CG1BVAL A  25      30.805  18.788  57.449  0.72 15.11      A1   C
ATOM    153  CG2AVAL A  25      30.835  18.826  57.661  0.28 13.58      A1   C
ATOM    154  CG2BVAL A  25      29.909  16.996  55.922  0.72 13.25      A1   C

seq# left justified

Known Problems

Due to the ever-increasing size of protein structures in the PDB, the atom serial number field may soon need to be increased. An increase of one column will allow for cases where entries have more than 99,999 atoms. Only 5 digits are available for the atom serial number, but some structures have already been received with more that 99,999 atoms.

No distinction is made between ribo- and deoxyribonucleotides in the SEQRES records. These residues are identified with the same residue name (i.e., A, C, G, T, U).


CARD FILES:  FROM http://www.scripps.edu/brooks/c29docs/io.html#%20Coordinate

Reading coordinates


        The reading of coordinates is done with the READ COOR command,
and there are several options (which may change over in future versions).
Coordinates may be read into the main set or the comparison coordinate set
using the COMP keyword.

        There are three possible file formats that can be used to read in
coordinates. They are coordinate binary files, dynamics coordinate
trajectories, and coordinate card images. Protein Data Bank (PDB) formatted
files can also be read. They do however require some editing first. All
the HEADER and other junk before the actual coordinate section has to
be removed and optionally replaced by a standard CHARMM title. There should
be no line with NATOM (= number of atoms) preceding the actual coordinates.
CHARMM does no translation whatsoever of residue or atom names, so you
would either have to rename some entries in the PSF or in the coordinate
file in case there are differences.

        For all formats, a subset of the atoms in the PSF may be selected
using the standard atom selection syntax. For binary files, This is a
risky maneuver, and warning messages are given when this is attempted.
Only coordinates of selected atoms may be modified. When reading binary
files, or using the IGNOre keyword, coordinate values are mapped into
the selected atoms sequentially (NO checking is done!).

        The reading of the first two file formats is specified with the
FILE option. The program reads the file header to tell which format it
is dealing with. The coordinate binary files have a file header of
'COOR' and contain only one set of coordinates. These are created with a
WRIT COOR FILE command. The dynamics coordinate trajectories have a file
header of 'CORD' and have multiple coordinate sets. These files are
created by the dynamics function of the program. To specify which
coordinate set in the trajectory to be read, the IFILE option is
provided. One specifies the coordinates position within the file. The
default value for this option will cause the first coordinate set to be
read. If the IFILE value is negative, then the next file (other than
the first one) will be read. This will only work if a set has already been
read from the file with a positive IFILE value.

        For binary files, the APPEnd command will 'deselect' all atoms
up to the highest one with a known position. This is done in addition
to the normal atom selection. This is useful for structures with several
distinct segments where it is desireable to keep separate coordinate
modules.

        The CARD file format is the standard means in CHARMM for
providing a human readable and writable coordinate file. The format is
as follows:

         title
         NATOM (I5)
         ATOMNO RESNO   RES  TYPE  X     Y     Z   SEGID RESID Weighting
           I5    I5  1X A4 1X A4 F10.5 F10.5 F10.5 1X A4 1X A4 F10.5

        The title is a title for the coordinates, see *note syn:
(usage.doc)Syntactic Glossary, for details. Next comes the
number of coordinates. If this number is zero or too large, the entire
file will be read. Finally, there is one line for each coordinate.

        ATOMNO gives the number of the atom in the file. It is ignored
on reading. RESNO gives the residue number of the atom. It must be
specified relative to the first residue in the PSF. The OFFSet option
should be specified if one wishes to read coordinates into other positions.
The APPEnd option adds an additional offset which points to the
the residue just beyond the highest one with known positions. This option
also 'deselects' all atoms below this residue (inclusive).
For example, if one is reading in coordinates for the second segment of a
two chain protein using two card files, and the APPEnd option is used,
RESNO must start at 1 in both files for the file reading to work
correctly.

        It should also be remembered that for card images, residues are
identified by RESIDUE NUMBER. This number can be modified by using the
OFFSet feature, which allows coordinates to be read from a different PSF.
Both positive and negative values are allowed. The RESId option will
cause the residue number field to be ignored and map atoms from SEGID
and RESID labels instead.

        RES gives the residue type of the atom. RES is checked against
the residue type in the PSF for consistency. TYPE gives the IUPAC name
of the atom. The coordinates of an atom within a residue need not be
specified in any particular order. A search is made within each residue
in the PSF for an atom whose IUPAC name is given in the coordinate file.

        The RESId option overrides the residue number and fills coordinates
based on the SEGID and RESID identifiers in the coordinate file.
This is the recommended method where different PSF's are used.

        The IGNORE option allows one to read in a card coordinate file
while bypassing the normal tests of the residue name, number, and atom
name. When IGNORE is specified in place of card, the identifying
information is ignored completely. Starting from the first selected
atom, the coordinates are copied sequentially from the file.

        The PDB option works very much like the CARD option, but expects the
actual file format to be according to Protein Data Bank standards:

 text IATOM  TYPE  RES  IRES      X  Y  Z    W
  A6   I5  2X A4   A4    I5  4X     3F8.3 6X F6.2

        Normally, the coordinates are not reinitialized before new values
are read, but if this is desired, the INITialize keyword, will cause the
coordinate values for all selected atoms to be initialized. Note that only
atoms that have been selected, will be initialized (9999.0). The COOR INIT
command provides a more general way to initialize coordinates.

The READ COOR DYNR variant reads a full coordinate set from a dynamics
RESTart file. It REQUIRES a matching PSF and allows no selections or
other manipulations. A restart file (usually) contains three sets of
atom data, and you chose which one to read in with keywords:
 CURR     the current coordinates
 DELT     the displacement to be taken from the current coordinates
 VEL      the current velocities (in AKMA units)

NOTE: The restart file written after a crash may be sligthly different,
at present (c28a2) it contains the previous coordinates instead of velocities.
 


X-PLOR Protein Structure Files (PSF)  from

http://www.lrz-muenchen.de/~heller/ego/manual/node88.html

Protein Structure Files (PSF) are used by EGO as a summary of the atom type, mass, partial charge and connectivity of the molecular system. PSF files are generated from the original PDB file in combination with X-PLOR topology file using X-PLOR.

The topology data files used by X-PLOR specify the atom parameters and connectivity for all amino acids and nucleotides. X-PLOR extracts all the information necessary (along with patches and modifications from the default configuration) for a given molecule in the PSF file in the form of:

 

1.
a list of atoms with the atom types (CH3E, CH1E, O, N, ...), partial charges and masses,
2.
a list of atom number pairs representing bonds,
3.
a list of atom number triples representing the angles between pairs of bonds,
4.
a list of atom number quadruples representing dihedral angles,
5.
atom number quadruples representing improper angles[3,4],
6.
atom numbers defining hydrogen bond donors and acceptors,
7.
explicit nonbonded interaction exclusions (See also Section 4.9).

An example PSF file follows (pti.psf):





 

PSF 
       4 !NTITLE
 REMARKS FILENAME="pti.psf"                                                     
 REMARKS BPTI COORDINATES TAKEN FROM CRISTALLOGRAPHIC DATA W/O WATERS           
 REMARKS HYDROGEN POSITIONS GENERATED USING HBUILD (2 ITERATIONS)               
 REMARKS RMS FLUCTUATIONS (T. ICHEYE) FOR HEAVY PROTEIN ATOMS INCLUDED          
 REMARKS DATE:24-Apr-89  02:34:58       created by user: heller                 

     568 !NATOM
       1 MAIN 1    ARG  HT1  HC     0.260000       1.00800           0
       2 MAIN 1    ARG  HT2  HC     0.260000       1.00800           0
       3 MAIN 1    ARG  N    NH3    0.000000E+00   14.0067           0
       4 MAIN 1    ARG  HT3  HC     0.260000       1.00800           0
...

     582 !NBOND: bonds
       3       5       5      18      18      19       5       6
       6       7       7       8       8       9       9      10
...

     834 !NTHETA: angles
       3       5      18       3       5       6       5      18      19
      18       5       6       5       6       7       6       7       8
...

     351 !NPHI: dihedrals
       3       5       6       7       5       6       7       8
       6       7       8       9       7       8       9      11
...

     259 !NIMPHI: impropers
       5       3      18       6       9       8      11      10
      11      12      15       9      22      20      25      23
...

     114 !NDON: donors
       9      10      12      13      12      14      15      16
      15      17       3       1       3       2       3       4
...

      79 !NACC: acceptors
      19      18      26      25      32      31      33      31
      35      34      47      46      54      53      63      62
...

      24 !NNB
      45      44      43      97      96      95     210     209
     208     224     223     222     236     235     234     328
...

     222       0 !NGRP
       0       0       0       5       0       0       7       0       0
...


From:  http://www.sinica.edu.tw/~scimath/msi/insight2K/charmm_principles/Ch02_model_build.FM5.html#444493

The structure of a PSF is as follows:

 
Atom number

 
Segment name

 
Residue identifier

 
Residue name

 
Atom name

 
Atom type

 
Atomic charge

 
Atomic mass

 
A flag to indicate whether the atom is constrained

 

Creating a PSF