Crystal Structures of MW1337R and Lin2004: Representatives of a Novel Protein Family That Adopt a Four‐helical Bundle Fold
Piotr Koźbiał,Qingping Xu,Hsiu‐Ju Chiu,Daniel McMullan,Sanjeev Krishna,Mitchell D. Miller,Polat Abdubek,Claire Acosta,Tamara Astakhova,Herbert L. Axelrod,Dennis Carlton,Thomas Clayton,Marc C. Deller,Lian Duan,Ylva Elias,Marc‐André Elsliger,Julie Feuerhelm,Slawomir K. Grzechnik,Joanna Hale,Gye Won Han,Lukasz Jaroszewski,Kevin K. Jin,Heath E. Klock,Mark W. Knuth,Eric Koesema,Abhinav Kumar,David Marciano,Andrew T. Morse,Kevin D. Murphy,Edward Nigoghossian,Linda Okach,Silvya Oommachen,Ron Reyes,Christopher L. Rife,Glen Spraggon,Christina V. Trout,Henry van den Bedem,Dana Weekes,Aprilfawn White,Guenter Wolf,Chloé Zubieta,Keith O. Hodgson,John Wooley,Ashley M. Deacon,Adam Godzik,Scott A. Lesley,Ian A. Wilson
DOI: https://doi.org/10.1002/prot.22020
2008-01-01
Abstract:To extend the structural coverage of proteins with unknown functions, we targeted a novel protein family (Pfam accession number PF08807, DUF1798)1 for which we proposed and determined the structures of two representative members. The MW1337R gene of Staphylococcus aureus subsp. aureus Rosenbach (Wood 46) encodes a protein with a molecular weight of 13.8 kDa (residues 1–116) and a calculated isoelectric point of 5.15. The lin2004 gene of the nonspore-forming bacterium Listeria innocua Clip11262 encodes a protein with a molecular weight of 14.6 kDa (residues 1–121) and a calculated isoelectric point of 5.45. MW1337R and lin2004, as well as their homologs, which, so far, have been found only in Bacillus, Staphylococcus, Listeria, and related genera (Geobacillus, Exiguobacterium, and Oceanobacillus), have unknown functions and are annotated as hypothetical proteins. The genomic contexts of MW1337R and lin2004 are similar and conserved in related species. In prokaryotic genomes, most often, functionally interacting proteins are coded by genes, which are colocated in conserved operons.2 Proteins from the same operon as MW1337R and lin2004 either have unknown functions (i.e., belong to DUF1273, Pfam accession number PF06908) or are similar to ypsB from Bacillus subtilis. The function of ypsB is unclear, although it has a strong similarity to the N-terminal region of DivIVA, which was characterized as a bifunctional protein with distinct roles during vegetative growth and sporulation.3, 4 In addition, members of the DUF1273 family display distant sequence similarity with the DprA/Smf protein, which acts downstream of the DNA uptake machinery, possibly in conjunction with RecA.5 The RecA activities in Bacillus subtilis are modulated by RecU Holliday-junction resolvase.6 In all analyzed cases, the gene coding for RecU is in the vicinity of MW1337R, lin2004, or their orthologs, but on a different operon located in the complementary DNA strand. Here, we report the crystal structures of MW1337R and lin2004, which were determined using the semiautomated, high-throughput pipeline of the Joint Center for Structural Genomics (JCSG),7 part of the National Institute of General Medical Sciences' Protein Structure Initiative. The gene encoding MW1337R was amplified by polymerase chain reaction (PCR) from Staphylococcus aureus subsp. aureus Rosenbach (Wood 46) genomic DNA using PfuTurbo DNA polymerase (Stratagene) and primers corresponding to the predicted 5′ and 3′ ends. The Rosenbach (Wood 46) strain used here (ATCC 10832) contains a single amino acid substitution (M45I) when compared to the available GenBank sequence (MW1337; NP_646154.1; GI: 21283066) from Staphylococcus aureus subsp. aureus strain MW2. The PCR product was cloned into plasmid pMH4, which encodes an expression and purification tag (MGSDKIHHHHHH) at the amino terminus of the full-length protein. The cloning junctions were confirmed by DNA sequencing. Protein expression was performed in a selenomethionine-containing medium using the Escherichia coli strain GeneHogs (Invitrogen). At the end of fermentation, lysozyme was added to the culture to a final concentration of 250 μg/mL, and the cells were harvested. After one freeze/thaw cycle, the cells were sonicated in lysis buffer 1 [50 mM Tris pH 7.9, 50 mM NaCl, 10 mM imidazole, 1 mM Tris(2-carboxyethyl)phosphine hydrochloride (TCEP)], and the lysate was clarified by centrifugation at 32,500 × g for 30 min. The soluble fraction was passed over nickel-chelating resin (GE Healthcare) pre-equilibrated with lysis buffer 1, the resin was washed with wash buffer 1 [50 mM Tris pH 7.9, 300 mM NaCl, 40 mM imidazole, 10% (v/v) glycerol, 1 mM TCEP], and the protein was eluted with elution buffer 1 [20 mM Tris pH 7.9, 300 mM imidazole, 10% (v/v) glycerol, 1 mM TCEP]. The eluate was diluted 10-fold with buffer Q [20 mM Tris pH 7.9, 5% (v/v) glycerol, 1 mM TCEP] containing 50 mM NaCl and loaded onto a RESOURCE Q column (GE Healthcare) pre-equilibrated with the same buffer. MW1337R was eluted with a linear gradient of 50–500 mM NaCl in buffer Q, buffer exchanged with crystallization buffer 1 [20 mM Tris pH 7.9, 150 mM NaCl, 1 mM TCEP], and concentrated for crystallization assays to 18 mg/mL by centrifugal ultrafiltration (Millipore). MW1337R was crystallized using the nanodroplet vapor diffusion method8 with standard JCSG crystallization protocols.7 The crystallization reagent that produced the MW1337R crystal used for structure solution contained 0.2 M MgCl2, 2.5 M NaCl, and 0.1 M Tris pH 7.0. Glycerol was added to the crystal as a cryoprotectant to a final concentration of 15% (v/v). Initial screening for diffraction was carried out using the Stanford Automated Mounting (SAM) system9 at the Stanford Synchrotron Radiation Laboratory (SSRL, Menlo Park, CA). The MW1337R crystal was trigonal, belonging to space group R32 (indexed in the hexagonal setting; Table I).10, 11 Molecular weight and oligomeric state of MW1337R were determined using a 1 × 30 cm Superdex 200 column (GE Healthcare) in combination with static light scattering (SEC/SLS; Wyatt Technology). The mobile phase consisted of 20 mM Tris pH 8.0, 150 mM NaCl, and 0.02% (w/v) sodium azide. The gene encoding lin2004 (GenBank: NP_471338.1; GI: 16801070) was amplified by PCR from genomic DNA using PfuTurbo DNA polymerase (Stratagene) and primers corresponding to the predicted 5′ and 3′ ends. The PCR product was cloned into plasmid pSpeedET, which encodes an expression and purification tag followed by a tobacco etch virus (TEV) protease cleavage site (MGSDKIHHHHHHENLYFQG) at the amino terminus of the full-length protein. The cloning junctions were confirmed by DNA sequencing. Protein expression was performed in a selenomethionine-containing medium using the Escherichia coli strain GeneHogs (Invitrogen). At the end of fermentation, lysozyme was added to the culture to a final concentration of 250 μg/mL, and the cells were harvested. After one freeze/thaw cycle, the cells were sonicated in lysis buffer 2 [50 mM HEPES pH 8.0, 50 mM NaCl, 10 mM imidazole, 1 mM TCEP], and the lysate was clarified by centrifugation at 32,500g for 30min. The soluble fraction was loaded onto nickel-chelating resin (GE Healthcare) pre-equilibrated with lysis buffer 2, the resin was washed with wash buffer 2 [50 mM HEPES pH 8.0, 300 mM NaCl, 40 mM imidazole, 10% (v/v) glycerol, 1 mM TCEP], and the protein was eluted with elution buffer 2 [20 mM HEPES pH 8.0, 300 mM imidazole, 10% (v/v) glycerol, 1 mM TCEP]. The eluate was buffer exchanged with crystallization buffer 2 [20 mM HEPES pH 8.0, 200 mM NaCl, 40 mM imidazole, 1 mM TCEP] and incubated with 1 mg of TEV protease per 10 mg of eluted protein. The protease-treated eluate was passed over nickel-chelating resin (GE Healthcare) pre-equilibrated with crystallization buffer 2, and the resin was washed with the same buffer. Lin2004 was eluted with elution buffer 2 (unsuccessful TEV cleavage), buffer exchanged with crystallization buffer 2, and concentrated for crystallization assays to 12.5 mg/mL by centrifugal ultrafiltration (Millipore). Lin2004 was crystallized using the nanodroplet vapor diffusion method with standard JCSG crystallization protocols. Crystals obtained from two different crystallization conditions were used for the structure solution. The crystallization reagent yielding the crystal used for MAD phasing contained 0.2 M Ca(OAc)2, 30% (v/v) polyethylene glycol (PEG) 400, and 0.1 M acetate pH 4.5. Ethylene glycol was added to the crystal as a cryoprotectant to a final concentration of 5% (v/v). The crystal used for refinement was obtained using a precipitating reagent containing 30% (v/v) glycerol, 5.6% (w/v) PEG 4000, and 0.1 M acetate pH 4.6. No additional cryoprotectant was added to the crystal. Initial screening for diffraction was carried out using the SAM system at the SSRL. Lin2004 crystals were indexed in hexagonal space group P6322 (Table II). Molecular weight and oligomeric state of lin2004 were estimated using a size exclusion chromatography (SEC) column pre-calibrated with gel filtration standard (Bio-Rad). The column and mobile phase used for SEC/SLS of MW1337R were also used for this analysis. For the MW1337R structure, single-wavelength anomalous diffraction (SAD) data were collected at the Advanced Photon Source (APS, Chicago, IL) on beamline 23-ID-D at a wavelength corresponding to the selenium absorption peak. The data sets were collected using the BLU-ICE environment12 at 100 K with a MarMosaic 300 CCD detector (Mar USA). The SAD data were integrated and reduced using MOSFLM13 and then scaled with the program SCALA from the CCP4 suite.10 Phasing was performed with SHELXD14 and autoSHARP,15 and automated model building was performed with RESOLVE.16 Model completion and refinement were performed using Coot17 and REFMAC5.18 Data and refinement statistics for MW1337R are summarized in Table I. For lin2004, multiple-wavelength anomalous diffraction (MAD) data were collected at the SSRL on beamline BL11-1 at wavelengths corresponding to the inflection, high energy remote, and absorption peak of a selenium MAD experiment. Data from a second crystal were collected to higher resolution on the same beamline at a wavelength of 1.0 Å. The data sets were collected using the BLU-ICE environment at 100 K with an ADSC Q315 CCD detector. The data were integrated and reduced using MOSFLM and then scaled with the program SCALA from the CCP4 suite. MAD phasing was performed with SHELXD and autoSHARP. Automated model building was performed with ARP/wARP,19 yielding a starting model composed of 122 residues (−3 to 118). After the initial refinement against the MAD data set, model building and refinement were completed using the higher resolution data set with Coot and REFMAC5. Data and refinement statistics for lin2004 are summarized in Table II. Analysis of the stereochemical quality of the model was accomplished using AutoDepInputTool,20 MolProbity,21 SFcheck 4.0,10 and WHATIF 5.0.22 Protein quaternary structure analysis was performed using the PQS server.23 Figure 1(B) was adapted from an analysis using PDBsum24; Figure 2(A) was prepared using ProtSkin (http://www.mcgnmr.ca/ProtSkin/) and PyMol (DeLano Scientific); Figure 2(B) was prepared using CHROMA25; Figure 4 was prepared using the SEED genomic platform (http://theseed.uchicago.edu/FIG/index.cgi); and all other figures were prepared with PyMOL (DeLano Scientific). Atomic coordinates and experimental structure factors for MW1337R at 2.25 Å resolution and for lin2004 at 1.74 Å resolution have been deposited in the PDB and are accessible under the codes 2ets and 2huj, respectively. Crystal structure of MW1337R from Staphylococcus aureus subsp. aureus Rosenbach (Wood 46). A: Stereo ribbon diagram of the MW1337R monomer color-coded from N-terminus (blue) to C-terminus (red); α-helices H1–H4 are indicated. B: Diagram showing the secondary structural elements of MW1337R superimposed on its primary sequence. Dashed lines indicate regions with no electron density, which are not included in the protein structure. Conserved residues in MW1337R. A: Sequence conservation among homologous proteins from species related to Staphylococcus aureus subsp. aureus Rosenbach (Wood 46) is indicated by the intensity of red color mapped onto the surface of the MW1337R protein structure. The most conserved residues are labeled. The PO43– ion is represented in white sticks. B: Representative sequences from multiple sequence alignment of MW1337R and its homologs from related species. Abbreviations: NP_646154, S. aureus subsp. aureus MW2 (MW1337R from S. aureus subsp. aureus Rosenbach (Wood 46) contains a single amino acid substitution: M45I); NP_390109 (yppE), Bacillus subtilis subsp. subtilis str. 168; NP_471338 (lin2004), Listeria innocua Clip11262; YP_301386, Staphylococcus saprophyticus subsp. saprophyticus ATCC 15305; AAW54386, Staphylococcus epidermidis RP62A; BAE04773, Staphylococcus haemolyticus JCSC1435. The invariant Asp30, Phe31, Pro37, and Tyr100 residues are marked with an asterisk. Residues conserved between related species are marked with a colon; residues conserved between related species that are also present in the hydrophobic core of unrelated proteins with a four-helical bundle fold are marked with one dot. The highlighting shows generally conserved amino acid residues (uncharged, yellow; polar, grey). C: Residues in the hydrophobic core of MW1337R (Leu4, Val5, Leu8, Val12, Phe19, Ile42, Ile45, Leu46, Ile49, Leu70, Ile71, Leu95, Val98, and Leu102) that are conserved in several unrelated protein domains with a four-helical bundle fold are shown as sticks on the ribbon diagram of the MW1337R structure. The crystal structure of MW1337R [Fig. 1(A)] was determined to 2.25 Å resolution using the SAD method. Data collection, model, and refinement statistics are summarized in Table I. The final model includes a monomer (residues 1–110, as well as the last five residues of the N-terminal expression and purification tag (HHHHH)), one PO43– ion, three Cl− ions, and 55 water molecules in the asymmetric unit. No electron density was observed for the first seven residues of the expression and purification tag (MGSDKIH) or the last six C-terminal residues of the protein (KEGTDG). Four His residues of the purification tag (–4 to –1) have poorly defined density, and the side chains of Ser44, Glu48, Glu54, and Arg85 are disordered. The Matthews coefficient (Vm)26 for MW1337R is 2.79 Å3/Da, and the estimated solvent content is 55.5%. The Ramachandran plot produced by MolProbity27 shows that 97.4% and 100% of the residues are in favored and allowed regions, respectively. The crystal structure of lin2004 was determined to 1.74 Å resolution using the MAD method. Data collection, model, and refinement statistics are summarized in Table II. The final model includes a monomer (residues 1–120, as well as the last five residues of the TEV protease cleavage site (LYFQG)) and 97 water molecules in the asymmetric unit. No electron density was observed for the expression and purification tag (MGSDKIHHHHHH), the first two residues of the protease cleavage site (EN), andthe C-terminal residue (Met121). The following residues have poor electron density for their side chains: Leu(-4), Glu26, Lys35, Arg63, Arg109, Lys117, and Glu119. The Matthews coefficient (Vm)26 for lin2004 is 2.85 Å3/Da, and the estimated solvent content is 56.6%. The Ramachandran plot produced by MolProbity27 shows that all of the residues are in favored regions. MW1337R is composed of four α-helices (residues 1–26, 34–57, 63–81, and 87–109) with a total α-helical content of 84.6% [Fig. 1(B)]. The MW1337R monomer is an up-down-up-down helical bundle with a right-hand twist. A DALI search28 with this domain found two putative homologs, namely, lin2004 (described here) and yppE (PDB code 2im8, DALI Z-score 16.5, RMSD 1.7 Å, and 20% sequence identity for 114 superimposed Cα atoms), as well as a number of hits to proteins that are less likely to be evolutionarily related. The function of yppE is unknown, although its gene was found to be induced during sporulation in an Spo0A-dependent manner.29 Lin2004 is composed of four α-helices (residues 3–26, 37–62, 69–85, and 93–118), and the total α-helical content is 79.2%. The lin2004 monomer is also an up-down-up-down helical bundle with a right-hand twist. A DALI search28 with this domain finds yppE [Fig. 2(B) and Fig.3; PDB code 2im8, DALI Z-score 18.4, RMSD 1.6 Å, and 27% sequence identity for 123 structurally aligned residues] and MW1337R [Fig. 2(B) and Fig. 3; PDB code 2ets, DALI Z-score 16.4, RMSD 1.4 Å, and 16% sequence identity for 113 structurally aligned residues], as well as a number of hits with DALI Z-scores of 9.6 or less to evolutionarily unrelated proteins. MW1337R, lin2004, and yppE are not only structurally similar (Fig. 3), but also share a similar pattern of highly conserved residues [Fig. 2(B)]. Stereo view of structure superposition of MW1337R homologs. MW1337R (PDB code 2ets) is shown in green, lin2004 (PDB code 2huj) is shown in red, and yppE (PDB code 2im8) is shown in blue. So far, structures of over 100 proteins are available that adopt a four-helical bundle fold. No significant sequence similarity exists between MW1337R and other four-helical bundle proteins (besides lin2004 and yppE) that align structurally with a high DALI Z-score (notably, PDB structures 1dov, 1st6, and 1rkc have Z-scores of ∼9.2, RMSDs of ∼2.7 Å, but only ∼6% sequence identity). Despite some similarity in shared hydrophobic residues that are located inside their cores [Fig. 2(C)], a common origin is unlikely for most of them. Sequence alignment [Fig. 2(B)] of MW1337R, lin2004, and their homologs from related species shows an interesting spatial pattern of residue conservation (excluding residues of the hydrophobic core, which may be conserved for physicochemical reasons). These residues are primarily located in the loop between helices H1 and H2, in the adjacent loop between helices H3 and H4, and on the side of helices H3 and H4, where a shallow, polar cleft is located [Fig. 2(A)]. Analysis of the crystallographic packing using the PQS server23 and SEC both suggest that lin2004 is a monomer. PQS analysis indicates that MW1337R is likely a hexamer, but this prediction is not consistent with SEC/SLS, which suggests a monomer. The structurally similar yppE (PDB: 2im8) is monomeric based on PQS analysis. Analysis of the MW1337R and lin2004 structures using the CastP server30 does not reveal any deep cavity that may serve as a ligand-binding site. The PO43– and three Cl− ions in the MW1337R structure are coordinated by the side chains of residues that are not conserved in related species. An analysis of the MW1337R structure using the Fuzzy-Oil-Drop server31 predicts that phosphate-interacting Glu75 (PDB: 2ets) belongs to a ligand-binding site. This server also predicts that Glu11 (PDB: 2ets) may be involved in ligand binding, but belongs to a different site from Glu75. The same analysis for lin2004 predicts Gln13, Lys45, and Gln80 as ligand-binding residues, where the locations of Lys45 and Gln80 correspond to the phosphate-binding site in MW1337R. One approach to identifying and assigning protein function is to analyze the neighborhood of its gene as well as those of its orthologs. The yppE orthologs are found in similar neighborhoods in all known genomes of Bacillales (Fig. 4). The most common neighbor of yppE in these genomes is RecU (also called prfA), which is present in the complementary strand of yppE such that the two genes are transcribed divergently. The arrangement of yppE orthologs (including MW1337R and lin2004) and RecU in genomes of Bacillus, Geobacillus, Oceanobacillus, Staphylococcus, and Listeria suggests that the expression of these proteins may be coregulated (Fig. 4). Genomic context of MW1337, lin2004, yppE, and their orthologs. The arrows indicate the direction of transcription in the predicted open reading frames (ORFs). The genes encoding orthologs of MW1337R are labeled as 1. The genes encoding RecU, penicillin-binding protein 1A, and DivIVA are labeled as 2, 3, and 4, respectively. The generic names (Bacillus, Geobacillus, Listeria, Oceanobacillus, and Staphylococcus) are abbreviated. Overlapping ORFs are represented by arrows that are shifted down. Similar genes are highlighted by the same color. Genes that are not conserved in the genomic neighborhood are colored grey. MW1337R and lin2004 are likely to interact with other proteins from their genomic neighborhood. The conserved surface residues of these proteins (Fig. 2) are likely to be involved in such interactions, although there is as yet no direct biochemical evidence supporting this assertion. Our structures, as well as those of yppE and RecU (PDB codes 2im8 and 1zp7, respectively), can now be used in the prediction of such interactions. The overexpression of ponA (encoding penicillin-binding protein 1A by the two-gene operon that also encodes RecU) was found to be toxic and causes nucleoid condensation in E. coli.32 It is likely that RecU and PonA activities are tightly controlled. Thus, it is tempting to speculate that the function of yppE, MW1337R, lin2004, and their orthologs is to regulate RecU/PonA expression and activity. In conclusion, we report here the first structural representatives of a new Pfam family (PF08807). The structural and genomic neighborhood analyses presented here suggest a regulatory role for these proteins. Additional biochemical and biophysical studies could provide valuable insights into the precise function of these proteins. The yppE protein from Bacillus subtilis is induced during sporulation. If MW1337R and lin2004 are functionally linked to their genomic neighbor, ypsB, a protein similar to the N-terminal region of DivIVA, they could be involved in DNA processing during vegetative cell division or sporulation. The JCSG has developed The Open Protein Structure Annotation Network (TOPSAN), a wiki-based community project to collect, share, and distribute information about protein structures determined at PSI centers. TOPSAN offers a combination of automatically generated, as well as comprehensive, expert-curated annotations, provided by JCSG personnel and members of the research community. Additional information about MW1337R and lin2004 is available at https://www.topsan.org/2ETS and https://www.topsan.org/2HUJ, respectively. Portions of this research were carried out at the APS and SSRL. The APS is supported by the U.S. Department of Energy, Office of Basic Energy Sciences. GM/CA CAT has been funded by the National Cancer Institute and the National Institute of General Medical Science. SSRL is a national user facility operated by Stanford University on behalf of the U.S. Department of Energy, Office of Basic Energy Sciences. The SSRL Structural Molecular Biology Program is supported by the Department of Energy, Office of Biological and Environmental Research, and by the National Institutes of Health (National Center for Research Resources, Biomedical Technology Program, and the National Institute of General Medical Sciences). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institute of General Medical Sciences or the National Institutes of Health. We thank the reviewer for insightful comments and suggestions that led to a putative function assignment for the proteins described in this manuscript.