Top-down mass spectrometry is a promising technology that has been extensively developing in recent years. However, only a very limited number of algorithms for efficiently processing this kind of data have been proposed so far. In particular, the only published result on de novo sequencing of proteins making use of top-down data we are aware of is [1], where a method for processing combined sets of top-down and bottom-up spectra is proposed, which uses top-down spectra as a scaffold for assembling peptides produced by PEAKS [2] from bottom-up spectra.
In this work, we describe a method for de novo sequencing of proteins from pure sets of top-down tandem mass spectra, which to the best of our knowledge represents the very first attempt to address this problem.
The given raw mass spectra are deconvoluted (to which end we use MS-Deconv [3]), and the resulting set of spectra is passed as input to our algorithm. For each spectrum, it constructs a spectrum graph, extracts from each its connected component a longest path, and derives from the latter all the possible k-mers for a fixed k. For the obtained k-mers, we calculate their frequency (the total number of occurrences), restrict our attention to the ones, for which the latter exceeds a certain threshold f, and consider them in the order of decreasing frequency. Each of these k-mers gets prolonged by an iterative algorithm, which uses the theoretical spectrum of the currently obtained sequence fragment P as a scaffold to align the input spectra, then constructs from those a superspectrum consisting of well-confirmed peaks, and computes in its spectrum graph an optimal path that spells out either an immediate prolongation of P or a new fragment separated from P by a mass of one or a few amino acids.
We benchmarked our method on a top-down dataset consisting of 4,951 ETD and 4,930 HCD spectra acquired from the Fab region of alemtuzumab [4]. Extension was performed for 8-mers with frequency at least 7. In total, our method retrieved 69.1% and 39.6% of the light chain and Fd region, respectively. The derived fraction of the Fd region is substantially smaller due to poorer coverage. Moreover, we tested our approach on a top-down dataset comprising 1,349 CID and 1,330 ETD spectra acquired from histone H4; 57.2% of the sequence of the latter was retrieved. We point out that the prefix of its sequence that could not be reconstructed is subject to several post-translational modifications, the presence of which was captured by our algorithm. Further details are summarized in Fig. 1.
Future directions of this research include development of procedures capable of reconstructing post-translationally modified protein sequences, and finer algorithmic solutions for the most time-consuming steps of the algorithm, the running time of which may sometimes amount to a few hours.
Fig. 1. The de novo sequencing results produced by our method for the light chain and Fd region of alemtuzumab, and for histone H4. Red fragments were retrieved entirely and correctly; blue fragments were represented in the reconstruction by correct mass gaps; the seed 8-mers are underlined.
Acknowledgements
V.D., P.A.P., and K.V. were partially supported by Government of Russian Federation (grant 11.G34.31.0018). L.D. and M.V. were supported by the Netherlands Organization for Scientific Research (NWO), Zenith grant 93511034. The presented software tool partly reuses the code developed earlier by Yakov Sirotkin, Sonya Alexandrova, and Mikhail Dvorkin at the Algorithmic Biology Laboratory, Saint Petersburg Academic University.
References
[1] X. Liu, L. Dekker, S. Wu, M. Vanduijn, T. Luider, N. Tolić, Q. Kou, M. Dvorkin, S. Alexandrova, K. Vyatkina, L. Paša-Tolić, and P. A. Pevzner, “De Novo Protein Sequencing by Combining top-Down and Bottom-Up Tandem Mass Spectra.” Journal of Proteome Research, 13(7):3241-48 (2014)
[2] B. Ma, K. Zhang, C. Hendrie, C. Liang, M. Li, A. Doherty-Kirby, and G. Lajoie, “PEAKS: powerful software for peptide de novo sequencing by tandem mass spectrometry.”. Rapid Communications in Mass Spectrometry, 17(20): 2337-42 (2003)
[3] X. Liu, Y. Inbar, P. C. Dorrestein, C.Wynne, N. Edwards, P. Souda, J. P. Whitelegge, V. Bafna, and P. A. Pevzner, “Deconvolution and Database Search of Complex Tandem Mass Spectra of Intact Proteins: A Combinatorial Approach.” Molecular & Cellular Proteomics, 9(12):2772-82 (2010)
[4] L. Dekker, S. Wu, M. Vanduijn, N. Tolić, C. Stingl, R. Zhao, T. Luider, and L. Paša-Tolić, “An integrated top-down and bottom-up proteomic approach to characterize the antigen binding fragment of antibodies.” Proteomics, 14(10):1239-48 (2014)
[5] Z. Tian, N. Tolić, R. Zhao, R. J. Moore, S. M. Hengel, E. W. Robinson, D. L. Stenoien, S. Wu, R. D. Smith, and L. Paša-Tolić, “Enhanced top-down characterization of histone post-translational modifications.” Genome Biology, 13:R86 (2012)