"One read per gene per cell is optimal for single-cell RNA-Seq", M. J. Zhang, V. Ntranos, D. Tse, Nature Communications, 2019. In brief, every cell of every organism has a genome, which can be thought as a long string of A, C, G, and T. We considered this problem and firstly studied fundamental limits for being able to reconstruct the genome perfectly. We attempt to close the gap between the blue and green curves in the rightmost plot by introducing the truncated normal (TN) test. Existing workflows perform clustering and differential expression on the same dataset, and clustering forces separation regardless of the underlying truth, rendering the p-values invalid. The IBM Functional Genomics Platform contains over 300 million bacterial and viral sequences, enriched with genes, proteins, domains, and metabolic pathways. The past ten years there has been an explosion of genomics data -- the entire DNA sequences of several organisms, including human, are now available. While several differential expression methods exist, none of these tests correct for the data snooping problem as they were not designed to account for the clustering process. We study the fundamental limits of this problem and design scalable algorithms for this. Many high-throughput sequencing based assays have been designed to make various biological measurements of interest. Students may discuss and work on problems in groups of at most three people but must write up their own solutions. State-of-the-art pipelines perform differential analysis after clustering on the same dataset. An underlying question for virtually all single-cell RNA sequencing experiments is how to allocate the limited sequencing budget: deep sequencing of a few cells or shallow sequencing of many cells? Stanford Genomics The Stanford Genomics formerly Stanford Functional Genomics Facility (SFGF) provides services for high-throughput sequencing, single-cell assays, gene expression and genotyping studies utilizing microarray and real-time PCR, and related services to researchers within the Stanford community and to other institutions. "Optimal Assembly for High Throughput Shotgun Sequencing", Guy Bresler, Ma'ayan Bresler, David Tse, 2013. The TN test is an approximate test based on the truncated normal distribution that corrects for a significant portion of the selection bias. The most important problem in computational genomics is that of genome assembly. Single-cell RNA sequencing (scRNA-Seq) technologies have revolutionized biological research over the past few years by providing us with the tools to simultaneously interrogate the transcriptional states of hundreds of thousands of cells in a single experiment. GBSC is set up to facilitate massive scale genomics at Stanford and supports omics, microbiome, sensor, and phenotypic data types. This question has attracted a lot of attention in the literature, but as of now, there has not been a clear answer. We also drew connections between this problem and community detection problems and used that to derive a spectral algorithm for this. Many high-throughput sequencing based assays have been designed to make various biological measurements of interest. A natural experimental design question arises; how should we choose to allocate a fixed sequencing budget across cells, in order to extract the most information out of the experiment? Founded in 2012, the Center for Computational, Evolutionary and Human Genomics (CEHG) supports and showcases the cutting edge scientific research conducted by faculty and trainees in 40 member labs across the School of Humanities and Sciences and the School of Medicine. More reads can significantly reduce the effect of the technical noise in estimating the true transcriptional state of a given cell, while more cells can provide us with a broader view of the biological variability in the population. "Community Recovery in Graphs with Locality", Yuxin Chen, Govinda Kamath, Changho Suh, David Tse, 2016. At the center, our group is closely involved in the Program for Conservation Genomics | Stanford Center for Computational, Evolutionary, and Human Genomics Program for Conservation Genomics Enabling the use of genomics in conservation management The remaining major barriers to applying genomic tools in conservation management lie in the complexity of designing and analyzing genomic experiments. These two copies are almost identical with some polymorphic sites and regions (less than 0.3% of the genome). Hence we studied the complementary question of what was the most unambiguous assembly one could obtain from a set of reads. We introduce a method for correcting the selection bias induced by clustering. However, this seemingly unconstrained increase in the number of samples available for scRNA-Seq introduces a practical limitation in the total number of reads that can be sequenced per cell. Jonathanâs lab uses statistical and computational methods to study questions in genomics and evolutionary biology. Applications of these tools to sequence analysis will be presented: comparing genomes of different species, gene finding, gene regulation, whole genome sequencing and assembly. The Stanford Genetics and Genomics Certificate Program utilizes the expertise of the Stanford faculty along with top industry leaders to teach cutting-edge topics in the field of genetics and genomics. "Partial DNA Assembly: A Rate-Distortion Perspective", Ilan Shomorony, Govinda M. Kamath, Fei Xia, Thomas A. Courtade, David N. Tse, 2016. The genome assembly problem is to reconstruct the genome from these reads. The area of computational genomics includes both applications of older methods, and development of novel algorithms for the analysis of genomic sequences. This cloud-based platform traverses biological entities seamlessly, accelerating discovery of disease mechanisms to address global public health challenges. A Zero-Knowledge Based Introduction to Biology, Molecular Evolution and Phylogenetic Tree Reconstruction. "HINGE: long-read assembly achieves optimal repeat resolution", Govinda M. Kamath, Ilan Shomorony, Fei Xia, Thomas A. Courtade, David N. Tse, 2017. African Wild Dog De Novo Genome Assembly We are collaborating with 10X Genomics to adapt their long-range genomic libraries to allow high-quality genome assemblies at low cost. Sequence alignments, hidden Markov models, multiple alignment algorithms and heuristics such as Gibbs sampling, and the probabilistic interpretation of alignments will be covered. We observe that these p-values are often spuriously small. Interestingly, our results indicate that the corresponding optimal estimator is not the commonly-used plug-in estimator, but the one developed via empirical Bayes (EB). Interestingly, the corresponding optimal estimator is not the widely-used plugin estimator but one developed via empirical Bayes. "Valid post-clustering differential analysis for single-cell RNA-Seq", Jesse M. Zhang, Govinda M. Kamath, David N. Tse, 2019. Stanford Center for Genomics and Personalized Medicine Large computational cluster. Extraordinary advances in sequencing technology in the past decade have revolutionized biology and medicine. We observe that because clustering forces separation, reusing the same dataset generates artificially low p-values and hence false discoveries, and we introduce a valid post-clustering differential analysis framework which corrects for this problem. Study include genome assembly problem is to reconstruct the genome assembly problem is to reconstruct the genome assembly, haplotype phasing, RNA-Seq quantification. Makinen, Belazzougui, Cunial, Tomescu: Genome-Scale algorithm design genomics: tools for understanding disease edited by Gary Peltz. The conditions that were derived here to be able to recover were not satisfied in most practical datasets. Various algorithms to solve this problem and design scalable algorithms for the analysis of genomic sequences. We found that the conditions that were derived here to be able to recover were not satisfied in most practical datasets. Developing scalable algorithms for the analysis of genomic sequences. Detection problems and used that to derive a spectral algorithm for this. The corresponding Optimal estimator is not the widely-used plugin estimator but one developed via empirical Bayes. These two copies are almost identical with some polymorphic sites and regions (less than 0.3% of the genome). Krogh Mitchison: biological Sequence analysis, Makinen, Belazzougui, Cunial Tomescu. "Haplotype assembly from high-throughput Mate-Pair reads", Govinda Kamath, Ma'ayan Bresler, David Tse, 2015.