Module – Bioinformatics & Big Data
1. You can find, interpret and assess the quality of data obtained from bioinformatic databases and can use these
databases to solve biomedical problems.
2. You can interpret commonly used visualizations of high-dimensional data.
3. You can navigate the genome to identify different functional elements in the genome.
4. You can obtain and analyse DNA and protein sequences, perform alignments, understand the underlying
implications, and use these alignments to predict protein function.
5. You understand the protein folding process, the role of the amino acids and different forces in this process, and
you understand the (im)possibilities of protein structure determination and prediction tools, so you can apply
these tools when necessary.
6. You can obtain and combine information from several data sources and use this information to understand the
relation between variations in the DNA/amino acid sequence and their effect on health and disease.
7. You can develop search-strategies, using public databases, in order to identify data relevant to disease
prevention and management.
Lecture introduction bioinformatics and big data on the 24th of March by Peter-Bram ‘t Hoen
A bioinformatician has knowledge about the informatician, statistician and biological fields. Bioinformatics is the
science of developing and applying computer algorithms to biological data sets with the aim of acquiring new
biological insights.
Why did some people are asymptomatic to corona and others die of the virus (same variant)? Sequenced exomes
and found specific mutations in the TLR7 receptor genome. Making some people more susceptible to the virus.
The virus enters through the ACE receptor on the virus.
Sequencing has become cheaper throughout the years.
Big data, the four V’s: volume; velocity; variety and veracity.
Thesaurus / vocabulary: collection of terms, definition and sysnonyms (example: Unified Medical Language Ssytem
UMLS, including MeSH). Ontology describes the relationships (e.g. hierarchy) between terms in the thesaurus and
to external databases (Hierarchy - tree-like structure). You can search for a term, but also for the terms below the
tree structure.
Curation: annotators are busy and perform a manual check, only when they find an important topic they will put
it up there. Manual check is important for high-confidence and reliable information. Curated (manual check) and
non-curated (high-throughput/automated) databases.
Important curated resources: Gene, Disease, Protein, Chemical, Drug, Pathway.
EntrezGene: gene-centered information (relies on other database sources and links them (portal for other
databases)). → what a gene does.
OMIM: genetically inherited disease (two types of information, disease centered and gene centered).
SwissProt (more curated) or Uniprot: the central protein resource → function, expression and structure of proteins
Human Protein Atlas: cell and tissue expression → also information of proteins and its distribution in humans and
even in cells.
ChEMBL: a chemical resource → chemical structure and its activity in the cell.
DrugBank: drug-centered information (on chemical itself but also all the clinical trials on the drug).
Conclusions:
◦ Bioinformatics is a key component of current biomedical research
◦ One analysis on the computer may save you a lot of time in the lab
◦ Make effective use of existing data
◦ Do not blindly rely on the computer, but carefully inspect results and underlying data
◦ Computers work with structured data and knowledge. We need to put effort in formal knowledge
representation that is also machine readable
Bio.tools
1
,Protein folding on the 31st of March by Hanka Venselaar
How to distinguish between the good and bad mutations? We need knowledge about a protein, and preferably
also a 3D structure.
From previous course we know how genes are transcribed into RNA which is translated into protein by the
ribosome. We know that proteins are the workhorses of the body, almost any process is carried out by proteins.
You will see that the shape of the protein defines what a protein can do, and exactly that shape is defined by the
genome. So, the areas of DNA and that of protein structures are of course heavily connected.
Mutations in the coding sequence of genes will affect the protein’s shape and therefore what it can do. You will
see that mutations that are detrimental for the structure are often very rare and not retained in the genome (and
therefore rare in population). Mutations that by accident make a protein better, might be retained in the
population. And this is how the function of proteins will affect what happens in the genome too…on a population
level.
Everyone is different (which is a good thing) and the differences between us are caused by small changes in the
DNA, which cause small changes in protein’s behavior. Many differences are not a problem at all, and during these
days we hope to make you see why that is the case. These small changes do not affect a protein very much. Or, in
the case of eye color, the changes do not affect a protein that is essential for life. These are changes that can
spread throughout the population and cause variation within species. (which is a good process=evolution). A
single-nucleotide polymorphism is a germline substitution of a single nucleotide at a specific position in the
genome. However, when a mutation changes an essential protein in such a way that it stops working, you have a
problem. This is either lethal, or will, at least, make life much more difficult. Because it is more difficult to
reproduce, these mutations (luckily) remain rare in the population.
Mutations in a protein could affect the function, but this is not always the case. Whether a mutation has any effect
at all, depends on its position in the protein, the function of that particular area, and even on the role of the
protein for cell processes.
Proteins are the workhorses in your body. You can group them in types of protein functions, such as enzymes (the
proteins that catalyze a reaction), structural proteins that are building the important structures in cells (actine
skeleton), transport proteins (that transport goodies from A to B), defense proteins, etc.
What’s also important to remember is that proteins all have a different shape…and it is that shape that defines
what the protein can do. Very often a protein is very specialized to do exactly 1 thing. It evolved over many years
to be really good at that one particular function. The main important message you need to remember from all of
this, is that each protein has its own unique shape, and that shapes defines the unique function.
Enzymes may be the best well known group of proteins because these enzymes all “do” something to other
molecules. You want to keep in mind that each enzyme is a protein, but enzymes are just a group of all possible
proteins. Still, it is usually the best well known example because they are directly related to reactions that take
place in cells. These proteins each take care of 1 step in a reaction mechanism. Usually you can describe this as
substance A -> Substance B. Enzymes all have an active site, from the large protein only ~3 residues are involved
in the reaction itself…the rest is necessary to make sure that the active site is correctly shaped and regulated. And
this is important to make sure that the enzyme binds the correct molecule. So, one of the aspects is that an
enzyme always binds to something else, the substrate, and produces something new, the products. The protein
itself is not used in this reaction, which is exactly the definition of a catalyst.
In the detailed picture you can see a small molecule interacting in the active site of a protein. These interactions
are extremely important, otherwise the protein would start acting on other molecules as well. Enzymes are often
shown in a very simplified way (such as the picture on the right). In real life, the active site can be studied in detail.
Each atom in the active site needs to be placed exactly at the right position.
2
,Here are a few facts about enzymes. You can find enzymes for more than 5.000 chemical reactions in human cells.
Most of them are either the creation or breaking down of other molecules. So, in chemical words…a bond is being
broken and/or created in the small molecules that bind. Often the enzymes need another molecule too, a cofactor.
This is necessary to provide some extra energy (ATP, NADH) to start a reaction (most reactions cannot take place
spontaneously), or as an extra electron storage. Regulation is extremely important, this is done by regulation the
protein itself (for example, by adding an inhibition domain), or even by regulation of the transcription or translation
of the gene. Diseases that are associated with these enzymes often result in a depletion or an excess of a certain
molecule. Think about Phenyl Ketonuria, the disease in which certain amino acids cannot be broken down and
therefore a buildup in the cells become toxic. Of course, you could treat this by making sure that a person doesn’t
eat the substance that needs to be broken down. This is not always possible but for PKU it is. And last but not
least, enzymes are also used in commercial settings…because the reactions they catalyze are often faster and
more efficient than when anything else would be used.
Transport proteins. In cells, we have to move molecules from A to B, this can be done by diffusion, but that might
take long. And many molecules cannot pass a membrane. Also, when something in the bloodstream needs to go
into a cell, you need something to transport that. So, therefore nature has built protein channels, structures that
pass through a membrane and allow certain molecules to pass. They can be regulated by other molecules, or by
voltage or gradients. Or they might simply be a hole in the membrane such as aquaporin, which allows water to
freely float through the membrane. Very typical of these proteins is that they often contain maybe alpha helices
that run through the membrane, and their outside is hydrophobic just like the membrane is. Problems with these
proteins can lead to many diseases in which the transport of water or other molecules is disturbed.
Not all transport proteins are located in the membrane, one very well-known one is hemoglobin. The oxygen
transporter in blood. This protein can travel throughout the body in our bloodstream, takes oxygen in those places
where the oxygen level is high (the lungs) and drops it wherever needed (rest of our body). The special thing about
hemoglobin is that it needs extra co-factors, the heme group. And in the heme group it needs an iron atom.
Defense proteins (such as antibodies) recognize and attack pathogens. Especially now we wait until we have all
created the correct antibodies against the corona virus. We need something that can defend our bodies from
pathogenic attacks. We have a first line defense of proteins that simply recognize anything dangerous and those
will start reactions that you see in infections….but the antibodies are a specific group of the defense proteins.
These can recognize very specifically just 1 type of pathogen, and they will remember them for a long time. You
can learn much more about this intricate system during an immunology course. But what is interesting for us, is
the structure of an antibody which contains 4 chains, 2 long (heavy) chains and 2 short (light) chains) together
they produce a complex that has a rough Y shape, and top of the Y we find two binding sites that are used to
recognize a pathogen. The recognition sites each consists of amino acids of 6 loops, 3 from the heavy and 3 from
the light chain each. The amino acids in these loops vary a lot and therefore we can recognize many different
pathogens.
Other defense proteins are known, such as the MHC and complement system. The last one is part of the first line
of attack, your inborn immune system. You can imagine that if something goes wrong with these antibodies, it
might have severe results. Non-functional antibodies will leave you very susceptible for any kind of pathogen
(immunodeficiency). But over active immune systems also lead to disease, allergies and arthritis are just two
examples in which the proteins are attacking their own body cells (auto-immune disease). The antibodies are also
widely used in research too because you can use them specifically to interact with something.
Well, the secret of proteins lies in the amino acids, the building blocks of proteins of life. Here we have to dive into
the details on atomic level. Each amino acid has the same basic structure, an amino (NH2) group on one side, and
an acid (COOH) group on the other. Central we find a carbon atom, known as the C-alpha, which is the central
point of the amino acid. The Acid group on one side of an amino acid is connected to the amino group of another
one during the translation process of a protein. So, our cell’s machinery, the ribosome, reads RNA and knows
3
, which amino acid to connect next. The amino group, C-alpha and acid group form the backbone of a protein. C-
alpha is also connected to a sidechain, the R-group and this sidechain is different for each amino acid
You have probably seen some kind of variation of this picture before. During the translation phase, amino acids
are connected by the ribosome. The acid side of one amino-acid is connected to the Amino side of a new residue.
During this process a water molecule will leave the peptide and a new bond will be formed. We call the connection
between the 2 amino acids a peptide bond. Note that the R-group, or sidechain of each amino acid, does not play
a role in this process. Another words for amino acids that you will hear is residue. We have 20 different ones, some
of them are essential, which means that you will have to eat them, others can be created from the essential ones
and are therefore non-essential. The basic shape is the same, but the sidechain makes each AA unique. When the
protein is created by the ribosome, it is basically produces as a single long chain, however, the chain will fold
almost directly into a protein shape. This is what we are looking at in this course. The secret of the amino acids, is
their unique sidechain shape. It is this sidechain that can provide unique properties to each amino acid.
And here is an overview of all 20 amino acids, you can see their molecular shapes. Important to know is that each
amino acid has of course a name, but we also abbreviate the name with a 3 letter code, and a 1-letter code. We
see large differences in size, look at the smallest glycine and the largest Tryptophan. The most important
properties in general are size, charge and hydrophobicity, but this chart indicates many other properties as well.
4