The use of Bioinformatic Tools to Assess the Structure and Function of an
Unidentified Coding and Non-Coding Sequence.19002410
Abstract
Nucleotide sequences 4a and 4b were interpreted with the use of bioinformatic tools, to first find their
identity using NCBI nucleotide BLAST. Aims focused on using a range of bioinformatic tools to
recall the characteristics of each sequence and their function in regulating cellular processes or role in
pathology within the body. Outputs were interpreted relative to the purpose of each tool, providing
information on the structural features of each sequence, or their role in cell signalling. HOTAIR was
revealed to be long non-coding RNA sequence 4a, found to be upregulated in several diseases within
the body. BECN1 was found to be coding sequence 4b, responsible for autophagy, and was closely
associated to other protein subunits within the PI3K complex. Other databases such as
STRING, HuRi, Alphafold, Genevisible and DisGeNET were then used to further characterise
and establish the function of each sequence. Outputs were interpreted relative to the protein-
protein interactions, splice variants, or varying levels of expression of each sequence within a range of
tissues. Prevalence of the gene within each tissue was associated with normal, or abnormal, cellular
function, playing a vital role in medical research. Information provided by
each database informed subsequent investigation into the impact each sequence had within
the Homo sapien genome.
Introduction
Bioinformatics has allowed for a large volume of biological and statistical information to be
processed, stored, and collated in the form of databases. Recall of complex datasets can provide a large
array of information to determine the identity and characteristics of sequences. Of interest was the role
of non-coding and coding nucleotide sequences that were analysed in terms of their homology to a
range of genes, where results demonstrated a variety of differences due to splice variants and their
interactions with structural proteins. Approximately 98% of genetic information can be categorised as
non-coding; demonstrating that an array of information can be explored and deduced from sequences
that do not have the required mechanisms to code for proteins (Perenthaler et al., 2019). A variety
of bioinformatic tools can collectively be used to gain further insight into genomics, whereby
sequences can be analysed to reveal their role in disease presentation, cellular function or signalling
pathways. Notably, the rise of the Human Genome Project has enabled the field of bioinformatics to
gain traction within the scientific community, as an efficient way to store information and establish a
valuable relationship between the related disciplines of mathematics and computer science (Hood,
Rowen, 2013). The availability of a range of bioinformatic tools allowed for a more objective point of
view on the role of a gene within the body, and a clearer understanding of its functional significance.
This is beneficial, as deductions were made from a greater number of resources that had a large
coverage across several databases. The aims of this project were to gain an understanding of the
functionality of two sequences within the human genome, and to depict the key features they
present that lead to their distinction from other elements of the transcriptome. Bioinformatic tools
were used as the basis of research into each sequence, whereby each search result led to further
investigation using other databases, to gain insight into the significance of both non-coding and coding
sequences 4a and 4b.
Methods
1
, A summary of the bioinformatic tools used to obtain structural and functional information
surrounding sequence 4a and 4b during this investigation are summarised in Table 1.
Table 1. List of bioinformatic tools that were used, including their purpose in the characterisation of sequences 4a and 4b. A
range of bioinformatic tools were utilised to determine the structure and function of each sequence, starting with NCBI Nucleotide
BLAST. Each sequence was identified as either coding or non-coding and named according to their homology to
predicted sequences stored within the chosen database. Outputs provided by each tool informed subsequent investigation using
other databases to explore the impact the sequences had in more detail.
Name of Bioinformatic Tool Purpose of Tool
NCBI Nucleotide BLAST To confirm the identity of each sequence compared to several predicted and
experimentally confirmed outputs
STRING To assess the interaction of sequence 4b with other co-regulatory proteins
within protein complexes
HuRI Confirm the coverage of STRING and objectiveness of outputs provided for
sequence 4b by comparing the two databases
Alphafold Visualise the structural features of sequence 4b in relation to its function
Genevisible Used to compare the level of 4a’s expression to an array of tissue types
in healthy samples and in cancer presentation
DisGeNET Compared the role of splice variants within exon and intronic regions
of sequence 4a in disease presentation, whilst determining the chromosome
number it was found on
Results
Non-coding sequence
A total of 11 sequences were recalled from NCBI nucleotide BLAST to confirm the identity of
sequence 4a to be non-coding HOX antisense intergenic RNA (HOTAIR). HOTAIR displayed
100% homology matching the query length of 2158 nucleotides. Searches were refined to genes
specific to the Homo sapien genome only, where only highly similar sequences were selected (Figure
1).
Figure 1. Output from NCBI Nucleotide BLAST using nucleotide sequence 4a (NCBI, 2021). A list of similar sequences to the
query length of sequence 4a were shown and refined to the Homo sapien genome.
Following from the identification of sequence 4a, the extent of HOTAIR expression
within healthy tissue was observed using data from Genevisible (Figure 2).
2