Oplysninger
Spørg Stuwie

W2 Ex Genbank-new-answers - 22111

Weekly exercise

Kursus

Introduction to Bioinformatics (22111)

11 Dokumenter

Studerende delte 11 dokumenter i dette kursus

Universitet

Danmarks Tekniske Universitet

Akademisk år: 2021/2022

Uploadet af:

Anonym studerende

Dette dokument er blevet uploadet af en studerende, ligesom dig, der besluttede at forblive anonym.

Danmarks Tekniske Universitet

Kommentarer

Venligst log på eller registrer dig for at poste kommentarer.

Studerende så også

Andre relaterede dokumenter

Forhåndsvisning af tekst

ExGenbank-new-answers

From 22111

Note: numbers in Part 2 and Part 3 are updated on February 7, 2022.

Part 1 QUESTION 1.

a) Inspecting the FEATURE table of the entry reveals that two CDS regions are defined; therefore there are

two genes in this entry. As stated on the GenBank hand-out "CDS" is the most stable definition of a protein

coding gene used in the GenBank format - sometimes "gene" will also be present, but CDS is more

commonly used.

b) Columba livia (Rock pigeon / domestic pigeon)

c) The HEADER contain general information about the entry: Organism, publication references, keywords,

accession-ID etc. The FEATURE table contains information that refers to coordinates in the DNA sequence -

for example definition of CDS regions.

QUESTION 1.

a) Since the FEATURE table has been thrown away, we no longer have the coordinates for the genes. As

such they are "in there" somewhere, but we cannot find them without using external information.

b) The entire "ORIGIN" block (all the DNA sequence) has been converted to FASTA format. The

FEATURE table is discarded. From the HEADER block the definition (title) and accession number is

preserved, the rest is discarded.

QUESTION 1.

The downloaded file has Unix line endings. Remember from the JEdit exercise that line endings are

indicated by the letters "U", "W" or "M" in the lower right hand corner of the jEdit window.

QUESTION 1.

a) The "join" statements defines how to extract the coding sequence from the entire length of DNA in the

entry: "join(1104.,1306.,1614.)" is basically a recipe stating to paste together the three

intervals - and we'll get the protein coding part of the gene: the coding exons glued together. The CDS will

always start with a START codon (e. ATG) and end with a STOP codon (e. TAA).

b) The gene contains three coding exons. Note: from a CDS definition we don't get any information about

UnTranslated Regions (UTR's) that are often found before and after the coding region in the mRNA).

QUESTION 1.

The first number is the Gene Identifier (taken from the VERSION line in the header). The subsequent

numbers are the positions (coordinates) in the original gene entry (taken from the join line).

Part 2 QUESTION 2.

a) 210,468 hits

b) No. There is e. the first hit, M57671, "Octodon degus insulin mRNA, complete cds" which is from a

Degu (en.wikipedia/wiki/Degu), a rat-like carnivore from Chile. In fact, you can see in the right

side of the results page that only 11,216 hits are from human. There is no reason to expect only human

results from GenBank, since it is not a human-centric database.

c) No. There are many hits to complete or partial chromosome sequences which contain a lot of other genes.

An example is JWIN03000075, "Camelus dromedarius breed African isolate Drom800 Contig74, whole

genome shotgun sequence".

QUESTION 2.

a) In the Search details box, you find "insulin[All Fields]".

QUESTION 2.

a) 17,520 hits.

b)Yes, it is among the hits on the first page of results.

Title: Homo sapiens insulin (INS) gene, complete cds Accession: AH

c) ("Homo sapiens"[Organism] OR human[All Fields]) AND insulin[All Fields]

QUESTION 2.

a) 5431 hits.

b) Yes (except for 10 hits that are synthetic constructs, but based on human sequence). See the "Top

Organisms" box on the right.

c) No.

There are many examples of insulin-degrading enzyme, insulin-like growth factor, insulin receptor and

insulin-induced genes.

Many entries are mRNA and therefore not gene entries.

QUESTION 2.

a) 9 hits.

b) 13 hits.

c) Accession codes: AH002844 J00265 J00268, Locus name: AH002844, Definition (title): "Human insulin

gene, complete cds".

QUESTION 2.

The important thing here is not the precise search string, but that you understand the principle of using "kill-

words". One possible answer could be:

insulin[title] complete[title] NOT mRNA[title] NOT receptor[title] NOT receptor-like[title] NOT

"insulin like"[title] NOT "insulin degrading"[title] NOT "growth factor"[title] NOT "family member"

[title] NOT "insulin induced"[title] NOT "insulin dependent"[title] NOT "insulin promoter"[title]

which gives 18 hits, representing 12 organisms and some synthetic constructs.

Note: the use of double quotes ("") to add two-word "kill phrases".

Note: don't kill "insulin precursor"! Insulin is always synthesized as a precursor, preproinsulin, that

contains both a signal peptide, a propeptide, and the two mature chains. More about insulin in the exercises

next week.

Part 3

QUESTION 3.

It's a good idea to separate the two logical parts of the search string:

One for narrowing down the species:

(rat[ORGANISM] OR mouse[ORGANISM])

And one for actually searching for insulin:

QUESTION 3.

Like in 3, it will never be possible to do this query perfectly - a good attempt could be:

actin[title] AND actin[protein name] NOT mRNA[title] NOT partial[title]

which yields 409 hits.

Note that this will miss entries that are not annotated with "Protein name". Alternatively, you could search

with the "Title" field, but that requires a lot of "kill words":

actin[title] complete[title] NOT mRNA[title] NOT pseudogene[title] NOT regulator[title]

NOT binding[title] NOT associated[title] NOT related[title]

yields 925 hits and still requires some cleanup. -->

QUESTION 3.

human[organism] "insulin receptor"[title] NOT mRNA[title] NOT substrate[title] NOT partial[title]

gives 74 hits, with #1 or #2 being the right one:

NG_008852 Homo sapiens insulin receptor (INSR), RefSeqGene on chromosome 19

AH002851 Homo sapiens insulin receptor (INSR) gene, complete cds

Retrieved from "teaching.healthtech.dtu/22111/index.php?title=ExGenbank-new-answers&oldid=1153"

This page was last modified on 7 February 2022, at 15:31.

Var dette dokument nyttigt?

W2 Ex Genbank-new-answers - 22111

Kursus: Introduction to Bioinformatics (22111)

11 Dokumenter

Studerende delte 11 dokumenter i dette kursus

Universitet: Danmarks Tekniske Universitet

Var dette dokument nyttigt?

2/10/22, 9:59 AM

ExGenbank-new-answers - 22111

https://teaching.healthtech.dtu.dk/22111/index.php/ExGenbank-new-answers

1/4

ExGenbank-new-answers

From 22111

Note: numbers in Part 2 and Part 3 are updated on February 7, 2022.

Part 1

QUESTION 1.1

a) Inspecting the FEATURE table of the entry reveals that two CDS regions are defined; therefore there are

two genes in this entry. As stated on the GenBank hand-out "CDS" is the most stable definition of a protein

coding gene used in the GenBank format - sometimes "gene" will also be present, but CDS is more

commonly used.

b) Columba livia (Rock pigeon / domestic pigeon)

c) The HEADER contain general information about the entry: Organism, publication references, keywords,

accession-ID etc. The FEATURE table contains information that refers to coordinates in the DNA sequence -

for example definition of CDS regions.

QUESTION 1.2

a) Since the FEATURE table has been thrown away, we no longer have the coordinates for the genes. As

such they are "in there" somewhere, but we cannot find them without using external information.

b) The entire "ORIGIN" block (all the DNA sequence) has been converted to FASTA format. The

FEATURE table is discarded. From the HEADER block the definition (title) and accession number is

preserved, the rest is discarded.

QUESTION 1.3

The downloaded file has Unix line endings. Remember from the JEdit exercise that line endings are

indicated by the letters "U", "W" or "M" in the lower right hand corner of the jEdit window.

QUESTION 1.4

a) The "join" statements defines how to extract the coding sequence from the entire length of DNA in the

entry: "join(1104..1192,1306..1510,1614..1742)" is basically a recipe stating to paste together the three

intervals - and we'll get the protein coding part of the gene: the coding exons glued together. The CDS will

always start with a START codon (e.g. ATG) and end with a STOP codon (e.g. TAA).

b) The gene contains three coding exons. Note: from a CDS definition we don't get any information about

UnTranslated Regions (UTR's) that are often found before and after the coding region in the mRNA).

QUESTION 1.5

The first number is the Gene Identifier (taken from the VERSION line in the header). The subsequent

numbers are the positions (coordinates) in the original gene entry (taken from the join line).

Part 2

QUESTION 2.1.1

a) 210,468 hits

b) No. There is e.g. the first hit, M57671.1, "Octodon degus insulin mRNA, complete cds" which is from a

Degu (http://en.wikipedia.org/wiki/Degu), a rat-like carnivore from Chile. In fact, you can see in the right

side of the results page that only 11,216 hits are from human. There is no reason to expect only human

results from GenBank, since it is not a human-centric database.

c) No. There are many hits to complete or partial chromosome sequences which contain a lot of other genes.

An example is JWIN03000075.1, "Camelus dromedarius breed African isolate Drom800 Contig74, whole

W2 Ex Genbank-new-answers - 22111

Introduction to Bioinformatics (22111)

Danmarks Tekniske Universitet

Kommentarer

Studerende så også

Andre relaterede dokumenter

Forhåndsvisning af tekst

ExGenbank-new-answers

From 22111

Note: numbers in Part 2 and Part 3 are updated on February 7, 2022.

Part 1

QUESTION 1.

a) Inspecting the FEATURE table of the entry reveals that two CDS regions are defined; therefore there are

two genes in this entry. As stated on the GenBank hand-out &quot;CDS&quot; is the most stable definition of a protein

coding gene used in the GenBank format - sometimes &quot;gene&quot; will also be present, but CDS is more

commonly used.

b) Columba livia (Rock pigeon / domestic pigeon)

c) The HEADER contain general information about the entry: Organism, publication references, keywords,

accession-ID etc. The FEATURE table contains information that refers to coordinates in the DNA sequence -

for example definition of CDS regions.

QUESTION 1.

a) Since the FEATURE table has been thrown away, we no longer have the coordinates for the genes. As

such they are &quot;in there&quot; somewhere, but we cannot find them without using external information.

b) The entire &quot;ORIGIN&quot; block (all the DNA sequence) has been converted to FASTA format. The

FEATURE table is discarded. From the HEADER block the definition (title) and accession number is

preserved, the rest is discarded.

QUESTION 1.

The downloaded file has Unix line endings. Remember from the JEdit exercise that line endings are

indicated by the letters &quot;U&quot;, &quot;W&quot; or &quot;M&quot; in the lower right hand corner of the jEdit window.

QUESTION 1.

a) The &quot;join&quot; statements defines how to extract the coding sequence from the entire length of DNA in the

entry: &quot;join(1104.,1306.,1614.)&quot; is basically a recipe stating to paste together the three

intervals - and we&#039;ll get the protein coding part of the gene: the coding exons glued together. The CDS will

always start with a START codon (e. ATG) and end with a STOP codon (e. TAA).

b) The gene contains three coding exons. Note: from a CDS definition we don&#039;t get any information about

UnTranslated Regions (UTR&#039;s) that are often found before and after the coding region in the mRNA).

QUESTION 1.

The first number is the Gene Identifier (taken from the VERSION line in the header). The subsequent

numbers are the positions (coordinates) in the original gene entry (taken from the join line).

Part 2

QUESTION 2.

a) 210,468 hits

b) No. There is e. the first hit, M57671, &quot;Octodon degus insulin mRNA, complete cds&quot; which is from a

Degu (en.wikipedia/wiki/Degu), a rat-like carnivore from Chile. In fact, you can see in the right

side of the results page that only 11,216 hits are from human. There is no reason to expect only human

results from GenBank, since it is not a human-centric database.

c) No. There are many hits to complete or partial chromosome sequences which contain a lot of other genes.

An example is JWIN03000075, &quot;Camelus dromedarius breed African isolate Drom800 Contig74, whole

genome shotgun sequence&quot;.

QUESTION 2.

a) In the Search details box, you find &quot;insulin[All Fields]&quot;.

QUESTION 2.

a) 17,520 hits.

b)Yes, it is among the hits on the first page of results.

Title: Homo sapiens insulin (INS) gene, complete cds Accession: AH

c) (&quot;Homo sapiens&quot;[Organism] OR human[All Fields]) AND insulin[All Fields]

QUESTION 2.

a) 5431 hits.

b) Yes (except for 10 hits that are synthetic constructs, but based on human sequence). See the &quot;Top

Organisms&quot; box on the right.

c) No.

There are many examples of insulin-degrading enzyme, insulin-like growth factor, insulin receptor and

insulin-induced genes.

Many entries are mRNA and therefore not gene entries.

QUESTION 2.

a) 9 hits.

b) 13 hits.

c) Accession codes: AH002844 J00265 J00268, Locus name: AH002844, Definition (title): &quot;Human insulin

gene, complete cds&quot;.

QUESTION 2.

The important thing here is not the precise search string, but that you understand the principle of using &quot;kill-

words&quot;. One possible answer could be:

insulin[title] complete[title] NOT mRNA[title] NOT receptor[title] NOT receptor-like[title] NOT

&quot;insulin like&quot;[title] NOT &quot;insulin degrading&quot;[title] NOT &quot;growth factor&quot;[title] NOT &quot;family member&quot;

[title] NOT &quot;insulin induced&quot;[title] NOT &quot;insulin dependent&quot;[title] NOT &quot;insulin promoter&quot;[title]

which gives 18 hits, representing 12 organisms and some synthetic constructs.

Note: the use of double quotes (&quot;&quot;) to add two-word &quot;kill phrases&quot;.

Note: don&#039;t kill &quot;insulin precursor&quot;! Insulin is always synthesized as a precursor, preproinsulin, that

contains both a signal peptide, a propeptide, and the two mature chains. More about insulin in the exercises

next week.

two genes in this entry. As stated on the GenBank hand-out "CDS" is the most stable definition of a protein

coding gene used in the GenBank format - sometimes "gene" will also be present, but CDS is more

such they are "in there" somewhere, but we cannot find them without using external information.

b) The entire "ORIGIN" block (all the DNA sequence) has been converted to FASTA format. The

indicated by the letters "U", "W" or "M" in the lower right hand corner of the jEdit window.

a) The "join" statements defines how to extract the coding sequence from the entire length of DNA in the

entry: "join(1104.,1306.,1614.)" is basically a recipe stating to paste together the three

intervals - and we'll get the protein coding part of the gene: the coding exons glued together. The CDS will

b) The gene contains three coding exons. Note: from a CDS definition we don't get any information about

UnTranslated Regions (UTR's) that are often found before and after the coding region in the mRNA).

b) No. There is e. the first hit, M57671, "Octodon degus insulin mRNA, complete cds" which is from a

An example is JWIN03000075, "Camelus dromedarius breed African isolate Drom800 Contig74, whole

genome shotgun sequence".

a) In the Search details box, you find "insulin[All Fields]".

c) ("Homo sapiens"[Organism] OR human[All Fields]) AND insulin[All Fields]

b) Yes (except for 10 hits that are synthetic constructs, but based on human sequence). See the "Top

Organisms" box on the right.

c) Accession codes: AH002844 J00265 J00268, Locus name: AH002844, Definition (title): "Human insulin

gene, complete cds".

The important thing here is not the precise search string, but that you understand the principle of using "kill-

words". One possible answer could be:

"insulin like"[title] NOT "insulin degrading"[title] NOT "growth factor"[title] NOT "family member"

[title] NOT "insulin induced"[title] NOT "insulin dependent"[title] NOT "insulin promoter"[title]

Note: the use of double quotes ("") to add two-word "kill phrases".

Note: don't kill "insulin precursor"! Insulin is always synthesized as a precursor, preproinsulin, that

It's a good idea to separate the two logical parts of the search string:

Note that this will miss entries that are not annotated with "Protein name". Alternatively, you could search

with the "Title" field, but that requires a lot of "kill words":

yields 925 hits and still requires some cleanup. -->

human[organism] "insulin receptor"[title] NOT mRNA[title] NOT substrate[title] NOT partial[title]

Retrieved from "teaching.healthtech.dtu/22111/index.php?title=ExGenbank-new-answers&oldid=1153"