Transposable elements

Can knowledge graphs help us understand transposons better?

Genetics

Anna Fensel

AI & Data science

What is this about?

We want to construct a knowledge graph (KG) to describe transposable elements better.

We will do this by:

  1. Getting funded \(\dots\)

  2. Creating an overview of TE knowledge.

  3. Applying this KG on a small use case.

  4. Generalising the KG to the wider TE domain.

Who are we?

Me

  • Bsc in lifesciences at the HAN.

  • Msc programme bioinformatics WUR.

  • Msc thesis on data FAIRification.

Collaborators

Anna Fensel Like Fokkens Daniel Croll Clément Goubert The TE hub

What is the issue?

  • Lots of databases.

  • Querying between these databases.

  • Finding in which hosts a TE occurs.

  • Relating this to TE characteristics.

Overview of which TEs are described per organism type. Figure from [1]

A knowledge graph may be the answer

The Knowledge graph

By combining graphs we can make new inferences.

alignment
Governor of
Capitol of
Starred in
Produced
Produced
same as
Arnold S.
Schwarzenegger
California
Sacramento
Predator 1987
20th century studios
Die Hard 1988
Figure 1: Knowledge graph alignment.

The aim & research questions

Can a KG provide a better way for the TE community to interact with its data?

  • How can a KG describe the classification of TEs?

  • Is the use of a model organism useful to trial a KG?

  • Can KGs be used for data integration in the TE domain?

  • How can KGs be used for complex question answering in the TE domain?

What has been done already?

How did other people unify biological data?

Centralisation of TE databases

TEhub [12] overviews most of the current data systems.

Knowledge graph creation

SPARQLing-genomics data layers [13]

GenomicKB [14]

Wikipathways [15]

uniprot [16]

More examples

  • Orthodb

  • Rhea

TEs in A. thaliana

A TE knowledge graph for Arabidopsis thaliana by [17].

How

Project
Usecases
WP2 – Integrating data
WP1 – Producing the KG
WP4 – Generalise the KG
Usecase 1
WP3 – Analyse the KG
Usecase 2
Get a funding source.
Figure 2: Suggested work plan for this proposal. Figure made with Mermaid.

Funding

NWO-ENW grant

Supports a full PhD project. 10% acceptance rate.

Dutch Openscience fund

Supports a short 1 year project. The first WP can be funded this way.

ZonMW

A call geared towards medical applications.

Experimental plant science group PhD programme

A call from the Wageningen university.

WP1–KG construction

  • Take the Zymoseptoria tritici fungal model as a start.

  • Ontology engineering:

    • Sequence ontology [18].

    • Gene expression ontology [19].

    • ACLAME ontology [20]

  • Relevant data types and annotations.

    • Distance to the closest gene.
    • \(\dots\)

WP2–Integration

  • Include literature annotations
    • Expression studies.
    • Population genomes.
  • Link to databases:
    • Dfam and REPETDB: federated queries.
    • For databases that only provide a dump: ingest and storage.
  • Data processing:
    • Annotate the gene models on TE-consensus.

WP3–Usecases

  1. Are there TEs that implicate them selves with a phenotypic trait (fungicide resistance or climate adaptation)? Are these TEs the same throughout the whole population of Z. tritici?”

  2. Can the characteristics of the TE (transposase sequence, length &c) explain the differing number of TE insertions in Z. tritici populations”

WP4–Generalisation

  • Based on the usecase experience, improve the KG.

  • Generalise it for fungi.

  • Host workshops to teach the usage of the KG.

  • Host it at TEhub.

Timeline

PhD project timeline 012345678910111213141516Create the ontology Data preparation Date integration Ch. 1 KG analysis Ch. 2 Ch. 3 Generalising the KG Finalise thesis Dissemination Ch. 4 WP1WP2WP3WP4writingPhD project timeline
Figure 3

Questions to you

  • How far should I try to generalise it?

  • How should I balance storing/referencing data?

  • What usecases do you want to answer?

Additional slides for discussion

  1. Risks

Risks

KG are an upcoming technology
Not every user understands it.
Creating a standardised pipeline system is hard
There are a lot of methods available, each with a specific purpose; integrating this is a challenge.
Some data may not be easily available

Some websites only provide an data dump, not an API. In this case, special provisions need to be taken for data ingestion and storage.

More information

I thank my advisors Johana Rhodes, Mariana Silva and Christopher Watamba for their help.

My repository

Cited works

1.
Rodriguez F, Arkhipova IR (2023) An Overview of Best Practices for Transposable Element Identification, Classification, and Annotation in Eukaryotic Genomes. In: Branco MR, de Mendoza Soler A (eds) Transposable Elements: Methods and Protocols. Springer US, New York, NY, pp 1–23
2.
Storer J, Hubley R, Rosen J, Wheeler TJ, Smit AF (2021) The Dfam community resource of transposable element families, sequence models, and genome annotations. Mobile DNA 12(1):2. https://doi.org/10.1186/s13100-020-00230-y
3.
Ross K, Varani AM, Snesrud E, et al (2021) TnCentral: A Prokaryotic Transposable Element Database and Web Portal for Transposon Analysis. mBio 12(5):e0206021. https://doi.org/10.1128/mBio.02060-21
4.
Amselem J, Cornut G, Choisne N, et al (2019) RepetDB: A unified resource for transposable element references. Mobile DNA 10(1):6. https://doi.org/10.1186/s13100-019-0150-y
5.
Bao W, Kojima KK, Kohany O (2015) Repbase Update, a database of repetitive elements in eukaryotic genomes. Mobile DNA 6(1):11. https://doi.org/10.1186/s13100-015-0041-9
6.
Wicker T, Sabot F, Hua-Van A, et al (2007) A unified classification system for eukaryotic transposable elements. Nat Rev Genet 8(12):973–982. https://doi.org/10.1038/nrg2165
7.
Tansirichaiya S, Rahman MA, Roberts AP (2019) The Transposon Registry. Mob DNA 10(1):40. https://doi.org/10.1186/s13100-019-0182-3
8.
Du J, Grant D, Tian Z, et al (2010) SoyTEdb: A comprehensive database of transposable elements in the soybean genome. BMC Genomics 11(1):113. https://doi.org/10.1186/1471-2164-11-113
9.
Shao F, Wang J, Xu H, Peng Z (2018) FishTEDB: A collective database of transposable elements identified in the complete genomes of fish. Database (Oxford) 2018:bax106. https://doi.org/10.1093/database/bax106
10.
Xu Z, Liu J, Ni W, et al (2017) GrTEdb: The first web-based database of transposable elements in cotton (Gossypium raimondii). Database (Oxford) 2017:bax013. https://doi.org/10.1093/database/bax013
11.
Gu X, Wang M, Zhang X-O (2024) TE-TSS: An integrated data resource of human and mouse transposable element (TE)-derived transcription start site (TSS). Nucleic Acids Research 52(D1):D322–D333. https://doi.org/10.1093/nar/gkad1048
12.
Elliott TA, Heitkam T, Hubley R, Quesneville H, Suh A, Wheeler TJ (2021) TE Hub: A community-oriented space for sharing and connecting tools, data, resources, and methods for transposable element annotation. Mobile DNA 12(1, 1):1–5. https://doi.org/10.1186/s13100-021-00244-0
13.
14.
Feng F, Tang F, Gao Y, et al (2023) GenomicKB: A knowledge graph for the human genome. Nucleic Acids Research 51(D1):D950–D956. https://doi.org/10.1093/nar/gkac957
15.
Home | WikiPathways. https://www.wikipathways.org/. Accessed 2 Feb 2024
16.
The UniProt Consortium (2019) UniProt: A worldwide hub of protein knowledge. Nucleic Acids Research 47(D1):D506–D515. https://doi.org/10.1093/nar/gky1049
17.
Confais J, Wan M, Saidi S, Francillonne N, Quesneville H (2022) Transposable elements, from their annotation to their integration into knowledge graphs. In: Journée thématique - Annotation, Intelligence Artificielle et Text-mining. PEPI IBIS, Jouy-en-Josas, France
18.
Ashburner M, Ball CA, Blake JA, et al (2000) Gene Ontology: Tool for the unification of biology. Nat Genet 25(1, 1):25–29. https://doi.org/10.1038/75556
19.
Sant DW, Sinclair M, Mungall CJ, et al (2021) Sequence Ontology terminology for gene regulation. Biochimica et Biophysica Acta (BBA) - Gene Regulatory Mechanisms 1864(10):194745. https://doi.org/10.1016/j.bbagrm.2021.194745
20.
Leplae R, Lima-Mendez G, Toussaint A (2010) ACLAME: A CLAssification of Mobile genetic Elements, update 2010. Nucleic Acids Research 38:D57–D61. https://doi.org/10.1093/nar/gkp938