Prepare the gene annotation databases

This function prepares the gene annotation databases for a given species and set of annotation sources. It retrieves the necessary information from various annotation packages or external resources and organizes it into a list. The list contains the annotation data for each specified annotation source.

Usage

PrepareDB(
  species = c("Homo_sapiens", "Mus_musculus"),
  db = c("GO", "GO_BP", "GO_CC", "GO_MF", "KEGG", "WikiPathway", "Reactome", "CORUM",
    "MP", "DO", "HPO", "PFAM", "CSPA", "Surfaceome", "SPRomeDB", "VerSeDa", "TFLink",
    "hTFtarget", "TRRUST", "JASPAR", "ENCODE", "MSigDB", "CellTalk", "CellChat",
    "Chromosome", "GeneType", "Enzyme", "TF"),
  db_IDtypes = c("symbol", "entrez_id", "ensembl_id"),
  db_version = "latest",
  db_update = FALSE,
  convert_species = TRUE,
  Ensembl_version = 103,
  mirror = NULL,
  biomart = NULL,
  max_tries = 5,
  custom_TERM2GENE = NULL,
  custom_TERM2NAME = NULL,
  custom_species = NULL,
  custom_IDtype = NULL,
  custom_version = NULL
)

Arguments

species: A character vector specifying the species for which the gene annotation databases should be prepared. Default is c("Homo_sapiens", "Mus_musculus").
db: A character vector specifying the annotation sources to be included in the gene annotation databases. Default is c("GO", "GO_BP", "GO_CC", "GO_MF", "KEGG", "WikiPathway", "Reactome", "CORUM", "MP", "DO", "HPO", "PFAM", "CSPA", "Surfaceome", "SPRomeDB", "VerSeDa", "TFLink", "hTFtarget", "TRRUST", "JASPAR", "ENCODE", "MSigDB", "CellTalk", "CellChat", "Chromosome", "GeneType", "Enzyme", "TF").
db_IDtypes: A character vector specifying the desired ID types to be used for gene identifiers in the gene annotation databases. Default is c("symbol", "entrez_id", "ensembl_id").
db_version: A character vector specifying the version of the gene annotation databases to be retrieved. Default is "latest".
db_update: A logical value indicating whether the gene annotation databases should be forcefully updated. If set to FALSE, the function will attempt to load the cached databases instead. Default is FALSE.
convert_species: A logical value indicating whether to use a species-converted database when the annotation is missing for the specified species. The default value is TRUE.
Ensembl_version: Ensembl database version. If NULL, use the current release version.
mirror: Specify an Ensembl mirror to connect to. The valid options here are 'www', 'uswest', 'useast', 'asia'.
biomart: The name of the BioMart database that you want to connect to. Possible options include "ensembl", "protists_mart", "fungi_mart", and "plants_mart".
max_tries: The maximum number of attempts to connect with the BioMart service.
custom_TERM2GENE: A data frame containing a custom TERM2GENE mapping for the specified species and annotation source. Default is NULL.
custom_TERM2NAME: A data frame containing a custom TERM2NAME mapping for the specified species and annotation source. Default is NULL.
custom_species: A character vector specifying the species name to be used in a custom database. Default is NULL.
custom_IDtype: A character vector specifying the ID type to be used in a custom database. Default is NULL.
custom_version: A character vector specifying the version to be used in a custom database. Default is NULL.

Value

A list containing the prepared gene annotation databases:

TERM2GENE: mapping of gene identifiers to terms
TERM2NAME: mapping of terms to their names
semData: semantic similarity data for gene sets (only for Gene Ontology terms)

Details

The `PrepareDB` function prepares gene annotation databases for a given species and set of annotation sources. It retrieves the necessary information from various annotation packages or external resources and organizes it into a list. The function also supports creating custom databases based on user-provided gene sets.

Examples