Skip to contents

Integrate single-cell RNA-seq data using various integration methods.

Usage

Integration_SCP(
  srtMerge = NULL,
  batch,
  append = TRUE,
  srtList = NULL,
  assay = NULL,
  integration_method = "Uncorrected",
  do_normalization = NULL,
  normalization_method = "LogNormalize",
  do_HVF_finding = TRUE,
  HVF_source = "separate",
  HVF_method = "vst",
  nHVF = 2000,
  HVF_min_intersection = 1,
  HVF = NULL,
  do_scaling = TRUE,
  vars_to_regress = NULL,
  regression_model = "linear",
  scale_within_batch = FALSE,
  linear_reduction = "pca",
  linear_reduction_dims = 50,
  linear_reduction_dims_use = NULL,
  linear_reduction_params = list(),
  force_linear_reduction = FALSE,
  nonlinear_reduction = "umap",
  nonlinear_reduction_dims = c(2, 3),
  nonlinear_reduction_params = list(),
  force_nonlinear_reduction = TRUE,
  neighbor_metric = "euclidean",
  neighbor_k = 20L,
  cluster_algorithm = "louvain",
  cluster_resolution = 0.6,
  seed = 11,
  ...
)

Arguments

srtMerge

A merged Seurat object that includes the batch information.

batch

A character string specifying the batch variable name.

append

Logical, if TRUE, the integrated data will be appended to the original Seurat object (srtMerge).

srtList

A list of Seurat objects to be checked and preprocessed.

assay

The name of the assay to be used for downstream analysis.

integration_method

A character string specifying the integration method to use. Supported methods are: "Uncorrected", "Seurat", "scVI", "MNN", "fastMNN", "Harmony", "Scanorama", "BBKNN", "CSS", "LIGER", "Conos", "ComBat". Default is "Uncorrected".

do_normalization

A logical value indicating whether data normalization should be performed.

normalization_method

The normalization method to be used. Possible values are "LogNormalize", "SCT", and "TFIDF". Default is "LogNormalize".

do_HVF_finding

A logical value indicating whether highly variable feature (HVF) finding should be performed. Default is TRUE.

HVF_source

The source of highly variable features. Possible values are "global" and "separate". Default is "separate".

HVF_method

The method for selecting highly variable features. Default is "vst".

nHVF

The number of highly variable features to select. Default is 2000.

HVF_min_intersection

The feature needs to be present in batches for a minimum number of times in order to be considered as highly variable. The default value is 1.

HVF

A vector of highly variable features. Default is NULL.

do_scaling

A logical value indicating whether to perform scaling. If TRUE, the function will force to scale the data using the ScaleData function.

vars_to_regress

A vector of variable names to include as additional regression variables. Default is NULL.

regression_model

The regression model to use for scaling. Options are "linear", "poisson", or "negativebinomial" (default is "linear").

scale_within_batch

Whether to scale data within each batch. Only valid when the integration_method is one of "Uncorrected", "Seurat", "MNN", "Harmony", "BBKNN", "CSS", "ComBat".

linear_reduction

The linear dimensionality reduction method to use. Options are "pca", "svd", "ica", "nmf", "mds", or "glmpca" (default is "pca").

linear_reduction_dims

The number of dimensions to keep after linear dimensionality reduction (default is 50).

linear_reduction_dims_use

The dimensions to use for downstream analysis. If NULL, all dimensions will be used.

linear_reduction_params

A list of parameters to pass to the linear dimensionality reduction method.

force_linear_reduction

A logical value indicating whether to force linear dimensionality reduction even if the specified reduction is already present in the Seurat object.

nonlinear_reduction

The nonlinear dimensionality reduction method to use. Options are "umap","umap-naive", "tsne", "dm", "phate", "pacmap", "trimap", "largevis", or "fr" (default is "umap").

nonlinear_reduction_dims

The number of dimensions to keep after nonlinear dimensionality reduction. If a vector is provided, different numbers of dimensions can be specified for each method (default is c(2, 3)).

nonlinear_reduction_params

A list of parameters to pass to the nonlinear dimensionality reduction method.

force_nonlinear_reduction

A logical value indicating whether to force nonlinear dimensionality reduction even if the specified reduction is already present in the Seurat object.

neighbor_metric

The distance metric to use for finding neighbors. Options are "euclidean", "cosine", "manhattan", or "hamming" (default is "euclidean").

neighbor_k

The number of nearest neighbors to use for finding neighbors (default is 20).

cluster_algorithm

The clustering algorithm to use. Options are "louvain", "slm", or "leiden" (default is "louvain").

cluster_resolution

The resolution parameter to use for clustering. Larger values result in fewer clusters (default is 0.6).

seed

An integer specifying the random seed for reproducibility. Default is 11.

...

Additional arguments to be passed to the integration method function.

Value

A Seurat object.

Examples

data("panc8_sub")
panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected"
)
#> [2023-11-21 07:22:20.499585] Start Uncorrected_integrate
#> [2023-11-21 07:22:20.504665] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 07:22:20.75097] Checking srtList... ...
#> Data 1/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList is raw_normalized_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList is raw_counts. Perform NormalizeData(LogNormalize) on the data ...
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> Number of available HVF: 2000
#> [2023-11-21 07:22:22.999887] Finished checking.
#> [2023-11-21 07:22:23.744863] Perform integration(Uncorrected) on the data...
#> [2023-11-21 07:22:23.745] Perform ScaleData on the data...
#> [2023-11-21 07:22:23.917545] Perform linear dimension reduction (pca) on the data...
#> Warning: The following arguments are not used: force.recalc
#> Warning: The following arguments are not used: force.recalc
#> [2023-11-21 07:22:24.772714] Perform FindClusters (louvain) on the data...
#> [2023-11-21 07:22:24.865321] Reorder clusters...
#> [2023-11-21 07:22:24.942215] Perform nonlinear dimension reduction (umap) on the data...
#> Non-linear dimensionality reduction(umap) using Reduction(Uncorrectedpca, dims:1-10) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Non-linear dimensionality reduction(umap) using Reduction(Uncorrectedpca, dims:1-10) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> [2023-11-21 07:22:33.153057] Uncorrected_integrate done
#> Elapsed time: 12.65 secs 
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))


panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected",
  HVF_min_intersection = 5
)
#> [2023-11-21 07:22:33.601557] Start Uncorrected_integrate
#> [2023-11-21 07:22:33.606674] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 07:22:33.982132] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> Number of available HVF: 264
#> [2023-11-21 07:22:35.497287] Finished checking.
#> [2023-11-21 07:22:36.376903] Perform integration(Uncorrected) on the data...
#> [2023-11-21 07:22:36.377034] Perform ScaleData on the data...
#> [2023-11-21 07:22:36.431986] Perform linear dimension reduction (pca) on the data...
#> Warning: The following arguments are not used: force.recalc
#> Warning: The following arguments are not used: force.recalc
#> [2023-11-21 07:22:37.085666] Perform FindClusters (louvain) on the data...
#> [2023-11-21 07:22:37.201152] Reorder clusters...
#> [2023-11-21 07:22:37.262771] Perform nonlinear dimension reduction (umap) on the data...
#> Non-linear dimensionality reduction(umap) using Reduction(Uncorrectedpca, dims:1-11) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Non-linear dimensionality reduction(umap) using Reduction(Uncorrectedpca, dims:1-11) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> [2023-11-21 07:22:44.938949] Uncorrected_integrate done
#> Elapsed time: 11.34 secs 
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))


panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Uncorrected",
  HVF_min_intersection = 5, scale_within_batch = TRUE
)
#> [2023-11-21 07:22:45.433616] Start Uncorrected_integrate
#> [2023-11-21 07:22:45.438893] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 07:22:45.794249] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> Number of available HVF: 264
#> [2023-11-21 07:22:47.383402] Finished checking.
#> [2023-11-21 07:22:48.120754] Perform integration(Uncorrected) on the data...
#> [2023-11-21 07:22:48.120892] Perform ScaleData on the data...
#> [2023-11-21 07:22:48.178095] Perform linear dimension reduction (pca) on the data...
#> Warning: The following arguments are not used: force.recalc
#> Warning: The following arguments are not used: force.recalc
#> [2023-11-21 07:22:48.901565] Perform FindClusters (louvain) on the data...
#> [2023-11-21 07:22:49.048645] Reorder clusters...
#> [2023-11-21 07:22:49.108998] Perform nonlinear dimension reduction (umap) on the data...
#> Non-linear dimensionality reduction(umap) using Reduction(Uncorrectedpca, dims:1-12) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Non-linear dimensionality reduction(umap) using Reduction(Uncorrectedpca, dims:1-12) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> [2023-11-21 07:22:57.015141] Uncorrected_integrate done
#> Elapsed time: 11.58 secs 
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))


panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Seurat"
)
#> [2023-11-21 07:22:57.506948] Start Seurat_integrate
#> [2023-11-21 07:22:57.513201] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 07:22:57.868443] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> Number of available HVF: 2000
#> [2023-11-21 07:22:59.443209] Finished checking.
#> [2023-11-21 07:23:00.186965] Perform FindIntegrationAnchors on the data...
#> [2023-11-21 07:23:19.293753] Perform integration(Seurat) on the data...
#> [2023-11-21 07:23:26.460136] Perform ScaleData on the data...
#> [2023-11-21 07:23:26.556868] Perform linear dimension reduction (pca) on the data...
#> Warning: The following arguments are not used: force.recalc
#> Warning: The following arguments are not used: force.recalc
#> [2023-11-21 07:23:27.440833] Perform FindClusters (louvain) on the data...
#> [2023-11-21 07:23:27.569122] Reorder clusters...
#> [2023-11-21 07:23:27.639545] Perform nonlinear dimension reduction (umap) on the data...
#> Non-linear dimensionality reduction(umap) using Reduction(Seuratpca, dims:1-12) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Non-linear dimensionality reduction(umap) using Reduction(Seuratpca, dims:1-12) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> [2023-11-21 07:23:35.54961] Seurat_integrate done
#> Elapsed time: 38.04 secs 
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))


panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Seurat",
  FindIntegrationAnchors_params = list(reduction = "rpca")
)
#> [2023-11-21 07:23:36.032687] Start Seurat_integrate
#> [2023-11-21 07:23:36.038165] Spliting srtMerge into srtList by column tech... ...
#> [2023-11-21 07:23:36.661033] Checking srtList... ...
#> Data 1/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 1/5 of the srtList...
#> Data 2/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 2/5 of the srtList...
#> Data 3/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 3/5 of the srtList...
#> Data 4/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 4/5 of the srtList...
#> Data 5/5 of the srtList has been log-normalized.
#> Perform FindVariableFeatures on the data 5/5 of the srtList...
#> Use the separate HVF from srtList...
#> Number of available HVF: 2000
#> [2023-11-21 07:23:38.28451] Finished checking.
#> [2023-11-21 07:23:39.311497] Use 'rpca' integration workflow...
#> [2023-11-21 07:23:39.311634] Perform ScaleData on the data 1 ...
#> [2023-11-21 07:23:39.362197] Perform linear dimension reduction (pca) on the data 1 ...
#> [2023-11-21 07:23:39.516416] Perform ScaleData on the data 2 ...
#> [2023-11-21 07:23:39.577217] Perform linear dimension reduction (pca) on the data 2 ...
#> [2023-11-21 07:23:39.717211] Perform ScaleData on the data 3 ...
#> [2023-11-21 07:23:39.780633] Perform linear dimension reduction (pca) on the data 3 ...
#> [2023-11-21 07:23:39.914314] Perform ScaleData on the data 4 ...
#> [2023-11-21 07:23:39.992768] Perform linear dimension reduction (pca) on the data 4 ...
#> [2023-11-21 07:23:40.255761] Perform ScaleData on the data 5 ...
#> [2023-11-21 07:23:40.308695] Perform linear dimension reduction (pca) on the data 5 ...
#> [2023-11-21 07:23:40.478852] Perform FindIntegrationAnchors on the data...
#> [2023-11-21 07:23:49.166569] Perform integration(Seurat) on the data...
#> [2023-11-21 07:23:56.003096] Perform ScaleData on the data...
#> [2023-11-21 07:23:56.098095] Perform linear dimension reduction (pca) on the data...
#> Warning: The following arguments are not used: force.recalc
#> Warning: The following arguments are not used: force.recalc
#> [2023-11-21 07:23:56.921441] Perform FindClusters (louvain) on the data...
#> [2023-11-21 07:23:57.045694] Reorder clusters...
#> [2023-11-21 07:23:57.119271] Perform nonlinear dimension reduction (umap) on the data...
#> Non-linear dimensionality reduction(umap) using Reduction(Seuratpca, dims:1-11) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Non-linear dimensionality reduction(umap) using Reduction(Seuratpca, dims:1-11) as input
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> Found more than one class "dist" in cache; using the first, from namespace 'BiocGenerics'
#> Also defined by ‘spam’
#> [2023-11-21 07:24:05.337769] Seurat_integrate done
#> Elapsed time: 29.31 secs 
CellDimPlot(panc8_sub, group.by = c("tech", "celltype"))


if (FALSE) {
integration_methods <- c(
  "Uncorrected", "Seurat", "scVI", "MNN", "fastMNN", "Harmony",
  "Scanorama", "BBKNN", "CSS", "LIGER", "Conos", "ComBat"
)
for (method in integration_methods) {
  panc8_sub <- Integration_SCP(
    srtMerge = panc8_sub, batch = "tech",
    integration_method = method,
    linear_reduction_dims_use = 1:50,
    nonlinear_reduction = "umap"
  )
  print(CellDimPlot(panc8_sub,
    group.by = c("tech", "celltype"),
    reduction = paste0(method, "UMAP2D"),
    xlab = "", ylab = "", title = method,
    legend.position = "none", theme_use = "theme_blank"
  ))
}

nonlinear_reductions <- c("umap", "tsne", "dm", "phate", "pacmap", "trimap", "largevis", "fr")
panc8_sub <- Integration_SCP(
  srtMerge = panc8_sub, batch = "tech",
  integration_method = "Seurat",
  linear_reduction_dims_use = 1:50,
  nonlinear_reduction = nonlinear_reductions
)
for (nr in nonlinear_reductions) {
  print(CellDimPlot(panc8_sub,
    group.by = c("tech", "celltype"),
    reduction = paste0("Seurat", nr, "2D"),
    xlab = "", ylab = "", title = nr,
    legend.position = "none", theme_use = "theme_blank"
  ))
}
}