Selects informative genes based on k-nearest neighbour analysis.

This function selects genes based on k-nearest neighbour analysis. The function takes a seurat object or gene expression matrix as input and compute distance to k-nearest neighbour for each gene/feature. A threshold is set based on permutation analysis and FDR computation.

select_genes(
  data = NULL,
  distance_method = c("pearson", "cosine", "euclidean", "spearman", "kendall"),
  noise_level = 5e-05,
  k = 80,
  row_sum = 1,
  fdr = 5e-05,
  assay = NULL,
  layer = c("data", "sct", "counts"),
  no_dknn_filter = FALSE,
  no_anti_cor = FALSE,
  seed = 123
)

Arguments

data: A matrix, data.frame or Seurat object.
distance_method: a character string indicating the method for computing distances (one of "pearson", "cosine", "euclidean", spearman or "kendall").
noise_level: This parameter controls the fraction of genes with high dknn (ie. noise) whose neighborhood (i.e associated distances) will be used to compute simulated DKNN values. A value of 0 means to use all the genes. A value close to 1 means to use only gene with high dknn (i.e close to noise ).
k: An integer specifying the size of the neighborhood.
row_sum: A feature/gene whose row sum is below this threshold will be discarded. Use -Inf to keep all genes.
fdr: A numeric value indicating the false discovery rate threshold (range: 0 to 1).
assay: The assay to use in the Seurat object. If NULL, the function will try to guess.
layer: a character string indicating which slot to use from the input scRNA-seq object (one of "data", "sct" or "counts").
no_dknn_filter: a logical indicating whether to skip the k-nearest-neighbors (KNN) filter. If FALSE, all genes are kept for the next steps.
no_anti_cor: If TRUE, correlation below 0 are set to zero ("pearson", "cosine", "spearman" "kendall"). This may increase the relative weight of positive correlation (as true anti-correlation may be rare).
seed: An integer specifying the random seed to use.

Value

a ClusterSet class object

References

- Lopez F.,Textoris J., Bergon A., Didier G., Remy E., Granjeaud S., Imbert J. , Nguyen C. and Puthier D. TranscriptomeBrowser: a powerful and flexible toolbox to explore productively the transcriptional landscape of the Gene Expression Omnibus database. PLoSONE, 2008;3(12):e4001.

Author

Julie Bavais, Sebastien Nin, Lionel Spinelli and Denis Puthier

Examples


# Restrict vebosity to info messages only.
set_verbosity(1)

# Load a dataset
load_example_dataset("7871581/files/pbmc3k_medium")
#> |-- INFO :  Dataset 7871581/files/pbmc3k_medium was already loaded. 

# Select informative genes
res <- select_genes(pbmc3k_medium,
                    distance = "pearson",
                    row_sum=5)
#> |-- INFO :  Number of selected rows/genes (row_sum): 1164 
#> |-- INFO :  Computing distances using selected method: pearson 
#> |-- INFO :  Computing distances to KNN. 
#> |-- INFO :  Computing simulated distances to KNN. 
#> |-- INFO :  Computing distances to KNN threshold (DKNN threshold). 
#> |-- INFO :  Selecting informative genes. 
#> |-- INFO :  Instantiating a ClusterSet object. 

# Result is a ClusterSet object
is(res)
#> [1] "ClusterSet"
slotNames(res)
#> [1] "data"                     "gene_clusters"            "top_genes"                "gene_clusters_metadata"   "gene_cluster_annotations" "cells_metadata"           "dbf_output"               "parameters"              

# The selected genes
nrow(res)
#> [1] 293
head(row_names(res))
#> [1] "PTCRA"     "ACRBP"     "TUBB1"     "SDPR"      "HIST1H2AC" "C2orf88"