This function selects genes based on k-nearest neighbour analysis. The function takes a seurat object or gene expression matrix as input and compute distance to k-nearest neighbour for each gene/feature. A threshold is set based on permutation analysis and FDR computation.

select_genes(
  data = NULL,
  distance_method = c("pearson", "cosine", "euclidean", "spearman", "kendall"),
  noise_level = 5e-05,
  k = 80,
  row_sum = 1,
  fdr = 0.005,
  which_slot = c("data", "sct", "counts"),
  no_dknn_filter = FALSE,
  no_anti_cor = FALSE,
  seed = 123
)

Arguments

data

A matrix, data.frame or Seurat object.

distance_method

a character string indicating the method for computing distances (one of "pearson", "cosine", "euclidean", spearman or "kendall").

noise_level

This parameter controls the fraction of genes with high dknn (ie. noise) whose neighborhood (i.e associated distances) will be used to compute simulated DKNN values. A value of 0 means to use all the genes. A value close to 1 means to use only gene with high dknn (i.e close to noise).

k

An integer specifying the size of the neighborhood.

row_sum

A feature/gene whose row sum is below this threshold will be discarded. Use -Inf to keep all genes.

fdr

A numeric value indicating the false discovery rate threshold (range: 0 to 100).

which_slot

a character string indicating which slot to use from the input scRNA-seq object (one of "data", "sct" or "counts").

no_dknn_filter

a logical indicating whether to skip the k-nearest-neighbors (KNN) filter. If FALSE, all genes are kept for the next steps.

no_anti_cor

If TRUE, correlation below 0 are set to zero ("pearson", "cosine", "spearman" "kendall"). This may increase the relative weight of positive correlation (as true anti-correlation may be rare).

seed

An integer specifying the random seed to use.

Value

a ClusterSet class object

References

- Lopez F.,Textoris J., Bergon A., Didier G., Remy E., Granjeaud S., Imbert J. , Nguyen C. and Puthier D. TranscriptomeBrowser: a powerful and flexible toolbox to explore productively the transcriptional landscape of the Gene Expression Omnibus database. PLoSONE, 2008;3(12):e4001.

Author

Julie Bavais, Sebastien Nin, Lionel Spinelli and Denis Puthier

Examples


# Restrict vebosity to info messages only.
set_verbosity(1)

# Load a dataset
load_example_dataset("7871581/files/pbmc3k_medium")
#> |-- INFO :  Dataset 7871581/files/pbmc3k_medium was already loaded. 

# Select informative genes
res <- select_genes(pbmc3k_medium,
                    distance = "pearson",
                    row_sum=5)
#> |-- INFO :  Computing distances using selected method: pearson 
#> |-- INFO :  Computing distances to KNN. 
#> |-- INFO :  Computing simulated distances to KNN. 
#> |-- INFO :  Computing threshold of distances to KNN (DKNN threshold). 
#> |-- INFO :  Selecting informative genes. 
#> |-- INFO :  Creating the ClusterSet object. 

# Result is a ClusterSet object
is(res)
#> [1] "ClusterSet"
slotNames(res)
#> [1] "data"                     "gene_clusters"           
#> [3] "top_genes"                "gene_clusters_metadata"  
#> [5] "gene_cluster_annotations" "cells_metadata"          
#> [7] "dbf_output"               "parameters"              

# The selected genes
nrow(res)
#> [1] 293
head(row_names(res))
#> [1] "PTCRA"     "ACRBP"     "TUBB1"     "SDPR"      "HIST1H2AC" "C2orf88"