API

Dataset

This is the primary class for processing the data once its saved in HDF5 format. It allows filtering of cells and genes, normalization and scaling of data, identification of highly variable genes and dimensionality reduction using PCA. This class has been designed to seamlessly work with HDF5 format such that only a minimal portion of data is ever loaded in memory. Also, the methods (functions) associated with this class have been designed in such a way that external data can be easily plugged in at any stage. For example, users can bring in normalization factors (cell size factors), a list of highly variable genes, etc.

class nabo.Dataset(h5_fn: h5py._hl.files.File, mito_patterns: List[str] = None, ribo_patterns: List[str] = None, force_recalc: bool = False)

Class for perform filtering, normalization and dimensionality reduction.

Parameters:
  • h5_fn – Path to input HDF5 file
  • mito_patterns – Pattern to grep mitochondrial gene names
  • ribo_patterns – Pattern to grep ribosomal gene names
  • force_recalc – If set to True then all the saved data from previous instance of this class will be deleted
correct_var(n_bins: int = 100, lowess_frac: float = 0.4) → None

Removes mean-variance trend in the dataset and adds corrected variance to as ‘fixed_var’ column to geneStats.

Parameters:
  • n_bins – Number of bins for expression values. Larger number of bins will provide a better fit for outliers but may also result in overfitting.
  • lowess_frac – value for parameter frac in statsmodels’ lowess function
Returns:

None

dump_hvgs(hvgs: List[str]) → None

Save HVGs to HDF5 file

Parameters:hvgs – List of highly variable gens to save in the HDF5 file
Returns:
export_as_dataframe(genes: List[str], normalized: bool = True, clr_normed: bool = False, clr_axis: int = 0) → pandas.core.frame.DataFrame

Export data for given genes. Data is exported only for cells that are present in keepCellsIdx attribute. :param genes: Genes to be exported :param normalized: Perform library size normalization (default: True) :param clr_normed: Perform CLR normalization (default: False) :return: Pandas dataframe with cells as rows and genes as columns

filter_data(min_exp: int = 1000, max_exp: int = inf, min_ngenes: int = 100, max_ngenes: int = inf, min_mito: int = -1, max_mito: int = 101, min_ribo: int = -1, max_ribo: int = 101, min_gene_abundance: int = 10, rm_mito: bool = True, rm_ribo: bool = True, verbose: bool = True) → None

Filter cells and genes

Parameters:
  • min_exp – Minimum total expression value for each cell (if count data then these would be minimum number of reads or UMI per cell)
  • max_exp – Maximum total expression value for each cell.
  • min_ngenes – Minimum number of genes expressed per cell.
  • max_ngenes – Maximum number of genes expressed per cell.
  • min_mito – Minimum percentage of mitochondrial genes’ expression
  • max_mito – Maximum percentage of mitochondrial genes’ expression
  • min_ribo – Minimum percentage of ribosomal genes’ expression
  • max_ribo – Maximum percentage of ribosomal genes’ expression
  • min_gene_abundance – Minimum total expression of gene
  • rm_mito – if True, exclude mitochondrial genes (default: True)
  • rm_ribo – if True, exclude mitochondrial genes (default: True)
  • verbose – if True then report the number of cell/genes removed using each cutoff.(default: True)
Returns:

None

find_hvgs(var_min_thresh: float = None, nzm_min_thresh: float = None, var_max_thresh: float = inf, nzm_max_thresh: float = inf, min_cells: int = 0, plot: bool = True, use_corrected_var: bool = False, update_cache: bool = False) → None

Identifies highly variable genes using cutoff provided for corrected variance and non-zero mean expression. Saves the result in attribute hvgList.

NOTE: Input threshold values are considered to be in log scale
if use_corrected_var is True.
Parameters:
  • var_min_thresh – Minimum corrected variance
  • nzm_min_thresh – Minimum non-zero mean
  • var_max_thresh – Maximum corrected variance
  • nzm_max_thresh – Minimum non-zero mean
  • min_cells – Minimum number of cells where a gene should have non-zero value
  • plot – if True then a scatter plots of mean and variance are displayed highlighting the gene datapoints that were selected as HVGs in blue.
  • use_corrected_var – if True then uses corrected variance variance (default: True)
  • update_cache – If true then Dump HVG list to the HDF5 ( default: False)
Returns:

None

fit_ipca(genes: List[str], n_comps: int = 100, batch_size: int = None, disable_tqdm: bool = False) → None

Fit PCA with genes of interest. The fitted PCA object is saved as instance attribute ipca. This function uses scikit-learn’s incremental PCA.

Parameters:
  • genes – List of genes to use to fit PCA
  • n_comps – Number of components to use
  • batch_size – Number of cells to use for fitting in a batch
  • disable_tqdm – if True, progress will not be displayed ( default: False)
Returns:

None

get_cell_lib(hto_patterns: List[str], min_ratio: float = 0.5) → pandas.core.series.Series

Get cell library based on hashtag ratios :param hto_patterns: Pattern to search for hashtags :param min_ratio: Minimum ratio (0.5) :return: Pandas series

get_cum_exp(genes: List[str], report_missing: bool = False) → numpy.ndarray

Calculates cumulative expression of provided genes for each cell.

Parameters:
  • genes – List of gene names
  • report_missing – if True then will print a warning message if gene is not found otherwise will remain silent ( default: False)
Returns:

numpy.ndarray of shape rawNCells with each element as total expression value of param genes in a cell.

get_gene_abundance() → numpy.ndarray
Returns:ndarray where each element is the number of cells where a gene is expressed. Order is same as in attribute genes.
get_genes_by_pattern(patterns: List[str]) → List[str]

Get names of genes that match the pattern

Parameters:patterns – List of Regex pattern
Returns:List of gene names matching the pattern
get_genes_per_cell() → numpy.ndarray
Returns:ndarray where each element is number of expressed genes from a cell. Order is same as in attribute cells.
get_lvgs(nzm_cutoff: float = None, log_nzm_cutoff: float = None, n: int = None, use_corrected_var: bool = False) → list

Get name of names with least corrected variance.

Parameters:
  • nzm_cutoff – Minimum non-zero mean values for returned genes
  • log_nzm_cutoff – Minimum non-zero mean values (log scale) for returned genes
  • n – Number of genes to return (default: same number as HVGs)
  • use_corrected_var – if True then uses corrected variance variance (default: True)
Returns:

A list of gene names

get_norm_exp(gene: str, as_dict: bool = False, key_suffix: str = '', only_valid_cells: bool = False)

Get normalized expression of a gene across all the cells.

Parameters:
  • gene – Valid name of a gene. Gene name will be converted to upper case.
  • as_dict – if True, returns a dictionary a dictionary with cell names as keys and normalized expression as values otherwise returns a list of normalized expression values. The order of values is same as the cell names in attribute cells (default: False)
  • key_suffix – A character/string to append to each cell name ( default: ‘’)
  • only_valid_cells – If True then returns only valid cells i.e. cells removed during filtering step are not included. (default: False)
Returns:

get_scaled_values(scaling_params: pandas.core.frame.DataFrame, tqdm_msg: str = '', disable_tqdm: bool = False, fill_missing: bool = False) → Generator[Tuple[str, numpy.ndarray], None, bool]

Generator that yields cell wise scaled expression values. The yielded vector will have genes in same order as provided in the input list.

Parameters:
  • scaling_params – Scaling parameters i.e. mean and std. dev. for each genes in form of a pandas DataFrame. This can be obtained from get_scaling_params method of Dataset.
  • tqdm_msg – Message for progress bar (default: ‘’)
  • disable_tqdm – if True, progress wil not be displayed ( default: False)
  • fill_missing – If True, then gene names in scaling_params that are not present in the Dataset are assigned 0 value. (Default: False, raises error if gene name not found). Using this parameter is not recommended as of now because the implications of doing this has not been thoroughly tested.
Returns:

(cell name, scaled values)

get_scaling_params(genes: List[str] = None, only_valid: bool = True) → pandas.core.frame.DataFrame

Get genes’ mean and standard deviation (uncorrected)

Parameters:
  • genes – Name of genes whose parameters are required. Returns every gene’s parameter if value is None (default: None)
  • only_valid – if True, include only valid genes (default: True)
Returns:

A pandas DataFrame contains columns ‘mu’ and ‘sigma’

get_total_exp_per_cell() → numpy.ndarray
Returns:ndarray where each element is total expression values/counts from a cell. Order is same as in attribute cells.
plot_filtered(color: str = 'coral', display_stats: bool = True, savename: str = None, showfig: bool = True) → None

Plot total expression, genes/cell, % mitochondrial expression and % ribosomal expression for each cell from filtered data

Parameters:
  • color
  • display_stats
  • savename
  • showfig
Returns:

None

plot_raw(color: str = 'skyblue', display_stats: bool = True, savename: str = None, showfig: bool = True) → None

Plot total expression, genes/cell, % mitochondrial expression and % ribosomal expression fro each cell from raw data

Returns:None
remove_cells(cell_names: List[str], update_cache: bool = False, verbose: bool = False) → None

Remove list of cells by providing their names. Note that no data is actually deleted from the dataset but just the keepCellsIdx attribute is modified.

Parameters:
  • cell_names – List of cell names to remove
  • verbose – Print message about number of cells removed ( Default: False)
  • update_cache – If True then the ‘keep_cells_idx’ dataset in the H5 file is updated. This will override the saved list of cells (keepCellsIdx) when the dataset is loaded in the future.
Returns:

remove_genes(gene_names: List[str], update_cache: bool = False, verbose: bool = False) → None

Remove genes by providing their names. Note that no data is actually deleted from the dataset but just the keepGenesIdx attribute is modified.

Parameters:
  • gene_names – List of gene names to remove
  • verbose – Print message about number of cells removed ( Default: False)
  • update_cache – If True then the ‘keep_genes_idx’ dataset in the H5 file is updated. This will override the saved list of cells (keepCellsIdx) when the dataset is loaded in the future.
Returns:

set_gene_stats() → None

Calculates the gene-wise expression mean and variance values. Sets geneStats attribute as a pandas DataFrame of shape ( nGeneEntries, 4)

Returns:None
set_sf(sf: Dict[str, float] = None, size_scale: float = 1000.0, all_genes: bool = False) → None

Set size factor for each cell. Updates sf attribute

Parameters:
  • sf – size factor dict with keys same as the dataset names in ‘cell_data’ group of H5 file and values as size factor for the corresponding cell to be used for normalization.
  • size_scale – Values are scaled to this factor after normalization using size factor (default: 1000, set to None to disable size scaling).
  • all_genes – Use expression values from all genes, even those which were filtered out, to calculate size factor.
Returns:

None

transform_pca(out_file: str, pca_group_name: str, transformer, scaling_params: pandas.core.frame.DataFrame, disable_tqdm: bool = False, fill_missing: bool = False) → None

Transforms values into PCA space and saves them in HDF5 format

Parameters:
  • out_file – Name of HDF5 file for output
  • pca_group_name – Name of HDF5 group wherein PCA transformed values will be written. If the group exists then it will be deleted
  • transformer – sklearn’s incremental PCA instance on which fit function has already been called
  • scaling_params – a DataFrame as return by get_scaling_params method of reference sample
  • disable_tqdm – if True, progress will not be displayed( default: False)
  • fill_missing – If True, then gene names in scaling_params that are not present in the Dataset are assigned 0 value. (Default: False, raises error if gene name not found). Using this parameter is not recommended as of now because the implications of doing this have not been thoroughly tested.
Returns:

None

update_exp_suffix(suffix: str, delimiter: str = '_') → None

Set the suffix for cell names obtained from exp attribute

Parameters:
  • suffix – Add this string to end of cell names
  • delimiter – delimiter character/string separating cell name from suffix
Returns:

Mapping

This is the core class that performs reference graph building by calculating Euclidean distances between cells and then identifying shared nearest neighbours among cells. Cells from any number of target samples can then be mapped over this graph by calculating reference-target cell distances. All the results are saved as a HDF5 file of user’s choice.

class nabo.Mapping(mapping_h5_fn: str, ref_name: str, ref_pca_fn: str, ref_pca_grp_name: str, overwrite: bool = False)

This class encapsulates functions required to perform cell mapping.

Parameters:
  • mapping_h5_fn – Output filename. Results will be written to this HDF5 file. If the file already exists then it may be used to read existing data
  • ref_name – Label for reference samples.
  • ref_pca_fn – Path to HDF5 file that contains input data. Ideally this should be the transformed PCA data file generated by Nabo’s Dataset class.
  • ref_pca_grp_name – Name of the group in the inout HDF5 file wherein the data exists
  • overwrite – Deletes all the data saved in the mapping_h5_fn to start from scratch (default: False)
calc_dist(target_fn: str, target_grp: str, dist_grp: str, sorted_dist_grp: str, ignore_ref_cells: List[str]) → None

Calculates euclidean distance between each pair of reference cell or a modified canberra distance between each pair of reference and target cell.

Parameters:
  • target_fn – input HDF5 file for target sample. If the distances are being calculated for reference sample then this is reference file name
  • target_grp – group name within HDF5 file wherein data is located
  • dist_grp – Name of output group name where distances will be saved
  • sorted_dist_grp – Name of group where distance sorted cell indices will be saved
  • ignore_ref_cells – List of names of reference cells to which distance should not be calculated.
Returns:

None

calc_snn(target_sorted_dist_grp: str, target_name: str, graph_grp: str, fix_graph_attempts: int = 5, fix_weight: float = None) → None

Creates a shared nearest neighbour graph based on distances calculated by ‘calc_dist’ method.

Parameters:
  • target_sorted_dist_grp – Name of HDF5 group wherein the indices of distance sorted cells are saved.
  • target_name – A label for target sample. This will be appended to each target cell name in the graph
  • graph_grp – Name of output group where graph will be saved
  • fix_graph_attempts – Number of attempts to connect a disconnected graph. This parameter will be soon be removed.
  • fix_weight – Weight of edges used to connect disconnected components of graph (default: 0.5/(2*(k-1))-0.5)
Returns:

None

make_ref_graph(use_stored_distances: bool = False)

A wrapper to run calc_dist and calc_snn methods creating the reference graph

Param:use_stored_distances: If True then distance between reference cells is not calculated and Nabo will try to use the stored distance matrix. (Default: False)
Returns:None
map_target(target_name: str, target_pca_fn: str, target_pca_grp_name: str, ignore_ref_cells: List[str] = None, use_stored_distances: bool = False, overwrite: bool = False) → None

A wrapper to run calc_dist and calc_snn function for mapping target cells onto reference graph. If same target name is provided twice then the data is overwritten.

Parameters:
  • target_name – Label/name of target sample
  • target_pca_fn – Filename of input data for target. In typical usage this would be the HDF5 file generated using Nabo’s Dataset class.
  • target_pca_grp_name – Name of group containing data in HDF5 file
  • ignore_ref_cells – List of reference cell names to be excluded from mapping
  • use_stored_distances – If True then distance between target and reference cells is not calculated and Nabo will try to use the stored distance matrix. (Default: False)
  • overwrite – Overwrite the target data
Returns:

None

set_parameters(use_comps: int, k: int, dist_factor: float, chunk_size: int) → None

Set run parameters for mapping

Parameters:
  • use_comps – Number of input dimensions to use. In typical usage this would mean number of PCA components starting from first.
  • k – Number of nearest neighbours to consider. This is same as k in kNN.
  • dist_factor – In a given dimension i, if value of target cell is x_i and value for reference cell is y_i then the distance will be saved only if abs(x_i- y_i)/abs(x_i) < dist_factor; otherwise distance would be given the highest value i.e 1.
  • chunk_size – Number of cells to load in memory in one go. For smaller RAM usage set a smaller value.
Returns:

None

Graph

This class allows users to interact with the reference graph containing projected target cells. It is possible to visualize the graph, perform clustering on the graph and generate statistics to assess the mappings.

class nabo.Graph

Class for storing Nabo’s SNN graph. Inherits from networkx’s Graph class

calc_contiguous_spl(nodes: List[str]) → float

Calculates mean of shortest path lengths between subsequent nodes provided in the input list in reference graph.

Parameters:nodes – List of nodes from reference sample
Returns:Mean shortest path length
calc_diff_potential(r: Dict[str, float] = None) → Dict[str, float]

Calculate differentiation potential of cells.

This function is a reimplementation of population balance analysis (PBA) approach published in Weinreb et al. 2017, PNAS. This function computes the random walk normalized Laplacian matrix of the reference graph, L_rw = I-A/D and then calculates a Moore-Penrose pseudoinverse of L_rw. The method takes an optional but recommended parameter ‘r’ which represents the relative rates of proliferation and loss in different gene expression states (R). If not provided then a vector with ones is used. The differentiation potential is the dot product of inverse L_rw and R

Parameters:r – Same as parameter R in the above said reference. Should be a dictionary with each reference cell name as a key and its corresponding R values.
Returns:V (Vector potential) as dictionary. Smaller values represent less differentiated cells.
calc_modularity() → float

Calculates modularity of the reference graph. The clusters should have already been defined.

Returns:Value between 0 and 1
classify_target(target: str, weight_frac: float = 0.5, min_degree: int = 2, min_weight: float = 0.1, cluster_dict: Dict[str, int] = None, na_label: str = 'NA', ret_counts: bool = False) → dict

This classifier identifies the total weight of all the connections made by each target cell to each cluster (of reference cells). If a target cell has more than 50% (default value) of it’s total connection weight in one of the clusters then the target cell is labeled to be from that cluster. One useful aspect of this classifier is that it will not classify the target cell to be from any cluster if it fails to reach the threshold (default, 50%) for any cluster (such target cell be labeled as ‘0’ by default).

Parameters:
  • target – Name of target sample
  • weight_frac – Required minimum fraction of weight in a cluster to be classified into that cluster
  • min_degree – Minimum degree of the target node
  • min_weight – Minimum edge weight. Edges with less weight than min_weight will be ignored but will still contribute to total weight.
  • cluster_dict – Cluster labels for each reference cell. If not provided then the stored cluster information is used.
  • na_label – Label for cells that failed to get classified into any cluster
  • ret_counts – It True, then returns number of target cells classified to each cluster, else returns predicted cluster for each target cell
Returns:

Dictionary. Keys are target cell names and value their predicted custer if re_Count is False. Otherwise, keys are cluster labels and values are the number of target cells classified to that cluster

get_cells_from_clusters(clusters: List[str], remove_suffix: bool = True) → List[str]

Get cell names for input cluster numbers

Parameters:
  • clusters – list of cluster identifiers
  • remove_suffix – Remove suffix from cell names
Returns:

List of cell names

get_cluster_identity_weights() → pandas.core.series.Series
Returns:Cluster identity weights for each cell
get_k_path_neighbours(nodes: List[str], k_dist: int, full_trail: bool = False, trail_start: int = 0) → List[str]

Get set of nodes at a given distance

Parameters:
  • nodes – Input nodes
  • k_dist – Path distance from input node
  • full_trail – If True then returns only nodes at k_dist path distance else return all nodes upto k_dist ( default: False)
  • trail_start – If full_trail is True, then the trail starts at this path distance (default: 0).
Returns:

List of nodes

get_mapped_cells(target: str, ref_cells: str, remove_suffix: bool = True) → List[str]

Get target cells that map to a given list of reference cells.

Parameters:
  • target – Name of target sample
  • ref_cells – List of reference cell names
  • remove_suffix – If True then removes target name suffix from end of node name
Returns:

List of target cell names

get_mapping_score(target: str, min_weight: float = 0, min_score: float = 0, weighted: bool = True, by_cluster: bool = False, sorted_names_only: bool = False, top_n_only: int = None, all_nodes: bool = True, score_multiplier: int = 1000, ignore_nodes: List[str] = None, include_nodes: List[str] = None, remove_suffix: bool = False, verbose: bool = False)

Calculate a weighted/unweighted degree of incident target nodes on reference nodes.

Parameters:
  • target – Target sample name
  • min_weight – Ignore a edge if edge weight is smaller then this value in the SNN graph. Only applicable if calculating a weighted mapping score
  • min_score – If score is smaller then reset score to zero
  • weighted – Use edge weights if True
  • by_cluster – If True, then combine scores from nodes of same cluster into a list. The keys are cluster number in the output dictionary
  • sorted_names_only – If True, then return only sorted list of base cells from highest to lowest mapping score. Cells with mapping score less than min_score are not reported (default: False)
  • top_n_only – If sorted_names_only is True and an integer value is provided then this method will return top n number of nodes sorted based on score. min_score value will be ignored.
  • all_nodes – if False, then returns only nodes with non-zero score (after resetting using min_score)
  • score_multiplier – Score is multiplied by this number after normalizing for total number of target cells.
  • ignore_nodes – List of nodes from ‘target’ sample to be ignored while calculating the score (default: None).
  • include_nodes – List of target nodes from ‘target’ sample. Mapping score will be calculated ONLY for those reference cells that are connected to this subset of target cells in the graph. By default mapping score will be calculated against each target node.
  • remove_suffix – Remove suffix from cell names (default: False)
  • verbose – Prints graph stats
Returns:

Mapping score

get_mapping_specificity(target_name: str, fill_na: bool = True) → Dict[str, float]

Calculates the mapping specificity of target nodes. Mapping specificity of a target node is calculated as the mean of shortest path lengths between all pairs of mapped reference nodes.

Parameters:
  • target_name – Name of target sample
  • fill_na – if True, then nan values will be replaced with largest value (default: True)
Returns:

Dictionary with target node names as keys and mapping specificity as values

get_random_nodes(n: int) → List[str]

Get random list of nodes from reference graph.

Parameters:n – Number of nodes to return
Returns:A list of reference nodes
get_ref_specificity(target: str, target_values: Dict[str, float], incl_unmapped: bool = False) → Dict[str, float]

Calculates the average mapping specificity of all target nodes that mapped to a given a reference node. Requires that the mapping specificity of target nodes is already calculated.

Parameters:
  • target – Name of target sample
  • target_values – Mapping specificity values of target nodes
  • incl_unmapped – If True, then includes unmapped reference nodes in the dictionary with value set at 0 Default: False)
Returns:

Dictionary with reference node names as keys and values as mean mapping specificity of their mapped target nodes

static get_score_percentile(score: Dict[str, int], p: int) → float

Get value for at a given percentile

Parameters:
  • score – Mapping score or any other dictionary where values are numbers
  • p – Percentile
Returns:

Percentile value

import_clusters(cluster_dict: Dict[str, str] = None, missing_val: str = 'NA') → None

Import cluster information for reference cells.

Parameters:
  • cluster_dict – Dictionary with cell names as keys and cluster number as values. Cluster numbers should start from 1
  • missing_val – This value will be filled in when fill_missing is True (Default: NA)
Returns:

None

import_clusters_from_csv(csv: str, csv_sep: str = ', ', cluster_col: int = 0, header=None, append_ref_name: bool = False)
Parameters:
  • csv – Filename containing cluster information. Make sure that the first column contains cell names and second contains the cluster labels.
  • csv_sep – Separator for CSV file (default: ‘,’)
  • cluster_col – Column number (0 based count) where cluster info is present (Default: 0)
  • append_ref_name – Append the reference name to the cell name ( Default: True)
Returns:

None

import_clusters_from_json(fn)

Import clusters om JSON file

Parameters:fn – Input file in JSON format.
Returns:None
import_layout(pos_dict) → None

Alternatively one can provide a dictionary with keys as node name and values as coordinate (x, y) tuple.

Parameters:pos_dict – Dictionary with keys as node names and values as 2D coordinates of nodes on the graph.
Returns:None
import_layout_from_csv(csv: str, csv_sep: str = ', ', dim_cols: tuple = (0, 1), header=None, append_ref_name: bool = False)

Import graph layout coordinates from a CSV file

Parameters:
  • csv – Filename containing layout coordinates. Make sure that the first column contains cell names and second and thrid contain the x and y coordinates
  • csv_sep – Separator for CSV file (default: ‘,’)
  • append_ref_name – Append the reference name to the cell name ( Default: True)
Returns:

None

import_layout_from_json(fn)
Parameters:fn – Input json file
Returns:
layout

Copies ‘pos’ attribute values (x/y coordinate tuple) from graph nodes and returns a dictionary :return:

load_from_gml(fn: str) → None

Load data from GML format file. It is critical that this graph was generated using Nabo’s Mapping class.

Parameters:fn – Full path of GML file
Returns:None
load_from_h5(fn: str, name: str, kind: str) → None

Loads a graph saved by Mapping class in HDF5 format

Parameters:
  • fn – Path to HDF5 file
  • name – Label/name of sample used in Mapping object. This function assumes that the group in HDF5 containing graph data is named: name + ‘_graph’
  • kind – Can have a value of either ‘reference’ or ‘target’. Only be one sample can have kind=’reference’ for an instance of this class
Returns:

None

make_clusters(n_clusters: int) → None

Performs graph agglomerative clustering using algorithm in Newman 2004

Parameters:n_clusters – Number of clusters
Returns:None
make_leiden_clusters(resolution: float = 1.0, random_seed=4466) → None

Leiden clustering

Parameters:n_clusters – Number of clusters
Returns:None
save_clusters_as_csv(outfn)
Parameters:outfn – Output CSV file
Returns:
save_clusters_as_json(outfn)
Parameters:outfn – Output JSON file
Returns:
save_graph(save_name: str) → None

Save graph in GML format

Parameters:save_name – Output filename with path
Returns:None
save_layout_as_csv(out_fn)

Saves the layout in CSV format

Parameters:out_fn – Output CSV file
Returns:
save_layout_as_json(out_fn)
Parameters:out_fn – Output json file
Returns:
set_de_groups(target: str, min_score: float, node_dist: int, from_clusters: List[str] = None, full_trail: bool = False, trail_start: int = 1, stringent_control: bool = False) → None

Categorises reference nodes into either ‘Test’, ‘Control’ or ‘Other’ group. Nodes with mapping score higher than min_score are categorized as ‘Test’, cells at node_dist path distance are categorized as ‘Control’ and rest of the nodes are categorized as ‘Other’.

Parameters:
  • target – Name of target sample whose corresponding mapping scores to be considered
  • min_score – Minimum mapping score
  • node_dist – Path distance
  • from_clusters – List of cluster number. ‘Test’ cells will only be limited to these clusters.
  • full_trail – If True then returns only nodes at node_dist path distance else return all nodes upto node_dist ( default: False)
  • trail_start – If full_trail is True, then the trail starts at this path distance (default: 0).
  • stringent_control – If True then control group will not contain cells that have mapping score higher than min_score
Returns:

None

set_ref_layout(niter: int = 500, verbose: bool = True, init_pos: dict = None, disable_rescaling: bool = False, outbound_attraction_distribution: bool = True, edge_weight_influence: float = 1.0, jitter_tolerance: float = 1.0, barnes_hut_optimize: bool = True, barnes_hut_theta: float = 1.2, scaling_ratio: float = 1.0, strong_gravity_mode: bool = False, gravity: float = 1.0) → None

Calculates a 2D graph layout using ForceAtlas2 algorithm. The ForceAtlas2 implementation being used here will not prevent nodes in the graph from overlapping with each other. We aim to improve this in the future.

Parameters:
  • niter – Number of iterations (default: 500)
  • verbose – Print the progress (default: True)
  • init_pos – Initial positions of nodes
  • disable_rescaling – If True then layout coordinates are not rescaled to only have non negative positions (Default: False)
  • outbound_attraction_distribution
  • edge_weight_influence
  • jitter_tolerance
  • barnes_hut_optimize
  • barnes_hut_theta
  • scaling_ratio
  • strong_gravity_mode
  • gravity
Returns:

None

Marker

This module contains functions to identify marker genes (genes with significantly high expression) for sub-populations of interest.

nabo.run_de_test(dataset1: nabo._dataset.Dataset, dataset2, test_cells: List[str], control_cells: List[List[str]], test_label: str = None, control_group_labels: list = None, exp_frac_thresh: float = 0.25, log2_fc_thresh: float = 1, qval_thresh: float = 0.05, tqdm_msg: str = '') → pandas.core.frame.DataFrame

Identifies differentially expressed genes using Mann Whitney U test.

Parameters:
  • dataset1 – nabo.Dataset instance
  • dataset2 – nabo.Dataset instance or None
  • test_cells – list of cells for which markers has to be found. These could be cells from a cluster,cells with high mapping score, etc
  • control_cells – List of cell groups against which markers need to be found. This could just one groups of cells or multiple groups of cells.
  • test_label – Label for test cells.
  • control_group_labels – Labels of control cell groups
  • exp_frac_thresh – Fraction of cells that should have a non zero value for a gene.
  • log2_fc_thresh – Threshold for log2 fold change
  • qval_thresh – Threshold for adjusted p value
  • tqdm_msg – Message to print while displaying progress
Returns:

pd.Dataframe

nabo.find_cluster_markers(clusters: dict, dataset: nabo._dataset.Dataset, de_frequency: int, exp_frac_thresh: float = 0.25, log2_fc_thresh: float = 0.5, qval_thresh: float = 0.05) -> (<class 'pandas.core.frame.DataFrame'>, typing.Dict[int, typing.List[str]])

Identifies marker genes for each cluster in a Graph. This function works a wrapper for run_de_test.

Parameters:
  • clusters – dict
  • dataset – nabo.Dataset
  • de_frequency – Minimum number of clusters against a gene should be significantly differentially expressed for it to qualify as a marker
  • exp_frac_thresh – Fraction of cells that should have a non zero value for a gene.
  • log2_fc_thresh – Threshold for log2 fold change
  • qval_thresh – Threshold for adjusted p value
Returns:

A tuple where first element is a pandas DataFrame and second element is a dictionary where keys are cluster numbers and values are lists of marker genes for the corresponding clusters

GraphPlot

This class allows a highly customized graph visualization. It is created to work seamlessly with Graph class instances. When called it will, by default, automatically produce the graph visualization. This class requires that the set_ref_layout method has been called on the Graph object.

class nabo.GraphPlot(g: nabo._graph.Graph, only_ref=True, vc='steelblue', cmap=None, vc_attr=None, vc_default='grey', vec=None, vc_min=None, vc_max=None, vc_percent_trim=None, max_ncolors=40, vs=2, vs_scale=15, vs_min=None, vs_max=None, vs_percent_trim=0, vlw=0, v_alpha=0.6, v_zorder=None, draw_edges: str = 'all', ec='k', elw=0.1, e_alpha=0.1, bundle_edges: bool = False, bundle_bw: float = 0.1, bundle_decay: float = 0.7, edge_min_weight: float = 0, texts=None, texts_fs=20, title=None, title_fs=30, label_attr=None, label_attr_type='centroid', label_attr_pos=(1, 1), label_attr_space=0.05, label_attr_fs=16, rasterized=True, save_name=None, dpi=300, fig_size=(5, 5), show_fig=True, remove_axes=True, ax=None, verbose=False)

Class for customized Graph drawing

Parameters:
  • g – A Graph class instance from Nabo
  • only_ref – Only reference graph is drawn is True (default: True)
  • vc – vertex colour. Can be a valid matplotlib color string, a dictionary with node names as keys and values as matplotlib color strings/ floats / RGB tuple. If floats then color will be selected on colormap. This parameter is overridden by vc_attr:
  • cmap – A valid matplotlib colour map
  • vc_attr – Name of graph attribute to be used for colors. Attribute values should either be floats or ints
  • vc_default – Default color of a node. Should be either a valid matplotlib string or RGB tuple.
  • vc_min – Minimum value for vertex colour. If value is less than this threshold, then value will be reset to this threshold.
  • vc_max – Maximum value for vertex colour. If value is less than this threshold, then value will be reset to this threshold.
  • vc_percent_trim – Percentage of values to be ceiled or floored. This will set vc_min and vc_max values based on percentiles. Example, setting to 1 will cause lowest 1 % values to be reset to next largest and values larger than 99 percentile (100-1) to set to 99th percentile.
  • max_ncolors – Maximum number of colours to use
  • vs – Vertex size. Should be a integer or float value to set size for all nodes or a dictionary with keys as node name and values as either float ot int.
  • vs_scale – Multiplier for vs
  • vs_min – Same as vc_min but for vertex size
  • vs_max – Same as vc_max but for vertex size
  • vs_percent_trim – Same as vc_percent_trim but for vertex size
  • vlw – Vertex line width
  • v_alpha – Transparency/alpha value for vertices. Should be between 0 and 1
  • draw_edges – Can be either: ‘all’, ‘ref’, ‘target’, ‘none’ ( Default: ‘all’)
  • ec – Edge colour
  • elw – Edge line width
  • e_alpha – Edge transparency/alpha value
  • texts – Text to be placed on the graph. Should be a dictionary with keys as texts and values as tuple for xy coordinates
  • texts_fs – Font size for texts
  • title – Title for the Graph
  • title_fs – Title font size
  • label_attr – Node attribute to use to retrieve labels
  • label_attr_type – Can be either ‘legend’ or ‘centroid’
  • label_attr_pos – Tuple for xy coords to position start of labels. Only used when label_attr_type is ‘centroid’
  • label_attr_space – Spacing between labels. Only used when label_attr_type is ‘centroid’
  • label_attr_fs – Label font size
  • rasterized – If True, then rasterize the scatter points
  • save_name – File name for saving figure
  • fig_size – Figure size. Should be a tuple (width, height)
  • show_fig – If True then show figure
  • remove_axes – Remove axis and ticklabels if set to True (Default: True)
  • ax – Matplotlib axis. Draws on this axis rather than create new.
  • verbose – If True, then prints messages.