API¶

Dataset¶

This is the primary class for processing the data once its saved in HDF5 format. It allows filtering of cells and genes, normalization and scaling of data, identification of highly variable genes and dimensionality reduction using PCA. This class has been designed to seamlessly work with HDF5 format such that only a minimal portion of data is ever loaded in memory. Also, the methods (functions) associated with this class have been designed in such a way that external data can be easily plugged in at any stage. For example, users can bring in normalization factors (cell size factors), a list of highly variable genes, etc.

class nabo.Dataset(h5_fn: h5py._hl.files.File, mito_patterns: List[str] = None, ribo_patterns: List[str] = None, force_recalc: bool = False)¶

Class for perform filtering, normalization and dimensionality reduction.

Parameters:	h5_fn – Path to input HDF5 file mito_patterns – Pattern to grep mitochondrial gene names ribo_patterns – Pattern to grep ribosomal gene names force_recalc – If set to True then all the saved data from previous instance of this class will be deleted

correct_var(n_bins: int = 100, lowess_frac: float = 0.4) → None¶

Removes mean-variance trend in the dataset and adds corrected variance to as ‘fixed_var’ column to geneStats.

Parameters:	n_bins – Number of bins for expression values. Larger number of bins will provide a better fit for outliers but may also result in overfitting. lowess_frac – value for parameter frac in statsmodels’ lowess function
Returns:	None

dump_hvgs(hvgs: List[str]) → None¶

Save HVGs to HDF5 file

Parameters:	hvgs – List of highly variable gens to save in the HDF5 file
Returns:

export_as_dataframe(genes: List[str], normalized: bool = True, clr_normed: bool = False, clr_axis: int = 0) → pandas.core.frame.DataFrame¶: Export data for given genes. Data is exported only for cells that are present in keepCellsIdx attribute. :param genes: Genes to be exported :param normalized: Perform library size normalization (default: True) :param clr_normed: Perform CLR normalization (default: False) :return: Pandas dataframe with cells as rows and genes as columns

filter_data(min_exp: int = 1000, max_exp: int = inf, min_ngenes: int = 100, max_ngenes: int = inf, min_mito: int = -1, max_mito: int = 101, min_ribo: int = -1, max_ribo: int = 101, min_gene_abundance: int = 10, rm_mito: bool = True, rm_ribo: bool = True, verbose: bool = True) → None¶

Filter cells and genes

Parameters:

min_exp – Minimum total expression value for each cell (if count data then these would be minimum number of reads or UMI per cell)
max_exp – Maximum total expression value for each cell.
min_ngenes – Minimum number of genes expressed per cell.
max_ngenes – Maximum number of genes expressed per cell.
min_mito – Minimum percentage of mitochondrial genes’ expression
max_mito – Maximum percentage of mitochondrial genes’ expression
min_ribo – Minimum percentage of ribosomal genes’ expression
max_ribo – Maximum percentage of ribosomal genes’ expression
min_gene_abundance – Minimum total expression of gene
rm_mito – if True, exclude mitochondrial genes (default: True)
rm_ribo – if True, exclude mitochondrial genes (default: True)
verbose – if True then report the number of cell/genes removed using each cutoff.(default: True)

Returns:

None

find_hvgs(var_min_thresh: float = None, nzm_min_thresh: float = None, var_max_thresh: float = inf, nzm_max_thresh: float = inf, min_cells: int = 0, plot: bool = True, use_corrected_var: bool = False, update_cache: bool = False) → None¶

Identifies highly variable genes using cutoff provided for corrected variance and non-zero mean expression. Saves the result in attribute hvgList.

NOTE: Input threshold values are considered to be in log scale: if use_corrected_var is True.

Parameters:

var_min_thresh – Minimum corrected variance
nzm_min_thresh – Minimum non-zero mean
var_max_thresh – Maximum corrected variance
nzm_max_thresh – Minimum non-zero mean
min_cells – Minimum number of cells where a gene should have non-zero value
plot – if True then a scatter plots of mean and variance are displayed highlighting the gene datapoints that were selected as HVGs in blue.
use_corrected_var – if True then uses corrected variance variance (default: True)
update_cache – If true then Dump HVG list to the HDF5 ( default: False)

Returns:

None

fit_ipca(genes: List[str], n_comps: int = 100, batch_size: int = None, disable_tqdm: bool = False) → None¶

Fit PCA with genes of interest. The fitted PCA object is saved as instance attribute ipca. This function uses scikit-learn’s incremental PCA.

Parameters:	genes – List of genes to use to fit PCA n_comps – Number of components to use batch_size – Number of cells to use for fitting in a batch disable_tqdm – if True, progress will not be displayed ( default: False)
Returns:	None

get_cell_lib(hto_patterns: List[str], min_ratio: float = 0.5) → pandas.core.series.Series¶: Get cell library based on hashtag ratios :param hto_patterns: Pattern to search for hashtags :param min_ratio: Minimum ratio (0.5) :return: Pandas series

get_cum_exp(genes: List[str], report_missing: bool = False) → numpy.ndarray¶

Calculates cumulative expression of provided genes for each cell.

Parameters:	genes – List of gene names report_missing – if True then will print a warning message if gene is not found otherwise will remain silent ( default: False)
Returns:	numpy.ndarray of shape rawNCells with each element as total expression value of param genes in a cell.

get_gene_abundance() → numpy.ndarray¶

Returns:	ndarray where each element is the number of cells where a gene is expressed. Order is same as in attribute genes.

get_genes_by_pattern(patterns: List[str]) → List[str]¶

Get names of genes that match the pattern

Parameters:	patterns – List of Regex pattern
Returns:	List of gene names matching the pattern

get_genes_per_cell() → numpy.ndarray¶

Returns:	ndarray where each element is number of expressed genes from a cell. Order is same as in attribute cells.

get_lvgs(nzm_cutoff: float = None, log_nzm_cutoff: float = None, n: int = None, use_corrected_var: bool = False) → list¶

Get name of names with least corrected variance.

Parameters:	nzm_cutoff – Minimum non-zero mean values for returned genes log_nzm_cutoff – Minimum non-zero mean values (log scale) for returned genes n – Number of genes to return (default: same number as HVGs) use_corrected_var – if True then uses corrected variance variance (default: True)
Returns:	A list of gene names

get_norm_exp(gene: str, as_dict: bool = False, key_suffix: str = '', only_valid_cells: bool = False)¶

Get normalized expression of a gene across all the cells.

Parameters:

gene – Valid name of a gene. Gene name will be converted to upper case.
as_dict – if True, returns a dictionary a dictionary with cell names as keys and normalized expression as values otherwise returns a list of normalized expression values. The order of values is same as the cell names in attribute cells (default: False)
key_suffix – A character/string to append to each cell name ( default: ‘’)
only_valid_cells – If True then returns only valid cells i.e. cells removed during filtering step are not included. (default: False)

Returns:

get_scaled_values(scaling_params: pandas.core.frame.DataFrame, tqdm_msg: str = '', disable_tqdm: bool = False, fill_missing: bool = False) → Generator[Tuple[str, numpy.ndarray], None, bool]¶

Generator that yields cell wise scaled expression values. The yielded vector will have genes in same order as provided in the input list.

Parameters:

scaling_params – Scaling parameters i.e. mean and std. dev. for each genes in form of a pandas DataFrame. This can be obtained from get_scaling_params method of Dataset.
tqdm_msg – Message for progress bar (default: ‘’)
disable_tqdm – if True, progress wil not be displayed ( default: False)
fill_missing – If True, then gene names in scaling_params that are not present in the Dataset are assigned 0 value. (Default: False, raises error if gene name not found). Using this parameter is not recommended as of now because the implications of doing this has not been thoroughly tested.

Returns:

(cell name, scaled values)

get_scaling_params(genes: List[str] = None, only_valid: bool = True) → pandas.core.frame.DataFrame¶

Get genes’ mean and standard deviation (uncorrected)

Parameters:	genes – Name of genes whose parameters are required. Returns every gene’s parameter if value is None (default: None) only_valid – if True, include only valid genes (default: True)
Returns:	A pandas DataFrame contains columns ‘mu’ and ‘sigma’

get_total_exp_per_cell() → numpy.ndarray¶

Returns:	ndarray where each element is total expression values/counts from a cell. Order is same as in attribute cells.

plot_filtered(color: str = 'coral', display_stats: bool = True, savename: str = None, showfig: bool = True) → None¶

Plot total expression, genes/cell, % mitochondrial expression and % ribosomal expression for each cell from filtered data

Parameters:	color – display_stats – savename – showfig –
Returns:	None

plot_raw(color: str = 'skyblue', display_stats: bool = True, savename: str = None, showfig: bool = True) → None¶

Plot total expression, genes/cell, % mitochondrial expression and % ribosomal expression fro each cell from raw data

Returns:	None

remove_cells(cell_names: List[str], update_cache: bool = False, verbose: bool = False) → None¶

Remove list of cells by providing their names. Note that no data is actually deleted from the dataset but just the keepCellsIdx attribute is modified.

Parameters:	cell_names – List of cell names to remove verbose – Print message about number of cells removed ( Default: False) update_cache – If True then the ‘keep_cells_idx’ dataset in the H5 file is updated. This will override the saved list of cells (keepCellsIdx) when the dataset is loaded in the future.
Returns:

remove_genes(gene_names: List[str], update_cache: bool = False, verbose: bool = False) → None¶

Remove genes by providing their names. Note that no data is actually deleted from the dataset but just the keepGenesIdx attribute is modified.

Parameters:	gene_names – List of gene names to remove verbose – Print message about number of cells removed ( Default: False) update_cache – If True then the ‘keep_genes_idx’ dataset in the H5 file is updated. This will override the saved list of cells (keepCellsIdx) when the dataset is loaded in the future.
Returns:

set_gene_stats() → None¶

Calculates the gene-wise expression mean and variance values. Sets geneStats attribute as a pandas DataFrame of shape ( nGeneEntries, 4)

Returns:	None

set_sf(sf: Dict[str, float] = None, size_scale: float = 1000.0, all_genes: bool = False) → None¶

Set size factor for each cell. Updates sf attribute

Parameters:

sf – size factor dict with keys same as the dataset names in ‘cell_data’ group of H5 file and values as size factor for the corresponding cell to be used for normalization.
size_scale – Values are scaled to this factor after normalization using size factor (default: 1000, set to None to disable size scaling).
all_genes – Use expression values from all genes, even those which were filtered out, to calculate size factor.

Returns:

None

transform_pca(out_file: str, pca_group_name: str, transformer, scaling_params: pandas.core.frame.DataFrame, disable_tqdm: bool = False, fill_missing: bool = False) → None¶

Transforms values into PCA space and saves them in HDF5 format

Parameters:

out_file – Name of HDF5 file for output
pca_group_name – Name of HDF5 group wherein PCA transformed values will be written. If the group exists then it will be deleted
transformer – sklearn’s incremental PCA instance on which fit function has already been called
scaling_params – a DataFrame as return by get_scaling_params method of reference sample
disable_tqdm – if True, progress will not be displayed( default: False)
fill_missing – If True, then gene names in scaling_params that are not present in the Dataset are assigned 0 value. (Default: False, raises error if gene name not found). Using this parameter is not recommended as of now because the implications of doing this have not been thoroughly tested.

Returns:

None

update_exp_suffix(suffix: str, delimiter: str = '_') → None¶

Set the suffix for cell names obtained from exp attribute

Parameters:	suffix – Add this string to end of cell names delimiter – delimiter character/string separating cell name from suffix
Returns:

Mapping¶

This is the core class that performs reference graph building by calculating Euclidean distances between cells and then identifying shared nearest neighbours among cells. Cells from any number of target samples can then be mapped over this graph by calculating reference-target cell distances. All the results are saved as a HDF5 file of user’s choice.

class nabo.Mapping(mapping_h5_fn: str, ref_name: str, ref_pca_fn: str, ref_pca_grp_name: str, overwrite: bool = False)¶

This class encapsulates functions required to perform cell mapping.

Parameters:

mapping_h5_fn – Output filename. Results will be written to this HDF5 file. If the file already exists then it may be used to read existing data
ref_name – Label for reference samples.
ref_pca_fn – Path to HDF5 file that contains input data. Ideally this should be the transformed PCA data file generated by Nabo’s Dataset class.
ref_pca_grp_name – Name of the group in the inout HDF5 file wherein the data exists
overwrite – Deletes all the data saved in the mapping_h5_fn to start from scratch (default: False)

calc_dist(target_fn: str, target_grp: str, dist_grp: str, sorted_dist_grp: str, ignore_ref_cells: List[str]) → None¶

Calculates euclidean distance between each pair of reference cell or a modified canberra distance between each pair of reference and target cell.

Parameters:

target_fn – input HDF5 file for target sample. If the distances are being calculated for reference sample then this is reference file name
target_grp – group name within HDF5 file wherein data is located
dist_grp – Name of output group name where distances will be saved
sorted_dist_grp – Name of group where distance sorted cell indices will be saved
ignore_ref_cells – List of names of reference cells to which distance should not be calculated.

Returns:

None

calc_snn(target_sorted_dist_grp: str, target_name: str, graph_grp: str, fix_graph_attempts: int = 5, fix_weight: float = None) → None¶

Creates a shared nearest neighbour graph based on distances calculated by ‘calc_dist’ method.

Parameters:

target_sorted_dist_grp – Name of HDF5 group wherein the indices of distance sorted cells are saved.
target_name – A label for target sample. This will be appended to each target cell name in the graph
graph_grp – Name of output group where graph will be saved
fix_graph_attempts – Number of attempts to connect a disconnected graph. This parameter will be soon be removed.
fix_weight – Weight of edges used to connect disconnected components of graph (default: 0.5/(2*(k-1))-0.5)

Returns:

None

make_ref_graph(use_stored_distances: bool = False)¶

A wrapper to run calc_dist and calc_snn methods creating the reference graph

Param:	use_stored_distances: If True then distance between reference cells is not calculated and Nabo will try to use the stored distance matrix. (Default: False)
Returns:	None

map_target(target_name: str, target_pca_fn: str, target_pca_grp_name: str, ignore_ref_cells: List[str] = None, use_stored_distances: bool = False, overwrite: bool = False) → None¶

A wrapper to run calc_dist and calc_snn function for mapping target cells onto reference graph. If same target name is provided twice then the data is overwritten.

Parameters:

target_name – Label/name of target sample
target_pca_fn – Filename of input data for target. In typical usage this would be the HDF5 file generated using Nabo’s Dataset class.
target_pca_grp_name – Name of group containing data in HDF5 file
ignore_ref_cells – List of reference cell names to be excluded from mapping
use_stored_distances – If True then distance between target and reference cells is not calculated and Nabo will try to use the stored distance matrix. (Default: False)
overwrite – Overwrite the target data

Returns:

None

set_parameters(use_comps: int, k: int, dist_factor: float, chunk_size: int) → None¶

Set run parameters for mapping

Parameters:

use_comps – Number of input dimensions to use. In typical usage this would mean number of PCA components starting from first.
k – Number of nearest neighbours to consider. This is same as k in kNN.
dist_factor – In a given dimension i, if value of target cell is x_i and value for reference cell is y_i then the distance will be saved only if abs(x_i- y_i)/abs(x_i) < dist_factor; otherwise distance would be given the highest value i.e 1.
chunk_size – Number of cells to load in memory in one go. For smaller RAM usage set a smaller value.

Returns:

None

Graph¶

This class allows users to interact with the reference graph containing projected target cells. It is possible to visualize the graph, perform clustering on the graph and generate statistics to assess the mappings.

class nabo.Graph¶

Class for storing Nabo’s SNN graph. Inherits from networkx’s Graph class

calc_contiguous_spl(nodes: List[str]) → float¶

Calculates mean of shortest path lengths between subsequent nodes provided in the input list in reference graph.

Parameters:	nodes – List of nodes from reference sample
Returns:	Mean shortest path length

calc_diff_potential(r: Dict[str, float] = None) → Dict[str, float]¶

Calculate differentiation potential of cells.

This function is a reimplementation of population balance analysis (PBA) approach published in Weinreb et al. 2017, PNAS. This function computes the random walk normalized Laplacian matrix of the reference graph, L_rw = I-A/D and then calculates a Moore-Penrose pseudoinverse of L_rw. The method takes an optional but recommended parameter ‘r’ which represents the relative rates of proliferation and loss in different gene expression states (R). If not provided then a vector with ones is used. The differentiation potential is the dot product of inverse L_rw and R

Parameters:	r – Same as parameter R in the above said reference. Should be a dictionary with each reference cell name as a key and its corresponding R values.
Returns:	V (Vector potential) as dictionary. Smaller values represent less differentiated cells.

calc_modularity() → float¶

Calculates modularity of the reference graph. The clusters should have already been defined.

Returns:	Value between 0 and 1

classify_target(target: str, weight_frac: float = 0.5, min_degree: int = 2, min_weight: float = 0.1, cluster_dict: Dict[str, int] = None, na_label: str = 'NA', ret_counts: bool = False) → dict¶

This classifier identifies the total weight of all the connections made by each target cell to each cluster (of reference cells). If a target cell has more than 50% (default value) of it’s total connection weight in one of the clusters then the target cell is labeled to be from that cluster. One useful aspect of this classifier is that it will not classify the target cell to be from any cluster if it fails to reach the threshold (default, 50%) for any cluster (such target cell be labeled as ‘0’ by default).

Parameters:

target – Name of target sample
weight_frac – Required minimum fraction of weight in a cluster to be classified into that cluster
min_degree – Minimum degree of the target node
min_weight – Minimum edge weight. Edges with less weight than min_weight will be ignored but will still contribute to total weight.
cluster_dict – Cluster labels for each reference cell. If not provided then the stored cluster information is used.
na_label – Label for cells that failed to get classified into any cluster
ret_counts – It True, then returns number of target cells classified to each cluster, else returns predicted cluster for each target cell

Returns:

Dictionary. Keys are target cell names and value their predicted custer if re_Count is False. Otherwise, keys are cluster labels and values are the number of target cells classified to that cluster

get_cells_from_clusters(clusters: List[str], remove_suffix: bool = True) → List[str]¶

Get cell names for input cluster numbers

Parameters:	clusters – list of cluster identifiers remove_suffix – Remove suffix from cell names
Returns:	List of cell names

get_cluster_identity_weights() → pandas.core.series.Series¶

Returns:	Cluster identity weights for each cell

get_k_path_neighbours(nodes: List[str], k_dist: int, full_trail: bool = False, trail_start: int = 0) → List[str]¶

Get set of nodes at a given distance

Parameters:	nodes – Input nodes k_dist – Path distance from input node full_trail – If True then returns only nodes at k_dist path distance else return all nodes upto k_dist ( default: False) trail_start – If full_trail is True, then the trail starts at this path distance (default: 0).
Returns:	List of nodes

get_mapped_cells(target: str, ref_cells: str, remove_suffix: bool = True) → List[str]¶

Get target cells that map to a given list of reference cells.

Parameters:	target – Name of target sample ref_cells – List of reference cell names remove_suffix – If True then removes target name suffix from end of node name
Returns:	List of target cell names

get_mapping_score(target: str, min_weight: float = 0, min_score: float = 0, weighted: bool = True, by_cluster: bool = False, sorted_names_only: bool = False, top_n_only: int = None, all_nodes: bool = True, score_multiplier: int = 1000, ignore_nodes: List[str] = None, include_nodes: List[str] = None, remove_suffix: bool = False, verbose: bool = False)¶

Calculate a weighted/unweighted degree of incident target nodes on reference nodes.

Parameters:

target – Target sample name
min_weight – Ignore a edge if edge weight is smaller then this value in the SNN graph. Only applicable if calculating a weighted mapping score
min_score – If score is smaller then reset score to zero
weighted – Use edge weights if True
by_cluster – If True, then combine scores from nodes of same cluster into a list. The keys are cluster number in the output dictionary
sorted_names_only – If True, then return only sorted list of base cells from highest to lowest mapping score. Cells with mapping score less than min_score are not reported (default: False)
top_n_only – If sorted_names_only is True and an integer value is provided then this method will return top n number of nodes sorted based on score. min_score value will be ignored.
all_nodes – if False, then returns only nodes with non-zero score (after resetting using min_score)
score_multiplier – Score is multiplied by this number after normalizing for total number of target cells.
ignore_nodes – List of nodes from ‘target’ sample to be ignored while calculating the score (default: None).
include_nodes – List of target nodes from ‘target’ sample. Mapping score will be calculated ONLY for those reference cells that are connected to this subset of target cells in the graph. By default mapping score will be calculated against each target node.
remove_suffix – Remove suffix from cell names (default: False)
verbose – Prints graph stats

Returns:

Mapping score

get_mapping_specificity(target_name: str, fill_na: bool = True) → Dict[str, float]¶

Calculates the mapping specificity of target nodes. Mapping specificity of a target node is calculated as the mean of shortest path lengths between all pairs of mapped reference nodes.

Parameters:	target_name – Name of target sample fill_na – if True, then nan values will be replaced with largest value (default: True)
Returns:	Dictionary with target node names as keys and mapping specificity as values

get_random_nodes(n: int) → List[str]¶

Get random list of nodes from reference graph.

Parameters:	n – Number of nodes to return
Returns:	A list of reference nodes

get_ref_specificity(target: str, target_values: Dict[str, float], incl_unmapped: bool = False) → Dict[str, float]¶

Calculates the average mapping specificity of all target nodes that mapped to a given a reference node. Requires that the mapping specificity of target nodes is already calculated.

Parameters:	target – Name of target sample target_values – Mapping specificity values of target nodes incl_unmapped – If True, then includes unmapped reference nodes in the dictionary with value set at 0 Default: False)
Returns:	Dictionary with reference node names as keys and values as mean mapping specificity of their mapped target nodes

static get_score_percentile(score: Dict[str, int], p: int) → float¶

Get value for at a given percentile

Parameters:	score – Mapping score or any other dictionary where values are numbers p – Percentile
Returns:	Percentile value

import_clusters(cluster_dict: Dict[str, str] = None, missing_val: str = 'NA') → None¶

Import cluster information for reference cells.

Parameters:	cluster_dict – Dictionary with cell names as keys and cluster number as values. Cluster numbers should start from 1 missing_val – This value will be filled in when fill_missing is True (Default: NA)
Returns:	None

import_clusters_from_csv(csv: str, csv_sep: str = ', ', cluster_col: int = 0, header=None, append_ref_name: bool = False)¶

Parameters:	csv – Filename containing cluster information. Make sure that the first column contains cell names and second contains the cluster labels. csv_sep – Separator for CSV file (default: ‘,’) cluster_col – Column number (0 based count) where cluster info is present (Default: 0) append_ref_name – Append the reference name to the cell name ( Default: True)
Returns:	None

import_clusters_from_json(fn)¶

Import clusters om JSON file

Parameters:	fn – Input file in JSON format.
Returns:	None

import_layout(pos_dict) → None¶

Alternatively one can provide a dictionary with keys as node name and values as coordinate (x, y) tuple.

Parameters:	pos_dict – Dictionary with keys as node names and values as 2D coordinates of nodes on the graph.
Returns:	None

import_layout_from_csv(csv: str, csv_sep: str = ', ', dim_cols: tuple = (0, 1), header=None, append_ref_name: bool = False)¶

Import graph layout coordinates from a CSV file

Parameters:	csv – Filename containing layout coordinates. Make sure that the first column contains cell names and second and thrid contain the x and y coordinates csv_sep – Separator for CSV file (default: ‘,’) append_ref_name – Append the reference name to the cell name ( Default: True)
Returns:	None

import_layout_from_json(fn)¶

Parameters:	fn – Input json file
Returns:

layout¶: Copies ‘pos’ attribute values (x/y coordinate tuple) from graph nodes and returns a dictionary :return:

load_from_gml(fn: str) → None¶

Load data from GML format file. It is critical that this graph was generated using Nabo’s Mapping class.

Parameters:	fn – Full path of GML file
Returns:	None

load_from_h5(fn: str, name: str, kind: str) → None¶

Loads a graph saved by Mapping class in HDF5 format

Parameters:	fn – Path to HDF5 file name – Label/name of sample used in Mapping object. This function assumes that the group in HDF5 containing graph data is named: name + ‘_graph’ kind – Can have a value of either ‘reference’ or ‘target’. Only be one sample can have kind=’reference’ for an instance of this class
Returns:	None

make_clusters(n_clusters: int) → None¶

Performs graph agglomerative clustering using algorithm in Newman 2004

Parameters:	n_clusters – Number of clusters
Returns:	None

make_leiden_clusters(resolution: float = 1.0, random_seed=4466) → None¶

Leiden clustering

Parameters:	n_clusters – Number of clusters
Returns:	None

save_clusters_as_csv(outfn)¶

Parameters:	outfn – Output CSV file
Returns:

save_clusters_as_json(outfn)¶

Parameters:	outfn – Output JSON file
Returns:

save_graph(save_name: str) → None¶

Save graph in GML format

Parameters:	save_name – Output filename with path
Returns:	None

save_layout_as_csv(out_fn)¶

Saves the layout in CSV format

Parameters:	out_fn – Output CSV file
Returns:

save_layout_as_json(out_fn)¶

Parameters:	out_fn – Output json file
Returns:

set_de_groups(target: str, min_score: float, node_dist: int, from_clusters: List[str] = None, full_trail: bool = False, trail_start: int = 1, stringent_control: bool = False) → None¶

Categorises reference nodes into either ‘Test’, ‘Control’ or ‘Other’ group. Nodes with mapping score higher than min_score are categorized as ‘Test’, cells at node_dist path distance are categorized as ‘Control’ and rest of the nodes are categorized as ‘Other’.

Parameters:

target – Name of target sample whose corresponding mapping scores to be considered
min_score – Minimum mapping score
node_dist – Path distance
from_clusters – List of cluster number. ‘Test’ cells will only be limited to these clusters.
full_trail – If True then returns only nodes at node_dist path distance else return all nodes upto node_dist ( default: False)
trail_start – If full_trail is True, then the trail starts at this path distance (default: 0).
stringent_control – If True then control group will not contain cells that have mapping score higher than min_score

Returns:

None

set_ref_layout(niter: int = 500, verbose: bool = True, init_pos: dict = None, disable_rescaling: bool = False, outbound_attraction_distribution: bool = True, edge_weight_influence: float = 1.0, jitter_tolerance: float = 1.0, barnes_hut_optimize: bool = True, barnes_hut_theta: float = 1.2, scaling_ratio: float = 1.0, strong_gravity_mode: bool = False, gravity: float = 1.0) → None¶

Calculates a 2D graph layout using ForceAtlas2 algorithm. The ForceAtlas2 implementation being used here will not prevent nodes in the graph from overlapping with each other. We aim to improve this in the future.

Parameters:

niter – Number of iterations (default: 500)
verbose – Print the progress (default: True)
init_pos – Initial positions of nodes
disable_rescaling – If True then layout coordinates are not rescaled to only have non negative positions (Default: False)
outbound_attraction_distribution –
edge_weight_influence –
jitter_tolerance –
barnes_hut_optimize –
barnes_hut_theta –
scaling_ratio –
strong_gravity_mode –
gravity –

Returns:

None

Marker¶

This module contains functions to identify marker genes (genes with significantly high expression) for sub-populations of interest.

nabo.run_de_test(dataset1: nabo._dataset.Dataset, dataset2, test_cells: List[str], control_cells: List[List[str]], test_label: str = None, control_group_labels: list = None, exp_frac_thresh: float = 0.25, log2_fc_thresh: float = 1, qval_thresh: float = 0.05, tqdm_msg: str = '') → pandas.core.frame.DataFrame¶

Identifies differentially expressed genes using Mann Whitney U test.

Parameters:

dataset1 – nabo.Dataset instance
dataset2 – nabo.Dataset instance or None
test_cells – list of cells for which markers has to be found. These could be cells from a cluster,cells with high mapping score, etc
control_cells – List of cell groups against which markers need to be found. This could just one groups of cells or multiple groups of cells.
test_label – Label for test cells.
control_group_labels – Labels of control cell groups
exp_frac_thresh – Fraction of cells that should have a non zero value for a gene.
log2_fc_thresh – Threshold for log2 fold change
qval_thresh – Threshold for adjusted p value
tqdm_msg – Message to print while displaying progress

Returns:

pd.Dataframe

nabo.find_cluster_markers(clusters: dict, dataset: nabo._dataset.Dataset, de_frequency: int, exp_frac_thresh: float = 0.25, log2_fc_thresh: float = 0.5, qval_thresh: float = 0.05) -> (<class 'pandas.core.frame.DataFrame'>, typing.Dict[int, typing.List[str]])¶

Identifies marker genes for each cluster in a Graph. This function works a wrapper for run_de_test.

Parameters:	clusters – dict dataset – nabo.Dataset de_frequency – Minimum number of clusters against a gene should be significantly differentially expressed for it to qualify as a marker exp_frac_thresh – Fraction of cells that should have a non zero value for a gene. log2_fc_thresh – Threshold for log2 fold change qval_thresh – Threshold for adjusted p value
Returns:	A tuple where first element is a pandas DataFrame and second element is a dictionary where keys are cluster numbers and values are lists of marker genes for the corresponding clusters

GraphPlot¶

This class allows a highly customized graph visualization. It is created to work seamlessly with Graph class instances. When called it will, by default, automatically produce the graph visualization. This class requires that the set_ref_layout method has been called on the Graph object.

class nabo.GraphPlot(g: nabo._graph.Graph, only_ref=True, vc='steelblue', cmap=None, vc_attr=None, vc_default='grey', vec=None, vc_min=None, vc_max=None, vc_percent_trim=None, max_ncolors=40, vs=2, vs_scale=15, vs_min=None, vs_max=None, vs_percent_trim=0, vlw=0, v_alpha=0.6, v_zorder=None, draw_edges: str = 'all', ec='k', elw=0.1, e_alpha=0.1, bundle_edges: bool = False, bundle_bw: float = 0.1, bundle_decay: float = 0.7, edge_min_weight: float = 0, texts=None, texts_fs=20, title=None, title_fs=30, label_attr=None, label_attr_type='centroid', label_attr_pos=(1, 1), label_attr_space=0.05, label_attr_fs=16, rasterized=True, save_name=None, dpi=300, fig_size=(5, 5), show_fig=True, remove_axes=True, ax=None, verbose=False)¶

Class for customized Graph drawing

Parameters:

g – A Graph class instance from Nabo
only_ref – Only reference graph is drawn is True (default: True)
vc – vertex colour. Can be a valid matplotlib color string, a dictionary with node names as keys and values as matplotlib color strings/ floats / RGB tuple. If floats then color will be selected on colormap. This parameter is overridden by vc_attr:
cmap – A valid matplotlib colour map
vc_attr – Name of graph attribute to be used for colors. Attribute values should either be floats or ints
vc_default – Default color of a node. Should be either a valid matplotlib string or RGB tuple.
vc_min – Minimum value for vertex colour. If value is less than this threshold, then value will be reset to this threshold.
vc_max – Maximum value for vertex colour. If value is less than this threshold, then value will be reset to this threshold.
vc_percent_trim – Percentage of values to be ceiled or floored. This will set vc_min and vc_max values based on percentiles. Example, setting to 1 will cause lowest 1 % values to be reset to next largest and values larger than 99 percentile (100-1) to set to 99th percentile.
max_ncolors – Maximum number of colours to use
vs – Vertex size. Should be a integer or float value to set size for all nodes or a dictionary with keys as node name and values as either float ot int.
vs_scale – Multiplier for vs
vs_min – Same as vc_min but for vertex size
vs_max – Same as vc_max but for vertex size
vs_percent_trim – Same as vc_percent_trim but for vertex size
vlw – Vertex line width
v_alpha – Transparency/alpha value for vertices. Should be between 0 and 1
draw_edges – Can be either: ‘all’, ‘ref’, ‘target’, ‘none’ ( Default: ‘all’)
ec – Edge colour
elw – Edge line width
e_alpha – Edge transparency/alpha value
texts – Text to be placed on the graph. Should be a dictionary with keys as texts and values as tuple for xy coordinates
texts_fs – Font size for texts
title – Title for the Graph
title_fs – Title font size
label_attr – Node attribute to use to retrieve labels
label_attr_type – Can be either ‘legend’ or ‘centroid’
label_attr_pos – Tuple for xy coords to position start of labels. Only used when label_attr_type is ‘centroid’
label_attr_space – Spacing between labels. Only used when label_attr_type is ‘centroid’
label_attr_fs – Label font size
rasterized – If True, then rasterize the scatter points
save_name – File name for saving figure
fig_size – Figure size. Should be a tuple (width, height)
show_fig – If True then show figure
remove_axes – Remove axis and ticklabels if set to True (Default: True)
ax – Matplotlib axis. Draws on this axis rather than create new.
verbose – If True, then prints messages.