API¶
Dataset¶
This is the primary class for processing the data once its saved in HDF5 format. It allows filtering of cells and genes, normalization and scaling of data, identification of highly variable genes and dimensionality reduction using PCA. This class has been designed to seamlessly work with HDF5 format such that only a minimal portion of data is ever loaded in memory. Also, the methods (functions) associated with this class have been designed in such a way that external data can be easily plugged in at any stage. For example, users can bring in normalization factors (cell size factors), a list of highly variable genes, etc.
-
class
nabo.
Dataset
(h5_fn: h5py._hl.files.File, mito_patterns: List[str] = None, ribo_patterns: List[str] = None, force_recalc: bool = False)¶ Class for perform filtering, normalization and dimensionality reduction.
Parameters: - h5_fn – Path to input HDF5 file
- mito_patterns – Pattern to grep mitochondrial gene names
- ribo_patterns – Pattern to grep ribosomal gene names
- force_recalc – If set to True then all the saved data from previous instance of this class will be deleted
-
correct_var
(n_bins: int = 100, lowess_frac: float = 0.4) → None¶ Removes mean-variance trend in the dataset and adds corrected variance to as ‘fixed_var’ column to geneStats.
Parameters: - n_bins – Number of bins for expression values. Larger number of bins will provide a better fit for outliers but may also result in overfitting.
- lowess_frac – value for parameter frac in statsmodels’ lowess function
Returns: None
-
dump_hvgs
(hvgs: List[str]) → None¶ Save HVGs to HDF5 file
Parameters: hvgs – List of highly variable gens to save in the HDF5 file Returns:
-
export_as_dataframe
(genes: List[str], normalized: bool = True, clr_normed: bool = False, clr_axis: int = 0) → pandas.core.frame.DataFrame¶ Export data for given genes. Data is exported only for cells that are present in keepCellsIdx attribute. :param genes: Genes to be exported :param normalized: Perform library size normalization (default: True) :param clr_normed: Perform CLR normalization (default: False) :return: Pandas dataframe with cells as rows and genes as columns
-
filter_data
(min_exp: int = 1000, max_exp: int = inf, min_ngenes: int = 100, max_ngenes: int = inf, min_mito: int = -1, max_mito: int = 101, min_ribo: int = -1, max_ribo: int = 101, min_gene_abundance: int = 10, rm_mito: bool = True, rm_ribo: bool = True, verbose: bool = True) → None¶ Filter cells and genes
Parameters: - min_exp – Minimum total expression value for each cell (if count data then these would be minimum number of reads or UMI per cell)
- max_exp – Maximum total expression value for each cell.
- min_ngenes – Minimum number of genes expressed per cell.
- max_ngenes – Maximum number of genes expressed per cell.
- min_mito – Minimum percentage of mitochondrial genes’ expression
- max_mito – Maximum percentage of mitochondrial genes’ expression
- min_ribo – Minimum percentage of ribosomal genes’ expression
- max_ribo – Maximum percentage of ribosomal genes’ expression
- min_gene_abundance – Minimum total expression of gene
- rm_mito – if True, exclude mitochondrial genes (default: True)
- rm_ribo – if True, exclude mitochondrial genes (default: True)
- verbose – if True then report the number of cell/genes removed using each cutoff.(default: True)
Returns: None
-
find_hvgs
(var_min_thresh: float = None, nzm_min_thresh: float = None, var_max_thresh: float = inf, nzm_max_thresh: float = inf, min_cells: int = 0, plot: bool = True, use_corrected_var: bool = False, update_cache: bool = False) → None¶ Identifies highly variable genes using cutoff provided for corrected variance and non-zero mean expression. Saves the result in attribute hvgList.
- NOTE: Input threshold values are considered to be in log scale
- if use_corrected_var is True.
Parameters: - var_min_thresh – Minimum corrected variance
- nzm_min_thresh – Minimum non-zero mean
- var_max_thresh – Maximum corrected variance
- nzm_max_thresh – Minimum non-zero mean
- min_cells – Minimum number of cells where a gene should have non-zero value
- plot – if True then a scatter plots of mean and variance are displayed highlighting the gene datapoints that were selected as HVGs in blue.
- use_corrected_var – if True then uses corrected variance variance (default: True)
- update_cache – If true then Dump HVG list to the HDF5 ( default: False)
Returns: None
-
fit_ipca
(genes: List[str], n_comps: int = 100, batch_size: int = None, disable_tqdm: bool = False) → None¶ Fit PCA with genes of interest. The fitted PCA object is saved as instance attribute ipca. This function uses scikit-learn’s incremental PCA.
Parameters: - genes – List of genes to use to fit PCA
- n_comps – Number of components to use
- batch_size – Number of cells to use for fitting in a batch
- disable_tqdm – if True, progress will not be displayed ( default: False)
Returns: None
-
get_cell_lib
(hto_patterns: List[str], min_ratio: float = 0.5) → pandas.core.series.Series¶ Get cell library based on hashtag ratios :param hto_patterns: Pattern to search for hashtags :param min_ratio: Minimum ratio (0.5) :return: Pandas series
-
get_cum_exp
(genes: List[str], report_missing: bool = False) → numpy.ndarray¶ Calculates cumulative expression of provided genes for each cell.
Parameters: - genes – List of gene names
- report_missing – if True then will print a warning message if gene is not found otherwise will remain silent ( default: False)
Returns: numpy.ndarray of shape rawNCells with each element as total expression value of param genes in a cell.
-
get_gene_abundance
() → numpy.ndarray¶ Returns: ndarray where each element is the number of cells where a gene is expressed. Order is same as in attribute genes.
-
get_genes_by_pattern
(patterns: List[str]) → List[str]¶ Get names of genes that match the pattern
Parameters: patterns – List of Regex pattern Returns: List of gene names matching the pattern
-
get_genes_per_cell
() → numpy.ndarray¶ Returns: ndarray where each element is number of expressed genes from a cell. Order is same as in attribute cells.
-
get_lvgs
(nzm_cutoff: float = None, log_nzm_cutoff: float = None, n: int = None, use_corrected_var: bool = False) → list¶ Get name of names with least corrected variance.
Parameters: - nzm_cutoff – Minimum non-zero mean values for returned genes
- log_nzm_cutoff – Minimum non-zero mean values (log scale) for returned genes
- n – Number of genes to return (default: same number as HVGs)
- use_corrected_var – if True then uses corrected variance variance (default: True)
Returns: A list of gene names
-
get_norm_exp
(gene: str, as_dict: bool = False, key_suffix: str = '', only_valid_cells: bool = False)¶ Get normalized expression of a gene across all the cells.
Parameters: - gene – Valid name of a gene. Gene name will be converted to upper case.
- as_dict – if True, returns a dictionary a dictionary with cell names as keys and normalized expression as values otherwise returns a list of normalized expression values. The order of values is same as the cell names in attribute cells (default: False)
- key_suffix – A character/string to append to each cell name ( default: ‘’)
- only_valid_cells – If True then returns only valid cells i.e. cells removed during filtering step are not included. (default: False)
Returns:
-
get_scaled_values
(scaling_params: pandas.core.frame.DataFrame, tqdm_msg: str = '', disable_tqdm: bool = False, fill_missing: bool = False) → Generator[Tuple[str, numpy.ndarray], None, bool]¶ Generator that yields cell wise scaled expression values. The yielded vector will have genes in same order as provided in the input list.
Parameters: - scaling_params – Scaling parameters i.e. mean and std. dev. for each genes in form of a pandas DataFrame. This can be obtained from get_scaling_params method of Dataset.
- tqdm_msg – Message for progress bar (default: ‘’)
- disable_tqdm – if True, progress wil not be displayed ( default: False)
- fill_missing – If True, then gene names in scaling_params that are not present in the Dataset are assigned 0 value. (Default: False, raises error if gene name not found). Using this parameter is not recommended as of now because the implications of doing this has not been thoroughly tested.
Returns: (cell name, scaled values)
-
get_scaling_params
(genes: List[str] = None, only_valid: bool = True) → pandas.core.frame.DataFrame¶ Get genes’ mean and standard deviation (uncorrected)
Parameters: - genes – Name of genes whose parameters are required. Returns every gene’s parameter if value is None (default: None)
- only_valid – if True, include only valid genes (default: True)
Returns: A pandas DataFrame contains columns ‘mu’ and ‘sigma’
-
get_total_exp_per_cell
() → numpy.ndarray¶ Returns: ndarray where each element is total expression values/counts from a cell. Order is same as in attribute cells.
-
plot_filtered
(color: str = 'coral', display_stats: bool = True, savename: str = None, showfig: bool = True) → None¶ Plot total expression, genes/cell, % mitochondrial expression and % ribosomal expression for each cell from filtered data
Parameters: - color –
- display_stats –
- savename –
- showfig –
Returns: None
-
plot_raw
(color: str = 'skyblue', display_stats: bool = True, savename: str = None, showfig: bool = True) → None¶ Plot total expression, genes/cell, % mitochondrial expression and % ribosomal expression fro each cell from raw data
Returns: None
-
remove_cells
(cell_names: List[str], update_cache: bool = False, verbose: bool = False) → None¶ Remove list of cells by providing their names. Note that no data is actually deleted from the dataset but just the keepCellsIdx attribute is modified.
Parameters: - cell_names – List of cell names to remove
- verbose – Print message about number of cells removed ( Default: False)
- update_cache – If True then the ‘keep_cells_idx’ dataset in the H5 file is updated. This will override the saved list of cells (keepCellsIdx) when the dataset is loaded in the future.
Returns:
-
remove_genes
(gene_names: List[str], update_cache: bool = False, verbose: bool = False) → None¶ Remove genes by providing their names. Note that no data is actually deleted from the dataset but just the keepGenesIdx attribute is modified.
Parameters: - gene_names – List of gene names to remove
- verbose – Print message about number of cells removed ( Default: False)
- update_cache – If True then the ‘keep_genes_idx’ dataset in the H5 file is updated. This will override the saved list of cells (keepCellsIdx) when the dataset is loaded in the future.
Returns:
-
set_gene_stats
() → None¶ Calculates the gene-wise expression mean and variance values. Sets geneStats attribute as a pandas DataFrame of shape ( nGeneEntries, 4)
Returns: None
-
set_sf
(sf: Dict[str, float] = None, size_scale: float = 1000.0, all_genes: bool = False) → None¶ Set size factor for each cell. Updates sf attribute
Parameters: - sf – size factor dict with keys same as the dataset names in ‘cell_data’ group of H5 file and values as size factor for the corresponding cell to be used for normalization.
- size_scale – Values are scaled to this factor after normalization using size factor (default: 1000, set to None to disable size scaling).
- all_genes – Use expression values from all genes, even those which were filtered out, to calculate size factor.
Returns: None
-
transform_pca
(out_file: str, pca_group_name: str, transformer, scaling_params: pandas.core.frame.DataFrame, disable_tqdm: bool = False, fill_missing: bool = False) → None¶ Transforms values into PCA space and saves them in HDF5 format
Parameters: - out_file – Name of HDF5 file for output
- pca_group_name – Name of HDF5 group wherein PCA transformed values will be written. If the group exists then it will be deleted
- transformer – sklearn’s incremental PCA instance on which fit function has already been called
- scaling_params – a DataFrame as return by get_scaling_params method of reference sample
- disable_tqdm – if True, progress will not be displayed( default: False)
- fill_missing – If True, then gene names in scaling_params that are not present in the Dataset are assigned 0 value. (Default: False, raises error if gene name not found). Using this parameter is not recommended as of now because the implications of doing this have not been thoroughly tested.
Returns: None
-
update_exp_suffix
(suffix: str, delimiter: str = '_') → None¶ Set the suffix for cell names obtained from exp attribute
Parameters: - suffix – Add this string to end of cell names
- delimiter – delimiter character/string separating cell name from suffix
Returns:
Mapping¶
This is the core class that performs reference graph building by calculating Euclidean distances between cells and then identifying shared nearest neighbours among cells. Cells from any number of target samples can then be mapped over this graph by calculating reference-target cell distances. All the results are saved as a HDF5 file of user’s choice.
-
class
nabo.
Mapping
(mapping_h5_fn: str, ref_name: str, ref_pca_fn: str, ref_pca_grp_name: str, overwrite: bool = False)¶ This class encapsulates functions required to perform cell mapping.
Parameters: - mapping_h5_fn – Output filename. Results will be written to this HDF5 file. If the file already exists then it may be used to read existing data
- ref_name – Label for reference samples.
- ref_pca_fn – Path to HDF5 file that contains input data. Ideally this should be the transformed PCA data file generated by Nabo’s Dataset class.
- ref_pca_grp_name – Name of the group in the inout HDF5 file wherein the data exists
- overwrite – Deletes all the data saved in the mapping_h5_fn to start from scratch (default: False)
-
calc_dist
(target_fn: str, target_grp: str, dist_grp: str, sorted_dist_grp: str, ignore_ref_cells: List[str]) → None¶ Calculates euclidean distance between each pair of reference cell or a modified canberra distance between each pair of reference and target cell.
Parameters: - target_fn – input HDF5 file for target sample. If the distances are being calculated for reference sample then this is reference file name
- target_grp – group name within HDF5 file wherein data is located
- dist_grp – Name of output group name where distances will be saved
- sorted_dist_grp – Name of group where distance sorted cell indices will be saved
- ignore_ref_cells – List of names of reference cells to which distance should not be calculated.
Returns: None
-
calc_snn
(target_sorted_dist_grp: str, target_name: str, graph_grp: str, fix_graph_attempts: int = 5, fix_weight: float = None) → None¶ Creates a shared nearest neighbour graph based on distances calculated by ‘calc_dist’ method.
Parameters: - target_sorted_dist_grp – Name of HDF5 group wherein the indices of distance sorted cells are saved.
- target_name – A label for target sample. This will be appended to each target cell name in the graph
- graph_grp – Name of output group where graph will be saved
- fix_graph_attempts – Number of attempts to connect a disconnected graph. This parameter will be soon be removed.
- fix_weight – Weight of edges used to connect disconnected components of graph (default: 0.5/(2*(k-1))-0.5)
Returns: None
-
make_ref_graph
(use_stored_distances: bool = False)¶ A wrapper to run calc_dist and calc_snn methods creating the reference graph
Param: use_stored_distances: If True then distance between reference cells is not calculated and Nabo will try to use the stored distance matrix. (Default: False) Returns: None
-
map_target
(target_name: str, target_pca_fn: str, target_pca_grp_name: str, ignore_ref_cells: List[str] = None, use_stored_distances: bool = False, overwrite: bool = False) → None¶ A wrapper to run calc_dist and calc_snn function for mapping target cells onto reference graph. If same target name is provided twice then the data is overwritten.
Parameters: - target_name – Label/name of target sample
- target_pca_fn – Filename of input data for target. In typical usage this would be the HDF5 file generated using Nabo’s Dataset class.
- target_pca_grp_name – Name of group containing data in HDF5 file
- ignore_ref_cells – List of reference cell names to be excluded from mapping
- use_stored_distances – If True then distance between target and reference cells is not calculated and Nabo will try to use the stored distance matrix. (Default: False)
- overwrite – Overwrite the target data
Returns: None
-
set_parameters
(use_comps: int, k: int, dist_factor: float, chunk_size: int) → None¶ Set run parameters for mapping
Parameters: - use_comps – Number of input dimensions to use. In typical usage this would mean number of PCA components starting from first.
- k – Number of nearest neighbours to consider. This is same as k in kNN.
- dist_factor – In a given dimension i, if value of target cell is x_i and value for reference cell is y_i then the distance will be saved only if abs(x_i- y_i)/abs(x_i) < dist_factor; otherwise distance would be given the highest value i.e 1.
- chunk_size – Number of cells to load in memory in one go. For smaller RAM usage set a smaller value.
Returns: None
Graph¶
This class allows users to interact with the reference graph containing projected target cells. It is possible to visualize the graph, perform clustering on the graph and generate statistics to assess the mappings.
-
class
nabo.
Graph
¶ Class for storing Nabo’s SNN graph. Inherits from networkx’s Graph class
-
calc_contiguous_spl
(nodes: List[str]) → float¶ Calculates mean of shortest path lengths between subsequent nodes provided in the input list in reference graph.
Parameters: nodes – List of nodes from reference sample Returns: Mean shortest path length
-
calc_diff_potential
(r: Dict[str, float] = None) → Dict[str, float]¶ Calculate differentiation potential of cells.
This function is a reimplementation of population balance analysis (PBA) approach published in Weinreb et al. 2017, PNAS. This function computes the random walk normalized Laplacian matrix of the reference graph, L_rw = I-A/D and then calculates a Moore-Penrose pseudoinverse of L_rw. The method takes an optional but recommended parameter ‘r’ which represents the relative rates of proliferation and loss in different gene expression states (R). If not provided then a vector with ones is used. The differentiation potential is the dot product of inverse L_rw and R
Parameters: r – Same as parameter R in the above said reference. Should be a dictionary with each reference cell name as a key and its corresponding R values. Returns: V (Vector potential) as dictionary. Smaller values represent less differentiated cells.
-
calc_modularity
() → float¶ Calculates modularity of the reference graph. The clusters should have already been defined.
Returns: Value between 0 and 1
-
classify_target
(target: str, weight_frac: float = 0.5, min_degree: int = 2, min_weight: float = 0.1, cluster_dict: Dict[str, int] = None, na_label: str = 'NA', ret_counts: bool = False) → dict¶ This classifier identifies the total weight of all the connections made by each target cell to each cluster (of reference cells). If a target cell has more than 50% (default value) of it’s total connection weight in one of the clusters then the target cell is labeled to be from that cluster. One useful aspect of this classifier is that it will not classify the target cell to be from any cluster if it fails to reach the threshold (default, 50%) for any cluster (such target cell be labeled as ‘0’ by default).
Parameters: - target – Name of target sample
- weight_frac – Required minimum fraction of weight in a cluster to be classified into that cluster
- min_degree – Minimum degree of the target node
- min_weight – Minimum edge weight. Edges with less weight than min_weight will be ignored but will still contribute to total weight.
- cluster_dict – Cluster labels for each reference cell. If not provided then the stored cluster information is used.
- na_label – Label for cells that failed to get classified into any cluster
- ret_counts – It True, then returns number of target cells classified to each cluster, else returns predicted cluster for each target cell
Returns: Dictionary. Keys are target cell names and value their predicted custer if re_Count is False. Otherwise, keys are cluster labels and values are the number of target cells classified to that cluster
-
get_cells_from_clusters
(clusters: List[str], remove_suffix: bool = True) → List[str]¶ Get cell names for input cluster numbers
Parameters: - clusters – list of cluster identifiers
- remove_suffix – Remove suffix from cell names
Returns: List of cell names
-
get_cluster_identity_weights
() → pandas.core.series.Series¶ Returns: Cluster identity weights for each cell
-
get_k_path_neighbours
(nodes: List[str], k_dist: int, full_trail: bool = False, trail_start: int = 0) → List[str]¶ Get set of nodes at a given distance
Parameters: - nodes – Input nodes
- k_dist – Path distance from input node
- full_trail – If True then returns only nodes at k_dist path distance else return all nodes upto k_dist ( default: False)
- trail_start – If full_trail is True, then the trail starts at this path distance (default: 0).
Returns: List of nodes
-
get_mapped_cells
(target: str, ref_cells: str, remove_suffix: bool = True) → List[str]¶ Get target cells that map to a given list of reference cells.
Parameters: - target – Name of target sample
- ref_cells – List of reference cell names
- remove_suffix – If True then removes target name suffix from end of node name
Returns: List of target cell names
-
get_mapping_score
(target: str, min_weight: float = 0, min_score: float = 0, weighted: bool = True, by_cluster: bool = False, sorted_names_only: bool = False, top_n_only: int = None, all_nodes: bool = True, score_multiplier: int = 1000, ignore_nodes: List[str] = None, include_nodes: List[str] = None, remove_suffix: bool = False, verbose: bool = False)¶ Calculate a weighted/unweighted degree of incident target nodes on reference nodes.
Parameters: - target – Target sample name
- min_weight – Ignore a edge if edge weight is smaller then this value in the SNN graph. Only applicable if calculating a weighted mapping score
- min_score – If score is smaller then reset score to zero
- weighted – Use edge weights if True
- by_cluster – If True, then combine scores from nodes of same cluster into a list. The keys are cluster number in the output dictionary
- sorted_names_only – If True, then return only sorted list of base cells from highest to lowest mapping score. Cells with mapping score less than min_score are not reported (default: False)
- top_n_only – If sorted_names_only is True and an integer value is provided then this method will return top n number of nodes sorted based on score. min_score value will be ignored.
- all_nodes – if False, then returns only nodes with non-zero score (after resetting using min_score)
- score_multiplier – Score is multiplied by this number after normalizing for total number of target cells.
- ignore_nodes – List of nodes from ‘target’ sample to be ignored while calculating the score (default: None).
- include_nodes – List of target nodes from ‘target’ sample. Mapping score will be calculated ONLY for those reference cells that are connected to this subset of target cells in the graph. By default mapping score will be calculated against each target node.
- remove_suffix – Remove suffix from cell names (default: False)
- verbose – Prints graph stats
Returns: Mapping score
-
get_mapping_specificity
(target_name: str, fill_na: bool = True) → Dict[str, float]¶ Calculates the mapping specificity of target nodes. Mapping specificity of a target node is calculated as the mean of shortest path lengths between all pairs of mapped reference nodes.
Parameters: - target_name – Name of target sample
- fill_na – if True, then nan values will be replaced with largest value (default: True)
Returns: Dictionary with target node names as keys and mapping specificity as values
-
get_random_nodes
(n: int) → List[str]¶ Get random list of nodes from reference graph.
Parameters: n – Number of nodes to return Returns: A list of reference nodes
-
get_ref_specificity
(target: str, target_values: Dict[str, float], incl_unmapped: bool = False) → Dict[str, float]¶ Calculates the average mapping specificity of all target nodes that mapped to a given a reference node. Requires that the mapping specificity of target nodes is already calculated.
Parameters: - target – Name of target sample
- target_values – Mapping specificity values of target nodes
- incl_unmapped – If True, then includes unmapped reference nodes in the dictionary with value set at 0 Default: False)
Returns: Dictionary with reference node names as keys and values as mean mapping specificity of their mapped target nodes
-
static
get_score_percentile
(score: Dict[str, int], p: int) → float¶ Get value for at a given percentile
Parameters: - score – Mapping score or any other dictionary where values are numbers
- p – Percentile
Returns: Percentile value
-
import_clusters
(cluster_dict: Dict[str, str] = None, missing_val: str = 'NA') → None¶ Import cluster information for reference cells.
Parameters: - cluster_dict – Dictionary with cell names as keys and cluster number as values. Cluster numbers should start from 1
- missing_val – This value will be filled in when fill_missing is True (Default: NA)
Returns: None
-
import_clusters_from_csv
(csv: str, csv_sep: str = ', ', cluster_col: int = 0, header=None, append_ref_name: bool = False)¶ Parameters: - csv – Filename containing cluster information. Make sure that the first column contains cell names and second contains the cluster labels.
- csv_sep – Separator for CSV file (default: ‘,’)
- cluster_col – Column number (0 based count) where cluster info is present (Default: 0)
- append_ref_name – Append the reference name to the cell name ( Default: True)
Returns: None
-
import_clusters_from_json
(fn)¶ Import clusters om JSON file
Parameters: fn – Input file in JSON format. Returns: None
-
import_layout
(pos_dict) → None¶ Alternatively one can provide a dictionary with keys as node name and values as coordinate (x, y) tuple.
Parameters: pos_dict – Dictionary with keys as node names and values as 2D coordinates of nodes on the graph. Returns: None
-
import_layout_from_csv
(csv: str, csv_sep: str = ', ', dim_cols: tuple = (0, 1), header=None, append_ref_name: bool = False)¶ Import graph layout coordinates from a CSV file
Parameters: - csv – Filename containing layout coordinates. Make sure that the first column contains cell names and second and thrid contain the x and y coordinates
- csv_sep – Separator for CSV file (default: ‘,’)
- append_ref_name – Append the reference name to the cell name ( Default: True)
Returns: None
-
import_layout_from_json
(fn)¶ Parameters: fn – Input json file Returns:
-
layout
¶ Copies ‘pos’ attribute values (x/y coordinate tuple) from graph nodes and returns a dictionary :return:
-
load_from_gml
(fn: str) → None¶ Load data from GML format file. It is critical that this graph was generated using Nabo’s Mapping class.
Parameters: fn – Full path of GML file Returns: None
-
load_from_h5
(fn: str, name: str, kind: str) → None¶ Loads a graph saved by Mapping class in HDF5 format
Parameters: - fn – Path to HDF5 file
- name – Label/name of sample used in Mapping object. This function assumes that the group in HDF5 containing graph data is named: name + ‘_graph’
- kind – Can have a value of either ‘reference’ or ‘target’. Only be one sample can have kind=’reference’ for an instance of this class
Returns: None
-
make_clusters
(n_clusters: int) → None¶ Performs graph agglomerative clustering using algorithm in Newman 2004
Parameters: n_clusters – Number of clusters Returns: None
-
make_leiden_clusters
(resolution: float = 1.0, random_seed=4466) → None¶ Leiden clustering
Parameters: n_clusters – Number of clusters Returns: None
-
save_clusters_as_csv
(outfn)¶ Parameters: outfn – Output CSV file Returns:
-
save_clusters_as_json
(outfn)¶ Parameters: outfn – Output JSON file Returns:
-
save_graph
(save_name: str) → None¶ Save graph in GML format
Parameters: save_name – Output filename with path Returns: None
-
save_layout_as_csv
(out_fn)¶ Saves the layout in CSV format
Parameters: out_fn – Output CSV file Returns:
-
save_layout_as_json
(out_fn)¶ Parameters: out_fn – Output json file Returns:
-
set_de_groups
(target: str, min_score: float, node_dist: int, from_clusters: List[str] = None, full_trail: bool = False, trail_start: int = 1, stringent_control: bool = False) → None¶ Categorises reference nodes into either ‘Test’, ‘Control’ or ‘Other’ group. Nodes with mapping score higher than min_score are categorized as ‘Test’, cells at node_dist path distance are categorized as ‘Control’ and rest of the nodes are categorized as ‘Other’.
Parameters: - target – Name of target sample whose corresponding mapping scores to be considered
- min_score – Minimum mapping score
- node_dist – Path distance
- from_clusters – List of cluster number. ‘Test’ cells will only be limited to these clusters.
- full_trail – If True then returns only nodes at node_dist path distance else return all nodes upto node_dist ( default: False)
- trail_start – If full_trail is True, then the trail starts at this path distance (default: 0).
- stringent_control – If True then control group will not contain cells that have mapping score higher than min_score
Returns: None
-
set_ref_layout
(niter: int = 500, verbose: bool = True, init_pos: dict = None, disable_rescaling: bool = False, outbound_attraction_distribution: bool = True, edge_weight_influence: float = 1.0, jitter_tolerance: float = 1.0, barnes_hut_optimize: bool = True, barnes_hut_theta: float = 1.2, scaling_ratio: float = 1.0, strong_gravity_mode: bool = False, gravity: float = 1.0) → None¶ Calculates a 2D graph layout using ForceAtlas2 algorithm. The ForceAtlas2 implementation being used here will not prevent nodes in the graph from overlapping with each other. We aim to improve this in the future.
Parameters: - niter – Number of iterations (default: 500)
- verbose – Print the progress (default: True)
- init_pos – Initial positions of nodes
- disable_rescaling – If True then layout coordinates are not rescaled to only have non negative positions (Default: False)
- outbound_attraction_distribution –
- edge_weight_influence –
- jitter_tolerance –
- barnes_hut_optimize –
- barnes_hut_theta –
- scaling_ratio –
- strong_gravity_mode –
- gravity –
Returns: None
-
Marker¶
This module contains functions to identify marker genes (genes with significantly high expression) for sub-populations of interest.
-
nabo.
run_de_test
(dataset1: nabo._dataset.Dataset, dataset2, test_cells: List[str], control_cells: List[List[str]], test_label: str = None, control_group_labels: list = None, exp_frac_thresh: float = 0.25, log2_fc_thresh: float = 1, qval_thresh: float = 0.05, tqdm_msg: str = '') → pandas.core.frame.DataFrame¶ Identifies differentially expressed genes using Mann Whitney U test.
Parameters: - dataset1 – nabo.Dataset instance
- dataset2 – nabo.Dataset instance or None
- test_cells – list of cells for which markers has to be found. These could be cells from a cluster,cells with high mapping score, etc
- control_cells – List of cell groups against which markers need to be found. This could just one groups of cells or multiple groups of cells.
- test_label – Label for test cells.
- control_group_labels – Labels of control cell groups
- exp_frac_thresh – Fraction of cells that should have a non zero value for a gene.
- log2_fc_thresh – Threshold for log2 fold change
- qval_thresh – Threshold for adjusted p value
- tqdm_msg – Message to print while displaying progress
Returns: pd.Dataframe
-
nabo.
find_cluster_markers
(clusters: dict, dataset: nabo._dataset.Dataset, de_frequency: int, exp_frac_thresh: float = 0.25, log2_fc_thresh: float = 0.5, qval_thresh: float = 0.05) -> (<class 'pandas.core.frame.DataFrame'>, typing.Dict[int, typing.List[str]])¶ Identifies marker genes for each cluster in a Graph. This function works a wrapper for run_de_test.
Parameters: - clusters – dict
- dataset – nabo.Dataset
- de_frequency – Minimum number of clusters against a gene should be significantly differentially expressed for it to qualify as a marker
- exp_frac_thresh – Fraction of cells that should have a non zero value for a gene.
- log2_fc_thresh – Threshold for log2 fold change
- qval_thresh – Threshold for adjusted p value
Returns: A tuple where first element is a pandas DataFrame and second element is a dictionary where keys are cluster numbers and values are lists of marker genes for the corresponding clusters
GraphPlot¶
This class allows a highly customized graph visualization. It is created to work seamlessly with Graph class instances. When called it will, by default, automatically produce the graph visualization. This class requires that the set_ref_layout method has been called on the Graph object.
-
class
nabo.
GraphPlot
(g: nabo._graph.Graph, only_ref=True, vc='steelblue', cmap=None, vc_attr=None, vc_default='grey', vec=None, vc_min=None, vc_max=None, vc_percent_trim=None, max_ncolors=40, vs=2, vs_scale=15, vs_min=None, vs_max=None, vs_percent_trim=0, vlw=0, v_alpha=0.6, v_zorder=None, draw_edges: str = 'all', ec='k', elw=0.1, e_alpha=0.1, bundle_edges: bool = False, bundle_bw: float = 0.1, bundle_decay: float = 0.7, edge_min_weight: float = 0, texts=None, texts_fs=20, title=None, title_fs=30, label_attr=None, label_attr_type='centroid', label_attr_pos=(1, 1), label_attr_space=0.05, label_attr_fs=16, rasterized=True, save_name=None, dpi=300, fig_size=(5, 5), show_fig=True, remove_axes=True, ax=None, verbose=False)¶ Class for customized Graph drawing
Parameters: - g – A Graph class instance from Nabo
- only_ref – Only reference graph is drawn is True (default: True)
- vc – vertex colour. Can be a valid matplotlib color string, a dictionary with node names as keys and values as matplotlib color strings/ floats / RGB tuple. If floats then color will be selected on colormap. This parameter is overridden by vc_attr:
- cmap – A valid matplotlib colour map
- vc_attr – Name of graph attribute to be used for colors. Attribute values should either be floats or ints
- vc_default – Default color of a node. Should be either a valid matplotlib string or RGB tuple.
- vc_min – Minimum value for vertex colour. If value is less than this threshold, then value will be reset to this threshold.
- vc_max – Maximum value for vertex colour. If value is less than this threshold, then value will be reset to this threshold.
- vc_percent_trim – Percentage of values to be ceiled or floored. This will set vc_min and vc_max values based on percentiles. Example, setting to 1 will cause lowest 1 % values to be reset to next largest and values larger than 99 percentile (100-1) to set to 99th percentile.
- max_ncolors – Maximum number of colours to use
- vs – Vertex size. Should be a integer or float value to set size for all nodes or a dictionary with keys as node name and values as either float ot int.
- vs_scale – Multiplier for vs
- vs_min – Same as vc_min but for vertex size
- vs_max – Same as vc_max but for vertex size
- vs_percent_trim – Same as vc_percent_trim but for vertex size
- vlw – Vertex line width
- v_alpha – Transparency/alpha value for vertices. Should be between 0 and 1
- draw_edges – Can be either: ‘all’, ‘ref’, ‘target’, ‘none’ ( Default: ‘all’)
- ec – Edge colour
- elw – Edge line width
- e_alpha – Edge transparency/alpha value
- texts – Text to be placed on the graph. Should be a dictionary with keys as texts and values as tuple for xy coordinates
- texts_fs – Font size for texts
- title – Title for the Graph
- title_fs – Title font size
- label_attr – Node attribute to use to retrieve labels
- label_attr_type – Can be either ‘legend’ or ‘centroid’
- label_attr_pos – Tuple for xy coords to position start of labels. Only used when label_attr_type is ‘centroid’
- label_attr_space – Spacing between labels. Only used when label_attr_type is ‘centroid’
- label_attr_fs – Label font size
- rasterized – If True, then rasterize the scatter points
- save_name – File name for saving figure
- fig_size – Figure size. Should be a tuple (width, height)
- show_fig – If True then show figure
- remove_axes – Remove axis and ticklabels if set to True (Default: True)
- ax – Matplotlib axis. Draws on this axis rather than create new.
- verbose – If True, then prints messages.