816. Ambiguous Coordinates

本文介绍了一种算法,用于解析模糊的二维坐标字符串,如(123),并返回所有可能的原始坐标组合,例如[(1,23)(12,3)(1.2,3)(1,2.3)]。通过去除字符串中的逗号、小数点和空格,再重构为合法坐标,该算法确保了坐标的有效性和唯一性。

We had some 2-dimensional coordinates, like "(1, 3)" or "(2, 0.5)".  Then, we removed all commas, decimal points, and spaces, and ended up with the string S.  Return a list of strings representing all possibilities for what our original coordinates could have been.

Our original representation never had extraneous zeroes, so we never started with numbers like "00", "0.0", "0.00", "1.0", "001", "00.01", or any other number that can be represented with less digits.  Also, a decimal point within a number never occurs without at least one digit occuring before it, so we never started with numbers like ".1".

The final answer list can be returned in any order.  Also note that all coordinates in the final answer have exactly one space between them (occurring after the comma.)

 

Example 1:
Input: "(123)"
Output: ["(1, 23)", "(12, 3)", "(1.2, 3)", "(1, 2.3)"]
Example 2:
Input: "(00011)"
Output:  ["(0.001, 1)", "(0, 0.011)"]
Explanation: 
0.0, 00, 0001 or 00.01 are not allowed.
Example 3:
Input: "(0123)"
Output: ["(0, 123)", "(0, 12.3)", "(0, 1.23)", "(0.1, 23)", "(0.1, 2.3)", "(0.12, 3)"]
Example 4:
Input: "(100)"
Output: [(10, 0)]
Explanation: 
1.0 is not allowed.

 

Note:

  • 4 <= S.length <= 12.
  • S[0] = "(", S[S.length - 1] = ")", and the other elements in S are digits.

 

Approach #1: Brute Force. [Java] [Memory Limit Exceeded]

class Solution {
    public List<String> ambiguousCoordinates(String S) {
        List<String> ans = new ArrayList<>();
        StringBuilder sb = new StringBuilder(S);
        for (int i = 1; i < S.length(); ++i) {
            StringBuilder perfix = new StringBuilder(sb.substring(0, i));
            StringBuilder suffix = new StringBuilder(sb.substring(i));
            if (valid(perfix) && valid(suffix)) {
                List<String> l1 = split(perfix);
                List<String> l2 = split(suffix);

                for (int j = 0; j < l1.size(); ++j) {
                    for (int k = 0; k < l2.size(); ++k) {
                        String temp = "(" + l1.get(j) + ", " + l2.get(k) + ")";
                        ans.add(temp);
                    }
                }
            }
        }
        return ans;
    }
    
    public List<String> split(StringBuilder sb) {
        List<String> ret = new ArrayList<>();
        if (sb.length() == 1) {
            ret.add(sb.toString());
            return ret;
        } else if (sb.charAt(0) == '0' && sb.charAt(1) == '0') {
            sb.insert(1, '.');
            ret.add(sb.toString());
            return ret;
        } else if (sb.charAt(sb.length() - 1) == '0' && sb.charAt(sb.length() - 2) == '0') {
            sb.insert(sb.length() - 1, '.');
            ret.add(sb.toString());
            return ret;
        } else {
            for (int i = 1; i < sb.length() - 1; ++i) {
                StringBuilder temp = sb;
                sb.insert(i, '.');
                ret.add(sb.toString());
            }
        }
        return ret;
    }
    
    public boolean valid(StringBuilder sb) {
        if (sb.length() == 1) return true;
        if (sb.length() > 4 && sb.charAt(0) == '0' && sb.charAt(1) == 0 &&
            sb.charAt(sb.length() - 2) == '0' && sb.charAt(sb.length() - 1) == '0')
            return false;
        for (int i = 0; i < sb.length(); ++i) 
            if (sb.charAt(i) != '0') return true;
        return false;
    }
}

  

Approach #2: String. [Java]

class Solution {
    public List<String> ambiguousCoordinates(String S) {
        List<String> ans = new ArrayList<>();
        int n = S.length();
        for (int i = 1; i < n - 1; ++i) {
            List<String> A = f(S.substring(1, i)), B = f(S.substring(i, n-1));
            for (String a : A) for (String b : B) ans.add("(" + a + ", " + b + ")");
        }
        return ans;
    }
    
    public List<String> f(String s) {
        int n = s.length();
        List<String> ret = new ArrayList<>();
        if (n == 0 || n > 1 && s.charAt(0) == '0' && s.charAt(n-1) == '0') return ret;
        if (n > 1 && s.charAt(0) == '0') {
            ret.add("0." + s.substring(1));
            return ret;
        }
        ret.add(s);
        if (n == 1 || s.charAt(n-1) == '0') return ret;
        for (int i = 1; i < n; ++i) {
            ret.add(s.substring(0, i) + "." + s.substring(i, n));
        }
        return ret;
    }
}

 

Analysis:

if S == "" : return []

if S == "0" : return [S]

if S == "0XXXX0" : return []

if S == "0XXXX" : return ["0.XXXX"]

if S == "XXXX0" : return [S]

return [S, "X.XXX", "XX.XX", "XXX.X" ...]

  

Reference:

https://leetcode.com/problems/ambiguous-coordinates/discuss/123851/C%2B%2BJavaPython-Solution-with-Explanation

 

转载于:https://www.cnblogs.com/ruruozhenhao/p/10786871.html

>>> splice_data, spliced_ptms, altered_flanks = project.project_ptms_onto_MATS(SE_events\ = SE_data, MXE_events = MXE_data, A5SS_events = A5SS_data, A3SS_events = A3SS_data, RI_\ events = RI_data, coordinate_type = 'hg38', identify_flanking_sequences = True) Projecting PTMs onto MATS splice events using hg38 coordinates. Skipped Exon events: 0%| | 0/3635 [00:00<?, ?it/s] Traceback (most recent call last): File "<python-input-23>", line 1, in <module> splice_data, spliced_ptms, altered_flanks = project.project_ptms_onto_MATS(SE_events = SE_data, MXE_events = MXE_data, A5SS_events = A5SS_data, A3SS_events = A3SS_data, RI_events = RI_data, coordinate_type = 'hg38', identify_flanking_sequences = True) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/940660EA0660CEB4/PTM-POSE/venv/lib/python3.13/site-packages/ptm_pose/project.py", line 416, in project_ptms_onto_MATS spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events(SE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based', taskbar_label = "Skipped Exon events", separate_modification_types=separate_modification_types, PROCESSES = SE_processes) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/940660EA0660CEB4/PTM-POSE/venv/lib/python3.13/site-packages/ptm_pose/project.py", line 327, in project_ptms_onto_splice_events splice_data, spliced_ptm_info = find_ptms_in_many_regions(splice_data, ptm_coordinates, chromosome_col = chromosome_col, strand_col = strand_col, region_start_col = region_start_col, region_end_col = region_end_col, dPSI_col = dPSI_col, sig_col = sig_col, event_id_col = event_id_col, gene_col = gene_col, extra_cols = extra_cols, annotate_original_df = annotate_original_df, coordinate_type = coordinate_type,start_coordinate_system=start_coordinate_system, end_coordinate_system=end_coordinate_system, taskbar_label = taskbar_label, separate_modification_types=separate_modification_types) ~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/mnt/940660EA0660CEB4/PTM-POSE/venv/lib/python3.13/site-packages/ptm_pose/project.py", line 212, in find_ptms_in_many_regions if annotate_original_df: ^^^^^^^^^^^^^^^^^^^^ File "/mnt/940660EA0660CEB4/PTM-POSE/venv/lib/python3.13/site-packages/pandas/core/generic.py", line 1577, in __nonzero__ raise ValueError( ...<2 lines>... ) ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). 脚本project.py内容如下: import numpy as np import pandas as pd import multiprocessing import datetime from ptm_pose import pose_config, helpers from ptm_pose import flanking_sequences as fs from tqdm import tqdm def find_ptms_in_region(ptm_coordinates, chromosome, strand, start, end, gene = None, coordinate_type = 'hg38'): """ Given an genomic region in either hg38 or hg19 coordinates (such as the region encoding an exon of interest), identify PTMs that are mapped to that region. If so, return the exon number. If none are found, return np.nan. Parameters ---------- chromosome: str chromosome where region is located strand: int DNA strand for region is found on (1 for forward, -1 for reverse) start: int start position of region on the chromosome/strand (should always be less than end) end: int end position of region on the chromosome/strand (should always be greater than start) coordinate_type: str indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'. Returns ------- ptms_in_region: pandas.DataFrame dataframe containing all PTMs found in the region. If no PTMs are found, returns np.nan. """ #restrict to PTMs on the same chromosome and strand ptms_in_region = ptm_coordinates[(ptm_coordinates['Chromosome/scaffold name'] == chromosome) & (ptm_coordinates['Strand'] == strand)].copy() if coordinate_type in ['hg18', 'hg19','hg38']: loc_col = f'Gene Location ({coordinate_type})' else: raise ValueError('Coordinate type must be hg38 or hg19') #check to make sure the start value is less than the end coordinate. If it is not, treat the end coordinate as the start and the start coordinate as the end if start < end: ptms_in_region = ptms_in_region[(ptms_in_region[loc_col] >= start) & (ptms_in_region[loc_col] <= end)] else: ptms_in_region = ptms_in_region[(ptms_in_region[loc_col] <= start) & (ptms_in_region[loc_col] >= end)] #extract only PTM information from dataframe and return that and list (if not ptms, return empty dataframe) if not ptms_in_region.empty: #grab uniprot id and residue ptms_in_region = ptms_in_region[['Source of PTM', 'UniProtKB Accession','Isoform ID', 'Isoform Type', 'Residue', 'PTM Position in Isoform', loc_col, 'Modification', 'Modification Class', 'Canonical Flanking Sequence', 'Constitutive', 'MS_LIT', 'MS_CST', 'LT_LIT', 'Compendia', 'Number of Compendia']] #check if ptm is associated with the same gene (if info is provided). if not, do not add if gene is not None: for i, row in ptms_in_region.iterrows(): #if ';' in row['UniProtKB Accession']: # uni_ids = row['UniProtKB Accession'].split(';') # remove = True # for uni in uni_ids: # if row['UniProtKB Accession'] in pose_config.uniprot_to_genename: # if gene in pose_config.uniprot_to_genename[uni.split('-')[0]].split(' '): # remove = False # break # if remove: # ptms_in_region.drop(i) if row['UniProtKB Accession'] in pose_config.uniprot_to_genename: if gene not in pose_config.uniprot_to_genename[row['UniProtKB Accession']].split(' '): ptms_in_region = ptms_in_region.drop(i) else: ptms_in_region = ptms_in_region.drop(i) #make sure ptms still are present after filtering if ptms_in_region.empty: return pd.DataFrame() else: ptms_in_region.insert(0, 'Gene', gene) #calculate proximity to region start and end ptms_in_region['Proximity to Region Start (bp)'] = (ptms_in_region[loc_col] - start).abs() ptms_in_region['Proximity to Region End (bp)'] = (ptms_in_region[loc_col] - end).abs() ptms_in_region['Proximity to Splice Boundary (bp)'] = ptms_in_region.apply(lambda x: min(x['Proximity to Region Start (bp)'], x['Proximity to Region End (bp)']), axis = 1) return ptms_in_region else: return pd.DataFrame() def convert_strand_symbol(strand): """ Given DNA strand information, make sure the strand information is in integer format (1 for forward, -1 for reverse). This is intended to convert from string format ('+' or '-') to integer format (1 or -1), but will return the input if it is already in integer format. Parameters ---------- strand: str or int DNA strand information, either as a string ('+' or '-') or an integer (1 or -1) Returns ------- int DNA strand information as an integer (1 for forward, -1 for reverse) """ if isinstance(strand, str): if strand == '+' or strand == '1': return 1 elif strand == '-' or strand == '-1': return -1 else: return strand def find_ptms_in_many_regions(region_data, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', gene_col = None, dPSI_col = None, sig_col = None, event_id_col = None, extra_cols = None, annotate_original_df = True, coordinate_type = 'hg38', start_coordinate_system = '1-based', end_coordinate_system = '1-based', separate_modification_types = False, taskbar_label = None): """ Given a dataframe with a unique region in each row, project PTMs onto the regions. Assumes that the region data will have chromosome, strand, and genomic start/end positions, and each row corresponds to a unique region. Parameters ---------- ptm_coordinates: pandas.DataFrame dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs region_data: pandas.DataFrame dataframe containing region information, including chromosome, strand, and genomic location of regions of interest chromosome_col: str column name in splice_data that contains chromosome information. Default is 'chr'. Expects it to be a str with only the chromosome number: 'Y', '1', '2', etc. strand_col: str column name in splice_data that contains strand information. Default is 'strand'. Expects it to be a str with '+' or '-', or integers as 1 or -1. Will convert to integers automatically if string format is provided. region_start_col: str column name in splice_data that contains the start position of the region of interest. Default is 'exonStart_0base'. region_end_col: str column name in splice_data that contains the end position of the region of interest. Default is 'exonEnd'. gene_col: str column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None. event_id_col: str column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None. coordinate_type: str indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'. separate_modification_types: bool Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True. taskbar_label: str Label to display in the tqdm progress bar. Default is None, which will automatically state "Projecting PTMs onto regions using ----- coordinates". Returns ------- spliced_ptm_info: pandas.DataFrame Contains the PTMs identified across the different splice events splice_data: pandas.DataFrame dataframe containing the original splice data with an additional column 'PTMs' that contains the PTMs found in the region of interest, in the format of 'SiteNumber(ModificationType)'. If no PTMs are found, the value will be np.nan. """ if taskbar_label is None: taskbar_label = 'Projecting PTMs onto regions using ' + coordinate_type + ' coordinates.' if region_data[chromosome_col].str.contains('chr').any(): region_data[chromosome_col] = region_data[chromosome_col].str.strip('chr') spliced_ptm_info = [] spliced_ptms_list = [] num_ptms_affected = [] num_unique_ptm_sites = [] #copy region_data = region_data.copy() #iterate through each row of the splice data and find PTMs in the region for index, row in tqdm(region_data.iterrows(), total = len(region_data), desc = taskbar_label): #grab region information from row chromosome = row[chromosome_col] strand = convert_strand_symbol(row[strand_col]) start = row[region_start_col] end = row[region_end_col] #only provide these if column is given gene = row[gene_col] if gene_col is not None else None #adjust region coordinates if needed (make sure in 1-based coordinate system) if start_coordinate_system == '0-based': start += 1 elif start_coordinate_system != '1-based': raise ValueError("Start coordinate system must be either '0-based' or '1-based'") if end_coordinate_system == '0-based': end += 1 elif end_coordinate_system != '1-based': raise ValueError("End coordinate system must be either '0-based' or '1-based'") #project ptms onto region ptms_in_region = find_ptms_in_region(ptm_coordinates, chromosome, strand, start, end, gene = gene, coordinate_type = coordinate_type) extra_info = {} #add additional context from splice data, if indicated extra_info = {} if event_id_col is not None: extra_info['Region ID'] = row[event_id_col] if dPSI_col is not None: extra_info['dPSI'] = row[dPSI_col] if sig_col is not None: extra_info['Significance'] = row[sig_col] if extra_cols is not None: for col in extra_cols: extra_info[col] = row[col] #add extra info to ptms_in_region ptms_in_region = pd.concat([pd.DataFrame(extra_info, index = ptms_in_region.index), ptms_in_region], axis = 1) #if desired, add ptm information to the original splice event dataframe if annotate_original_df: if not ptms_in_region.empty: #split and separate unique modification types if separate_modification_types: ptms_in_region['Modification Class'] = ptms_in_region['Modification Class'].str.split(';') ptms_in_region = ptms_in_region.explode('Modification Class') ptms_info = ptms_in_region.apply(lambda x: x['UniProtKB Accession'] + '_' + x['Residue'] + str(x['PTM Position in Isoform']) + ' (' + x['Modification Class'] + ')', axis = 1) ptms_str = '/'.join(ptms_info.values) spliced_ptms_list.append(ptms_str) num_ptms_affected.append(ptms_in_region.shape[0]) num_unique_ptm_sites.append(ptms_in_region.groupby(['UniProtKB Accession', 'Residue', 'PTM Position in Isoform']).size().shape[0]) else: spliced_ptms_list.append(np.nan) num_ptms_affected.append(0) num_unique_ptm_sites.append(0) spliced_ptm_info.append(ptms_in_region.copy()) #combine all PTM information spliced_ptm_info = pd.concat(spliced_ptm_info, ignore_index = True) #convert ptm position to float if spliced_ptm_info.shape[0] > 0: spliced_ptm_info['PTM Position in Isoform'] = spliced_ptm_info['PTM Position in Isoform'].astype(float) #add ptm info to original splice event dataframe if annotate_original_df: region_data['PTMs'] = spliced_ptms_list region_data['Number of PTMs Affected'] = num_ptms_affected region_data['Number of Unique PTM Sites by Position'] = num_unique_ptm_sites region_data['Event Length'] = (region_data[region_end_col] - region_data[region_start_col]).abs() region_data['PTM Density (PTMs/bp)'] = (region_data['Number of Unique PTM Sites by Position']*3)/region_data['Event Length'] #multiply by 3 to convert aa to bp (3 bp per codon) return region_data, spliced_ptm_info def project_ptms_onto_splice_events(splice_data,annotate_original_df = True, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', dPSI_col = None, sig_col = None, event_id_col = None, gene_col = None, extra_cols = None, separate_modification_types = False, coordinate_type = 'hg38', start_coordinate_system = '1-based', end_coordinate_system = '1-based', taskbar_label = None, ptm_coordinates = None,PROCESSES = 1, **kwargs): """ Given splice event quantification data, project PTMs onto the regions impacted by the splice events. Assumes that the splice event data will have chromosome, strand, and genomic start/end positions for the regions of interest, and each row of the splice_event_data corresponds to a unique region. Important note: PTM-POSE relies on Ensembl based coordinates (1-based), so if the coordinates are 0-based, make sure to indicate using the start_coordinate_system and end_coordinate_system parameters. For example, rMATS uses 0-based for the start coordinates, but 1-based for the end coordinates. In this case, set start_coordinate_system = '0-based' and end_coordinate_system = '1-based'. Parameters ---------- splice_data: pandas.DataFrame dataframe containing splice event information, including chromosome, strand, and genomic location of regions of interest ptm_coordinates: pandas.DataFrame dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs. If none, it will pull from the config file. chromosome_col: str column name in splice_data that contains chromosome information. Default is 'chr'. Expects it to be a str with only the chromosome number: 'Y', '1', '2', etc. strand_col: str column name in splice_data that contains strand information. Default is 'strand'. Expects it to be a str with '+' or '-', or integers as 1 or -1. Will convert to integers automatically if string format is provided. region_start_col: str column name in splice_data that contains the start position of the region of interest. Default is 'exonStart_0base'. region_end_col: str column name in splice_data that contains the end position of the region of interest. Default is 'exonEnd'. event_id_col: str column name in splice_data that contains the unique identifier for the splice event. If provided, will be used to annotate the ptm information with the specific splice event ID. Default is None. gene_col: str column name in splice_data that contains the gene name. If provided, will be used to make sure the projected PTMs stem from the same gene (some cases where genomic coordiantes overlap between distinct genes). Default is None. dPSI_col: str column name in splice_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output sig_col: str column name in splice_data that contains the significance value for the splice event. Default is None, which will not include this information in the output. extra_cols: list list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns. coordinate_type: str indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'. start_coordinate_system: str indicates the coordinate system used for the start position. Either '0-based' or '1-based'. Default is '1-based'. end_coordinate_system: str indicates the coordinate system used for the end position. Either '0-based' or '1-based'. Default is '1-based'. separate_modification_types: bool Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True. taskbar_label: str Label to display in the tqdm progress bar. Default is None, which will automatically state "Projecting PTMs onto regions using ----- coordinates". PROCESSES: int Number of processes to use for multiprocessing. Default is 1 (single processing) **kwargs: additional keyword arguments Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the `filter_ptms()` function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add 'min_MS_observations = 2' to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the `filter_ptms()` function for more options. Returns ------- spliced_ptm_info: pandas.DataFrame Contains the PTMs identified across the different splice events splice_data: pandas.DataFrame dataframe containing the original splice data with an additional column 'PTMs' that contains the PTMs found in the region of interest, in the format of 'SiteNumber(ModificationType)'. If no PTMs are found, the value will be np.nan. """ #load ptm data from config if not provided if ptm_coordinates is None: ptm_coordinates = pose_config.ptm_coordinates.copy() #check for any keyword arguments to use for filtering if kwargs: filter_arguments = helpers.extract_filter_kwargs(**kwargs) #check any excess unused keyword arguments, report them helpers.check_filter_kwargs(filter_arguments) #filter ptm coordinates file to include only ptms with desired evidence ptm_coordinates = helpers.filter_ptms(ptm_coordinates, **filter_arguments) if taskbar_label is None: taskbar_label = 'Projecting PTMs onto splice events using ' + coordinate_type + ' coordinates.' #copy splice_data = splice_data.copy() #check columns to make sure they are present and correct data type check_columns(splice_data, chromosome_col=chromosome_col, strand_col=strand_col, region_start_col=region_start_col, region_end_col=region_end_col, dPSI_col=dPSI_col, sig_col=sig_col, event_id_col=event_id_col, gene_col=gene_col, extra_cols=extra_cols) if PROCESSES == 1: splice_data, spliced_ptm_info = find_ptms_in_many_regions(splice_data, ptm_coordinates, chromosome_col = chromosome_col, strand_col = strand_col, region_start_col = region_start_col, region_end_col = region_end_col, dPSI_col = dPSI_col, sig_col = sig_col, event_id_col = event_id_col, gene_col = gene_col, extra_cols = extra_cols, annotate_original_df = annotate_original_df, coordinate_type = coordinate_type,start_coordinate_system=start_coordinate_system, end_coordinate_system=end_coordinate_system, taskbar_label = taskbar_label, separate_modification_types=separate_modification_types) elif PROCESSES > 1: #check num_cpus available, if greater than number of cores - 1 (to avoid freezing machine), then set to PROCESSES to 1 less than total number of cores num_cores = multiprocessing.cpu_count() if PROCESSES > num_cores - 1: PROCESSES = num_cores - 1 #split dataframe into chunks equal to PROCESSES splice_data_split = np.array_split(splice_data, PROCESSES) pool = multiprocessing.Pool(PROCESSES) #run with multiprocessing results = pool.starmap(find_ptms_in_many_regions, [(splice_data_split[i], ptm_coordinates, chromosome_col, strand_col, region_start_col, region_end_col, gene_col, dPSI_col, sig_col, event_id_col, extra_cols, annotate_original_df, coordinate_type, start_coordinate_system, end_coordinate_system, separate_modification_types, taskbar_label) for i in range(PROCESSES)]) splice_data = pd.concat([res[0] for res in results]) spliced_ptm_info = pd.concat([res[1] for res in results]) #raise ValueError('Multiprocessing not yet functional. Please set PROCESSES = 1.') print(f'PTMs projection successful ({spliced_ptm_info.shape[0]} identified).\n') return splice_data, spliced_ptm_info def project_ptms_onto_MATS(SE_events = None, A5SS_events = None, A3SS_events = None, RI_events = None, MXE_events = None, coordinate_type = 'hg38', identify_flanking_sequences = False, dPSI_col = 'meanDeltaPSI', sig_col = 'FDR', extra_cols = None, separate_modification_types = False, PROCESSES = 1,ptm_coordinates = None, **kwargs): """ Given splice quantification from the MATS algorithm, annotate with PTMs that are found in the differentially included regions. Parameters ---------- ptm_coordinates: pandas.DataFrame dataframe containing PTM information, including chromosome, strand, and genomic location of PTMs SE_events: pandas.DataFrame dataframe containing skipped exon event information from MATS A5SS_events: pandas.DataFrame dataframe containing 5' alternative splice site event information from MATS A3SS_events: pandas.DataFrame dataframe containing 3' alternative splice site event information from MATS RI_events: pandas.DataFrame dataframe containing retained intron event information from MATS MXE_events: pandas.DataFrame dataframe containing mutually exclusive exon event information from MATS coordinate_type: str indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg38'. dPSI_col: str Column name indicating delta PSI value. Default is 'meanDeltaPSI'. sig_col: str Column name indicating significance of the event. Default is 'FDR'. extra_cols: list List of column names for additional information to add to the results. Default is None. separate_modification_types: bool Indicate whether residues with multiple modifications (i.e. phosphorylation and acetylation) should be treated as separate PTMs and be placed in unique rows of the output dataframe. Default is False. PROCESSES: int Number of processes to use for multiprocessing. Default is 1. **kwargs: additional keyword arguments Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the `filter_ptms()` function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add 'min_MS_observations = 2' to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the `filter_ptms()` function for more options. """ #load ptm data from config if not provided if ptm_coordinates is None: ptm_coordinates = pose_config.ptm_coordinates.copy() #check for any keyword arguments to use for filtering if kwargs: filter_arguments = helpers.extract_filter_kwargs(**kwargs) #check any excess unused keyword arguments, report them helpers.check_filter_kwargs(filter_arguments) #filter ptm coordinates file to include only ptms with desired evidence ptm_coordinates = helpers.filter_ptms(ptm_coordinates, **filter_arguments) print(f'Projecting PTMs onto MATS splice events using {coordinate_type} coordinates.') #reformat chromosome name format spliced_events = {} spliced_flanks = [] spliced_ptms = [] if SE_events is not None: if SE_events['chr'].str.contains('chr').any(): SE_events['chr'] = SE_events['chr'].apply(lambda x: x[3:]) SE_events['AS ID'] = "SE_" + SE_events.index.astype(str) #check to make sure there is enough information to do multiprocessing if that is desired if PROCESSES*4 > SE_events.shape[0]: SE_processes = 1 else: SE_processes = PROCESSES spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events(SE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based', taskbar_label = "Skipped Exon events", separate_modification_types=separate_modification_types, PROCESSES = SE_processes) SE_ptms['Event Type'] = 'SE' spliced_ptms.append(SE_ptms) if identify_flanking_sequences: print('Identifying flanking sequences for skipped exon events.') if 'upstreamES' in SE_events.columns: first_flank_start_col = 'upstreamES' first_flank_end_col = 'upstreamEE' second_flank_start_col = 'downstreamES' second_flank_end_col = 'downstreamEE' elif 'firstFlankingES' in SE_events.columns: first_flank_start_col = 'firstFlankingES' first_flank_end_col = 'firstFlankingEE' second_flank_start_col = 'secondFlankingES' second_flank_end_col = 'secondFlankingEE' else: raise ValueError('Could not find flanking sequence columns in skipped exon event data, based on what is typically outputted by MATS. Please check column names and provide the appropriate columns for the first and second flanking sequences') SE_flanks = fs.get_flanking_changes_from_splice_data(SE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'exonStart_0base', spliced_region_end_col = 'exonEnd', first_flank_start_col = first_flank_start_col, first_flank_end_col = first_flank_end_col, second_flank_start_col = second_flank_start_col, second_flank_end_col = second_flank_end_col, dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based') SE_flanks['Event Type'] = 'SE' spliced_flanks.append(SE_flanks) else: print('Skipped exon event data (SE_events) not provided, skipping') if A5SS_events is not None: if A5SS_events['chr'].str.contains('chr').any(): A5SS_events['chr'] = A5SS['chr'].apply(lambda x: x[3:]) #set the relevent start and end regions of the spliced out region, which are different depending on the strand region_start = [] region_end = [] first_flank_start = [] first_flank_end = [] second_flank_end = [] second_flank_start = [] for i, row in A5SS_events.iterrows(): strand = row['strand'] if strand == '+': region_start.append(row['shortEE']) region_end.append(row['longExonEnd']) if identify_flanking_sequences: first_flank_start.append(row['shortES']) first_flank_end.append(row['shortEE']) second_flank_start.append(row['flankingES']) second_flank_end.append(row['flankingEE']) else: region_start.append(row['longExonStart_0base']) region_end.append(row['shortES']) if identify_flanking_sequences: second_flank_start.append(row['shortES']) second_flank_end.append(row['shortEE']) first_flank_start.append(row['flankingES']) first_flank_end.append(row['flankingEE']) A5SS_events['event_start'] = region_start A5SS_events['event_end'] = region_end if identify_flanking_sequences: A5SS_events['first_flank_start'] = first_flank_start A5SS_events['first_flank_end'] = first_flank_end A5SS_events['second_flank_start'] = second_flank_start A5SS_events['second_flank_end'] = second_flank_end #set specific as id A5SS_events['AS ID'] = "5ASS_" + A5SS_events.index.astype(str) #check to make sure there is enough information to do multiprocessing if that is desired if PROCESSES*4 > A5SS_events.shape[0]: fiveASS_processes = 1 else: fiveASS_processes = PROCESSES #identify PTMs found within spliced regions spliced_events['5ASS'], fiveASS_ptms = project_ptms_onto_splice_events(A5SS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'event_start', region_end_col = 'event_end', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, start_coordinate_system = '0-based', extra_cols = extra_cols, taskbar_label = "5' ASS events", separate_modification_types=separate_modification_types, PROCESSES = fiveASS_processes) fiveASS_ptms['Event Type'] = '5ASS' spliced_ptms.append(fiveASS_ptms) #identify ptms with altered flanking sequences if identify_flanking_sequences: print("Identifying flanking sequences for 5'ASS events.") fiveASS_flanks = fs.get_flanking_changes_from_splice_data(A5SS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'event_start', spliced_region_end_col = 'event_end', first_flank_start_col = 'first_flank_start', first_flank_end_col = 'first_flank_end', second_flank_start_col = 'second_flank_start', second_flank_end_col = 'second_flank_end',dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based') fiveASS_flanks['Event Type'] = '5ASS' spliced_flanks.append(fiveASS_flanks) else: print("5' ASS event data (A5SS_events) not provided, skipping.") if A3SS_events is not None: if RI_events['chr'].str.contains('chr').any(): RI_events['chr'] = RI_events['chr'].apply(lambda x: x[3:]) if A3SS_events['chr'].str.contains('chr').any(): A3SS_events['chr'] = A3SS_events['chr'].apply(lambda x: x[3:]) #set the relevent start and end regions of the spliced out region, which are different depending on the strand region_start = [] region_end = [] first_flank_start = [] first_flank_end = [] second_flank_end = [] second_flank_start = [] for i, row in A3SS_events.iterrows(): strand = row['strand'] if strand == '+': region_start.append(row['longExonStart_0base']) region_end.append(row['shortES']) if identify_flanking_sequences: second_flank_start.append(row['flankingES']) second_flank_end.append(row['flankingEE']) first_flank_start.append(row['shortES']) first_flank_end.append(row['shortEE']) else: region_start.append(row['shortEE']) region_end.append(row['longExonEnd']) if identify_flanking_sequences: second_flank_start.append(row['flankingES']) second_flank_end.append(row['flankingEE']) first_flank_start.append(row['shortES']) first_flank_end.append(row['shortEE']) #save region info A3SS_events['event_start'] = region_start A3SS_events['event_end'] = region_end if identify_flanking_sequences: A3SS_events['first_flank_start'] = first_flank_start A3SS_events['first_flank_end'] = first_flank_end A3SS_events['second_flank_start'] = second_flank_start A3SS_events['second_flank_end'] = second_flank_end #add event ids A3SS_events['AS ID'] = "3ASS_" + A3SS_events.index.astype(str) #check to make sure there is enough information to do multiprocessing if that is desired if PROCESSES*4 > A3SS_events.shape[0]: threeASS_processes = 1 else: threeASS_processes = PROCESSES spliced_events['3ASS'], threeASS_ptms = project_ptms_onto_splice_events(A3SS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'event_start', region_end_col = 'event_end', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system = '0-based', taskbar_label = "3' ASS events", separate_modification_types=separate_modification_types, PROCESSES = threeASS_processes) threeASS_ptms['Event Type'] = '3ASS' spliced_ptms.append(threeASS_ptms) #identify ptms with altered flanking sequences if identify_flanking_sequences: print("Identifying flanking sequences for 3' ASS events.") threeASS_flanks = fs.get_flanking_changes_from_splice_data(A3SS_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'event_start', spliced_region_end_col = 'event_end', first_flank_start_col = 'first_flank_start', first_flank_end_col = 'first_flank_end', second_flank_start_col = 'second_flank_start', second_flank_end_col = 'second_flank_end', dPSI_col=dPSI_col, sig_col = dPSI_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based') threeASS_flanks['Event Type'] = '3ASS' spliced_flanks.append(threeASS_flanks) else: print("3' ASS event data (A3SS_events) not provided, skipping") if RI_events is not None: if RI_events['chr'].str.contains('chr').any(): RI_events['chr'] = RI_events['chr'].apply(lambda x: x[3:]) #add event id RI_events['AS ID'] = "RI_" + RI_events.index.astype(str) #check to make sure there is enough information to do multiprocessing if that is desired if PROCESSES*4 > RI_events.shape[0]: RI_processes = 1 else: RI_processes = PROCESSES spliced_events['RI'], RI_ptms = project_ptms_onto_splice_events(RI_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'upstreamEE', region_end_col = 'downstreamES', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, start_coordinate_system='0-based', extra_cols = extra_cols, taskbar_label = 'Retained Intron Events', separate_modification_types=separate_modification_types, PROCESSES = RI_processes) RI_ptms['Event Type'] = 'RI' spliced_ptms.append(RI_ptms) #identify ptms with altered flanking sequences if identify_flanking_sequences: print('Identifying flanking sequences for retained intron events.') RI_flanks = fs.get_flanking_changes_from_splice_data(RI_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', spliced_region_start_col = 'upstreamEE', spliced_region_end_col = 'downstreamES', first_flank_start_col = 'upstreamES', first_flank_end_col = 'upstreamEE', second_flank_start_col = 'downstreamES', second_flank_end_col = 'downstreamEE', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based') RI_flanks['Event Type'] = 'RI' spliced_flanks.append(RI_flanks) if MXE_events is not None: if MXE_events['chr'].str.contains('chr').any(): MXE_events['chr'] = MXE_events['chr'].apply(lambda x: x[3:]) #check to make sure there is enough information to do multiprocessing if that is desired if PROCESSES*4 > MXE_events.shape[0]: MXE_processes = 1 else: MXE_processes = PROCESSES #add AS ID MXE_events['AS ID'] = "MXE_" + MXE_events.index.astype(str) mxe_ptms = [] #first mxe exon spliced_events['MXE_Exon1'], MXE_Exon1_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = '1stExonStart_0base', region_end_col = '1stExonEnd', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', coordinate_type=coordinate_type, start_coordinate_system = '0-based', taskbar_label = 'MXE, First Exon', extra_cols=extra_cols, separate_modification_types=separate_modification_types, PROCESSES = MXE_processes) MXE_Exon1_ptms['Event Type'] = 'MXE (First Exon)' mxe_ptms.append(MXE_Exon1_ptms) #second mxe exon spliced_events['MXE_Exon2'], MXE_Exon2_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = '2ndExonStart_0base', region_end_col = '2ndExonEnd', event_id_col = 'AS ID', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', extra_cols=extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based', taskbar_label = 'MXE, Second Exon', separate_modification_types=separate_modification_types, PROCESSES = MXE_processes) MXE_Exon2_ptms['Event Type'] = 'MXE (Second Exon)' mxe_ptms.append(MXE_Exon2_ptms) #combine mxe ptms, and then drop any PTMs that were found in both MXE's mxe_ptms = pd.concat([MXE_Exon1_ptms, MXE_Exon2_ptms]) columns_to_check = ['UniProtKB Accession', 'Source of PTM', 'Residue', 'PTM Position in Isoform', 'Modification', 'Modification Class', 'Gene'] if dPSI_col is not None: columns_to_check.append('dPSI') if sig_col is not None: columns_to_check.append('Significance') if extra_cols is not None: columns_to_check += extra_cols mxe_ptms = mxe_ptms.drop_duplicates(subset = columns_to_check, keep = False) #flip dPSI values for second exon if dPSI_col is not None: mxe_ptms['dPSI'] = mxe_ptms.apply(lambda x: x['dPSI']* -1 if x['Event Type'] == 'MXE (Second Exon)' else x['dPSI'], axis = 1) #add mxe ptms to spliced_ptms spliced_ptms.append(mxe_ptms) spliced_ptms = pd.concat(spliced_ptms) if identify_flanking_sequences: spliced_flanks = pd.concat(spliced_flanks) return spliced_events, spliced_ptms, spliced_flanks else: return spliced_events, spliced_ptms #def project_ptms_onto_MAJIQ_dPSI(majiq_data, ptm_coordinates = None, coordinate_type = 'hg38', identify_flanking_sequences = False, dPSI_col = 'dPSI', sig_col = 'FDR', separate_modification_types = False, PROCESSES = 1): # print('in progress') # pass def add_splicegraph_info(psi_data, splicegraph, purpose = 'inclusion'): psi_data = psi_data[psi_data['splice_type'] != 'ME'].copy() if purpose == 'inclusion': #split exons into individual exons psi_data['Individual exon'] = psi_data['exons'].apply(lambda x: x.split(':')) psi_data = psi_data.explode('Individual exon').drop_duplicates() psi_data['Individual exon'] = psi_data['Individual exon'].astype(float) #add gene location information to psi data from spliceseq psi_data = psi_data.merge(splicegraph, left_on = ['symbol', 'Individual exon'], right_on = ['Symbol', 'Exon'], how = 'left') psi_data = psi_data.rename(columns = {'Chr_Start': 'spliced_region_start', 'Chr_Stop': 'spliced_region_end'}) return psi_data elif purpose == 'flanking': print('Not yet active. Please check back later.') else: raise ValueError('Purpose must be either inclusion or flanking. Please provide the correct purpose for the splicegraph information.') def project_ptms_onto_SpliceSeq(psi_data, splicegraph, gene_col ='symbol', dPSI_col = None, sig_col = None, extra_cols = None, coordinate_type = 'hg19', separate_modification_types = False, identify_flanking_sequences = False, flank_size = 5, ptm_coordinates = None, PROCESSES = 1, **kwargs): """ Given splice event quantification from SpliceSeq (such as what can be downloaded from TCGASpliceSeq), annotate with PTMs that are found in the differentially included regions. Parameters ---------- psi_data: pandas.DataFrame dataframe containing splice event quantification from SpliceSeq. Must contain the following columns: 'symbol', 'exons', 'splice_type'. splicegraph: pandas.DataFrame dataframe containing exon information from the splicegraph used during splice event quantification. Must contain the following columns: 'Symbol', 'Exon', 'Chr_Start', 'Chr_Stop'. gene_col: str column name in psi_data that contains the gene name. Default is 'symbol'. dPSI_col: str column name in psi_data that contains the delta PSI value for the splice event. Default is None, which will not include this information in the output. sig_col: str column name in psi_data that contains the significance value for the splice event. Default is None, which will not include this information in the output. extra_cols: list list of additional columns to include in the output dataframe. Default is None, which will not include any additional columns. coordinate_type: str indicates the coordinate system used for the start and end positions. Either hg38 or hg19. Default is 'hg19'. separate_modification_types: bool Indicate whether to store PTM sites with multiple modification types as multiple rows. For example, if a site at K100 was both an acetylation and methylation site, these will be separated into unique rows with the same site number but different modification types. Default is True. identify_flanking_sequences: bool Indicate whether to identify and return the flanking sequences for the splice events. Default is False. flank_size: int Size of the flanking sequence to extract from the splice event. Default is 5, which will extract 5 bases upstream and downstream of the splice event. Only relevant if identify_flanking_sequences is True. PROCESSES: int Number of processes to use for multiprocessing. Default is 1 (single processing). **kwargs: additional keyword arguments Additional keyword arguments to pass to the find_ptms_in_many_regions function, which will be fed into the `filter_ptms()` function from the helper module. These will be used to filter ptms with lower evidence. For example, if you want to filter PTMs based on the number of MS observations, you can add 'min_MS_observations = 2' to the kwargs. This will filter out any PTMs that have less than 2 MS observations. See the `filter_ptms()` function for more options. """ #load ptm data from config if not provided if ptm_coordinates is None: ptm_coordinates = pose_config.ptm_coordinates.copy() #check for any keyword arguments to use for filtering if kwargs: filter_arguments = helpers.extract_filter_kwargs(**kwargs) #check any excess unused keyword arguments, report them helpers.check_filter_kwargs(filter_arguments) #filter ptm coordinates file to include only ptms with desired evidence ptm_coordinates = helpers.filter_ptms(ptm_coordinates, **filter_arguments) #remove ME events from this analysis overlapping_columns = set(psi_data.columns).intersection({'Chromosome', 'Strand', 'Chr_Start', 'Chr_Stop'}) if len(overlapping_columns) > 0: #drop columns that will be added from splicegraph psi_data = psi_data.drop(columns=overlapping_columns) print('Removing ME events from analysis') spliced_data = psi_data.copy() spliced_data = spliced_data[spliced_data['splice_type'] != 'ME'].copy() #split exons into individual exons spliced_data['Individual exon'] = spliced_data['exons'].apply(lambda x: x.split(':')) spliced_data = spliced_data.explode('Individual exon').drop_duplicates() spliced_data['Individual exon'] = spliced_data['Individual exon'].astype(float) #add gene location information to psi data from spliceseq spliced_data = spliced_data.merge(splicegraph.copy(), left_on = ['symbol', 'Individual exon'], right_on = ['Symbol', 'Exon'], how = 'left') spliced_data = spliced_data.rename(columns = {'Chr_Start': 'spliced_region_start', 'Chr_Stop': 'spliced_region_end'}) print('Projecting PTMs onto SpliceSeq data') spliced_data, spliced_ptms = project_ptms_onto_splice_events(spliced_data, chromosome_col = 'Chromosome', strand_col = 'Strand', gene_col = 'symbol', region_start_col = 'spliced_region_start', region_end_col = 'spliced_region_end', event_id_col = 'as_id',dPSI_col = dPSI_col, sig_col = sig_col, extra_cols = extra_cols, separate_modification_types = separate_modification_types, coordinate_type = coordinate_type, PROCESSES = PROCESSES) ## add code for extracting flanking sequences (to do) if identify_flanking_sequences: altered_flanks = fs.get_flanking_changes_from_splicegraph(psi_data, splicegraph, dPSI_col = dPSI_col, sig_col = sig_col, extra_cols = extra_cols, gene_col = gene_col, coordinate_type=coordinate_type, flank_size = flank_size) return spliced_data, spliced_ptms, altered_flanks else: return spliced_data, spliced_ptms #def project_ptms_onto_TCGA_SpliceSeq(tcga_cancer = 'PRAD'): # """ # In progress. Will download and process TCGA SpliceSeq data for a specific cancer type, and project PTMs onto the spliced regions. # """ # print('Not yet active. Please check back later.') # pass def check_columns(splice_data, chromosome_col = None, strand_col = None, region_start_col = None, region_end_col = None, first_flank_start_col = None, first_flank_end_col = None, second_flank_start_col = None, second_flank_end_col = None, gene_col = None, dPSI_col = None, sig_col = None, event_id_col = None, extra_cols = None): """ Function to quickly check if the provided column names exist in the dataset and if they are the correct type of data """ expected_cols = [chromosome_col, strand_col, region_start_col, region_end_col, first_flank_start_col, first_flank_end_col, second_flank_start_col, second_flank_end_col, gene_col, dPSI_col, sig_col, event_id_col] expected_dtypes = [[str, object], [str,int, object], [int,float], [int,float], [int,float], [int,float], [int,float], [int,float], [str, object], float, float, None] #remove cols with None and the corresponding dtype entry expected_dtypes = [dtype for col, dtype in zip(expected_cols, expected_dtypes) if col is not None] expected_cols = [col for col in expected_cols if col is not None] #add extra columns to the expected columns list if extra_cols is not None: expected_cols += extra_cols expected_dtypes += [None]*len(extra_cols) #extra columns do not have dtype requirement #check to make sure columns exist in the dataframe if not all([x in splice_data.columns for x in expected_cols]): raise ValueError('Not all expected columns are present in the splice data. Please check the column names and provide the correct names for the following columns: {}'.format([x for x in expected_cols if x not in splice_data.columns])) #check to make sure columns are the correct data type for col, data_type in zip(expected_cols, expected_dtypes): if data_type is None: continue elif isinstance(data_type, list): if splice_data[col].dtype not in data_type: #try converting to the expected data type try: splice_data[col] = splice_data[col].astype(data_type[0]) except: raise ValueError('Column {} is not the expected data type. Expected data type is one of {}, but found data type {}'.format(col, data_type, splice_data[col].dtype)) else: if splice_data[col].dtype != data_type: #try converting to the expected data type try: splice_data[col] = splice_data[col].astype(data_type) except: raise ValueError('Column {} is not the expected data type. Expected data type is {}, but found data type {}'.format(col, data_type, splice_data[col].dtype))
07-09
<think>我们遇到了一个错误:`ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().`。这个错误发生在`project.py`的第212行,具体是在`find_ptms_in_many_regions`函数中,当检查`annotate_original_df`时。 错误原因:在`find_ptms_in_many_regions`函数中,第212行是: if annotate_original_df: 但是,在调用这个函数时,传入的`annotate_original_df`参数是一个DataFrame(从错误信息中可以看出),而不是一个布尔值。这可能是由于在调用`find_ptms_in_many_regions`时,参数传递错误。 让我们检查调用栈: - 在`project_ptms_onto_MATS`函数中,调用了`project_ptms_onto_splice_events`,并传入了`annotate_original_df`参数(但在这个调用中,我们并没有直接传递`annotate_original_df`,而是通过`**kwargs`传递?) - 在`project_ptms_onto_splice_events`函数中,调用了`find_ptms_in_many_regions`,并传入了`annotate_original_df`参数。 查看`project_ptms_onto_splice_events`函数的定义: def project_ptms_onto_splice_events(splice_data, annotate_original_df = True, ...) 注意,第一个参数是`splice_data`,第二个参数是`annotate_original_df`。但是,在`project_ptms_onto_MATS`中调用`project_ptms_onto_splice_events`时,我们是这样调用的: spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events(SE_events, ptm_coordinates, ...) 这里,`SE_events`作为第一个参数(即`splice_data`),`ptm_coordinates`作为第二个参数。但是,在函数定义中,第二个参数应该是`annotate_original_df`(默认值为True)。因此,这里我们错误地将`ptm_coordinates`传递给了`annotate_original_df`,导致`annotate_original_df`变成了一个DataFrame,而不是布尔值。 解决方案:在调用`project_ptms_onto_splice_events`时,应该使用关键字参数来传递`ptm_coordinates`,以避免参数位置错误。 在`project_ptms_onto_MATS`中,我们调用`project_ptms_onto_splice_events`时,应该将`ptm_coordinates`作为关键字参数传递。 修改`project_ptms_onto_MATS`函数中调用`project_ptms_onto_splice_events`的地方,将`ptm_coordinates`明确写成`ptm_coordinates=ptm_coordinates`。 具体地,在`project_ptms_onto_MATS`函数中,有以下几处调用: 1. 对于SE事件: spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events(SE_events, ptm_coordinates, ...) 改为: spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events(SE_events, ptm_coordinates=ptm_coordinates, ...) 2. 对于A5SS事件: spliced_events['5ASS'], fiveASS_ptms = project_ptms_onto_splice_events(A5SS_events, ptm_coordinates, ...) 改为: spliced_events['5ASS'], fiveASS_ptms = project_ptms_onto_splice_events(A5SS_events, ptm_coordinates=ptm_coordinates, ...) 3. 对于A3SS事件: spliced_events['3ASS'], threeASS_ptms = project_ptms_onto_splice_events(A3SS_events, ptm_coordinates, ...) 改为: spliced_events['3ASS'], threeASS_ptms = project_ptms_onto_splice_events(A3SS_events, ptm_coordinates=ptm_coordinates, ...) 4. 对于RI事件: spliced_events['RI'], RI_ptms = project_ptms_onto_splice_events(RI_events, ptm_coordinates, ...) 改为: spliced_events['RI'], RI_ptms = project_ptms_onto_splice_events(RI_events, ptm_coordinates=ptm_coordinates, ...) 5. 对于MXE事件: 对于第一个外显子: spliced_events['MXE_Exon1'], MXE_Exon1_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates, ...) 改为: spliced_events['MXE_Exon1'], MXE_Exon1_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates=ptm_coordinates, ...) 对于第二个外显子: spliced_events['MXE_Exon2'], MXE_Exon2_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates, ...) 改为: spliced_events['MXE_Exon2'], MXE_Exon2_ptms = project_ptms_onto_splice_events(MXE_events, ptm_coordinates=ptm_coordinates, ...) 这样,我们就将`ptm_coordinates`作为关键字参数传递,而不会影响其他参数的位置。 另外,在`project_ptms_onto_splice_events`函数中,我们定义了很多参数,在调用时,为了安全起见,建议将除了第一个参数(splice_data)之外的其他参数都使用关键字参数传递。但是,由于我们已经在函数定义中指定了默认值,所以只要将`ptm_coordinates`作为关键字参数传递,就可以避免它被误认为是`annotate_original_df`。 但是,我们注意到在调用`project_ptms_onto_splice_events`时,还传递了很多其他参数,这些参数都是按照位置传递的。为了避免类似错误,我们也可以将其他参数都改为关键字参数传递。不过,为了最小化修改,我们只修改`ptm_coordinates`为关键字参数。 此外,在`project_ptms_onto_splice_events`函数内部,调用`find_ptms_in_many_regions`时,我们也需要确保正确传递参数。在`project_ptms_onto_splice_events`函数中,我们调用`find_ptms_in_many_regions`时,已经使用了关键字参数传递,所以没有问题。 因此,我们只需要修改`project_ptms_onto_MATS`函数中对`project_ptms_onto_splice_events`的调用,将`ptm_coordinates`作为关键字参数传递。 修改后的代码示例(以SE事件为例): spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events(SE_events, ptm_coordinates=ptm_coordinates, chromosome_col = 'chr', strand_col = 'strand', region_start_col = 'exonStart_0base', region_end_col = 'exonEnd', dPSI_col=dPSI_col, sig_col = sig_col, gene_col = 'geneSymbol', event_id_col = 'AS ID', extra_cols = extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based', taskbar_label = "Skipped Exon events", separate_modification_types=separate_modification_types, PROCESSES = SE_processes) 其他事件同理。 注意:在`project_ptms_onto_MATS`函数中,我们调用`project_ptms_onto_splice_events`时,第一个参数是事件数据(如SE_events),第二个参数我们改为了`ptm_coordinates=ptm_coordinates`,这样后面的参数就可以按顺序传递,因为`ptm_coordinates`在函数定义中是第三个参数(在`splice_data`和`annotate_original_df`之后)。但是,我们跳过了`annotate_original_df`参数,因为我们没有传递它,所以它会使用默认值True。这符合我们的预期。 但是,我们也要注意,在`project_ptms_onto_splice_events`函数定义中,第二个参数是`annotate_original_df`,默认值为True。我们通过关键字参数传递`ptm_coordinates`,这样就不会影响到`annotate_original_df`,它仍然是默认的True。 因此,修改后应该可以解决问题。 但是,我们还需要注意,在`project_ptms_onto_splice_events`函数内部,我们使用了`ptm_coordinates`这个变量,所以这个修改确保`ptm_coordinates`被正确传递。 让我们修改`project_ptms_onto_MATS`函数中的几处调用,将`ptm_coordinates`作为关键字参数传递。 由于我们无法直接修改原文件,我们可以在本地修改`project.py`文件,然后重新运行。 但是,我们也可以考虑另一种方法:在`project_ptms_onto_splice_events`函数中,将`ptm_coordinates`参数放在`splice_data`之后,但放在`annotate_original_df`之前?这样按位置传递就不会错。但是这样会改变函数签名,可能会影响其他地方调用。所以,我们选择使用关键字参数传递。 因此,我们修改`project_ptms_onto_MATS`函数中的调用方式。 另外,我们还需要检查`project_ptms_onto_splice_events`函数的定义,确保它接受`ptm_coordinates`作为参数。从函数定义看,它确实有`ptm_coordinates`参数,且默认值为None。 所以,我们只需要在调用时明确指定`ptm_coordinates=ptm_coordinates`。 现在,我们修改`project.py`文件中的`project_ptms_onto_MATS`函数,将调用`project_ptms_onto_splice_events`的地方,把第二个位置参数(即`ptm_coordinates`)改为关键字参数传递。 修改后,重新运行代码,应该就不会出现这个错误了。 但是,我们注意到在`project_ptms_onto_MATS`函数中,对于MXE事件,我们调用了两次`project_ptms_onto_splice_events`,同样需要修改。 修改后的代码如下(只显示修改部分):</think>### 问题分析 错误发生在 `find_ptms_in_many_regions` 函数的第212行: ```python if annotate_original_df: ``` 这里尝试将 DataFrame 对象作为布尔值判断,导致 `ValueError: The truth value of a DataFrame is ambiguous`。原因是函数调用时参数传递错误,导致 `annotate_original_df` 意外接收了 DataFrame 对象而非布尔值。 ### 错误根源 在 `project_ptms_onto_MATS` 函数中调用 `project_ptms_onto_splice_events` 时: ```python project_ptms_onto_splice_events(SE_events, ptm_coordinates, ...) ``` 函数签名是: ```python def project_ptms_onto_splice_events( splice_data, annotate_original_df=True, # 第二个参数 ptm_coordinates=None, # 第三个参数 ... ) ``` 实际调用时: - `SE_events` → `splice_data` (正确) - `ptm_coordinates` → `annotate_original_df` (错误!) - 其他参数 → 后续参数 导致 `annotate_original_df` 被赋值为 DataFrame (`ptm_coordinates`) 而非布尔值。 ### 修复方案 修改调用方式,明确使用关键字参数传递 `ptm_coordinates`: ```python # 修改前 project_ptms_onto_splice_events(SE_events, ptm_coordinates, ...) # 修改后(添加 ptm_coordinates=) project_ptms_onto_splice_events( SE_events, ptm_coordinates=ptm_coordinates, # 明确指定参数名 ... ) ``` ### 需要修改的位置 在 `project_ptms_onto_MATS` 函数中,所有调用 `project_ptms_onto_splice_events` 的地方都需要修正: 1. **SE 事件** (约第416行) 2. **A5SS 事件** (约第468行) 3. **A3SS 事件** (约第516行) 4. **RI 事件** (约第570行) 5. **MXE 事件** (两次调用,约第592和600行) ### 修正后的代码示例 ```python # 修改 SE 事件调用 spliced_events['SE'], SE_ptms = project_ptms_onto_splice_events( SE_events, ptm_coordinates=ptm_coordinates, # 关键修复 chromosome_col='chr', strand_col='strand', region_start_col='exonStart_0base', region_end_col='exonEnd', dPSI_col=dPSI_col, sig_col=sig_col, gene_col='geneSymbol', event_id_col='AS ID', extra_cols=extra_cols, coordinate_type=coordinate_type, start_coordinate_system='0-based', taskbar_label="Skipped Exon events", separate_modification_types=separate_modification_types, PROCESSES=SE_processes ) ``` ### 完整修复步骤 1. 在 `project.py` 文件中定位 `project_ptms_onto_MATS` 函数 2. 找到所有 `project_ptms_onto_splice_events` 调用点(共6处) 3. 在每个调用中明确添加 `ptm_coordinates=ptm_coordinates` 4. 保存文件并重新运行程序 ### 其他建议 1. **参数验证**:在函数开头添加类型检查 ```python def find_ptms_in_many_regions(..., annotate_original_df=True, ...): if not isinstance(annotate_original_df, bool): raise TypeError("annotate_original_df must be boolean") ``` 2. **关键字参数**:调用复杂函数时始终使用关键字参数 3. **类型提示**:添加类型提示避免此类错误 ```python def project_ptms_onto_splice_events( splice_data: pd.DataFrame, annotate_original_df: bool = True, ptm_coordinates: Optional[pd.DataFrame] = None, ... ): ``` ### 修复后验证 修复后调用栈将变为: ```python project_ptms_onto_splice_events( splice_data=SE_events, # DataFrame annotate_original_df=True, # 布尔值 (默认) ptm_coordinates=ptm_coordinates, # DataFrame ... # 其他参数 ) ``` 这样 `annotate_original_df` 将正确接收布尔值,避免类型错误。 ### 相关问题 1. 如何避免 Python 中函数参数传递的常见错误? 2. 在大型项目中如何有效管理函数参数? 3. Pandas DataFrame 作为函数参数时有哪些最佳实践? 4. 类型提示(Type Hints)在 Python 项目中有哪些实际应用场景?
评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值