Closed
Conversation
Changed equality e.g from "x == None" to "x is None"
…in sequences Found more mixed spaces and tabs
….g. chrNA) 2. Lots of linting (spaces vs tabs) 3. valid VCF output
changed default to 1
set default chromosomes
Automatic hg19 conversion Verified targeted and non-targeted conversion
# Conflicts: # ensembldb/main.py # requirements.txt
Update requirements
1.3.4.1 Release
Major reodering for better use.
Resolved #3
…dges of target region. Does not include variants that include the target region without straddling or being fully inside.
… added repeats tests.
…), and proper dup/ins at splice junctions. Added Selenocysteine (cause a crash in ODP). Speed improvements (5X).
…se or repeats vs ins vs dup
Steven-N-Hart
requested changes
Mar 18, 2022
Owner
Steven-N-Hart
left a comment
There was a problem hiding this comment.
There are many things that need to be fixed in this PR before I can spend the effort to review the changes you propose. So far, there are 2 main reasons I am declining this PR as it stands:
- Many of the "changes" appear to be relative to the main (source) CAVA code, and not Steven-N-Hart/CAVA. Please start with the main branch and then add your edits.
- There are many necessary files that got accidently coommitted (e.g. cava/share/doc/pycurl/INSTALL.rst, SampleAll_*, etc
| ID CURATED_CSN NEW_CSN N NEW_HGVSC NEW_HGVSP DIFF_CSN | ||
| ID=chr3-37050680-G-GGAGAGA NM_000249.4(MLH1):c.+27_+28insGAGAGA c.*28_*27GA[0]%3B[3] NC_000003.12(NM_000249.4):c.*28_*27GA[0]%3B[3] . --- >>c.+27_+28insGAGAGA<< | ||
|
|
||
|
|
Owner
There was a problem hiding this comment.
This file should not be committed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
New Feature:
CAVA_HGVSg added with support for ins,del,dup, repeats,
Support gVCF <NON_REF> or <*> alleles (just repeat them in output)
Clean results from previous CAVA annotation runs from INFO field before adding new ones.
added "build" options to config file (defaults to GRCh38)
Bug fixes.
Fixed reference format to be NC_...(NM...):c.* instead of NM(gene):c.* to allow support for introns.
hgvs around splice sites properly done for ins/dups/repeats and added numerous unit tests.
Support for variants at the ends of chromosome M (does not circularize though).. circularization is not a big issue since there are no transcripts annotated near the ends.
updated to new HGVS for dup and del (omit sequence after dup/del as they are redundant e.g. no 123dupT .. just 123dup)
updated to use cDNA notation for start and stop range for insertions and repeats (122_123insCC, not 123insCC)
repeats annotation can change the alleles compared to ref after shifting (as per HGVS)
variants 1 bp before Methionine, and right in first codon.
Fixed 1-off problem with extensions (bug introduced when 'ext' were introduced in previous version)
added cDNA inversions to CSN (prioritize over ins)
added repeats and properly prioritize ahead of dup/ins and do not apply in coding regions.
fix annotation in CSN for variants after stop codon (+3 becomes *3)
Fixed bugs for indels in 1st codon.
Changed protein prediction for variants in first codon from 'p.?' to 'p.Met1?'
Added proper treatment of genes with Selenocysteine and trust cDNA/stop annotation (do not scan for stop codon .. to support selenocysteine genes)
centralize chromosome mappings between 'chrNumber' and 'Number' and various chromosome M (MT,chrM,chrMT,M) nomenclatures
Proper annotations of variants partly overlapping ends of transcript (the pos in -pos or *pos can extend past edge of annotated transcript as these are uncertain (as per communication with HGVS team).
fixed CAVA_HGVSc to appropriately limit repeats with multiple of 3 in coding sequene
Fixed CSN to make any variants modifying the first AA to be Met1?
Fixed CSN to make essential splice site modifications to have _p.?
Fixed repeat calculation code (both genomic and cDNA) since last checkout.
Speed improvements based on profiling with py-top:
Cache variant getters (e.g. isInsertion() becomes .is_insertion , isDeletion() calls becomes check of .is_deletion flat)
Cache 1bp before indels instead of refetching.
Cache Sequence Fetch in big chunks (marginal improvement for exome, big improvement for whole genome).
Cache transcripts loading and building, so it is only done once. Big speedup.
Speed up left/right shifting by using fetching chunks of sequence rather than small bits at a time.
.. Also merge fetching of lef/right sequence in one big chunk surrounding variant.
Speed up protein annotation by rewriting some code to trim non-mutated AA (based on profiling)
Includes code for HGVS DNA nomenclature (but not added yet).
Avoid repeatedly calling the “calculateCSNCoordinates” function multiple times on the same data by passing parameters around.
cached repeat normalization code to only run once.