Skip to content

Hgvsg#6

Closed
sicotteh wants to merge 102 commits intoSteven-N-Hart:masterfrom
sicotteh:hgvsg
Closed

Hgvsg#6
sicotteh wants to merge 102 commits intoSteven-N-Hart:masterfrom
sicotteh:hgvsg

Conversation

@sicotteh
Copy link

New Feature:
CAVA_HGVSg added with support for ins,del,dup, repeats,
Support gVCF <NON_REF> or <*> alleles (just repeat them in output)
Clean results from previous CAVA annotation runs from INFO field before adding new ones.
added "build" options to config file (defaults to GRCh38)

Bug fixes.

Fixed reference format to be NC_...(NM...):c.* instead of NM(gene):c.* to allow support for introns.
hgvs around splice sites properly done for ins/dups/repeats and added numerous unit tests.
Support for variants at the ends of chromosome M (does not circularize though).. circularization is not a big issue since there are no transcripts annotated near the ends.
updated to new HGVS for dup and del (omit sequence after dup/del as they are redundant e.g. no 123dupT .. just 123dup)
updated to use cDNA notation for start and stop range for insertions and repeats (122_123insCC, not 123insCC)
repeats annotation can change the alleles compared to ref after shifting (as per HGVS)
variants 1 bp before Methionine, and right in first codon.
Fixed 1-off problem with extensions (bug introduced when 'ext' were introduced in previous version)
added cDNA inversions to CSN (prioritize over ins)
added repeats and properly prioritize ahead of dup/ins and do not apply in coding regions.
fix annotation in CSN for variants after stop codon (+3 becomes *3)
Fixed bugs for indels in 1st codon.
Changed protein prediction for variants in first codon from 'p.?' to 'p.Met1?'
Added proper treatment of genes with Selenocysteine and trust cDNA/stop annotation (do not scan for stop codon .. to support selenocysteine genes)
centralize chromosome mappings between 'chrNumber' and 'Number' and various chromosome M (MT,chrM,chrMT,M) nomenclatures
Proper annotations of variants partly overlapping ends of transcript (the pos in -pos or *pos can extend past edge of annotated transcript as these are uncertain (as per communication with HGVS team).
fixed CAVA_HGVSc to appropriately limit repeats with multiple of 3 in coding sequene
Fixed CSN to make any variants modifying the first AA to be Met1?
Fixed CSN to make essential splice site modifications to have _p.?
Fixed repeat calculation code (both genomic and cDNA) since last checkout.

Speed improvements based on profiling with py-top:
Cache variant getters (e.g. isInsertion() becomes .is_insertion , isDeletion() calls becomes check of .is_deletion flat)
Cache 1bp before indels instead of refetching.
Cache Sequence Fetch in big chunks (marginal improvement for exome, big improvement for whole genome).
Cache transcripts loading and building, so it is only done once. Big speedup.
Speed up left/right shifting by using fetching chunks of sequence rather than small bits at a time.
.. Also merge fetching of lef/right sequence in one big chunk surrounding variant.
Speed up protein annotation by rewriting some code to trim non-mutated AA (based on profiling)
Includes code for HGVS DNA nomenclature (but not added yet).
Avoid repeatedly calling the “calculateCSNCoordinates” function multiple times on the same data by passing parameters around.

cached repeat normalization code to only run once.

dariober and others added 30 commits September 28, 2018 11:14
Changed equality e.g from "x == None" to "x is None"
…in sequences

Found more mixed spaces and tabs
….g. chrNA)

2. Lots of linting (spaces vs tabs)
3. valid VCF output
changed default to 1
set default chromosomes
Automatic hg19 conversion
Verified targeted and non-targeted conversion
# Conflicts:
#	ensembldb/main.py
#	requirements.txt
Steven-N-Hart and others added 28 commits June 11, 2021 16:11
Major reodering for better use.
…dges of target region. Does not include variants that include the target region without straddling or being fully inside.
…), and proper dup/ins at splice junctions. Added Selenocysteine (cause a crash in ODP). Speed improvements (5X).
Copy link
Owner

@Steven-N-Hart Steven-N-Hart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many things that need to be fixed in this PR before I can spend the effort to review the changes you propose. So far, there are 2 main reasons I am declining this PR as it stands:

  1. Many of the "changes" appear to be relative to the main (source) CAVA code, and not Steven-N-Hart/CAVA. Please start with the main branch and then add your edits.
  2. There are many necessary files that got accidently coommitted (e.g. cava/share/doc/pycurl/INSTALL.rst, SampleAll_*, etc

ID CURATED_CSN NEW_CSN N NEW_HGVSC NEW_HGVSP DIFF_CSN
ID=chr3-37050680-G-GGAGAGA NM_000249.4(MLH1):c.+27_+28insGAGAGA c.*28_*27GA[0]%3B[3] NC_000003.12(NM_000249.4):c.*28_*27GA[0]%3B[3] . --- >>c.+27_+28insGAGAGA<<


Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be committed

@sicotteh sicotteh closed this Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants