Hgvsg by sicotteh · Pull Request #6 · Steven-N-Hart/CAVA

sicotteh · 2022-03-18T04:24:10Z

New Feature:
CAVA_HGVSg added with support for ins,del,dup, repeats,
Support gVCF <NON_REF> or <*> alleles (just repeat them in output)
Clean results from previous CAVA annotation runs from INFO field before adding new ones.
added "build" options to config file (defaults to GRCh38)

Bug fixes.

Fixed reference format to be NC_...(NM...):c.* instead of NM(gene):c.* to allow support for introns.
hgvs around splice sites properly done for ins/dups/repeats and added numerous unit tests.
Support for variants at the ends of chromosome M (does not circularize though).. circularization is not a big issue since there are no transcripts annotated near the ends.
updated to new HGVS for dup and del (omit sequence after dup/del as they are redundant e.g. no 123dupT .. just 123dup)
updated to use cDNA notation for start and stop range for insertions and repeats (122_123insCC, not 123insCC)
repeats annotation can change the alleles compared to ref after shifting (as per HGVS)
variants 1 bp before Methionine, and right in first codon.
Fixed 1-off problem with extensions (bug introduced when 'ext' were introduced in previous version)
added cDNA inversions to CSN (prioritize over ins)
added repeats and properly prioritize ahead of dup/ins and do not apply in coding regions.
fix annotation in CSN for variants after stop codon (+3 becomes *3)
Fixed bugs for indels in 1st codon.
Changed protein prediction for variants in first codon from 'p.?' to 'p.Met1?'
Added proper treatment of genes with Selenocysteine and trust cDNA/stop annotation (do not scan for stop codon .. to support selenocysteine genes)
centralize chromosome mappings between 'chrNumber' and 'Number' and various chromosome M (MT,chrM,chrMT,M) nomenclatures
Proper annotations of variants partly overlapping ends of transcript (the pos in -pos or *pos can extend past edge of annotated transcript as these are uncertain (as per communication with HGVS team).
fixed CAVA_HGVSc to appropriately limit repeats with multiple of 3 in coding sequene
Fixed CSN to make any variants modifying the first AA to be Met1?
Fixed CSN to make essential splice site modifications to have _p.?
Fixed repeat calculation code (both genomic and cDNA) since last checkout.

Speed improvements based on profiling with py-top:
Cache variant getters (e.g. isInsertion() becomes .is_insertion , isDeletion() calls becomes check of .is_deletion flat)
Cache 1bp before indels instead of refetching.
Cache Sequence Fetch in big chunks (marginal improvement for exome, big improvement for whole genome).
Cache transcripts loading and building, so it is only done once. Big speedup.
Speed up left/right shifting by using fetching chunks of sequence rather than small bits at a time.
.. Also merge fetching of lef/right sequence in one big chunk surrounding variant.
Speed up protein annotation by rewriting some code to trim non-mutated AA (based on profiling)
Includes code for HGVS DNA nomenclature (but not added yet).
Avoid repeatedly calling the “calculateCSNCoordinates” function multiple times on the same data by passing parameters around.

cached repeat normalization code to only run once.

Changed equality e.g from "x == None" to "x is None"

…in sequences Found more mixed spaces and tabs

…s with no alt.

….g. chrNA) 2. Lots of linting (spaces vs tabs) 3. valid VCF output

changed default to 1

set default chromosomes

Automatic hg19 conversion Verified targeted and non-targeted conversion

Ref db build

# Conflicts: # ensembldb/main.py # requirements.txt

Update requirements

…harm suggestions

1.3.4.1 Release

Major reodering for better use.

Resolved #3

Master

…dges of target region. Does not include variants that include the target region without straddling or being fully inside.

… added repeats tests.

…ments

…), and proper dup/ins at splice junctions. Added Selenocysteine (cause a crash in ODP). Speed improvements (5X).

…se or repeats vs ins vs dup

Steven-N-Hart

There are many things that need to be fixed in this PR before I can spend the effort to review the changes you propose. So far, there are 2 main reasons I am declining this PR as it stands:

Many of the "changes" appear to be relative to the main (source) CAVA code, and not Steven-N-Hart/CAVA. Please start with the main branch and then add your edits.
There are many necessary files that got accidently coommitted (e.g. cava/share/doc/pycurl/INSTALL.rst, SampleAll_*, etc

Steven-N-Hart · 2022-03-18T13:04:30Z

SampleAll_diff.edited.txt

+ID                      CURATED_CSN                     NEW_CSN N       NEW_HGVSC                               NEW_HGVSP   DIFF_CSN
+ID=chr3-37050680-G-GGAGAGA	NM_000249.4(MLH1):c.+27_+28insGAGAGA	c.*28_*27GA[0]%3B[3]	NC_000003.12(NM_000249.4):c.*28_*27GA[0]%3B[3]	.   ---   >>c.+27_+28insGAGAGA<<
+
+


This file should not be committed

dariober and others added 30 commits September 28, 2018 11:14

Fix filename in if-condition

1282f6f

Resolved merge conflict by incorporating both suggestions.

923fd97

Merge remote-tracking branch 'origin/master'

f0018da

Fixed inconsistent tabs and spaces

2818a6b

Changed equality e.g from "x == None" to "x is None"

fixed some PEP8 violations and an incorrect import when getting prote…

f5a8d18

…in sequences Found more mixed spaces and tabs

updated version number, fixed "feature" that silently dropped variant…

82fdf37

…s with no alt.

line endings and updated release info

abd587b

Added HGVSc

4641f12

fixed incorrect HGVS INFO filed order.

ea1d69f

1. No longer drops chromosomes not present in the definition file. (e…

da6a0c9

….g. chrNA) 2. Lots of linting (spaces vs tabs) 3. valid VCF output

using python 3.8.2, had to use newer versions of gevent & six

732af07

Update config_template.txt

79ff780

changed default to 1

Update config_template.txt

327d660

set default chromosomes

Updates to get CAVA running in ODP

ee81136

Added Docker file and README.md

0f9ba23

Added Docker file and updated README.md

2cc7e31

fixed parameter error message for normalized_mitochondrial_chrom

89def67

allow sorting by M and consider transcripts for chrom 'M'

939b648

Standardized VERSION

a37215e

Automatic hg19 conversion Verified targeted and non-targeted conversion

added test VCF files

6960c19

added test VCF files & procedures

21f05e9

Merge pull request #13 from Steven-N-Hart/ref_db_build

81fcd94

Ref db build

Changed instructions for compiling tx.db

c1a847e

Changed instructions for compiling tx.db

62585d2

updated tests

8e49d47

adding mt normalization

8cbcb23

Merge branch 'mane' into update-requirements

f57d603

# Conflicts: # ensembldb/main.py # requirements.txt

fixed line separators

b44013b

Merge pull request #14 from Steven-N-Hart/update-requirements

6fd59d2

Update requirements

updated version info. Scripts for MANE transcripts.

2bbeebe

Steven-N-Hart and others added 28 commits June 11, 2021 16:11

Added end2end testing

c4135c8

added process to get test reference data

ff48b03

fix import structures

cfcddc6

e2e test passing

16bb1ee

removed extra configs, reformats

3a47a35

Both tests passed, reformatting, and re-architecting methods from Pyc…

df59f08

…harm suggestions

updated README.md

d4cbfb9

Released on pypi

39a8458

Fixed locations of database builds

a7d7a9e

Added Hugues tests for Met1toAA and EarlyStopInMiddle3BP

ce507a5

Merge pull request #1 from Steven-N-Hart/dev

edb77d4

1.3.4.1 Release

Reorder

259005d

Merge pull request #2 from Steven-N-Hart/dev

0ab447f

Major reodering for better use.

Adding support for gVCFs

957934f

Update mane_db_prep.py

e4c29f3

Resolved #3

Create python-publish.yml

41c7ac7

Update VERSION

46dea55

Merge pull request Steven-N-Hart#4 from Steven-N-Hart/master

ad28642

Master

Add back test directory AND allow variants that either straddle the e…

3fd9823

…dges of target region. Does not include variants that include the target region without straddling or being fully inside.

push changes for fix HGVS issues

525390e

added tests for BRCA1 deletions/insertions right outside the ends and…

8f50358

… added repeats tests.

another data dir added to .gitignore

2c27261

temporararely remote file from .gitignore

b570a19

temporararely remote file from .gitignore

6d10c7b

add .gitignore after removing large file from history using filter

538aa27

Fixed bugs around splice site, Methionine. and added 5X speed improve…

33a233f

…ments

Added unit tests, support repeats [] annotations, dna inversions (inv…

14d6725

…), and proper dup/ins at splice junctions. Added Selenocysteine (cause a crash in ODP). Speed improvements (5X).

Added CAVA_HGVSg, removing old CAVA annotation, and debugged proper u…

2e9c132

…se or repeats vs ins vs dup

Steven-N-Hart requested changes Mar 18, 2022

View reviewed changes

sicotteh closed this Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hgvsg#6

Hgvsg#6
sicotteh wants to merge 102 commits intoSteven-N-Hart:masterfrom
sicotteh:hgvsg

sicotteh commented Mar 18, 2022

Uh oh!

Steven-N-Hart left a comment

Uh oh!

Steven-N-Hart Mar 18, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		ID CURATED_CSN NEW_CSN N NEW_HGVSC NEW_HGVSP DIFF_CSN
		ID=chr3-37050680-G-GGAGAGA NM_000249.4(MLH1):c.+27_+28insGAGAGA c.28_27GA[0]%3B[3] NC_000003.12(NM_000249.4):c.28_27GA[0]%3B[3] . --- >>c.+27_+28insGAGAGA<<

Conversation

sicotteh commented Mar 18, 2022

Uh oh!

Steven-N-Hart left a comment

Choose a reason for hiding this comment

Uh oh!

Steven-N-Hart Mar 18, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants