GitHub - whr812756608/GODESS_preprocess: Data cleaning and preprocessing code for Simulated GODESS dataset, also include a 2D GNN model

Data cleaning and preprocessing directory for GODESS dataset。

Data cleaning, preprocessing and annotating doc for simulated GODESS dataset

We summarize the whole process into 7 steps below. Researchers interested in uploading carbohydrate data to this or other similar datasets should note that this preprocessing pipeline could be streamlined, or simplified substantially, if more completely and consistently annotated experimental data and meta-data is provided at the time of original upload to public repositories. Uniform standards in NMR and structure file annotation is a high priority issue for future carbohydrate research. For the current work, we have had to complete the annotation ourselves where ambiguities existed and only uploaded carbohydrate data that could feasibly be resolved in a trustworthy way.

1, Read-in the PDB file and the label file that contains NMR shift values, and then create atom connection file from the PDB file.

PDB structure files produced by GODESS contain pairwise atom bond information while Glycoscience PDB files do not. Glycoscience files however contain helpful monosaccharide linkage notes at the bottom of the PDB files that are not present in GODESS files. Thus the atom connections (stored in PDB format) helped match the monosaccharide ID between PDB file and csv file in a more automated way in GODESS.

2, Identify the monosaccharide components in the PDB file. Align the monosaccharide IDs between the PDB file and the label file with the help of the atom connection in GODESS or the monosaccharide linkage notes in the bottom of Glycoscience PDB files. Indentify non-monosaccharide components with the help of the atom connection information.

2.1 For example the monosaccharide bound '(1-3)' indicates the carbon with position number 1 is connected to the carbon with position number 3 via dehydration synthesis reaction. Therefore, with the complete atom connection information, we can manually assign monosaccharides IDs for all the atoms in the PDB file.
2.2 GODESS's terminal naming format for C and H, and double C/H is also internally consistent (e.g. H61,H62). In contrast, Glycosciences data, has various inconsistent schemes, sometimes H61 is H6A, or H62 is H6B, or similar trends, etc. (C3a, C31). We created lookup tables to resolve this inconsistent formatting issue in Glycosciences, but this was less needed in GODESS's data.
2.3 In Glycoscience the Acetyl (Ac) component is usually within monosaccharide name (e.g. GlcpNAc). When not specified Ac is usually a (1-2) linkage, though not always, and usually if a less common Ac linkage written as Glcp4Ac for C4 for example. GODESS treats Ac as a separate labeled component that is more clearly and consistently labeled ( GlcpNAc -> Glcp / N / Ac -> Ac(1-2)GlcpN ) in the linear chemical formula, as well as in all shift and structure data files. This consistency advantage enables us to use the Ac interaction easily as an extra feature at the atom-level when modeling the GODESS data. Other non-monosaccharide components from the PDB file can also be obtained (and verified) using atom connection information. Future work could additionally use the linear chemical formula to aid in annotation, especially if the NMR shifts of a variety of more complex or unusual non-monosaccharide components' shifts are predicted, for example for carbohydrates with more types of modifications or amino acid / protein conjugation.
2.4 GODESS did not have missing shift issues to the extent Glycoscience did as it's an extensive simulation program rather than pure experimental data.

3, To verify whether we aligned the monosaccharides between the PDB file (document at three letter abbreviation from the simulation software) and the label file (document as full monosaccharide name) correctly, we created a matching table to allow easier manual or semi-automated validation of the matching between NMR and PDB rows. If there are exceptions, we then go back to step2.

4, We apply feature engineering to the full monosaccharides including extracting some useful features like fischer projection, bond information etc.

5, Sometimes Ac especially is annotated in an inconsistent way. E.g. sometimes within a single file, Ac's atoms might be given a separate Ac monosaccharide column label in one monosaccharide unit (despite not being a monosaccharide), but in the same file a different Ac's atoms are merged with the parent monosaccharide label. Analogously ambigous situations can exist in the NMR files. Using a list of exceptions to establish whether these or similar inconsistencies exist, we semi-automatically find them and validate with manual checks, then go back to step 2 and re-run the whole process. These Ac mislabels usually gave obvious NMR shift prediction problems on the carbon bonded to the Ac so could also be identified quickly in parallel for validation following the process in the next step.

6, Over the course of the project, to more efficiently find the worst or most common outliers caused by ambiguous annotation first, we repeatedly ran a simple GNN model on coarsely-grained curated data and check for ranked discrepency outliers. We then conduct manually examinations, if the outlier is from annotation problems as usual, we then go back to step 2 and re-run the whole process. Outliers not directly caused by an annotation issue were typically a secondary effect of a related common primary issue that biased the model in some ways relevant for these types of outlier atoms. This iterative process plus chemistry knowledge allowed us to compile exhaustive lists and lookup tables to solve the ambiguities in annotation in the original dataset. Again we strongly encourage experimental researchers in glycosciences to adopt more uniform annotation standards in the future to avoid the need for such extensive curation as datasets grow.

7, We identified some rarely occurring monosaccharides units (e.g. only appear once or twice) and dropped them due to insufficient statistics. These dropped glycans were not included in our dataset size count.

Preliminary results of 2D GNN.

This is created to visualize the overall performance of 2D GNN on GODESS dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.idea		.idea
D2O		D2O
Process_doc		Process_doc
WrongList_DuplicatesPossible		WrongList_DuplicatesPossible
__pycache__		__pycache__
dataset		dataset
figures		figures
logo		logo
matlab_match_pdb		matlab_match_pdb
model_zoo		model_zoo
visulize		visulize
GODDESS.py		GODDESS.py
GODESSscrape.py		GODESSscrape.py
README.md		README.md
Second_round_pipeline.ipynb		Second_round_pipeline.ipynb
Step_1_Alignment_pdb_residual.ipynb		Step_1_Alignment_pdb_residual.ipynb
Step_2_Assign_shift_and_enrich_pdb.ipynb		Step_2_Assign_shift_and_enrich_pdb.ipynb
Step_3_Check_exceptions.ipynb		Step_3_Check_exceptions.ipynb
Step_4_Reformulate_monosaccharide.ipynb		Step_4_Reformulate_monosaccharide.ipynb
Step_5_Check_labeled_pdbs.ipynb		Step_5_Check_labeled_pdbs.ipynb
Step_6_replace_hyb_create_atom_shift_interaction_features.ipynb		Step_6_replace_hyb_create_atom_shift_interaction_features.ipynb
Step_7_post_process_data_after_interaction feature.ipynb		Step_7_post_process_data_after_interaction feature.ipynb
Visualize_results.ipynb		Visualize_results.ipynb
Visulize_all_result.ipynb		Visulize_all_result.ipynb
count_mono_atom_labeled.ipynb		count_mono_atom_labeled.ipynb
create_adjaency_matrix_from_labeled_pdb.py		create_adjaency_matrix_from_labeled_pdb.py
create_graph_data.py		create_graph_data.py
create_graph_data_experiment.py		create_graph_data_experiment.py
create_graph_data_noablation.py		create_graph_data_noablation.py
first_round_pipeline.ipynb		first_round_pipeline.ipynb
new_create_graph.ipynb		new_create_graph.ipynb
node_embeddings.py		node_embeddings.py
node_embeddings_godess.py		node_embeddings_godess.py
preprocess_data_get_matched_shift.ipynb		preprocess_data_get_matched_shift.ipynb
step1_copy_glycans.py		step1_copy_glycans.py
step2_process_glycans_pdb_csv.py		step2_process_glycans_pdb_csv.py
step3_simple_match_pdb_csv.py		step3_simple_match_pdb_csv.py
testing_complete_label_v4.png		testing_complete_label_v4.png
train_evaluate.py		train_evaluate.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data cleaning and preprocessing directory for GODESS dataset。

Data cleaning, preprocessing and annotating doc for simulated GODESS dataset

Preliminary results of 2D GNN.

About

Uh oh!

Releases

Packages

Languages

whr812756608/GODESS_preprocess

Folders and files

Latest commit

History

Repository files navigation

Data cleaning and preprocessing directory for GODESS dataset。

Data cleaning, preprocessing and annotating doc for simulated GODESS dataset

Preliminary results of 2D GNN.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages