Isfrag
Pasting an old recipee from ZettelKasten
Linked to 202002251433 (Private) in silico fragmentation worflow.
Children workflow on the beast MAPP
Side notes : try to write everything as scripts so that they can be launched without a GUI (ex on X2Go or directly Boabab).
Tutorial for NPatlas in silico fragmentation data treatment
Prior to fragmentation
Prepare space delimited input file
Complete metadate file is converted to list of Unique ID and smiles space separated (for cfm id input)
python frag_list_preparator.py ../npatlas_data/np_atlas_2019_12.tsv ../npatlas_data/np_atlas_2019_12_for_frag.tsv NPAID SMILES
python frag_list_preparator.py ../open_np_db_data/open_NP_db.tsv ../open_np_db_data/open_NP_db_tofrag.txt shortinchikey shortinchikey smiles
Split the file
(Works on Linux based shells)
split -a 5 -l 500 -d ../opennp_db_data/open_NP_db_tofrag.txt ../open_np_db_data/opennpdb_tofrag/opennpdb split -a 5 -l 500 -d ./lotusdata/lotus_data_for_frag.txt ./lotus_data/lotus_data_for_frag/lotus_data
split -a 5 -l 500 -d ./ ./lotusdata/lotus_data_for_frag/lotus_data split -a 5 -l 500 -d cfm/cfminput/platinum_tofrag.tsv cfm/cfm_input/splitted/lotus_to_frag
This one allows to preserve extensions and is build on number of desired chunks without splitting lines split -a 5 -n l/199 -d --additional-suffix=.txt cfminput/sub_platinum_tofrag.tsv cfm_input/splitted/lotus_to_frag
When using the docker cli files need to have an extension (or taken as folder ?)
for f in *; do mv "f.txt"; done
if you made a mistacke
find / -type f -name '*.txt' -exec sh -c 'mv -- "{0%.txt}"' {} \;
Prepare mutilple bash file to launch on baobab
(strangely enough the .sh incrementer script return an error on Linux, runned OK on MacOS command line )
Note: apparently sbatch as been replaced by srun on the boabab server side. Be sure to update the scripts accordingly
Fetching cfm-predict results
Download cfm-predict fragmentation results from the baob server using rsync command
rsync -rvz -e 'ssh' --progress allardp@baobab2.unige.ch:/home/allardp/CFM_results/npatlas ./results rsync -rvz -e 'ssh' --progress allardp@baobab2.unige.ch:/home/allardp/CFM_results/coconut/ .
find ./ -type f -name '*.mgf' | wc
allows to count all file in a folder Here 25090 files for NPatalas
Coconut 384222 .log files
384150 mgf files
We might eventually need to move all files from subfolders recursively to a superior folder
find ./bacasable -type f -print0 | xargs -0 mv -t ./bacasable
find ./results_coconut -type f -print0 | xargs -0 mv -t ./results_coconut
Pruning the raw log files
The output of cfm-predict consist of .log file containing mass spectra, where each fragments are individually labelled and eventually linked to a substrcture. Such information might be usefull later but for now we only want to keep the raw ms data
(need to define a help function here)
python raw_log_treater_npatlas.py ../npatlas_data/results_npatlas/npatlas/ .log python raw_log_treater.py ../coconut_data/results_coconut/ .log
At this step .log file should be pruned and contains only digits (m/z and intensities)
for coconut Parsing directory../coconut_data/results_coconut/ with file extension :.log
'mass' ../coconut_data/results_coconut/CNP0402147.log 'mass' ../coconut_data/results_coconut/CNP0153817.log 'mass' ../coconut_data/results_coconut/CNP0155980.log 'mass' ../coconut_data/results_coconut/CNP0086807.log 'mass' ../coconut_data/results_coconut/CNP0199206.log 'mass' ../coconut_data/results_coconut/CNP0334817.log 'mass' ../coconut_data/results_coconut/CNP0366374.log 'mass' ../coconut_data/results_coconut/CNP0232712.log 'mass' ../coconut_data/results_coconut/CNP0370068.log 'mass' ../coconut_data/results_coconut/CNP0055178.log 'mass' ../coconut_data/results_coconut/CNP0228597.log 'mass' ../coconut_data/results_coconut/CNP0119974.log 'mass' ../coconut_data/results_coconut/CNP0132139.log 'mass' ../coconut_data/results_coconut/CNP0145457.log 'mass' ../coconut_data/results_coconut/CNP0230801.log 'mass' ../coconut_data/results_coconut/CNP0310370.log 'mass' ../coconut_data/results_coconut/CNP0149436.log 'mass' ../coconut_data/results_coconut/CNP0396848.log 'mass' ../coconut_data/results_coconut/CNP0401434.log 'mass' ../coconut_data/results_coconut/CNP0101561.log 'mass' ../coconut_data/results_coconut/CNP0390928.log 'mass' ../coconut_data/results_coconut/CNP0405256.log 'mass' ../coconut_data/results_coconut/CNP0395006.log 'mass' ../coconut_data/results_coconut/CNP0159145.log 'mass' ../coconut_data/results_coconut/CNP0131085.log 'mass' ../coconut_data/results_coconut/CNP0230696.log 'mass' ../coconut_data/results_coconut/CNP0014450.log 'mass' ../coconut_data/results_coconut/CNP0214739.log 'mass' ../coconut_data/results_coconut/CNP0302005.log 'mass' ../coconut_data/results_coconut/CNP0279314.log 'mass' ../coconut_data/results_coconut/CNP0177036.log 'mass' ../coconut_data/results_coconut/CNP0274256.log 'mass' ../coconut_data/results_coconut/CNP0403745.log 'mass' ../coconut_data/results_coconut/CNP0039287.log 'mass' ../coconut_data/results_coconut/CNP0238803.log 'mass' ../coconut_data/results_coconut/CNP0014261.log 'mass' ../coconut_data/results_coconut/CNP0077076.log 'mass' ../coconut_data/results_coconut/CNP0125300.log 'mass' ../coconut_data/results_coconut/CNP0228582.log 'mass' ../coconut_data/results_coconut/CNP0393136.log 'mass' ../coconut_data/results_coconut/CNP0338003.log 'mass' ../coconut_data/results_coconut/CNP0070182.log 'mass' ../coconut_data/results_coconut/CNP0230961.log 'mass' ../coconut_data/results_coconut/CNP0001326.log 'mass' ../coconut_data/results_coconut/CNP0088652.log 'mass' ../coconut_data/results_coconut/CNP0045797.log 'mass' ../coconut_data/results_coconut/CNP0175458.log 'mass' ../coconut_data/results_coconut/CNP0300969.log 'mass' ../coconut_data/results_coconut/CNP0030335.log 'mass' ../coconut_data/results_coconut/CNP0126194.log 'mass' ../coconut_data/results_coconut/CNP0334816.log 'mass' ../coconut_data/results_coconut/CNP0290616.log 'mass' ../coconut_data/results_coconut/CNP0386127.log 'mass' ../coconut_data/results_coconut/CNP0406328.log 'mass' ../coconut_data/results_coconut/CNP0127289.log 'mass' ../coconut_data/results_coconut/CNP0032755.log 'mass' ../coconut_data/results_coconut/CNP0258640.log 'mass' ../coconut_data/results_coconut/CNP0199475.log 'mass' ../coconut_data/results_coconut/CNP0350989.log 'mass' ../coconut_data/results_coconut/CNP0333350.log 'mass' ../coconut_data/results_coconut/CNP0213544.log 'mass' ../coconut_data/results_coconut/CNP0204567.log 'mass' ../coconut_data/results_coconut/CNP0148525.log 'mass' ../coconut_data/results_coconut/CNP0053639.log 'mass' ../coconut_data/results_coconut/CNP0118368.log 'mass' ../coconut_data/results_coconut/CNP0226584.log 'mass' ../coconut_data/results_coconut/CNP0254221.log 'mass' ../coconut_data/results_coconut/CNP0241364.log 'mass' ../coconut_data/results_coconut/CNP0348684.log 'mass' ../coconut_data/results_coconut/CNP0053255.log 'mass' ../coconut_data/results_coconut/CNP0167909.log 'mass' ../coconut_data/results_coconut/CNP0142603.log Treated 384150 files, skipped 72
Populating the mgf headers
Preparation of the adducted metadata table
We need to prepare and adducted dataframe containing the protonated and deprotonated masses
This script recquire rdkit so we build a environment.yml file from a dedicated conda env
conda env export -n conda-env -f /path/to/environment.yml
Eventually you need to fetch the file from the internet (use wget )
wget https://zenodo.org/record/3547718/files/COCONUT4MetFrag.csv?download=1
Sometimes since SMILES or INchi can yield error and since there is a MF field, the emass can be calculated directly form the MF as described here https://bioinformatics.stackexchange.com/a/9273. Noooop actually not working since the GetMass() function yield a molecular weight and not and exact mass ....
Script table_adducter_script.py is adapted to cope with different delimiters ... beware and note that $ is mandatory to input the tab delim
python table_adducter_npatlas_script.py ../npatlas_data/np_atlas_2019_12.tsv ../npatlas_data/np_atlas_2019_12_adducted.tsv
python table_adducter_script.py ../coconut_data/COCONUT4MetFrag.csv ',' ../coconut_data/COCONUT4MetFrag_adducted.csv $'\t'
Addition of the metadata to the individual mgf headers
We can now populate each raw mgf with its corresponding metadata. For this we use the treat_npatlas.py script
python treat_npatlas.py ../npatlas_data/np_atlas_2019_12_adducted.tsv ../npatlas_data/results_npatlas/npatlas/
python mgf_header_populater.py ../coconut_data/COCONUT4MetFrag_adducted.csv ../coconut_data/results_coconut/ Identifier
on coconut
Treated 384150 files, skipped 28161.
Generating the final spectral file
We concatenate each documented mgf files to a single spectral mgf file.
find ./ -type f -name '.mgf' | while read F; do cat ${F} >> ../../npatlas_ISDB_pos.mgf; done find ./ -type f -name '.mgf' | while read F; do cat ${F} >> ../../coconut_ISDB_pos.mgf; done
Outputting non-fragmented entries
For several reasons (charged compounds, some tautomers, structures too heavy to be fragmented in a reasonable amount of time) some entries might not have been fragmented.
To find them we will first list all correctly converted mgf
find ./ -type f -name '.mgf' | sed 's!./!!' | sed 's!^!!' > list_mgf.txt
%%here eventually without the extension
find ./ -type f -name '.mgf' | sed 's!./!!' | sed 's!.mgf!!' > ../../list_mgf.txt
And then the list is compared to the initial input using the table_comparator.py
python table_comparator.py ../npatlas_data/npatlas_for_frag.txt ../npatlas_data/list_mgf.txt ../npatlas_data/unfragged_list.txt