Bioinformatics Asked on December 23, 2020
I have been asked to write a script in R to edit, using data.frame, this .fa file that is the result from extracting gene sequences from a repeat masker coordinate file on the genome I am working with. Each input looks as such:
>rnd-5_family-5445_Unspecified(-)
TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG
>rnd-5_family-2614_Unspecified(-)
AATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATA
AAAATCTTTCTACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTG
TAATTTACAAAATGAATTACTTGAAGGCAACACGAATCCAGAGTGATGCTTTACATAAATCTGCTTCT
ACCGATGCCAAAAATTGACGATATTCTATTATTTAATCTAAATGTTAGTCTTTACATACCCTCCCCTA
ATTGTTAGAATTTTATGAAATTTGATTTCAGGGGTCAGTTTAGCATGCTAAATCTAATTCAATAGATT
GATATTTTTCTTCAGGTAAAGAAAATTTTTGCGTCAAAGTAATCATATTCCTCCACGATTGCATATAA
CTATGGTATATAATTTAAAAGATTACACTTTACGTAATGAAAAATCGGCCAATCATTCAAAAGTTATG
AAAGTGATC
>rnd-5_family-2614_Unspecified(-)
TTTCTAGTTTGTATTTGAGGAAAGAAAATCCTACTATCACTTTTTAAAAAATTCACCAATCTGAGATT
AGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCT
ACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAA
ATGAATTACTT
>rnd-4_family-798_Unspecified(-)
ATGCTTAATTTGCTAAGCATTTAATAACCTTTTTGGGTTTTGTGATAAAGGATGCTGTAGACATTAAA
ATAAACCTTATACTGCTAT
This is the repeat masker coordinates that the above were derived from:
>KB824701.1 417 478 rnd-5_family-5445_Unspecified . -
>KB824701.1 587 1072 rnd-5_family-2614_Unspecified . -
>KB824701.1 914 1129 rnd-5_family-2614_Unspecified . -
>KB824701.1 1138 1225 rnd-4_family-798_Unspecified . -
and ideally I would like it to be in the format of this, for example using the first one:
>rnd-5_family-5445_KB824701.1_417_478
I was told to use data.frame in R but I am not sure exactly how this will work? Can someone help please
In case an answer would still be relevant after a year and a half, below is a solution with the Biostrings
package.
library(Biostrings)
# loading data.table for the fread() function.
# fread() is not only fast but is quite good at guessing how
# the file is formatted
library(data.table)
# readDNAStringSet() function creates a "DNAStringSet" object
my_fasta <- readDNAStringSet("fasta.fa")
> my_fasta
A DNAStringSet instance of length 4
width seq names
[1] 61 TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG rnd-5_family-5445...
[2] 485 AATCTGAGATTAGATTTATTTATCTGTTATCC...GGCCAATCATTCAAAAGTTATGAAAGTGATC rnd-5_family-2614...
[3] 215 TTTCTAGTTTGTATTTGAGGAAAGAAAATCCT...TAGAAATTGTAATTTACAAAATGAATTACTT rnd-5_family-2614...
[4] 87 ATGCTTAATTTGCTAAGCATTTAATAACCTTT...GTAGACATTAAAATAAACCTTATACTGCTAT rnd-4_family-798_...
# the fasta header is stored at the @NAMES slot of the @ranges slot of the
# DNAStringSet object created above
> my_fasta@ranges@NAMES
[1] "rnd-5_family-5445_Unspecified(-)" "rnd-5_family-2614_Unspecified(-)"
[3] "rnd-5_family-2614_Unspecified(-)" "rnd-4_family-798_Unspecified(-)"
# removing the trailing "_Unpecified..."
my_fasta@ranges@NAMES <- sub("_Unspecified.*",
"",
my_fasta@ranges@NAMES)
my_coord <- fread("repeat_masker.txt")
> my_coord
V1 V2 V3 V4 V5 V6
1: KB824701.1 417 478 rnd-5_family-5445_Unspecified . -
2: KB824701.1 587 1072 rnd-5_family-2614_Unspecified . -
3: KB824701.1 914 1129 rnd-5_family-2614_Unspecified . -
4: KB824701.1 1138 1225 rnd-4_family-798_Unspecified . -
#removing the ">" sign
my_coord$V1 <- sub(">",
"",
my_coord$V1)
# appending first 3 columns of the coordinate file to "NAMES" as requested by the OP
my_fasta@ranges@NAMES <- paste(my_fasta@ranges@NAMES,
my_coord$V1,
my_coord$V2,
my_coord$V3,
sep = "_")
# writing the modified DNAStringSet object into a fasta file
writeXStringSet(my_fasta, "edited_fasta.fa")
>rnd-5_family-5445_KB824701.1_417_478
TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG
>rnd-5_family-2614_KB824701.1_587_1072
AATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTA
CCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTTGA
AGGCAACACGAATCCAGAGTGATGCTTTACATAAATCTGCTTCTACCGATGCCAAAAATTGACGATATTCTATTATTTAA
TCTAAATGTTAGTCTTTACATACCCTCCCCTAATTGTTAGAATTTTATGAAATTTGATTTCAGGGGTCAGTTTAGCATGC
TAAATCTAATTCAATAGATTGATATTTTTCTTCAGGTAAAGAAAATTTTTGCGTCAAAGTAATCATATTCCTCCACGATT
GCATATAACTATGGTATATAATTTAAAAGATTACACTTTACGTAATGAAAAATCGGCCAATCATTCAAAAGTTATGAAAG
TGATC
>rnd-5_family-2614_KB824701.1_914_1129
TTTCTAGTTTGTATTTGAGGAAAGAAAATCCTACTATCACTTTTTAAAAAATTCACCAATCTGAGATTAGATTTATTTAT
CTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTACCACTTTTGATTCATTTTTTAAA
TCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTT
>rnd-4_family-798_KB824701.1_1138_1225
ATGCTTAATTTGCTAAGCATTTAATAACCTTTTTGGGTTTTGTGATAAAGGATGCTGTAGACATTAAAATAAACCTTATA
CTGCTAT
Answered by haci on December 23, 2020
Get help from others!
Recent Answers
Recent Questions
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP