TransWikia.com

Help Adjusting Bedtools Output in R

Bioinformatics Asked on December 23, 2020

I have been asked to write a script in R to edit, using data.frame, this .fa file that is the result from extracting gene sequences from a repeat masker coordinate file on the genome I am working with. Each input looks as such:

>rnd-5_family-5445_Unspecified(-)
TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG
>rnd-5_family-2614_Unspecified(-)
AATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATA
AAAATCTTTCTACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTG
TAATTTACAAAATGAATTACTTGAAGGCAACACGAATCCAGAGTGATGCTTTACATAAATCTGCTTCT
ACCGATGCCAAAAATTGACGATATTCTATTATTTAATCTAAATGTTAGTCTTTACATACCCTCCCCTA
ATTGTTAGAATTTTATGAAATTTGATTTCAGGGGTCAGTTTAGCATGCTAAATCTAATTCAATAGATT
GATATTTTTCTTCAGGTAAAGAAAATTTTTGCGTCAAAGTAATCATATTCCTCCACGATTGCATATAA
CTATGGTATATAATTTAAAAGATTACACTTTACGTAATGAAAAATCGGCCAATCATTCAAAAGTTATG
AAAGTGATC 
>rnd-5_family-2614_Unspecified(-)
TTTCTAGTTTGTATTTGAGGAAAGAAAATCCTACTATCACTTTTTAAAAAATTCACCAATCTGAGATT
AGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCT
ACCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAA
ATGAATTACTT 
>rnd-4_family-798_Unspecified(-)
ATGCTTAATTTGCTAAGCATTTAATAACCTTTTTGGGTTTTGTGATAAAGGATGCTGTAGACATTAAA
ATAAACCTTATACTGCTAT

This is the repeat masker coordinates that the above were derived from:

>KB824701.1 417 478 rnd-5_family-5445_Unspecified . -
>KB824701.1 587 1072 rnd-5_family-2614_Unspecified . -
>KB824701.1 914 1129 rnd-5_family-2614_Unspecified . -
>KB824701.1 1138 1225 rnd-4_family-798_Unspecified . -

and ideally I would like it to be in the format of this, for example using the first one:

>rnd-5_family-5445_KB824701.1_417_478

I was told to use data.frame in R but I am not sure exactly how this will work? Can someone help please

One Answer

In case an answer would still be relevant after a year and a half, below is a solution with the Biostrings package.

library(Biostrings)

# loading data.table for the fread() function.
# fread() is not only fast but is quite good at guessing how
# the file is formatted
library(data.table)

# readDNAStringSet() function creates a "DNAStringSet" object
my_fasta <- readDNAStringSet("fasta.fa")

> my_fasta
  A DNAStringSet instance of length 4
    width seq                                                                names               
[1]    61 TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG      rnd-5_family-5445...
[2]   485 AATCTGAGATTAGATTTATTTATCTGTTATCC...GGCCAATCATTCAAAAGTTATGAAAGTGATC rnd-5_family-2614...
[3]   215 TTTCTAGTTTGTATTTGAGGAAAGAAAATCCT...TAGAAATTGTAATTTACAAAATGAATTACTT rnd-5_family-2614...
[4]    87 ATGCTTAATTTGCTAAGCATTTAATAACCTTT...GTAGACATTAAAATAAACCTTATACTGCTAT rnd-4_family-798_...

# the fasta header is stored at the @NAMES slot of the @ranges slot of the
# DNAStringSet object created above
> my_fasta@ranges@NAMES
[1] "rnd-5_family-5445_Unspecified(-)" "rnd-5_family-2614_Unspecified(-)"
[3] "rnd-5_family-2614_Unspecified(-)" "rnd-4_family-798_Unspecified(-)"

# removing the trailing "_Unpecified..."
my_fasta@ranges@NAMES <- sub("_Unspecified.*",
                              "",
                              my_fasta@ranges@NAMES)

my_coord <- fread("repeat_masker.txt")
> my_coord
           V1   V2   V3                            V4 V5 V6
1: KB824701.1  417  478 rnd-5_family-5445_Unspecified  .  -
2: KB824701.1  587 1072 rnd-5_family-2614_Unspecified  .  -
3: KB824701.1  914 1129 rnd-5_family-2614_Unspecified  .  -
4: KB824701.1 1138 1225  rnd-4_family-798_Unspecified  .  -
#removing the ">" sign

my_coord$V1 <- sub(">",
                   "",
                   my_coord$V1)

# appending first 3 columns of the coordinate file to "NAMES" as requested by the OP
my_fasta@ranges@NAMES <- paste(my_fasta@ranges@NAMES,
                               my_coord$V1,
                               my_coord$V2,
                               my_coord$V3,
                               sep = "_")

# writing the modified DNAStringSet object into a fasta file
writeXStringSet(my_fasta, "edited_fasta.fa")

>rnd-5_family-5445_KB824701.1_417_478
TTTCACCGTAAATTAACTTTGAGAGGAGCTAATTCCTAAAAGAATTATACCGGCCTATTTG
>rnd-5_family-2614_KB824701.1_587_1072
AATCTGAGATTAGATTTATTTATCTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTA
CCACTTTTGATTCATTTTTTAAATCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTTGA
AGGCAACACGAATCCAGAGTGATGCTTTACATAAATCTGCTTCTACCGATGCCAAAAATTGACGATATTCTATTATTTAA
TCTAAATGTTAGTCTTTACATACCCTCCCCTAATTGTTAGAATTTTATGAAATTTGATTTCAGGGGTCAGTTTAGCATGC
TAAATCTAATTCAATAGATTGATATTTTTCTTCAGGTAAAGAAAATTTTTGCGTCAAAGTAATCATATTCCTCCACGATT
GCATATAACTATGGTATATAATTTAAAAGATTACACTTTACGTAATGAAAAATCGGCCAATCATTCAAAAGTTATGAAAG
TGATC
>rnd-5_family-2614_KB824701.1_914_1129
TTTCTAGTTTGTATTTGAGGAAAGAAAATCCTACTATCACTTTTTAAAAAATTCACCAATCTGAGATTAGATTTATTTAT
CTGTTATCCGCTCAGGTTGAAAAGTTTGCGCAATGTATAACTATAAAAATCTTTCTACCACTTTTGATTCATTTTTTAAA
TCTGGGGTCACATTTTATTCACGGTAGAAATTGTAATTTACAAAATGAATTACTT
>rnd-4_family-798_KB824701.1_1138_1225
ATGCTTAATTTGCTAAGCATTTAATAACCTTTTTGGGTTTTGTGATAAAGGATGCTGTAGACATTAAAATAAACCTTATA
CTGCTAT

Answered by haci on December 23, 2020

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP