TransWikia.com

Retain only part of a file name and fasta header in fasta directory

Bioinformatics Asked by Malia W on June 3, 2021

I have a MainDirectory/hundreds_of_subdirectories/thousands_of_fastas structure. For each fasta in the bottom level, I am trying to change the file name as well as the first line in the file to retain only some parts of the file.

So that the following text, which makes up both the file name and the first name of the file:

947-length-1150-cov-1000|contig:JAECWU010000213.1|slice:817050-818200|uce:uce-1452|match:817550-817700|orient:{'+'}|probes:4.unaligned.fasta

becomes just:

uce-1452.unaligned.fasta

For editing the first line in the file, I have tried

sed '1 s/^[^|uce-]*(|uce-[0-9]).*/1/' hundreds_of_subdirectories/*

but this isn’t the ticket..

Then messing around with some kind of loop…

for i in **.unaligned.fasta
do
  sed -E 's/^[^|uce-]*(|uce-[0-9]).*/1/'
done

For the file name, it was suggested that I use rename instead of sed, but with sed I have tried something like:

for x in hundreds_of_subdirectories/thousands_of_fastas*unaligned.fasta
do
  echo $x | sed -r 's/^[^|uce-]*(|uce-[0-9]).*/mv' -v "" "|uce-[0-9].*/1/.unaligned.fasta"/
done

Am I even barking up the right tree? Thanks!

2 Answers

I think you're working too hard to capture the uce: expression.

Try

sed -Ee '1s/^.+uce:(uce-[0-9]+).+/1.unaligned.fasta/' file_glob

to replace the first line content. You may want to add the in-place flag after testing the expression on a few files.

For the mv, use command substitution, which runs the command in a sub-shell and uses the output value as the parameter:

for f in file_glob
do    
    mv -- "$f" "$(echo "$f" | sed '...')"
done

Answered by Ram RS on June 3, 2021

If you are using bash, you can enable the globstar option (shopt -s globstar) which allows ** to match recursively. That way, **/*fasta will find all files whose name ends with fastq in this directory and all of its subdirectories. So that will let you easily iterate over all the target names. Next, if you have the perl rename command (called rename on Debian and Ubuntu and similar systems, prename or perl-rename on others), you can do it quite easily.

Note that perl-rename has the -n option that makes it print out the renaming it would do without actually renaming anything. So I suggest you first run this to test what you will be doing:

shopt -s globstar
for file in **/*.fasta; do
   echo perl -i.bak -pe 's/^s*>.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file"
   rename -n 's/^.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file"
done

That will echo the perl command, so you can try it manually first and make sure it works, and will print out the rename operations without actually doing anything. If you are satisfied it will work, run again without the echo or -n:

for file in **/*fasta; do
    perl -i.bak -pe 's/^s*>.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file
    rename 's/^.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file"
done

Or, if you don't have perl-rename, you can use the shell's string manipulation for this:

for file in **/*.fasta; do
    perl -i.bak -pe 's/^s*>.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file
    tmpFileName="${file#*uce:}"
    mv -- "$file" "${tmpFileName/|*/.unaligned.fasta}"
done

Explanation

  • shopt -s globstar: this enables the globstar bash option for the current shell session. It only needs to be run once, its effects will last until you close the terminal or start a new shell session. As for what it does, see man bash:

    globstar

    If set, the pattern ** used in a pathname expansion context will match all files and zero or more directories and subdirectories. If the pattern is followed by a /, only directories and subdirectories match.

  • for file in **/*fasta; do ... ; done: iterate over all files (and directories) whose name ends in .fasta, saving each file name as $file.

  • perl -i.bak -pe 's/^s*>.*(uce-d+).*(.unaligned.fasta)/$1$2/' "$file: this changes the fasta header. The -i.bak option tells perl to edit the file in place, but to save the original file with a .bak extension. This can help you revert any changes if there is a mistake. Once you have finished, you could run rm **/*fasta.bak to delete the backups. Alternatively, just remove the -i.bak if you don't want to keep backups.

    The -p option tells perl to process its input line by line and print every line after applying whatever script you give it with -e.

    The s/old/new/ is the substitution operator, same as sed. The regular expression is looking for 0 or more whitespace characters and then a > at the beginning of the line (^s*>). It then allows for 0 or more characters (.*) until it finds the string uce: followed by uce- and one or more digits (uce:(uce-d+)). The parentheses "capture" the pattern, storing it as $1. Next, we want 0 or more characters again until we reach .unaligned.fasta (.*(.unaligned.fasta)). We capture the .unaligned.fasta as $2 (second set of parentheses) only to avoid typing it out again. Note that this assumes you will never have anything after the .unaligned.fasta. Finally, we replace everything matched above with just $1 (the uce-numbers) and $2 (.unaligned.fasta).

  • rename 's/^.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file": the rename opration. This will match everything from the beginning of the file name until the last occurrence of uce:uce-NNNN, capture the uce-NNNN as $1 then skip the remaining characters until .unaligned.fasta and finally rename as just $1$2 (the capturing logic is the same as above).

  • tmpFileName="${file#*uce:}": The ${var#pattern} syntax means "print the value of $var, but remove the shortest match of pattern from the front of the variable's value". The ${var//pattern/replacement} syntax means "print the value of $var and replace the first occurrence of pattern with replacement. Here, the pattern is |* which means "anything after a |" (note that this is shell stuff and is using globs and not regular expressions). This is probably easier to understand with an example:

    $ echo "$file"
    947-length-1150-cov-1000|contig:JAECWU010000213.1|slice:817050-818200|uce:uce-1452|match:817550-817700|orient:{'+'}|probes:4.unaligned.fasta
    $ tmpFileName="${file#*uce:}"
    $ echo "$tmpFileName"
    uce-1452|match:817550-817700|orient:{'+'}|probes:4.unaligned.fasta
    $ echo "${tmpFileName/|*/.unaligned.fasta}"
    uce-1452.unaligned.fasta
    

    So, given tmpFileName="${file#*uce:}", "${tmpFileName/|*/.unaligned.fasta}" is the desired file name and this is what we pass to mv.

Answered by terdon on June 3, 2021

Add your own answers!

Ask a Question

Get help from others!

© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP