Bioinformatics Asked by Malia W on June 3, 2021
I have a MainDirectory/hundreds_of_subdirectories/thousands_of_fastas
structure. For each fasta in the bottom level, I am trying to change the file name as well as the first line in the file to retain only some parts of the file.
So that the following text, which makes up both the file name and the first name of the file:
947-length-1150-cov-1000|contig:JAECWU010000213.1|slice:817050-818200|uce:uce-1452|match:817550-817700|orient:{'+'}|probes:4.unaligned.fasta
becomes just:
uce-1452.unaligned.fasta
For editing the first line in the file, I have tried
sed '1 s/^[^|uce-]*(|uce-[0-9]).*/1/' hundreds_of_subdirectories/*
but this isn’t the ticket..
Then messing around with some kind of loop…
for i in **.unaligned.fasta
do
sed -E 's/^[^|uce-]*(|uce-[0-9]).*/1/'
done
For the file name, it was suggested that I use rename instead of sed, but with sed I have tried something like:
for x in hundreds_of_subdirectories/thousands_of_fastas*unaligned.fasta
do
echo $x | sed -r 's/^[^|uce-]*(|uce-[0-9]).*/mv' -v "" "|uce-[0-9].*/1/.unaligned.fasta"/
done
Am I even barking up the right tree? Thanks!
I think you're working too hard to capture the uce:
expression.
Try
sed -Ee '1s/^.+uce:(uce-[0-9]+).+/1.unaligned.fasta/' file_glob
to replace the first line content. You may want to add the in-place flag after testing the expression on a few files.
For the mv
, use command substitution, which runs the command in a sub-shell and uses the output value as the parameter:
for f in file_glob
do
mv -- "$f" "$(echo "$f" | sed '...')"
done
Answered by Ram RS on June 3, 2021
If you are using bash
, you can enable the globstar
option (shopt -s globstar
) which allows **
to match recursively. That way, **/*fasta
will find all files whose name ends with fastq
in this directory and all of its subdirectories. So that will let you easily iterate over all the target names. Next, if you have the perl rename
command (called rename
on Debian and Ubuntu and similar systems, prename
or perl-rename
on others), you can do it quite easily.
Note that perl-rename
has the -n
option that makes it print out the renaming it would do without actually renaming anything. So I suggest you first run this to test what you will be doing:
shopt -s globstar
for file in **/*.fasta; do
echo perl -i.bak -pe 's/^s*>.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file"
rename -n 's/^.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file"
done
That will echo
the perl command, so you can try it manually first and make sure it works, and will print out the rename
operations without actually doing anything. If you are satisfied it will work, run again without the echo
or -n
:
for file in **/*fasta; do
perl -i.bak -pe 's/^s*>.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file
rename 's/^.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file"
done
Or, if you don't have perl-rename
, you can use the shell's string manipulation for this:
for file in **/*.fasta; do
perl -i.bak -pe 's/^s*>.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file
tmpFileName="${file#*uce:}"
mv -- "$file" "${tmpFileName/|*/.unaligned.fasta}"
done
shopt -s globstar
: this enables the globstar
bash option for the current shell session. It only needs to be run once, its effects will last until you close the terminal or start a new shell session. As for what it does, see man bash
:
globstar
If set, the pattern
**
used in a pathname expansion context will match all files and zero or more directories and subdirectories. If the pattern is followed by a/
, only directories and subdirectories match.
for file in **/*fasta; do ... ; done
: iterate over all files (and directories) whose name ends in .fasta
, saving each file name as $file
.
perl -i.bak -pe 's/^s*>.*(uce-d+).*(.unaligned.fasta)/$1$2/' "$file
: this changes the fasta header. The -i.bak
option tells perl
to edit the file in place, but to save the original file with a .bak
extension. This can help you revert any changes if there is a mistake. Once you have finished, you could run rm **/*fasta.bak
to delete the backups. Alternatively, just remove the -i.bak
if you don't want to keep backups.
The -p
option tells perl
to process its input line by line and print every line after applying whatever script you give it with -e
.
The s/old/new/
is the substitution operator, same as sed
. The regular expression is looking for 0 or more whitespace characters and then a >
at the beginning of the line (^s*>
). It then allows for 0 or more characters (.*
) until it finds the string uce:
followed by uce-
and one or more digits (uce:(uce-d+)
). The parentheses "capture" the pattern, storing it as $1
. Next, we want 0 or more characters again until we reach .unaligned.fasta
(.*(.unaligned.fasta)
). We capture the .unaligned.fasta
as $2
(second set of parentheses) only to avoid typing it out again. Note that this assumes you will never have anything after the .unaligned.fasta
. Finally, we replace everything matched above with just $1
(the uce-numbers
) and $2
(.unaligned.fasta
).
rename 's/^.*uce:(uce-d+).*(.unaligned.fasta)/$1$2/' "$file"
: the rename opration. This will match everything from the beginning of the file name until the last occurrence of uce:uce-NNNN
, capture the uce-NNNN
as $1
then skip the remaining characters until .unaligned.fasta
and finally rename as just $1$2
(the capturing logic is the same as above).
tmpFileName="${file#*uce:}"
: The ${var#pattern}
syntax means "print the value of $var
, but remove the shortest match of pattern
from the front of the variable's value". The ${var//pattern/replacement}
syntax means "print the value of $var
and replace the first occurrence of pattern
with replacement
. Here, the pattern is |*
which means "anything after a |
" (note that this is shell stuff and is using globs and not regular expressions). This is probably easier to understand with an example:
$ echo "$file"
947-length-1150-cov-1000|contig:JAECWU010000213.1|slice:817050-818200|uce:uce-1452|match:817550-817700|orient:{'+'}|probes:4.unaligned.fasta
$ tmpFileName="${file#*uce:}"
$ echo "$tmpFileName"
uce-1452|match:817550-817700|orient:{'+'}|probes:4.unaligned.fasta
$ echo "${tmpFileName/|*/.unaligned.fasta}"
uce-1452.unaligned.fasta
So, given tmpFileName="${file#*uce:}"
, "${tmpFileName/|*/.unaligned.fasta}"
is the desired file name and this is what we pass to mv
.
Answered by terdon on June 3, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP