Bioinformatics Asked by C. Zeil on July 19, 2021
I am experiencing timeout problems when downloading the NCBI nr preformatted blast database using the update_blastdb script (version 504861).
I run the script with the following paramters
update_blastdb --decompress --passive --verbose nr
and I get the following error message (in verbose mode)
Downloading nr (45 volumes) ...
Downloading nr.00.tar.gz...Net::FTP=GLOB(0x5610fb59b8f8)>>> PASV
Net::FTP=GLOB(0x5610fb59b8f8)<<< 227 Entering Passive Mode (165,112,9,229,195,144).
Net::FTP=GLOB(0x5610fb59b8f8)>>> RETR nr.00.tar.gz
Net::FTP=GLOB(0x5610fb59b8f8)<<< 150 Opening BINARY mode data connection for nr.00.tar.gz (18745730730 bytes)
Net::FTP: Net::Cmd::getline(): timeout at /usr/share/perl/5.26/Net/FTP/dataconn.pm line 82.
Unable to close datastream at /usr/bin/update_blastdb line 202.
Net::FTP=GLOB(0x5610fb59b8f8)>>> PASV
Net::FTP: Net::Cmd::getline(): unexpected EOF on command channel: Connection reset by peer at /usr/bin/update_blastdb line 203.
Failed to download nr.00.tar.gz.md5!
Net::FTP: Net::Cmd::_is_closed(): unexpected EOF on command channel: Connection reset by peer at /usr/bin/update_blastdb line 101.
Net::FTP: Net::Cmd::_is_closed(): unexpected EOF on command channel: Connection reset by peer at /usr/bin/update_blastdb line 101.
The timeout happens after ~35 minutes and a file that is approximately 18GB big is being downloaded, which matches the expected filesize. The checksum file (nr.00.tar.gz.md5) is not downloaded. So I’m not sure which of the two files is actually the problem.
I tested downloading the nt database and everything seems to work fine, so I don’t think the script is the problem. For comparison, this is the nt download output for the first file
Downloading nt.00.tar.gz...Net::FTP=GLOB(0x562575a73168)>>> PASV
Net::FTP=GLOB(0x562575a73168)<<< 227 Entering Passive Mode (165,112,9,229,196,51).
Net::FTP=GLOB(0x562575a73168)>>> RETR nt.00.tar.gz
Net::FTP=GLOB(0x562575a73168)<<< 150 Opening BINARY mode data connection for nt.00.tar.gz (4065912989 bytes)
Net::FTP=GLOB(0x562575a73168)<<< 226 Transfer complete
Net::FTP=GLOB(0x562575a73168)>>> PASV
Net::FTP=GLOB(0x562575a73168)<<< 227 Entering Passive Mode (165,112,9,229,195,107).
Net::FTP=GLOB(0x562575a73168)>>> RETR nt.00.tar.gz.md5
Net::FTP=GLOB(0x562575a73168)<<< 150 Opening BINARY mode data connection for nt.00.tar.gz.md5 (47 bytes)
Net::FTP=GLOB(0x562575a73168)<<< 226 Transfer complete
Any help would be appreciated.
The answer of the support stated that it is likely that the size of the first download is the problem and that I should try using another tool like rsync. Because I'm unfamiliar with rsync I decided to write a python script that did the job.
import urllib, tarfile, json
base_url = "https://ftp.ncbi.nlm.nih.gov/blast/db"
# download manifest file to get the filenames
manifest_file = 'blastdb-manifest.json'
urllib.request.urlretrieve(f"{base_url}/{manifest_file}", manifest_file)
with open(manifest_file) as f:
manifest_data = json.load(f)
# download everything
for file in manifest_data['nr']['files']:
# download checksum file
checksum_file = f"{file}.md5"
urllib.request.urlretrieve(f"{base_url}/{checksum_file}", checksum_file)
# download the archive file
urllib.request.urlretrieve(f"{base_url}/{file}", file)
# check that the checksums match
calculated_checksum = get_md5_for_file(file)
with open(checksum_file) as f:
for line in f:
line = line.strip()
if line.endswith(file):
checksum = re.split("s+", line)[0]
if checksum == calculated_checksum:
checksum_matches = True
else:
raise Exception(f"Checksum doesn't match expected for file {file}")
# unpack the archive
tar = tarfile.open(archive_file)
tar.extractall(file)
tar.close()
The function to calculate the md5 checksum is the following (borrowed from this question)
def get_md5_for_file(file):
hash_md5 = hashlib.md5()
with open(file, 'rb') as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
Note that this is not the exact script I use, so I can't guarantee it works exactly like this, but I tried to include all the important stuff.
In my version I include checks to whether my existing blast database needs to be updated (I keep the manifest files and compare the manifest['nr']['last_updated'] values) and I also try to skip already unpacked archives while iterating (check if the nr.xx.phd file exists) to save time in case the script fails due to problems in the download.
Correct answer by C. Zeil on July 19, 2021
Same here. I used axel to download it a week ago, but fail to get finished. Also, the downloaded taxa zipped file (60Mb) doesn't match md5sum. Apparently, they update the FTP files every 30 minutes or 1 hour.
Answered by Life_Searching_Steps on July 19, 2021
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP