Unix & Linux Asked by Kris on December 4, 2020
How do I remove the first 300 million lines from a 700 GB text file
on a system with 1 TB disk space total, with 300 GB available?
(My system has 2 GB of memory.)
The answers I found use sed, tail, head:
But I think (please correct me) I cannot use them due to the disk space being limited to 1 TB and they produce a new file and/or have a tmp file during processing.
The file contains database records in JSON format.
If you have enough space to compress the file, which should free a significant amount of space, allowing you to do other operations, you can try this:
gzip file && zcat file.gz | tail -n +300000001 | gzip > newFile.gz
That will first gzip
the original input file (file
) to create file.gz
. Then, you zcat
the newly created file.gz
, pipe it through tail -n +300000001
to remove the first 3M lines, compress the result to save disk space and save it as newFile.gz
. The &&
ensures that you only continue if the gzip
operation was successful (it will fail if you run out of space).
Note that text files are very compressible. For example, I created a test file using seq 400000000 > file
, which prints the numbers from 1 to 400,000,000 and this resulted in a 3.7G file. When I compressed it using the commands above, the compressed file was only 849M and the newFile.gz
I created only 213M.
Correct answer by terdon on December 4, 2020
i'd do it as
<?php
$fp1 = fopen("file.txt", "rb");
// find the position of the 3M'th line:
for ($i = 0; $i < 300_000_000; ++ $i) {
fgets($fp1);
}
// the next fgets($fp1) call will read line 3M+1 :)
$fp2 = fopen("file.txt", "cb");
// copy all remaining lines from fp1 to fp2
while (false !== ($line = fgets($fp1))) {
fwrite($fp2, $line);
}
fclose($fp1);
// remove every line that wasn't copied over to fp2
ftruncate($fp2, ftell($fp2));
fclose($fp2);
or if i need it to run fast for some reason, i'd do the same in C++ with mmap() memory mapping, this should run much faster:
#include <iostream>
#include <fstream>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
int main(){
const std::string target_file = "file.txt";
std::fstream fp1(target_file, std::fstream::binary);
fp1.exceptions(std::fstream::failbit | std::fstream::badbit);
fp1.seekg(0, std::fstream::end);
const std::streampos total_file_size_before_truncation = fp1.tellg();
fp1.seekg(0, std::fstream::beg);
const int fd = open(target_file.c_str(), O_RDWR);
char *content_mmaped = (char *)mmap(NULL, total_file_size_before_truncation, PROT_READ, MAP_PRIVATE, fd, 0);
const std::string_view content_view(content_mmaped, total_file_size_before_truncation);
size_t line_no = 0;
size_t line_pos = 0;
size_t i = 0;
for(; i < total_file_size_before_truncation; ++i){
if(content_mmaped[i] == 'n'){
++line_no;
line_pos = i;
if(line_no >= (3000000-1)){
break;
}
}
}
// idk why i have to do all those casts...
fp1.write(&content_mmaped[i], std::streamoff(std::streamoff(total_file_size_before_truncation)-std::streamoff(i)));
fp1.close();
munmap(content_mmaped, total_file_size_before_truncation);
ftruncate(fd, i);
close(fd);
}
(but if i don't need the speed, i would probably use the first approach, as the code is much easier to read and probably less likely to contain bugs as a result)
Answered by hanshenrik on December 4, 2020
There are various approaches to remove the first lines. I recommend you to split up the file into chunks, change them (remove the first lines) and concatenate the chunks again.
In your case it would be very dangerous to change the file in-place. If something goes wrong you have no fallback option!
Here is my working solution (bash
). You probably need some improvements ...
function split_into_chunks {
BIG_FILE=$1
while [ $(stat -c %s $BIG_FILE) -gt 0 ]
do
CHUNK_FILE="chunk.$(ls chunk.* 2>/dev/null | wc -l)"
tail -10 $BIG_FILE > $CHUNK_FILE
test -s $CHUNK_FILE && truncate -s -$(stat -c %s $CHUNK_FILE) $BIG_FILE
done
}
function concat_chunks {
BIG_FILE=$1
test ! -s $BIG_FILE || (echo "ERROR: target file is not empty"; return)
for CHUNK_FILE in $(ls chunk.* | sort -t . -k2 -n -r)
do
cat $CHUNK_FILE >> $BIG_FILE
rm $CHUNK_FILE
done
}
Test:
$ seq 1000 > big-file.txt
$ stat -c "%s %n" chunk.* big-file.txt 2>/dev/null | tail -12
3893 big-file.txt
$ md5sum big-file.txt; wc -l big-file.txt
53d025127ae99ab79e8502aae2d9bea6 big-file.txt
1000 big-file.txt
$ split_into_chunks big-file.txt
$ stat -c "%s %n" chunk.* big-file.txt | tail -12
40 chunk.9
31 chunk.90
30 chunk.91
30 chunk.92
30 chunk.93
30 chunk.94
30 chunk.95
30 chunk.96
30 chunk.97
30 chunk.98
21 chunk.99
0 big-file.txt
$ # here you could change the chunks
$ # the test here shows that the file will be concatenated correctly again
$ concat_chunks big-file.txt
$ stat -c "%s %n" chunk.* big-file.txt 2>/dev/null | tail -12
3893 big-file.txt
$ md5sum big-file.txt; wc -l big-file.txt
53d025127ae99ab79e8502aae2d9bea6 big-file.txt
1000 big-file.txt
Hint: You definitely need to make sure that all your chunks are not too small (very long processing time) and not too big (not enough disk space)! My example uses 10 lines per chunk - I assume that is too low for your task.
Answered by sealor on December 4, 2020
You can just read and write to the file in place and then truncate the file. There may even be a way to do this with cli tools, not sure, but here it is in Java (untested).
RandomAccessFile out = new RandomAccessFile("file.txt", "rw");
RandomAccessFile in = new RandomAccessFile("file.txt", "r");
String line = null;
long rows = 0;
while( (line=in.readLine()) != null ){
if( rows > 300000000 ) {
out.writeBytes(line);
out.write('n');
}
rows++;
}
in.close();
out.setLength( out.getFilePointer() );
out.close();
Answered by Chris Seline on December 4, 2020
Think of Towers of Hanoi. Sort of.
First, move the lines you want into a new file:
find the start of line 3 million and 1
create a new, empty file
repeat {
read a decent number of blocks from the end of the old file
append the blocks to the end of the new file
truncate the old file by that many blocks
} until you get to the start of line 3 million and 1.
You should now have a file that contains just the lines you want, but not in the right order.
So lets do the same thing again to put them into the right order:
Truncate the original file to zero blocks` (i.e. delete the first 3 million lines)
repeat {
read the same number of blocks from the end of the new file (except the first time, when you won't have an exact number of blocks unless the first 3 million lines were an exact number of blocks long)
append those blocks to the end of the original file
truncate the new file by that many blocks
} until you have processed the whole file.
You should now have just the lines you want, and in the right order.
Actual working code is left as an exercise for the reader.
Answered by Ben Aveling on December 4, 2020
What about using vim for in-place editing?
Vim is already capable of reasoning about lines:
vim -c ":set nobackup nowritebackup" -c ":300000000delete" -c ":wq" filename
Explanation:
vim
will execute the various command passed to the -c
switches as if they where passesed in an interactive session.
So:
That should do the trick. I have used vim in a similar fashion in the past, it works. It may not be copy-paste safe, OP should do some tests and possibly adapt the command to their needs.
Just to be sure, you might want to remove the -c ":wq"
switches at the end, and visually inspect the file for correctness.
Answered by znpy on December 4, 2020
With ksh93
:
tail -n +300000001 < file 1<>; file
The 1<>;
operator is a ksh93-specific variation on the standard 1<>
operator (which opens in read+write mode without truncation), that truncates the file after the command has returned at the position the command left its stdout at if that command was successful.
With other shells, you can always do the truncating-in-place-afterwards by hand with perl
for instance:
{
tail -n +300000001 &&
perl -e 'truncate STDOUT, tell STDOUT'
} < file 1<> file
To get a progress bar, using pv
:
{
head -n 300000000 | pv -s 300000000 -lN 'Skipping 300M lines' > /dev/null &&
cat | pv -N 'Rewriting the rest' &&
perl -e 'truncate STDOUT, tell STDOUT'
} < file 1<> file
(using head | pv
and cat | pv
as pv
would refuse to work if its input and output were pointing to the same file. pv -Sls 300000000
would also not work as pv
doesn't leave the pointer within the file just after the 300000000th line after existing like head
does (and is required to by POSIX for seekable files). pv | cat
instead of cat | pv
would allow pv
to know how much it needs to read and give you an ETA, but it's currently bogus in that it doesn't take into account the cases where it's not reading from the start of that file as is the case here).
Note that those are dangerous as the file is being overwritten in place. There is a chance that you run out of disk space if the first 300M lines contained holes (shouldn't happen for a valid text file), and the rest of the file takes up more space than you have spare space on the FS.
Answered by Stéphane Chazelas on December 4, 2020
Another vote for custom program if you really DO need the task. C or any powerful enough dynamic language like Perl or Python will do. I won't write out the source here, but will describe algorithm that will prevent data loss while you move data around:
cat
as necessary.cp
or cat
.Answered by Oleg V. Volkov on December 4, 2020
I created a tool that may be of use to you: hexpeek is a hex editor designed for working with huge files and runs on any recent POSIX-like system (tested on Debian, CentOS, and FreeBSD).
One can use hexpeek or an external tool to find the 300-millionth newline. Then, assuming that X is the hexadecimal zero-indexed position of the first octet after the 300-millionth newline, the file can be opened in hexpeek and a single command 0,Xk will delete the first X octets in the file.
hexpeek requires no tmpfile to perform this operation; although the optional backup mode does and would likely need to be disabled via the -backup flag (sadly the current backup algorithm does not accommodate a rearrangement affecting more file space than is available for the backup file).
Of course, a custom C program can accomplish the same thing.
Answered by resiliware on December 4, 2020
The limitation of this problem is the amount of storage wherever that is located. Significant RAM is not required since fundamentally you can simply read one byte from wherever your file is stored and then either write or not write that byte [character] out to a new file wherever that may reside. Where the infile and outfile reside can be in totally separate places... on separate partitions, disks, or across a network. You do not need to read and write to the same folder. So for the attached program, you can simply give a full path name for and to work around disk space limitations. You will be at the mercy of other limitations, such as disk or network I/O speed, but it will work. Taking very long to work is better than not being able to happen.
LL
which is a hardcoded line length I used to read in a whole line at a time from a text file, I set it to 2048 characters. Set it to 1000000 if you like, which would require 1MB of RAM should you have extremely long lines in the text file.gzip -9
on it to create a mytextfile.gz
. Being a text file will likely compress to 5% the size, which is helpful considering disk i/o speed vs cpu speed.n_deleted_lines
to an uncompressed text file, so that will likely be huge.delete_n_lines.x /home/ron/mybigfile.txt /some_nfs_mounted_disk/mybigfile_deletedlines.txt
/* this file named delete_n_lines.c
compile by gcc -W delete_n_lines.c -o delete_n_lines.x -lz
have your huge text file already compressed via "gzip -9" to save disk space
this program will also read a regular uncompressed text file
*/
# include <stdlib.h>
# include <stdio.h>
# include <string.h>
# include <zlib.h>
# define LL 2048 /* line length, number of characters up to 'n' */
int main ( int argc, char *argv[] )
{
gzFile fin;
FILE *fout;
char line[LL];
long int i, n = 0;
long int n_lines_to_delete = 0;
if ( argc != 4 )
{
printf(" Usage: %s <infile> <outfile> <first_N_lines_to_delete>nn", argv[0] );
exit( 0 );
}
n = sscanf( argv[3], "%d", &n_lines_to_delete );
if ( n == 0 )
{
printf("n Error: problem reading N lines to deletenn" );
exit( 0 );
}
if ( strcmp( argv[1], argv[2] ) == 0 )
{
printf("n Error: infile and outfile are the same.n" );
printf(" don't do thatnn");
exit( 0 );
}
fout = fopen( argv[2], "w" );
if ( fout == NULL )
{
printf("n Error: could not write to %snn", argv[2] );
exit( 0 );
}
fin = gzopen( argv[1], "r" );
if ( fin == NULL )
{
printf("n Error: could not read %snn", argv[1] );
fclose( fout );
exit( 0 );
}
n = 0;
gzgets( fin, line, LL );
while ( ! gzeof( fin ) )
{
if ( n < n_lines_to_delete )
n++;
else
fputs( line, fout );
gzgets( fin, line, LL );
}
gzclose( fin );
fclose( fout );
printf("n deleted the first %d lines of %s, output file is %snn", n, argv[1], argv[2] );
return 0;
}
Answered by ron on December 4, 2020
You can do it with losetup
, as an alternative to the dd
method described here. Again, this method is dangerous all the same.
Again, the same test file and sizes (remove lines 1-300 from 1000 lines file):
$ seq 1 1000 > 1000lines.txt
$ stat -c %s 1000lines.txt
3893 # total bytes
$ head -n 300 1000lines.txt | wc -c
1092 # first 300 lines bytes
$ echo $((3893-1092))
2801 # target filesize after removal
Create a loop device:
# losetup --find --show 1000lines.txt
/dev/loop0
losetup: 1000lines.txt:
Warning: file does not fit into a 512-byte sector;
the end of the file will be ignored.
# head -n 3 /dev/loop0
1
2
3
# tail -n 3 /dev/loop0
921
922
923
Whoops. There are numbers missing. What's going on?
Loop devices require their backing files to be multiple of sector size. Text files with lines don't usually fit that scheme, so in order to not miss the end of file (last partial sector) content, just append some more data first, then try again:
# head -c 512 /dev/zero >> 1000lines.txt
# losetup --find --show 1000lines.txt
/dev/loop1
losetup: 1000lines.txt:
Warning: file does not fit into a 512-byte sector;
the end of the file will be ignored.
# tail -n 3 /dev/loop1
999
1000
The warning persists but the content is complete now, so that's okay.
Create another one, this time with the 300 line offset:
# losetup --find --show --offset=1092 1000lines.txt
/dev/loop2
losetup: 1000lines.txt:
Warning: file does not fit into a 512-byte sector;
the end of the file will be ignored.
# head -n 3 /dev/loop2
301
302
303
# tail -n 3 /dev/loop2
999
1000
Here's the nice thing about loop devices. You don't have to worry about truncating the file by accident. You can also easily verify that your offsets are indeed correct before performing any action.
Finally, just copy it over, from offset device to full:
cp /dev/loop2 /dev/loop1
Dissolve loop devices:
losetup -d /dev/loop2 /dev/loop1 /dev/loop0
(Or: losetup -D
to dissolve all loop devices.)
Truncate the file to target filesize:
truncate -s 2801 1000lines.txt
The result:
$ head -n 3 1000lines.txt
301
302
303
$ tail -n 3 1000lines.txt
998
999
1000
Answered by frostschutz on December 4, 2020
On some filesystems like ext4 or xfs, you can use the fallocate()
system call for that.
Answered by pink slime on December 4, 2020
Removing the first n lines (or bytes) can be done in-place using dd
(or alternatively using loop devices). It does not use a temporary file and there is no size limit; however, it is dangerous since there is no track of progress, and any error leaves you with a broken file.
Example: Create a sample file with 1000 lines:
$ seq 1 1000 > 1000lines.txt
$ head -n 3 1000lines.txt
1
2
3
$ tail -n 3 1000lines.txt
998
999
1000
We want to remove the first 300 lines. How many bytes does it correspond to?
$ stat -c %s 1000lines.txt
3893 # total bytes
$ head -n 300 1000lines.txt | wc -c
1092 # first 300 lines bytes
$ echo $((3893-1092))
2801 # target filesize after removal
The file is 3893 bytes, we want to remove the first 1092 bytes, leaving us with a new file of 2801 bytes.
To remove these bytes, we use the GNU dd
command, with conv=notrunc
as otherwise the file would be deleted before you can copy its contents:
$ dd conv=notrunc iflag=skip_bytes skip=1092 if=1000lines.txt of=1000lines.txt
5+1 records in
5+1 records out
2801 bytes (2.8 kB, 2.7 KiB) copied, 8.6078e-05 s, 32.5 MB/s
This removes the first 300 lines, but now the last 1092 bytes repeat, because the file is not truncated yet:
$ truncate -s 2801 1000lines.txt
This reduces the file to its final size, removing duplicated lines at end of file.
The result:
$ stat -c %s 1000lines.txt
2801
$ head -n 3 1000lines.txt
301
302
303
$ tail -n 3 1000lines.txt
998
999
1000
The process for a larger file is similar. You may need to set a larger blocksize for better performance (the blocksize option for dd
is bs
).
The main issue is determining the correct byte offset for the exact line number. In general it can only be done by reading and counting. With this method, you have to read the entire file at least once even if you are discarding a huge chunk of it.
Answered by frostschutz on December 4, 2020
Get help from others!
Recent Questions
Recent Answers
© 2024 TransWikia.com. All rights reserved. Sites we Love: PCI Database, UKBizDB, Menu Kuliner, Sharing RPP