Applying function to file line by line or read entirely into structure first?

Question

I've often found myself with the need to develop tools that process large files over a network and perform an operation to every element in that file. An element may be an individual line or an object that is parsed out based on its structure (XML, JSON, binary format). A key feature of these tools is what I often call "user feedback", and it tends to manifest itself as a progress bar that is updated periodically. I've found the only way to do this is to use the "line by line" approach:
for file in file_set:
    with open(file, 'r') as f:
        for element in f:
            do_thing_to(element)
        # after 'time' update progress

This seems idiomatic and straight forward. But I've often wondered if reading the entire file first into some structure and then using an apply or a map to that structure would result in faster performance. However, doing so I lose the ability to keep track of "progress" and inform the user on the granular level I've chosen. Instead it must become more broad in classification of progress.
This obviously is system dependent and requires benchmarking, but which tends to be the typical approach to such a problem?
An immediate concern I have for the "reading entirely first" method is a memory constraint, but that's all I can really think of. Speed and memory efficiency are the main concerns, per usual. If they both benchmark at the same rate, I'd default to the lower memory profile method.

JimmyJames · Answer

There are a number of factors here but we can definitely lay out some principles around these kinds of situations.  Let's start with the basic framework.  Consider the following visualization:
time it takes to load    |----------|
time it takes to process |----------|

The length of the line represents time.  The units involved matter in practice but not at the conceptual level.
Now here's a what it looks like when you load the data and then process it:
loading    |----------|
process               |----------|

We can simply add the  time it takes to load to the time it takes to process.  Now consider if we don't wait for loading to finish before we process it.  It might look something like this:
loading    |----------|
process     |----------|

Now I've made an assumption here that the loading process can happen in parallel with processing.  While this isn't guaranteed, it's absolutely doable with non-blocking IO.  Even with regular IO, this is often still roughly how things happen.
Now if either the loading or processing is insignificant, this won't have a major impact either way.  But when both take long enough to matter, stream processing can make a serious dent in the total time.  Another case where this can make a big is when you chain processes steps such as in a 'pipes and filters' design.  e.g. you could have  this:
|----------|
           |----------|
                      |----------|
                                 |----------|
                                            |----------|

Or this:
|----------|
 |----------|
  |----------|
   |----------|
    |----------|

This is simplifying some things, of course but at a high level it's absolutely true.  So with regard to your situation, the most costly step is likely the download of the file.  You don't seem to be considering that but if you wanted to stream, it would really be against the data as you pull it down.  But if your processing is relatively quick, there's not much advantage and it could present some complexities.
Another factor to consider if you are really to eek out every last drop of performance: it takes time to allocate memory.  Let's say you need to allocate 1KiB of memory per line and there are 1024 lines.  That's 1 MiB of memory if you pre-load and 1KiB (roughly) process at a line level.  It takes a lot longer to allocate a megabyte of memory than a kilobyte and then you need to reclaim which also  takes time.
Ultimately, at a high-level, if you are processing data sequentially, it's going to take more time and resources to pre-load the data.  When you are loading small files from disk or SSD, it's not going to matter and you might get a little speed boost by pre-loading because of how your hardware manages IO.  But for any significant amount of data, pre-loading is less efficient.
It's important to note  that there are other considerations such as how it can be more complex to handle errors in a streaming solution.  If you need all the data for a calculation or need to access the same values repeatedly, streaming  can become impractical or impossible.

gnasher729 · Answer

For many formats you have no choice but parsing the complete file. For example, with JSON adding a single zero byte to the end of a perfectly fine JSON file makes it invalid. And parsing the complete structure is likely easier than having a function that processes line by line.
That said, you avoid problems with very large files by passing largish blocks (say 64K at a time) to the parser. If you think that the whole file contents won’t be used, you can just parse the file without creating all the data structures.

Karl Bielefeldt · Answer

You can always measure, but you might be surprised at the results, especially for sequential access. People don't think about optimizations done at lower levels of abstraction. For example, your operating system is caching files to memory:
$ free -h
              total        used        free      shared  buff/cache   available
Mem:           31Gi       4.9Gi        22Gi       445Mi       4.2Gi        25Gi
Swap:         1.0Gi          0B       1.0Gi

Here on my system, I currently have 4.2G of file cache. Your language's standard library also does buffering. Some, like Java's BufferedReader, are more explicit than others. Even your disk drive has its own buffering. These things have all been optimized by some very smart people.
In other words, your application is not going out to physically read from the disk every time you read another line. If you try to optimize by doing your own buffering, you might end up throwing the filesystem cache out to make room in RAM. You might end up writing another application's memory to a swap file in order to make room in RAM. You might choose buffer strategies that can't take advantage of faster levels of CPU cache. You don't want to undo optimizations other people have made on your behalf.

Lie Ryan · Answer

For 90% of problems most people would encounter, reading the file in its entirety and then completely parsing them is faster, simpler, and easier. This should be your default choice when working with smaller data.
You should only use incremental parsing/stream processing when your program may be used in a context where it need to process a very large input, where slurping the entire file may cause unacceptable memory usage, or if processing takes such a significant amount time that you really need to report partial progress.

Doc Brown · Answer

This is often a tradeoff between

memory usage, and

ease of implementation

As you already noted by yourself, reading a file entirely first has the drawbacks of requiring more memory and making it more complicated to report progress.
However, reading a structured file entirely first may be necessary (or at least simpler) when further processing cannot be easily implemented sequentially. For example, let us say you have to process a complex XML file, and the processing requires several xslt queries into the data where the result of a previous query may influence the next query. For such a situation, reading the XML into an DOM document structure first may be way more simple than trying to build some sequential processing.
So here is how I usually deal with it this way: ask yourself

is the expected maximum file size "small enough" to be handled in entirety?

does reading the file entirely make further processing simpler?

If the answer to both questions is "yes", then I would prefer reading the file completely into a suitable data structure. Otherwise, I would prefer a sequential (i.e. "line-by-line") approach.
Let me add I had to deal sometimes with situations where reading the entire file was not feasible, but the requirements did not fit well to a sequential approach, either. These cases may require a mixed approach, for example one where a first step sequential processing step is used to filter the required data down to a smaller subset, or transform it into a different representation so afterwards the non-sequential processing can take place.

Applying function to file line by line or read entirely into structure first?

5 Answers

Add your own answers!

Ask a Question