Java read binary file in chunks


I created the the Gypsum programming language, and I have worked on optimizing the V8 JavaScript engine for mobile phones. Most of my writing focuses on compilers and programming languages. I'm also interested in 3D graphics, high performance computing, cryptography, and security. Part of my research lately has been analyzing the heap behavior of several network applications. This java read binary file in chunks logging every mallocfreeload, and store that occurs while a program java read binary file in chunks executing, then running a number of analysis tools on the log later.

I decided to write my log processing tools in Scala since the combination of pattern matching and good data java read binary file in chunks allows me to write new analysis tools very quickly. While writing these tools, I experienced a lot of performance problems.

Some of the logs can get quite large on the order of gigabytes. Here are some of the techniques I used to make things run at an acceptable speed.

By default, Scala sets the maximum heap size to MB. This is not nearly enough if you have any reasonably large data set that you want to hold in memory. If you exceed this limit, you will get an OutOfMemoryErrorand your program will crash. There will also be substantial garbage collection overhead as you approach the limit.

With Java, you can increase the maximum heap size by passing the command line option -Xmx followed by the maximum size of the heap. So to java read binary file in chunks a Scala program with a 4GB heap, you would run it like this:. If you are reading java read binary file in chunks bytes from a binary file with a FileInputStreamyou are going to have terrible performance.

Every time you read from the stream, Java will make a system call, which is expensive. This can be mitigated by wrapping the file with java read binary file in chunks BufferedInputStreambut you still need to make a number of system calls proportional to the size of the file. Java lets you map the contents of a file directly into memory using FileChannel and MappedByteBuffer from the java.

This works like the mmap system call in C. The file's contents will appear in your address space, but since it won't be on the heap, it will not cause any additional garbage collection overhead. Once a file is mapped, you can read data from it using the methods in ByteBuffer without making any additional system calls.

When you read from a part of the file that hasn't been read before, the kernel will load that section of the file automatically. First, each mapping can only cover 2GB of a file. This is apparently because ByteBuffer uses signed bit integers for offsets and positions. Use multiple mappings if you need to map a larger file, one for each 2GB chunk.

Second, there is no way to manually unmap a file. Unmapping occurs automatically when the ByteBuffer object gets garbage collected, but there is no way to control when that occurs. Because of both of these caveats, I would highly recommend running java read binary file in chunks bit JVM to avoid exhausting your virtual address space.

Most computers are based on the x86 architecture, which means that binary values in your data are probably in little-endian format least significant byte first. By default, Java expects file data to be in big-endian format, so you would normally have to run a bit of code to swap bytes every time you read an integer. Chances are, the JVM will still do some byte order conversions internally, but these should be on highly optimized code paths, i.

Following this pattern, my programs would read the log quickly at first but would get slower and slower as more data was put on the heap, eventually grinding to a java read binary file in chunks. The data processing passes wouldn't even get to run because my programs would run out of memory. Since many of my tools only need to make one pass over the log, it made more sense to present the sequence of events as a stream. Streams in Scala are like lists, but they don't evaluate their contents until requested.

This means a program can read and process data at the same time. Memory usage is kept at a fixed level, since new events aren't read until they are needed, and old events can be garbage collected. If you need to make multiple passes over a large data set, streams are still useful. Even though old events may have been garbage collected, reading them a second time will be fast since the file will probably still java read binary file in chunks in the kernel's buffer cache.

The event and the position of the next event in the buffer are computed lazily. The event is returned by head. The rest of the stream is generated in tail by creating a new EventStream at the position of the next event. When I write code, I usually strive for simplicity and readability over performance.

This is not always a best practice, especially when you are running a very simple, readable O n 2 algorithm on a 30 million element data set. If you can switch from a O n container to one that is O lg n or O 1it will probably boost your performance significantly. The extra complication may be worth it.

Subscribe jayconrod E-mail me Recent Posts Minibox: Processing large files in Scala Published on Edited on Tagged: Increase the maximum heap size. So to start a Scala program with a 4GB heap, you would java read binary file in chunks it like this: You can map a file like this: Make Java do endian conversion for you Most computers are based on the x86 architecture, which means that binary values in your data are probably in little-endian format least significant byte first.

Use streams instead of lists or arrays When I first wrote my data processing tools, I followed a very simple pattern: Here's how I implemented my event stream: The stream can be initialized and used like this: Double-check your code for slow areas When I write code, I usually strive for simplicity and readability over performance.