Wednesday, 28 March 2018

Understanding stdlib FileIO Caching


Understanding stdlib FileIO Caching


This post is regarding how how stdlib FIO caching  is implemented.

FIO in stdlib uses its own caching technique to achieve performance optimization. It supports 3 buffering modes:


Line buffering (__SLBF) - characters are transmitted to the system as a block when a new-line character is encountered. Line buffering is meaningful only for text streams.
Full buffering (__SFBF)- characters are transmitted to the system as a block when a buffer is filled.
No buffering   (__SNBF)- characters are transmitted to the system as they are written.

How buffering is achieved:-
FIO in stdlib uses in-process caching for providing buffered IO. In-process buffering is achieved by a local buffer which is maintained by FILE object. File object uses many data members to maintain caching information, For e.g. fp->_w, fp->_bf._base, fp->_bf._size, fp->_p to point out few.
Based on the type of caching, the necessary data members are initialized during the first read of the file. During that time the necessary file data is cached from the disc to the buffer. All the buffered reads/writes which is later made, modifies this local buffer.
Only when the buffer is full or the user explicitly flushes the data, the buffer is written to discs.

Every call to fread and fwrite first checks whether the stream is buffered IO or not and based on that it either updates the local buffer or directly write/reads using system read/write functions.

Other Related Information:-
1. By default Full buffering (__SFBF) is used for FIO. We mostly deals with the Full buffering mode.

2. We can use the setvbuf() and setbuf() library functions to control buffering. After opening a stream (but before any other operations have been performed on it), we can explicitly specify what kind of buffering we need by setting one of the flags,
_IOFBF (for full buffering),
_IOLBF (for line buffering), or
_IONBF (for unbuffered input/output).

3. Incase of unbuffered IO (__SNBF) we will be making direct system level FIO calls, somewhat similar to what we were doing in fileio.c build 9110.

Deductions:-
  1. Using Stdlib FIO, the performance improvement that we are getting is due to its local caching but the quantum will purely be dependent upon our usage.
  2. We could have leveraged the feature of stdlib unbuffered IO for our purpose but fflush() only uses write() system (and not uses fsync) which does not guarantee data write to disc. One way is we write our own fflush() that will call fsync() after calling stdlib fflush(). This way we can guarantee file writes to discs. 

Please share your thoughts and doubts in comments.

*Excuse me for poor formatting/