Using a straight log file to journal adds some complexity. I want to be loading the data into the database on a fairly quick basis, and then purging the temporary store, however, if the database is down I want to be able to store up a good deal of records before stressing about it. With a log file I'd have to wait for the log file to write out, this means delays from when data is available to when I can use it. Loss on crash becomes more severe the larger the file. Ultimately, we can just skip all the logic on dealing with a log file and do one file per datum. We'll additionally make this filesystem in it's own partition to avoid "cross-contamination" with the rest of the system.
There are a couple of catches:
- Not all file systems like large numbers of files.
- Some tweaking needs to be done to "optimize" the layout
- We really don't want the system managing what we've created
Unix-like file systems tend to use "dynamic" structures to hold the data. I say "dynamic" in that it's not a fixed number based on the size of the disk, nor is flexible while the system is running. Short version without going into too much depth is that these structures are called inodes, and you need at least 1 per file, sometimes more if the file takes up a lot of blocks (or fragments).
Which brings us to the second bit of flexibility, those chunks of disk that our file actually takes up. The filesystem keeps track of these by this chunk, too small and we burn up accounting (inodes) too quickly; too large, and we waste disk space.
Now, some versions of the utility that builds filesystems in unix-likes have a couple of cool options. newfs(1), on some versions, has -g and -h options:
-g average file size
-h average number of files per directory
Unfortunately, this seems to be broken on NetBSD 5.0.2. We'll just have to go it the old fashioned way:
-f frag size
-b block size
-n number of inodes
I figure I want to be able to keep a couple million samples around, my files are about 200 bytes each**. Now, the smallest allocations for frag and block size are 512/4096. So, that's what we are stuck with. Pulling a number out of the air I decide to go with about a 2Gig partition. Lets see where that lands us:
RedQueen# newfs -f 512 -b 4096 -n 4194304 /dev/rwd0n
/dev/rwd0n: 2126.2MB (4354560 sectors) block size 4096, fragment size 512
using 296 cylinder groups of 7.18MB, 1839 blks, 14176 inodes.
super-block backups (for fsck_ffs -b #) at:
32, 14744, 29456, 44168, 58880, 73592, 88304, 103016, 117728, 132440, 147152, 161864,
RedQueen# df -i /n/powerlog2
Filesystem 1K-blocks Used Avail %Cap iUsed iAvail %iCap Mounted on
/dev/wd0n 1649195 0 1566735 0% 1 4196093 0% /n/powerlog2
Ok, that looks good for inodes. Now, lets run a little test that drops 100k 200-byte files into the filesystem. This will actually take a while on the test system, initial rate is about 100/s (more on this later).
RedQueen# df -i /n/powerlog2
Filesystem 1K-blocks Used Avail %Cap iUsed iAvail %iCap Mounted on
/dev/wd0n 1649195 51564 1515171 3% 100001 4096093 2% /n/powerlog2
We are now making somewhat efficient use of the diskspace (even though we are abusing the filesystem).
We've turned on softdeps to get some better performance. async is a little risky, and I'd want to do more stress tests on wapbl before committing to that.
3. Normally, unix-like systems like to keep track of what's going on. This works to our disadvantage here.
We're going to take some control of our filesystem. Should the system crash, we don't want to wait for the huge delay in repair. First, lets pull the boot control away from the system.
RedQueen# grep wd0n /etc/fstab
/dev/wd0n /n/powerlog2 ffs rw,noatime,softdep,noauto 0 0
The last 0 is telling fsck to NOT process this file during it's boot run. 'noauto' means the system shouldn't attempt to mount it either. We've now got control of the filesystem during the boot.
The other control we want is from the systems periodic scans. There are 4 scripts that operate on a periodic basis, only 3 of which we need to concern ourselves with. (/etc/monthly doesn't do much.)
/etc/weekly maintains the locate database, which indexes the entire system. You can either turn this off (/etc/rc.conf: rebuild_locatedb=NO) or disable indexing out directories. This is controlled in /etc/locate.conf by adding:
ignorecontents /n/powerlog*
/etc/daily potentially sends two interesting things our way, an fsck (run_fsck) scan, which would be skipped by the '0' above, is disabled by default. A search for core files (find_core) *is* enabled, but doesn't have an exclusion. A quick fix is to add "find_core=NO" to /etc/daily.conf. Eventually, and exclusion path similar to locate, or the security check below should be added to the code.
/etc/security checks on a lot of the system states, but is fairly well behaved. It does have one scan that attempts to peruse the entire system: check_devices. This scans for setuid and device files and reports new/missing. It's on by default, but has a mechanism to exclude. Add
check_devices_ignore_path="/n/powerlog2"
to /etc/security.conf.
This does mean that one will need to manage these directories oneself: checking integrity and mounting. However, this has the advantage as that these processes can run while the rest of the system is live. Depending on ones usage model, one could have a spare partition ready to go to capture data while the old partition is maintained, or, if data retention is not needed, just newfs the partition again. Use will be application dependent.
---
* Jan 1, 1970
** Storing data in symlinks is possible if the target data is short enough:
dinode.h:#define MAXSYMLINKLEN_UFS1 ((NDADDR + NIADDR) * sizeof(int32_t))
dinode.h:#define MAXSYMLINKLEN_UFS2 ((NDADDR + NIADDR) * sizeof(int64_t))
dinode.h:#define NDADDR 12
dinode.h:#define NIADDR 3
So, 60 for UFS1, 120 for UFS2. If you attempt to split the data across multiple filenames the program complexity will grow, and you will incur greater lookup time for add/delete/open.