Hi, I think I finally found out what killed my webserver the last time. I've installed a watchdog and now I have something in my log: Jan 18 20:40:24 websrv lvcreate: page allocation failure. order:0, mode:0xd0 Jan 18 20:40:24 websrv Call Trace: Jan 18 20:40:24 websrv [] __alloc_pages+0x2ee/0x350 Jan 18 20:40:24 websrv [] client_alloc_pages+0x31/0x80 Jan 18 20:40:24 websrv [] kcopyd_client_create+0x55/0xb0 Jan 18 20:40:24 websrv [] dm_create_persistent+0xb8/0x130 Jan 18 20:40:24 websrv [] snapshot_ctr+0x29b/0x380 Jan 18 20:40:24 websrv [] dm_table_add_target+0x116/0x190 Jan 18 20:40:24 websrv [] populate_table+0x80/0xe0 Jan 18 20:40:24 websrv [] table_load+0x5e/0x130 Jan 18 20:40:24 websrv [] ctl_ioctl+0xe4/0x170 Jan 18 20:40:24 websrv [] table_load+0x0/0x130 Jan 18 20:40:24 websrv [] sys_ioctl+0xf4/0x2b0 Jan 18 20:40:24 websrv [] sysenter_past_esp+0x52/0x71 Jan 18 20:40:24 websrv Jan 18 20:40:24 websrv device-mapper: Could not create kcopyd client Jan 18 20:40:24 websrv device-mapper: error adding target to table Jan 18 20:40:24 websrv found reiserfs format "3.6" with standard journal Jan 18 20:40:25 websrv Reiserfs journal params: device dm-9, size 8192, journal first block 18, max trans len 1024, max batch 900, max commit age 30, max trans age 30 Jan 18 20:40:25 websrv reiserfs: checking transaction log (dm-9) for (dm-9) Jan 18 20:40:27 websrv reiserfs: replayed 31 transactions in 2 seconds Jan 18 20:40:27 websrv Using r5 hash to sort names Jan 18 20:40:27 websrv EXT3-fs: INFO: recovery required on readonly filesystem. Jan 18 20:40:27 websrv EXT3-fs: write access will be enabled during recovery. Jan 18 20:40:28 websrv kjournald starting. Commit interval 5 seconds Jan 18 20:40:28 websrv EXT3-fs: recovery complete. Jan 18 20:40:28 websrv EXT3-fs: mounted filesystem with ordered data mode. Jan 18 20:41:48 websrv watchdog[9580]: loadavg 25 7 2 is higher than the given threshold 24 18 12! Jan 18 20:41:48 websrv SOFTDOG: WDT device closed unexpectedly. WDT will not stop! Jan 18 20:41:48 websrv watchdog[9580]: shutting down the system because of error -3 Jan 18 20:41:50 websrv sshd[11779]: Accepted password for achim from ::ffff:213.23.24.241 port 4290 Jan 18 20:41:58 websrv serio: kseriod exiting Jan 18 20:41:58 websrv syslog-ng[7200]: syslog-ng version 1.6.0rc3 going down So, kcopyd fails to allocate buffers (lvcreate -s ...) and then the machine load goes up and kills the machine. The backup script tries to remove all snapshots when something fails. The script logged this: >>>>>>>> So Jan 18 20:40:20 CET 2004 ------------------------------- Logical volume "snap-root" created Logical volume "snap-boot" created device-mapper ioctl cmd 9 failed: Nicht genügend Hauptspeicher verfügbar (ENOMEM) Couldn't load device 'vg-snap--home'. Problem reactivating origin home device-mapper ioctl cmd 6 failed: Das Argument ist ungültig (EINVAL) Couldn't resume device 'vg-snap--home' Aborting. Failed to activate snapshot exception store. Remove new LV and retry. Logical volume "snap-var" created mount: Gerätedatei /dev/vg/snap-home existiert nicht (EEXIST) Logical volume "snap-boot" successfully removed Logical volume "snap-root" successfully removed Logical volume "snap-portage" successfully removed Logical volume "snap-var" successfully removed Probably last time the script failed to reactivate the root volume and that's why my log was empty and everything was dead. I'm going to look into this later but you might want to know. I've also thought a lot about the snapshot/origin map functions and the read path. I can't think of a case where doing this is wrong either. The filesystems will never try to read blocks that are currently being flushed. Most elevators will put reads in front of writes anyway and since dm is also something like a in io scheduler we are allowed to do this. A lot of stress testing with different filesystems (creating and removing snapshots while doing heavy I/O on the) didn't show any problems. Barriers will have to be handled separately anyway, something like deferring all incoming bios when a barrier is encountered while there are pending exceptions or something. _______________________________________________ dm-devel mailing list dm-devel@sistina.com http://lists.sistina.com/mailman/listinfo/dm-devel