7. Troubleshooting, or The Agony of Defeat

When building bootdisks, the first few tries often will not boot. The general approach to building a root disk is to assemble components from your existing system, and try and get the diskette-based system to the point where it displays messages on the console. Once it starts talking to you, the battle is half over because you can see what it is complaining about, and you can fix individual problems until the system works smoothly. If the system just hangs with no explanation, finding the cause can be difficult. The recommended procedure for investigating the problem where the system will not talk to you is as follows:

  • You may see a message like this:
    Kernel panic: VFS: Unable to mount root fs on XX:YY
    This is a common problem and it has only a few causes. First, check the device XX:YY against the list of device codes in /usr/src/linux/Documentation/devices.txt. If it is incorrect, you probably didn't do an rdev -R, or you did it on the wrong image. If the device code is correct, then check carefully the device drivers compiled into your kernel. Make sure it has floppy disk, ramdisk and ext2 filesystem support built-in.

  • If you see many errors like:
    end_request: I/O error, dev 01:00 (ramdisk), sector NNN
    This is an I/O error from the ramdisk driver, usually because the kernel is trying to write beyond the end of the device. The ramdisk is too small to hold the root filesystem. Check your bootdisk kernel's initialization messages for a line like:
            Ramdisk driver initialized : 16 ramdisks of 4096K size
    Check this size against the uncompressed size of the root filesystem. If the ramdisks aren't large enough, make them larger.

  • Check that the root disk actually contains the directories you think it does. It is easy to copy at the wrong level so that you end up with something like /rootdisk/bin instead of /bin on your root diskette.

  • Check that there is a /lib/libc.so with the same link that appears in your /lib directory on your hard disk.

  • Check that any symbolic links in your /dev directory in your existing system also exist on your root diskette filesystem, where those links are to devices which you have included in your root diskette. In particular, /dev/console links are essential in many cases.

  • Check that you have included /dev/tty1, /dev/null, /dev/zero, /dev/mem, /dev/ram and /dev/kmem files.

  • Check your kernel configuration -- support for all resources required up to login point must be built in, not modules. So ramdisk and ext2 support must be built-in.

  • Check that your kernel root device and ramdisk settings are correct.

Once these general aspects have been covered, here are some more specific files to check:

  1. Make sure init is included as /sbin/init or /bin/init. Make sure it is executable.

  2. Run ldd init to check init's libraries. Usually this is just libc.so, but check anyway. Make sure you included the necessary libraries and loaders.

  3. Make sure you have the right loader for your libraries -- ld.so for a.out or ld-linux.so for ELF.

  4. Check the /etc/inittab on your bootdisk filesystem for the calls to getty (or some getty-like program, such as agetty, mgetty or getty_ps). Double-check these against your hard disk inittab. Check the man pages of the program you use to make sure these make sense. inittab is possibly the trickiest part because its syntax and content depend on the init program used and the nature of the system. The only way to tackle it is to read the man pages for init and inittab and work out exactly what your existing system is doing when it boots. Check to make sure /etc/inittab has a system initialisation entry. This should contain a command to execute the system initialization script, which must exist.

  5. As with init, run ldd on your getty to see what it needs, and make sure the necessary library files and loaders were included in your root filesystem.

  6. Be sure you have included a shell program (e.g., bash or ash) capable of running all of your rc scripts.

  7. If you have a /etc/ld.so.cache file on your rescue disk, remake it.

If init starts, but you get a message like:
        Id xxx respawning too fast: disabled for 5 minutes  
it is coming from init, usually indicating that getty or login is dying as soon as it starts up. Check the getty and login executables and the libraries they depend upon. Make sure the invocations in /etc/inittab are correct. If you get strange messages from getty, it may mean the calling form in /etc/inittab is wrong.

If you get a login prompt, and you enter a valid login name but the system prompts you for another login name immediately, the problem may be with PAM or NSS. See Section 4.4. The problem may also be that you use shadow passwords and didn't copy /etc/shadow to your bootdisk.

If you try to run some executable, such as df, which is on your rescue disk but you yields a message like: df: not found, check two things: (1) Make sure the directory containing the binary is in your PATH, and (2) make sure you have libraries (and loaders) the program needs.