When your Debian system fails to boot, displaying a black screen or getting stuck somewhere, it's understandable to feel anxious. At this time, the system kernel's "black box" log—the `dmesg` log—becomes your most powerful diagnostic tool. It records every important action the kernel takes from the moment power is applied, including hardware checks, driver loading, file system mounting, and all critical steps and problems encountered. Mastering how to diagnose boot failures using `dmesg` is like possessing the key to unlock the system's boot black box, allowing you to quickly pinpoint the root cause of the problem.
`dmesg` is a command-line tool used to read and control the kernel's circular buffer. This buffer has a limited size, and the latest information overwrites the oldest. Therefore, obtaining the log as soon as a boot failure occurs is crucial. If the system can still display a command-line interface, simply open a terminal and type
Dmesg
to view all information. However, more commonly, the system cannot access the desktop or command line. At this point, you have several ways to obtain the logs: If the system is stuck at some stage but hasn't completely crashed, you can try using the key combination `Ctrl+Alt+F2` to `F6` to switch to another virtual console to see if you can log in and run commands. For more serious failures, you need to obtain the logs from an external environment, such as using a Debian installation USB drive or Live CD to enter "rescue mode". In rescue mode, you can mount the root partition of the original system and then switch to the original system environment to view the logs. Specifically, in the shell of rescue mode, execute the following commands to mount the original system and view the logs:
mount /dev/sda1 /mnt
chroot /mnt
dmesg > /root/dmesg_boot.log
This saves the logs to the `/root` directory of the original system for easy analysis later. Sometimes you need to compare the logs during normal operation and during a failure; in that case, you can back up a copy beforehand when the system is working normally: `dmesg > ~/dmesg_good.log`.
After obtaining the `dmesg` logs, you will face output that may be hundreds of lines long. You need to analyze them strategically, rather than reading them line by line. The first step is to find a clear error level identifier. Kernel messages have different log levels, such as "Emergency," "Alert," "Critical," "Error," "Warning," and "Notice." You can use the command `dmesg -l err,crit,alert,emerg` to directly filter out the highest level errors. A more common and faster filtering method is to use the grep command to search for keywords:
dmesg | grep -E \"error|fail|warn|invalid|unsupported|bug\"
This command can quickly identify lines in the log containing errors, failures, warnings, invalid parameters, unsupported operations, or kernel bugs. After finding these lines, carefully read the context before and after them (usually 5-10 lines before and after), which will help you understand the system state when the error occurred. For example, an "I/O error" might appear after an attempt to read a disk sector, indicating a faulty hardware device.
Boot failures in `dmesg` usually have relatively clear patterns. A common type is hardware identification or driver failure. During the initial boot process, the system probes all hardware. If critical hardware (such as the disk controller or the disk containing the root filesystem) cannot be recognized, or if the driver fails to load, the boot process will fail. Typical error messages might include "DRM kernel driver is missing or failed to load" (graphics card driver issue), "Unable to enumerate USB device" (USB device recognition failure), or more directly, "No bootable device." For driver issues, you can look for a failure message after "Loading module" or similar information. Another extremely common failure is filesystem-related errors. When the kernel attempts to mount the root filesystem, if it encounters a problem, explicit information will appear in the logs.
For example, "VFS: Cannot open root device" or "Please append a correct 'root=' boot option" directly tells you that the kernel cannot find the root device, which could be due to incorrect kernel parameters or a change in the disk device name (e.g., from `/dev/sda1` to `/dev/sdb1`). "EXT4-fs error" or "FAT-fs error," on the other hand, indicates a specific filesystem error, which could be disk corruption, superblock corruption, or filesystem inconsistency. These types of errors are often followed by "mount failed" or a system switching to read-only mode. The third typical failure is out-of-memory (OOM) or kernel panic. A kernel panic is a serious error that can cause the system to completely stop. `dmesg` will explicitly print "Kernel panic - not syncing:" followed by the reason, such as "Out of memory" or "Attempted to kill init". Out-of-memory failures may be preceded by numerous "page allocation failure" warnings. These problems are often related to hardware defects (bad memory modules) or driver vulnerabilities.
Besides directly searching for errors, analyzing the `dmesg` timeline can also pinpoint the failure point. Using `dmesg -T` adds human-readable timestamps to each log entry, allowing you to clearly see when the startup process slowed down or stopped. If the system gets stuck at a certain stage, the long blank periods after the last few normal messages in the log itself indicate the approximate time window when the failure occurred. You can view all messages around that point in time.
Another useful technique is to focus on the order of subsystem initialization. A normal boot log will show a clear process: [Time] Kernel unpacks itself -> Probes CPU and memory -> Initializes PCI bus and detects devices -> Loads driver modules -> Attempts to mount root filesystem -> Starts user-space init process. If your log abruptly ends after a certain stage, such as after "Trying to unpack rootfs image..." without "Freeing initrd memory", the problem is likely with initramfs processing. For complex problems, comparing the logs of two boots (one normal, one faulty) is very effective. You can use the `diff` tool to compare two files:
diff dmesg_good.log dmesg_bad.log | less
The differences are likely the key to the problem.
Mastering diagnostic approaches for specific scenarios can improve efficiency. For example, if your system fails to boot after updating the kernel or installing new hardware, you should focus on checking whether the new kernel module was loaded successfully or whether there are driver conflicts for the new hardware. You can check the logs for detection records related to new hardware (such as "NVIDIA GPU" or "USB 3.0 controller"). A common reason for system boot failures after an update is that the initramfs image is not correctly updated to match the new kernel. In `dmesg`, this might manifest as a "Device or resource busy" error or an inability to find the expected disk UUID when mounting the root filesystem. In this case, rebuilding the initramfs from the rescue environment might resolve the issue:
update-initramfs -u -k all
Another example is if you suspect a kernel parameter is causing the problem. You can temporarily modify the parameter in the bootloader (such as GRUB) and then compare the changes in the `dmesg` log to see if the error disappears or changes.
Once you've located the specific error message using `dmesg`, the troubleshooting path becomes clear. If it's a clear hardware error, such as "SATA link down" or "SMART error," you need to check the hardware connections or consider replacing the hard drive. If it's a driver issue, you can try adding kernel parameters at boot (such as `nomodeset` to disable graphics card mode settings) or installing the appropriate driver from the rescue environment. For file system errors, the `fsck` command is your preferred tool in a rescue environment:
fsck -y /dev/sda1
Be sure to verify the device name before running. If the problem is related to initramfs, rebuilding it is often effective, as mentioned earlier. When encountering tricky, hard-to-locate problems, trying to boot with an earlier, working kernel version is an effective rollback strategy.
Finally, good habits prevent problems before they occur. Regularly back up important system configuration files (such as `/etc/fstab`, `/etc/default/grub`) and known working `dmesg` logs. Consider configuring `systemd-journald` or `rsyslog` to persistently save kernel logs to disk, so that even if the system fails to boot completely, you can find previous boot logs in the `/var/log` directory by mounting the disk.
EN
CN