Server Down: What to Do When My Server Crashes?
A failed server is fairly common; but given that there are so many different types of servers, there is no one solution that fits all server crashes. So, a little disclaimer, providing solutions to every type of server crash is not within the scope of this article.
What this tutorial will provide, however, are useful preliminary solutions that are applicable across all server crashes. Which, more often than not, is usually what you need to do to fix your downed server. So if your server crashed, it is always advisable to try these common fixes first.
In every server crash (just like in the case of any other problem) you need to identify the cause first and then find a solution, accordingly. Though server crashes cannot be completely avoided, this article also provides a few tips to lower the possibility of one.
Server down: How to fix it
Step 1: Identify the root cause
View the symptoms
The symptoms can provide important clues to what exactly is wrong with the server. Although these diagnoses might not always be true for each case. Some common symptoms and their possible causes have been provided in the table below.
|The server does not power up||The server hardware has an issue. Most probably, the server power supply has failed.||Before concluding that it is a case of failed power supply, do some basic verifications, like: check if the system is plugged in and receiving power from the outlet; the surge protection; or the UPS.|
|Server boots but the screen shows the Blue Screen of Death (BSOD)||It is a case of hardware failure or device failure. If a new driver had been installed recently, the crash could be related to that particular driver.|
|Server starts and the operating system (OS) loads but some critical services do not start||Causes vary depending on a number of factors.|
Decipher the blue screen
Deciphering the blue screen can seem like a difficult and intimidating task, but really, it is organized in a particular structure and can provide important clues to what is wrong with the server. It is important to understand the structure and content in the blue screen (a brief description of the blue screen has been provided below).
Usually, the blue screen occurs when you attempt to start the machine in the safe mode and it does not start. The blue screen has four parts, organized in the same order:
Actual error message: There are a number of error messages that could appear on the screen based on the type of error. For example, the image below shows the error message
Other examples of error messages are given below:
KMODE_EXCEPTION_NOT_HANDLEDindicates an incorrectly configured device driver.
REGISTRY_ERRORindicates a serious problem in the registry.
INACCESSIBLE_BOOT_DEVICEindicates that the OS is unable to read from the hard disk.
UNEXPECTED_KERNEL_MODE_TRAPindicates a problem with the memory.
BAD_POOL_HEADERis hard to decipher but this indicates that the issue has something to do with a recent change in the system.
NTFS_FILE_SYSTEMindicates a corrupted hard disk.
KERNEL_DATA_INPAGE_ERRORindicates the OS was unable to read a page of kernel data from the page file.
NMI_HARDWARE_FAILUREindicates the inability of the hardware’s abstraction layer to identify the cause of the error.
- OS modules are already loaded into the memory: The image below shows the modules that are already loaded. This means that the cause of the error is not related to these modules.
- OS modules that could not be loaded due to the crash or error: The image below shows the modules that were unable to load. It could be that one of these modules is the cause of the crash.
- Status of the Kernel debugger: This section basically indicates the current status of the debugger. This debugger connects two computers with the same OS version which then sent the crash dump from the blue screen system to the functional system.
Boot the machine in safe mode
Safe mode can give an idea which drivers or services could be causing the problems. The safe mode starts with a minimum set of drivers and services. So, the safe mode does not load those services and drivers that might be causing the problems.
Look for issues in the Event Viewer Logs and Device Manager
If the Event Viewer logs do not give any clue, then go to the Device Manager and disable devise that are not required for the OS to start. After that, start the server. If the server boots, then the device or devices you have disabled is surely the cause of the problem. Enable one device at a time then reboot the machine. If the machine reboots after a device is enabled, that device is not the cause of the problem. When the machine does not start after enabling a device, you know you have identified the device that was causing the problem.
Step 2: Fix the issue
The section above might have already given you a few ideas on how to troubleshoot your server. It might have also given you the impression that cause identification and solution can happen almost immediately in that order. This section describes troubleshooting tips for some other issues.
Failure of critical services
This issue has been mentioned in step one and fixing it is a bit complicated because there are no straightforward causes. You need to analyze every component separately. For example, in the Microsoft Exchange, if a lower level service such as the System Attendant fails, then you can conclude that the Exchange is either corrupt or is unable to communicate with the Active directory. In that case, first, verify that nothing is hindering the communication with the LDAP directory and then try reinstalling Exchange Server or the latest service pack.
Another issue could be if the database fails to mount. In such a case, the database is probably corrupted or has some inconsistencies; and you may need to completely reinstall the database.
Tips to prevent your server from crashing
Preventive steps, it is worthwhile to remember, lowers, but does not eliminate the possibility of server crashes. Still, preventive steps are worth taking. The steps to prevent server crash are given below.
- Ensure that the server room is neat and clean.
- Ensure that the cold air comes from the front and the hot air is expelled from the back. This is especially applicable in the case of multiple servers.
- Keep the doors of the server room closed so that dust is prevented from entering as that can cause a lot of overheating.
- Make sure that cold air within the room is reaching all the equipment. It is essential to keep the servers cool.
- Install an air conditioner that is specifically customized for servers in the server room.
- Have a rack exposure that can have the cooling built into the bottom of the rack.
- Ensure that the room temperature does not exceed 77 degrees.
- Use blanking panels over empty server racks.
- Consider virtualization which can contribute to lower heat generation.
Some server crashes can be fixed by people who have basic knowledge of the software and hardware; while more complicated problems need expert attention. However, optimum load balancing, preventive care and good handling can ensure a longer trouble-free life for servers. More importantly, when the server crashes, it can be fixed with the preliminary solutions and may not even require expert hand. This can help save a lot of cost and time.
Other related tutorials you might be interested in:
- Server-Side Rendering with Redux and React-Router
- PHP DevOps Tutorial: Things You Need to Take Care of When Setting Up a New Server
- Running a Linux Web Server on an Android device
- How to Run Different MySQL versions on the Same Server
Kaushik Pal has more than 16 years of experience as a technical architect and software consultant in enterprise application and product development. He has interest in new technology and innovation, along with technical writing. His main focus is web architecture, web technologies, Java/J2EE, Open source, big data, cloud, and mobile technologies.You can find more of his work at www.techalpine.com and you can email him at firstname.lastname@example.org or email@example.com