Case: No post (no post output over the serial channel)
Title: No POST - General investigation
Description: This test checks for post output over the serial channel while power cycling the blade. In this case, no POST output was observed over the serial channel after the power cycle and the server did not see any Dhcp/Pxe request from this blade. This indicates that the blade is unable to complete the POST process and likely has faulty motherboard or power supply.
Component: Serial
Activity Name: BootNode
Exception: DatacenterPlatformServiceOperations.Abstractions.HardwareFailureException.
This test checks for POST output over the serial channel while powercycling the blade
In this case, no POST output was observed over the serial channel after the power cycle and the server did not see any Dhcp/Pxe request from this blade. Indicates that the blade is unable to complete the POST process. Cancellation Token cancel Token
For instructor:
Issue 1: If you place DIMMs in the wrong channels on the motherboard or install them in an incorrect order such as using memory modules with different speeds or sizes—the computer may fail to boot with this configuration.
Issue 2: If you have a mainboard that can hold up to 2 or more CPU’s. Make sure you have placed 1 faulty CPU and the system is preventing from booting (some servers are able to boot with 1 faulty CPU so test it before going forward).
Issue 3: Use a faulty PSU in the server
For students:
1. Gather Initial Information
· Understand the Problem: Begin by gathering detailed information from the user or environment (ticketing).
· Identify the symptoms, such as error messages, beep codes, LED’s or unusual behaviors.
· Recent Changes: Inquire about any recent changes to the system, including hardware or software installations, updates, or physical modifications. Are there related tasks on the same device?
2. Safety Precautions
· Static Discharge Protection: Use an anti-static wrist strap or mat to prevent electrostatic discharge, which can damage sensitive components.
· Power Off and Unplug: Ensure the computer is completely powered off and unplugged before opening the case.
3. External and Internal Inspection
· Check the power source.
· Check the Power Supply: Verify the power supply unit (PSU) is functioning properly.
· Use a PSU tester if possible or swap it with a known working unit if necessary.
· Examine External Connections: Inspect all external cables (power, monitor, peripherals) for damage or loose connections.
· Visual Check Inside: Open the case and visually inspect the motherboard and components for obvious signs of damage, such as burnt areas, broken connectors, or swollen capacitors.
· Component Seating: Confirm that all components (RAM, CPU, GPU, storage devices) are securely seated in their slots and that all cables are properly connected.
4. Power-On Test and Initial Diagnostics
· Power On and Listen: Reconnect the power and attempt to power on the computer. Listen for any unusual sounds (beep codes, fan noises) that could indicate specific issues.
· Verify visible LED’s.
· Monitor for POST: Observe whether the system passes the Power-On Self-Test (POST). If it doesn’t, note any beep codes or error lights, which can help pinpoint the issue.
5. Component Isolation and Minimal Boot Configuration
· Minimal Boot Configuration: Strip the system down to the bare essentials: motherboard, CPU, one stick of RAM, and the power supply.
· Disconnect all non-essential components (storage drives, additional RAM, expansion cards, etc.).
· Test Each Component: Gradually reintroduce components one by one, powering on the system each time to isolate the faulty component. If the system fails to boot after adding a specific component, that component is likely the cause.
· Swap Components: If possible, replace suspected faulty components with known working equivalents (e.g., PSU, RAM, CPU).
6. Diagnostic Tools and Advanced Troubleshooting
· BIOS/UEFI Diagnostics: If the system reaches the BIOS/UEFI, use built-in diagnostic tools to test the CPU, memory, and storage devices.
· External Diagnostic Tools: Use specialized hardware diagnostic tools (e.g., POST card, multimeter) or bootable diagnostic software (e.g., MemTest86 for RAM, SMART tests for drives) to test individual components.
· BIOS/UEFI Reset: Reset the BIOS/UEFI to default settings by removing the CMOS battery or using the reset jumper. This can resolve issues caused by incorrect BIOS settings.
7. Recheck, Validate, and Stress Test (if needed)
· Reassemble and Test: Once the faulty component is identified and replaced, reassemble the system and conduct thorough testing to ensure stability.
Case: Server DHCP/PXE issue. Network investigate
Title: No working network connection on blade level
Description: The blade is expected to be in the state: Connected, however, the status from the ToR port link reports: NotConnected
This likely indicates that the network cable between the blade and TOR is not securely connected or mis-wired. Please check network cabling between blade and ToR.
ToR Details: (host name : SwitchDevice-XXX and port name : Ethernet22/1)
Component: Network
Activity Name: BootNodeIntoMos
The blade is expected to be in the state: Connected, however, the status from the ToR port link reports: NotConnected
Issue 1: Slightly disconnect the DAC/Network cable from server to switch. Cable is still attached but not fully connected.
Issue 2: Use a faulty NIC in the server
Issue 3: Set the bios to the incorrect boot order. Set the harddisk/SSD as 1st boot option and disable PXE boot
2. Verify Network Connectivity:
· Ensure the network cable is properly connected and check for any loose connections.
· Test the network port with a known working cable and port to confirm there is no issue with the physical connection.
3. Check BIOS/UEFI Settings:
· Enter the BIOS/UEFI setup during startup (usually by pressing a key like F2, Del, or Esc).
· Verify that the network boot option (PXE) is enabled and properly configured.
· Ensure the boot order includes network boot as an option and that it is prioritized correctly if needed.
4. Inspect PXE Server Configuration:
· Confirm that the PXE server is up and running, and that it is properly configured to provide boot images.
· Verify that the PXE server has the correct IP configuration and that it is reachable from the client machine.
5. Examine Network Boot Settings:
· Check for any network-related settings in the BIOS/UEFI, such as VLAN configurations or network interface settings, that might affect PXE booting.
Ensure the network adapter is properly initialized and configured in the BIOS/UEFI settings.
6. Test with Another Network Interface:
· If possible, test the boot process with a different network adapter or network card to rule out hardware issues with the current one.
Examine Error Messages and Logs:
· Look for any error messages or logs related to the PXE boot process. These can often provide clues about what might be going wrong.
Case: Drive Replacement
Title: Server: Hard Disk (HDD) Replacement
Description: Observed HDD (number of disk) is reporting failed. Please replace it.
Details: In this case, the drive needs to be replaced due to suspected failure.
For Instructor:
Issue 1: If the drive is not properly seated in its slot or internal cables which are connected to the motherboard, the system may fail to recognize it.
Issue 2: If the drive has physical damage, such as a broken connector or damaged circuitry, it may not function correctly even if it is properly seated.
Issue 3: Use a faulty drive to simulate failure scenarios.
For Students:
Gather Initial Information
Understand the Problem: Begin by gathering detailed information from the user or environment (ticketing).
Identify the hard disk drive and replace it with a new one.
Make sure that the replacement drive has the same capacity as the faulty one.
Safety Precautions
Static Discharge Protection: Use an anti-static wrist strap or mat to prevent electrostatic discharge, which can damage sensitive components.
Power Off and Unplug: Ensure the computer is completely powered off and unplugged before opening the case.
Internal Inspection
Make sure that the new disk is fully seated in its slot and/or the cables are connected.
Re-assembly
Reassemble the device, make sure that all cables are re-connected correctly and power on the device.
Case: DIMM Replacement
Title: Server: Memory Module (RAM) Replacement
Description: Please replace the bad memory module F1. The blade log is reporting this error: 0239 RecType: 02 | GenId: 0x0001 | Memory | Correctable ECC / other correctable memory error logging limit reached. Module: F1 | SensorType: Memory | SensorName: Mem ECC Error | SensorNum: 0x87 | Assertion | Event Data (1-3): 0xA5010B Component: Memory
Activity Name: Memory Replacement
Details:
DIMM F1 needs to be replaced due to suspected failure as reported in the logs.
After replacement check if the server is booting up.
Issue 2: Use a faulty Memory module in the server.
Issue 3: Make sure that the serial numbers of the old and new dimms are noted for future reference and also that they note which dimm has been replaced.
Understand the problem: Begin by gathering detailed information from the user or environment (ticketing).
Identify the dimm which needs replacement, make sure that the size of the new dimm is the same as the one which needs replacement.
Check if the faulty dimm has not been replaced recently (less than 5 days ago).
Power off and unplug: Ensure the computer is completely powered off and all cables are unplugged before opening the server.
Case: FPGA Investigation
Title: FPGA card TOR Link Not Connected or miswired
Description: FPGA is not connected to the switch. Please investigate.
Component: Network Interface
Check for proper functioning of the FPGA or its cabling.
In this case, the FPGA or its cabling needs replacement. (internal or external)
The system needs to be tested to ensure it is functioning correctly.
Issue 1: If the FPGA is not properly seated in its slot or connected to the motherboard with the Oculink cable (cable that connects the FPGA to the motherboard), the system may fail to recognize it.
Issue 2: If the FPGA has physical damage, it may not function correctly even if it is properly installed.
Issue 3: Cable between the FPGA and switch is not properly seated.
( extra Issue 4: port on the switch has issues but that needs to be checked by another team)
Identify the symptoms, such as no light on the switch or FPGA itself.
Check the connection between the FPGA and the motherboard.
Make sure that the FPGA and its cable to the motherboard are seated correctly after replacement.
what is FPGA:
FPGA (Field-Programmable Gate Array) is a special type of hardware chip that can be custom-programmed by the user after manufacturing to perform specific logic tasks.
Reprogrammable Hardware: Unlike CPUs (which follow software), FPGAs can be configured to act like custom hardware circuits.
Highly Flexible: You can reprogram it multiple times to perform different functions.
Parallel Processing: FPGAs can handle many tasks at once, making them ideal for tasks like data processing, encryption, and networking.
Network acceleration (e.g., packet filtering)
Signal processing (e.g., telecommunications)
Hardware emulation and prototyping
Data centers for specialized tasks
Example: In servers, FPGAs are used to offload heavy network tasks from the CPU, making data processing faster.
Deployed FPGAs in every Azure server over recent years.
Created a configurable cloud that can adapt for different applications.
Faster than CPUs for specialized tasks.
More flexible than fixed ASICs (Application-Specific Integrated Circuits).
Can be reprogrammed for new functions as needed.
Bing Search Ranking — Speeds up how search results are ranked.
Deep Neural Network (DNN) Evaluation — Helps AI models run faster.
Software Defined Networking (SDN) Acceleration — Makes network functions quicker and more efficient.
Accelerated Networking in Azure:
Up to 10x lower latency between virtual machines.
CPU is freed up for other tasks, boosting overall performance.
Microsoft’s FPGA strategy gives Azure a flexible, powerful edge in AI, networking, and search — balancing speed, flexibility, and scalability.
FPGA (Field-Programmable Gate Array) is a programmable hardware chip that allows users to configure its internal circuits after manufacturing to perform specific tasks.
User-Programmable Hardware: You can change how the FPGA works by uploading a configuration file.
Parallel Processing: FPGAs can process many operations at the same time, making them very fast for certain tasks.
Reconfigurable: Can be reprogrammed multiple times for different functions.
Network Acceleration (e.g., Azure uses them to speed up networking)
AI/Deep Learning (e.g., DNN evaluation)
Data Processing & Encryption
Prototyping Custom Hardware
Microsoft Azure uses FPGAs in their servers to speed up Bing search, AI tasks, and reduce virtual machine latency with FPGA-based networking.
Let me know if you want a simple difference between FPGA, CPU, and ASIC!
Du:
in microsoft context
ChatGPT:
In Microsoft Azure, an FPGA (Field-Programmable Gate Array) is a special hardware accelerator installed in every Azure server to boost performance on certain tasks.
To create a Configurable Cloud — hardware that can adapt to different workloads.
FPGAs give better performance than CPUs for specific tasks but are still flexible (unlike fixed ASICs).
Bing Search Ranking — Speeds up how fast search results are processed and ranked.
AI/Deep Neural Networks (DNN) — Helps run machine learning models faster.
Software-Defined Networking (SDN) — Accelerates networking functions inside Azure.
Accelerated Networking — Cuts inter-VM latency by up to 10x, freeing up CPU resources.
When two Azure virtual machines communicate, FPGA-based networking can make the connection faster and reduce load on the server’s CPU.
Summary: In Microsoft’s Azure, FPGAs make the cloud smarter and faster by boosting AI, networking, and search tasks — with flexibility to adapt as needed.
Case: Processor Replacement
Title: CPU replacement
Description: there is a faulty CPU which causes the server to malfunction
Component: Processor
A CPU (processor) fault can cause the server to malfunction in different ways. There are usually 2 of the in each device (depends the configuration) and they each control a different part on the board.
If its not clear which of them is faulty then its best to replace them both. This to prevent extra work or errors in the future. These are usually the most expensive parts of a server. If a task (or ticket) asks to replace one then just follow that.
Issue 1: If the processor is not properly seated in its socket, the system may fail to recognize it.
Issue 2: If the cooling solution is not properly installed or thermal paste is not applied correctly, the processor may overheat and cause system instability.
Issue 3: Use a faulty processor to simulate failure scenarios.
Gather Initial Information and guideline how to perform:
Identify the symptoms, such as error messages, beep codes, LEDs, or unusual behaviors.
Make sure you have new cooling paste and alcohol whipes with you because the cooling element which is on the cpu needs to be cleaned completely before new paste can be applied.
There is only one way to put the new cpu in its socket. You will see on the cpu a little triangle. On or close by the socket there will be the same triangle. Make sure the are aligned. If not then there is a good chance that the socket will get damaged and that means that the systemboard needs to replaced.
Instructions for Applying Thermal Paste and installing the cooling element:
Clean the Surface: Ensure both the processor and the base of the cooling solution are clean. Use isopropyl alcohol and a lint-free cloth to remove any old thermal paste or debris.
Apply Thermal Paste: Apply a small amount of thermal paste (about the size of a pea or a grain of rice) to the center of the processor. For larger processors, you may need a bit more, but avoid using too much.
Spread the Paste: Some thermal pastes spread out evenly when the cooler is installed, but if you prefer, you can use a plastic card or a dedicated spreader to evenly distribute the paste across the surface of the processor.
Install the Cooler: Carefully place the cooling solution onto the processor, ensuring it is aligned correctly. Secure the cooler according to the manufacturer's instructions, applying even pressure to ensure the thermal paste spreads evenly.
Make sure that the CPU and its cooling element are seated correctly after replacement.
Case: Motherboard Replacement
Title: Motherboard or Systemboard Replacement
Description: This is the ultimate replacement task you can get since all the components which are seated on a board needs to be transferred from the old one to the new one.
Component: Motherboard
Since all the components need to be transferred its important to check before you start how they all are positioned on the board. If there is one component seated incorrect or not on the right place, the server wont work.
Its also very important that all the dbd’s (harddrives) are removed from the board and positioned on the same spot. This also goes for the dimms especially if different sizes are being used (for example 32gb and 16gb). This depends of the configuration of the server.
Make sure that the board remains clean.
The psu (power supply unit) doesn’t have to be removed, just the cables need to be disconnected from the board and once the board is replaced, connected to the same slots. This goes for all components which are not directly seated on the board itself.
Last changed2 months ago