Omega on 4/11/2008 at 15:07
Might be a bit of a stretch coming on these boards with a problem I've got at work.. But I'm kinda at a loss and I figured maybe someone knew of a solution. So at work we've got 5 newly built identical machines all with the same problem. They hang or restart at irregular times when the machines are being used and even when they're not being used. Mainly they're being used for traffic simulation calculations.
The full hardware specifications are:
* AOpen H500A Computer Case with an AOpen FSP350-60HLN power supply.
* Intel Bonetrail (DX48BT2)
* 4x KVR1333D3N9/2G Kingston modules
* Intel Core 2 Duo E8600
* Seagate Barracuda 7200rpm 500GB 32MB Cache
* Lite-on DVD-drive (IDE)
* Lite-on DVD-writer (SATA)
* Asus EN8600GT Silent HTDP 512MB DDR2 PCI-E
* Windows XP 64-bit UK
Does anyone know if there are any incompatibilities with these hardware specifications?
Things we've tried:
* Changing the PSU to a 600W one.
* Detaching the DVD drives.
* Changing the video card.
* Double checked if the BIOS was up to date.
* Checked the memory with memtest, which sometimes gives errors and sometimes doesn't. I suppose this might be a possible culprit, but kinda strange that it would happen for all 5 machines.
Things we can't try:
* Swap memory. Since it's DDR3 and highly pricey, we don't have any other modules in stock.
* Same goes for the motherboard.
Microwave Oven on 4/11/2008 at 16:05
The symptoms you describe usually indicate a thermal problem. My guess is the RAM is getting overheated. Try underclocking it and see if the problem goes away.
bikerdude on 4/11/2008 at 18:26
Evening
Have you checked the event log..? there might be an error code in there which will help diagnose the problem. What software is being run on these PC's, and what are the PC's doing when they crash/reboot?
If it is a heating issue - (which I find odd as your running stock intel kit)
Are these pc's, servers..?
what kind of case/enclosure are they in?
what is the ambient temp of the room they are in?
biker
TheOutrider on 4/11/2008 at 18:56
Are they, by any chance, all connected to the same outlet (or few outlets) via hilarious power strip tentacles? Might be you're straining the power supplies' power supply to the point where *it* starts fluctuating to the point where the computers aren't getting enough anymore.
Phatose on 5/11/2008 at 18:09
Have you checked your BIOS ram settings, to ensure that all that values are correct to manufacturers specs?
That much bad RAM seems unlikely, but fucked up SPD or manual memory settings are a definite possibility.
Omega on 6/11/2008 at 21:27
We still haven't found a solution. I'll definitely go doublecheck the RAM settings. I think a collegue of mine actually did check it out. But I'm not sure what his findings were. It's weird though if that's the solution, previously, one could just slot in some bars and be done with it. No BIOS settings needed.
Not sure if it's a heat problem. It might be, I haven't actually felt the memory modules, but I do know that the video card at one point while idle was around 81°C. But even after removing the card, the problem persisted. I also know that the northbridge heatsink is sometimes very hot. So hot you can't touch it.
As for the location of the machines, we pretty much ruled that out by taking them away from the customer and putting them in our own network. Problem still persists.
There's no error in the event logs. It just restarts and all you see is services starting up. Before that, nothing out of the ordinary.
I'm also not sure if it could be the powersupply outie, we changed that to a 600W one and it still happened. It definitely seems to point to either the RAM or the motherboard. Kingston has been very supportive by responding pretty much immediately even offering to send replacement modules of a different batch to us within the day. (okay my boss made me exagerate the number of computers affected ie. multiply n of computers by 10 which would mean 200 modules affected :devil: ) and intel has pretty much been silent for the most part. Saying that they're looking into it.
I'll keep you guys updated if we find a solution. Thanks for the suggestions.
baeuchlein on 7/11/2008 at 15:02
The northbridge is too hot to touch? That sounds a bit suspicious.
Try to position a fan (if possible, 80mm x 80mm or larger) close to the northbridge, sending a stream of cool air across the northbridge's heatsink. Do not fix that fan on top of the northbridge's heatsink, as some manufacturers do, but try to position it on top of the mainboard, meaning there's a 90-degrees angle between the mainboard surface and the fan. If you have a tower case where the mainboard is in vertical orientation, see if you can position the fan on top of the graphics card (which often is near the northbridge). Both of these positions are often better than to put the fan on top of the heatsink, although I can't exactly tell you why.
When you successfully placed a fan there, see whether that brings more stability to the system. Run the memory tests again as well and see whether they still report errors. If you have to buy any kind of fan, you could try just one fan (with one machine) and then buy more once you know that it works. And then you can think about how you can really fix these fans to the case or the motherboard in order to prevent the fans from dropping once vibrations rock the computer or someone carries the case around.
If you find some other piece of hardware which seems to hot, you can try the same approach with that as well. If cooling a particular piece of equipment results in a system with improved stability, you found the culprit.
Omega on 16/12/2008 at 10:17
Sorry for the necromancy, but I just wanted to add the cause of/solution to the problem I posted earlier.
It turned out that the chipset on the Kingston modules were incompatible with the Intel motherboard. And after getting in touch with Kingston's RMA support. They swapped the modules for the same model but with different chips on them. And then it seemed to work fine.
bikerdude on 16/12/2008 at 13:22
Quote Posted by Omega
Sorry for the necromancy, but I just wanted to add the cause of/solution to the problem I posted earlier.
It turned out that the chipset on the Kingston modules were incompatible with the Intel motherboard. And after getting in touch with Kingston's RMA support. They swapped the modules for the same model but with different chips on them. And then it seemed to work fine.
That dosent suprise me, memory often more then not the cause of a lot of problems in PC's
jstnomega on 26/12/2008 at 05:23
Quote Posted by Bikerdude
That dosent suprise me, memory often more then not the cause of a lot of problems in PC's
Yup. That's why we have JEDEC standards & the RAMBUS folks are a bunch of flaming assholes.:mad:
<end sarcasm>