Server Upgrade (Epyc Milan + Supermicro H12SSL-NT)

Zarathustra

Cloudless
Joined
Jun 19, 2019
Messages
4,214
Points
113
So,

I have been low key researching and mulling this over for years now, and I finally pulle3d the trigger a couple of weeks ago.

Wound up getting:
- Epyc 7543 (Milan, 32C/64T, 2.8Ghz Base, 3.7Ghz Boost) used from tugm4470 on ebay (this seller is highly regarded on Servethehome)
- Supermicro H12SSL-NT-O - new on Amazon
- 8x Hynix 64GB Registered DDR4-3200 - used on ebay from atechcomponents which seems to be a well rated seller
- Supermicro SNK-P0064AP4 (92mm Socket SP3 CPU cooler for 4U Cases) - new on amazon

I'm going to be doing some bench-top testing over the next few weeks before I take down the old server and go for the upgrade.

Did some basic assembly last night:

This is the part that no matter how many times I do it on a Threadripper (and now EPYC) always scares the **** out of me:

1.jpg

4094 hair-fine pins that unlike some others are irreparable if touched.

With my lock I'd drop the CPU on them, which is probably why they include that nifty secondary plastic cover.

2.jpg

And a few minutes later. I wanted to do some benchtop testing last night, but apparently I no longer have a good PSU in my spare parts bin. I'm going to have to yank the one out of my testbench machine, but I just didn't feel like doing it last night.

I'm a little bit concerned about the clearance behind PCIe slot Slot three with those weird Supermicro plastic M.2 clips sticking up, but I'm hoping it will be fine. I'm also considering putting a couple of slim heatsinks on those m.2 drives to ensure they don't get too toasty, but the Amazon store page says they only stick up by ~3mm, so hopefully I'll be OK. Those are going to be my mirrored boot drives (Using ZFS). I'm going to be using at least 5 maybe 6 of those slots, so they can't be blocked. This depends on if I can get the funky SlimSAS ports to run my Optane u.2 drives. If I can, then I don't need my u.2 riser card. 3 of the slots will have 16x cards, the rest 8x cards.

It's a little weird to see such a petite CPU cooler on a 225W TDP CPU, but we are not overclocking here. As long as we stay under TJMax at full load I'll be happy. Even better if I can get the full advertised boost out of it.
 
Anyway, in order to validate that everything was working the way it should, I started googling around for known benchmark numbers I could replicate on this thing in its benchtop configuration (without installing Windows)

All I could find was a launchtime Geekbench 4 result (I thought geekbench was for phones?) with a multicore score of ~116k. Well I ran mine and got ~128k, so I am going to call that a success. I'm going to guess that the tests at launch were run with slower memory, not the 3200MT/s stuff I have.

If anyone is curious:
https://browser.geekbench.com/v4/cpu/17037599

Then just for ****s and giggles I ran a Geekbench6 intending to compare it to other results of the same CPU in their browser. I landed on a multicore score of 17139, which seems about right, but it is tough to tell, because the benchmark browser has results all over the place in it.

Again, if anyone is curious:
https://browser.geekbench.com/v6/cpu/3965899

Mine lands among the higher results for a single Epyc 7543, so I'll take that as an indication that at least I wasn't scammed, and I did indeed get the RAM and CPU I was supposed to. There were some that were a couple of hundred points higher, but it looks like they were running some sort of Asus workstation board. I'm guessing they had a bigger cooler than this little 92mm thing, and benefited from better boost clocks.

Now I am going to run Memtest. That's probably going to take... ...a while.

I've never done a memtest with this much RAM before.
 
For what it is worth, I have now tested old v 5.x open source Memtest86+, new 6.x open source Memtest86+ and new PassMark version of Memtest86 v 10.6 (note absense of "+" after 86)

My experiences with them are as follows:

1.) Old open source Memtest 86+ is slow. One pass took about 22 hours in non-SMP mode.

2.) Newer 6.x version of open source Memtest86+ was faster, and I was able to run it in SMP mode. Power use also went from ~150w to ~175w with this version.

3.) PassMarks closed source version of Memtest86 (no plus) was by far the fastest. A typical pass is taking about 11.5 hours. Documentation says the first pass should be a quick pass followed by longer passes after that, but this does not seem to ahve been the case on my system. All passes are about the same length. By default it appears to want to run 4 passes.

It tricks you. The first ~50% of each pass go by very fast. Then there is a hangup from ~50-~80%, and then the last 20% seem to go by fast again.

Also by default, in SMP mode it seems to "only" run 16 cores. Not sure if it just uses half your cores (or a quarter of logical CPU's) or if it by default is capped at 16.

While I have decades of experience with both enterprise and consumer hardware, I figured that I don't know enough about the particulars of RAM, memory channel, and controller design to intelligently pick which settings to run to perform an adequate test. So I went ahead and assumed that people who know much more than me came up with the defaults based on some sort of reasoning, so I ran with those.

Right now I am 75% through pass #3 at ~32 hours.

Once this is complete - assuming there are no errors- I think I am going to be good to give the RAM seller a positive review.

I'll probably run a 24-48 hour pass of Prime95 (well, mprime to be precise) before I give a positive review to the CPU seller.

Not going to lie, the CPU purchase made me nervous as it came from China. I usually buy all significant parts I get from eBay from U.S. sellers just to avoid scams (or malware in firmware) but the seller, tugm4470 | eBay Stores is very highly regarded on the Servethehome forums. I found this to be necessary as many EPYC chips on eBay are vendor locked and fraudulently sold as unlocked.

He (She/them? Not sure if I am dealing with a big business here, but with 21k items sold, I'm guessing they must be) even have a promotional code for Servethehome forum members that gives you free expedited Fedex shipping from China. (I couldn't find a place to enter the code, so I just mentioned the code in the "message to seller" field.

The trip from Shenzhen to New England U.S. took 4 calendar days (Picked up by Fedex December 7th at 11am UTC, delivered December 11th ~3pm, I was pretty impressed with that.

The CPU had a quality stamp right on top of the heat spreader when received. Not sure what the goal with this is, but I dutifully cleaned the top of the CPU with IPA before installing the heatsink, and it wiped right off with no effort.


PXL_20231212_025901251-crop.jpg

I was low key worried it was something they were using to verify they got the same CPU back in case of returns, but the more I think about it it is probably just a "tested by" quality stamp. Google translate tells me it says "Chengmeng Technology", who are probably the sellers behind the tugm4470 store.

I found a company named "Shenzhen Chengmeng Technology Co. LTD" but they seem to manufacture and sell Children's Toys and kids sporting goods. China is confusing, man.

Of course, then I found the actual company:

As expected, while they call themselves a "manufacturer" it looks like they mostly acquire/recycle decomissioned mostly enterprise computer parts in bulk, test and then sell their own servers based on them, like any number of used enterprise parts brokers. I guess occasionally they just sell the parts as well, presumably with a decent markup for them.

Either way, it seems to be working fine, so I don't think I'll have to return anyhting.

My primary concern with buying stuff from china is spyware embedded int he firmware and stuff like that. I don't THINK there is a way to do that with the CPU (though admittedly I am not an expert there) but this is why I bought the motherboard new.
 
Last edited:
I figured that I don't know enough about the particulars of RAM, memory channel, and controller design to intelligently pick which settings to run to perform an adequate test. So I went ahead and assumed that people who know much more than me came up with the defaults based on some sort of reasoning, so I ran with those.
I picked up a few (Windows only) tricks that involved specialized / sketchy software - and would load the CPU as well - while exploring DDR5 overclocking.

I learned a lot, but little of it actionable in the sense that any performance improvement was worth potential stability degradation to me.

But the summary is that running RAM in-spec and CPU in-spec resulted in ultimate stability, and that errors would only really occur in situations where things were run noticeably out of spec and allowed to get too hot.

Otherwise, running Memtest of some variation to ensure that at least the various memory-related subsystems are functioning properly is good enough. If you want to do an 'enthusiast-grade' gut check, try and find a y-cruncher build for whatever OS you're using and run through its paces. Start off with lighter settings though as y-cruncher is a serious workload.
 
Doing some mprime (Linux version of Prime95) stability testing before blessing it as stable, but I have a good feeling about this thing now.

It did give me a scare at first though. Would run for a few seconds and then kill all the threads.

Not quite sure what is going on, but I googled it and lots of people are having the same issue with mprime.

When run from the configuration menu, it will be killed after a short while of running. But if you unpack the download afresh and run it with "./mprime -t" to immediately start an all core stress test, it works just fine.

Seems more like some kind of bug, and not a hardware issue, since the "mprime -t" method seems to be stable.

This is what 64 cores running at 100% looks like :p

1702787951134.png

Looks like the System Monitor GUI app in Linux Mint has some trouble with large amounts of memory. It is totally tallying that wrong.


The "free" command from the command line gets it right though.


Doing all of this from my desktop via the IPMI/BMC's console pass through. Pretty convenient. No need to hook up monitors and keyboards.


With this all core load the cores seem to be clocking at 2771 - 2772 Mhz, which is below advertised base clock of 2.8, but not by much.


Still that is a tiny bit disappointing, but probably not indicative of a problem.


Core temp is about 63C, and the CPU fan is at about 67% speed.


Might just be Supermicro doing their normal hyper-conservative thing.


I wonder if it is just bouncing off the TDP limiter. (I should probably check what it is set to in BIOS) It is pulling about 295w from the wall with all cores at full load according to my Kill-A-Watt.

For ****s and giggles did a single thread test. Core clocks up to 3676. So again, the same few Mhz short of the max boost clock. I'm guessing there are some conservative Supermicro clock settings preventing it from hitting max clocks.
 
Passed 48+ hours of mixed prime95 (well, actually mprime, but same thing) last night.

I'm ready to call this thing stable, and leave my positive reviews on eBay.

Next up, to do the actual drop in upgrade into my existing server.

I'll probably do that between Xmas and New Years when I have plenty of time to get it up and running when no one needs it.

Wish me luck!

And if you need server RAM or CPU's, I am happy to recommend atechcomponents (RAM) and tugm4470 on eBay. Based on my n=1 experience, they are both stellar sellers. The servethehome forums also has lots of buyers who are very happy with tugm4470. If ordering there, don't forget to be a servethehome forum member and message the code to the seller for free expedited shipping.
 
It's crazy how small the H12SSL series motherboards are in the case compared to the monster X9DRI-F.

The end result of this is that one of the two 12v 8pin EPS connectors is like 2mm too short to reach the closest EPS connector.

PXL_20231227_060327767-sml.jpg

Luckily I live near a Microcenter. They have 8" extensions in stock. I don't trust extensions (I've literally had one catch fire in the past, but it looks like I don't have much of a choice.
 
I've bought many things from tugm4470 myself. All EPYC related. One of the motherboards I bought from them had a bad memory slot. I contacted them and they immediately wanted me to do troubleshooting. To which I simply told them I had 5 other EPYC systems with the same RAM that all worked and even took RAM out of those systems to test the system and all experienced the same thing.

They sent me a replacement with no further questions.

It also only took me 2 - 3 days to receive them and I wasn't even aware of the "coupon" thing you mentioned. Overall, I will buy from them again and again. 4 of my 6 motherboards have come from them, but they're all AsRock boards for the single EPYC setups. Only Supermicro one I have is for my dual EPYC setup. I dislike SM so much, their crap takes SO long to boot it's ridiculous. Luckily I hardly ever power them down.

Like you though I am not too confident with buying EPYC CPUs on eBay due to the vendor locking. So far I have had good luck though. My 3 7H12s have came from a Reddit user, 1 of my EPYCs came from a friend (BOINC team member), and the other 3 I got on eBay (7V12). All have worked without issue.

As for your CPU max Mhz. Check the temps on the VRMs. If they get too hot they will throttle the CPU. Since these boards are designed for server chassis with high flow fans blowing across the entire board it means they don't get adequate cooling when placed on a bench or typical PC case. For my 2P SM board I bought 2x 40mm Noctua fans that I zip tied to the heatsink to give them additional cooling which helped a good bit.

I bought my RAM from mem-store though.
 
Last edited:
I've bought many things from tugm4470 myself. All EPYC related. One of the motherboards I bought from them had a bad memory slot. I contacted them and they immediately wanted me to do troubleshooting. To which I simply told them I had 5 other EPYC systems with the same RAM that all worked and even took RAM out of those systems to test the system and all experienced the same thing.

They sent me a replacement with no further questions.

That is great. They may just change my mind when it comes to bad experiences with Chinese vendors.

Though they are still just a PLA military intelligence officer's order away from flashing dome funny business to the firmware on anything they send you...
As for your CPU max Mhz. Check the temps on the VRMs. If they get too hot they will throttle the CPU. Since these boards are designed for server chassis with high flow fans blowing across the entire board it means they don't get adequate cooling when placed on a bench or typical PC case. For my 2P SM board I bought 2x 40mm Noctua fans that I zip tied to the heatsink to give them additional cooling which helped a good bit.

Good suggestion. I'm pretty sure the front to back airflow in my server case will be sufficient (3x 120mm Noctua Industrial PPc 3000rpm variant sees to that), but if worse comes to worse I can always direct some extra airflow on the VRM's.

I'm betting that is what that weird heatsink near the power connector is for.
 
Here's hoping these guys make good cables...

1703721062601.png

I would hate to have a repeat of this:

1703721127096.png

1703721155855.png
 
That is great. They may just change my mind when it comes to bad experiences with Chinese vendors.

Though they are still just a PLA military intelligence officer's order away from flashing dome funny business to the firmware on anything they send you...

My pi-hole should block any communications to them. If not, I will catch it and block it.

Good suggestion. I'm pretty sure the front to back airflow in my server case will be sufficient (3x 120mm Noctua Industrial PPc 3000rpm variant sees to that), but if worse comes to worse I can always direct some extra airflow on the VRM's.

I'm betting that is what that weird heatsink near the power connector is for.

Yes, that heatsink next to the power connector is the VRMs.
 
Took longer than I expected, but the new server has been in place for two days now.

Here is the last pic I took before buttoning it up and sticking it back in the rack:

1704075952199.png

I'm loving the end result, but I ran into a number of bumps along the way, which are finally resolved.

So, rather than decommissioning the old server, I transplanted it into my "testbench" machine, which I keep around for imaging drives, flashing firmware to boards, etc. etc. Stuff I want to be able to do separately from either my desktop or server.

The many PCIe lanes and ECC RAM will come in handy in that role, espe4cially since I often use ZFS on it for redundancy.

1704076095720.png

The Enthoo Pro case is awesome, and fits this massive SSI EEB form factor board. I decided to swap out the noisy fans that came with the Supermicro 4U coolers (92mm Nidec UltraFlo's) with a set of Noctua's that are friendlier to the ears when in my office. Thus far they seem to adequately keep the CPU's cool. I mean, they are "only" 95W each, so that is pretty easy by modern standards.

Seen here in its natural habitat with three hot swap 3.5" drives and six hot swap 2.5" drives. It also fits an additional 6 3.5" drives on the inside, where I have a small (relatively speaking) ZFS pool I use to temporarily store drive images, etc.

1704076400613.png1704076431744.png

Two 8C/16T Ivy Bridge Xeon E5-2650 V2's with 256GB of ECC RAM. This would have been quite the workstation back when it was new :p

But that was a long time ago, as evident by this following screenshot:

1704076685646.png

Also, here's a reminder that if you install Windows (or simply move a Windows install from an older system) on a system with a large amount of RAM, unless you have a corresponding very large drive, Windows WILL take over your entire drive with hiberfil.sys :p

256GB of RAM = 256GB of hiberfil.sys + swapfile on a 400gb (~380gb available) Intel SSD750 PCIe drive (the only NVMe drive I've ever found with an OPROM that loads during POST and allows you to boot on non-NVMe aware motherboards), which I partitioned 100GB for Linux, and 280GB for Windows. That doesn't leave much free :p

"Why is the drive full? I don't remember storing stuff on this drive or filling it with programs... ...oh"

And now we have disabled hibernation and swap. We won't get that fancy fast booting hibernation stuff, but I don't care.

The only question now is what I do with the old Sandy Bridge-E x79 Workstation board I was using in the testbench? It has been with me since 2011. I almost get a little misty-eyed at the thought of it no longer being in service somehow.

1704076749512.png

It was my main desktop from 2011 to 2019, when under water it would hit a heavy overclock of 4.8Gghz. I bought it when Bulldozer sucked at launch, and used it until I upgraded to my Threaderipper in 2019, then it went in the Testbench where it has been enjoying a lighter retirement load since.
 
Just for fun, to locally access the BMC web interface (not that I need to, especially when I get more features with ipmitool.)
Yet if they were on the same switch.... you could do the same thing... But consume more ports... so that would be a pain.
 
Become a Patron!
Back
Top