Feature Request for next generation NVME products

Layer8 2023-04-24 21:14:53 UTC #1

Hello all,

at the moment, all/most products, sockets M.2/3, connectors (U.2/U3 or most SFF-types) on the market are designed to attach a single NVME device with four lanes. Each PCIe 4.0 lane can transfer 2GB/s netto in one direction, which is 8GB/s read and 8GB/s write at the same time per device. Even the most expensive flash based professional data center SSDs cant transfer this rates. Typically, each PCIe generation is twice as fast as the prvevous one, which means that PCIe gen 5 can transfer 4GB/s per lane in one direction.

If you think about it, you may ask yourself why is there such a waste of lanes if one gen 4 lane would be enough for a powerful consume PC? I dont know, but I would like to discuss this topic from a technical point of view and place a feature request.

The best and easiest method would be, if mainboard producers would support x1x1x1x1... bifurcation, which would make it possible to attach up 16 NVMEs to one physical x16 slot (or 8 NVMEs if you would configure x2x2x2... ). The thing is, that no mainboard manufacturer supports this mode. Maybe because its not part of the PCIe-standard, maybe because there are no products out there who would use this kind of bifurcation, maybe its not support in the provided bios from AMD and Intel or maybe CPUs are not made to divide their lanes into x1.

But Broadcoms controllers like 9400 and 9500 series (which are both Tri-Mode) chipsets are designed to support physical x1, x2 and x4 connected NVME devices. They even can allocate up to 32 logical NVMEs.

Without any active silicon, its not possible to attache 32NVMEs to one 9x00-16i controller. This is only possible if you have a professional storage enclosure with active backplanes with PLX chips (PCIe-Switches) on it. But thats not a solution for consumer or semi professional systems, because its to expensive and you need a lot of cooling power.

But its possible to connect 16 NVMEs to one 9x00-16i card, because it provides 16 PCIe lanes over typically four SlimSAS connectors. And thats the ultimate goal: Please make it possible to connect
- 16 NVME to one 16i card (x1 mode for each NVME; of course there is no product at the moment which can hold 16 NVMEs) or
- 8 NVME to one 16i card (x2 mode fore each NVME)
- 8 NVME to one 8i card (x1 mode for each NVME)
- 4 NVME to one 8i card (x2 mode)
and so on.

I know, that this cant be done with a simple cable with one SlimSAS connector on the one and four U.2 connectors on the other end. The reason is, that each PCIe device needs its own refclock signal. And this singla is only availe once on a SlimSAS connector. So you would need a refclock buffer chip in the cable, which multiplies the single refclock signal to four individual signals. This is typically nothing which is build into a cable, but on a PCB.

So, here is my feature request to ICY DOCK:

Please make it possible to operate NVMEs in you products with x4, x2 or x1 mode.

This could be done this way:
1. Please add refclock buffers to your PCBs to multiply the refclock signal so that each single NVME has its own individual refclock signal.
2. Please provide cables which makes it possible to connect the NVMEs with one, two or four lanes like:

one slimsas 4i to one slimsas 4i (current state)
one slimsas 4i to two slimsas 4i
one slimsas 4i to one slimsas 8i (2x2 mode)
one slimsas to two slimsas 8i (4x1 mode)

This technology would drasticaly improve the ratio of controller to NVMEs, which will result in:
- smaller builds
- cooler builds
- cheaper builds
- less power consumption
- maximize the total NVME space in one build
- its more failsafe because of more NVMEs
- better load balancing
- more IOPS

Icydock_Tech 2023-04-24 22:52:29 UTC #2

Hello Layer8

Thank you for your suggestion. While your suggestion may be useful to specific users of PCIe drive products, we do have two major concerns that would give us pause of going with your suggestions.

The largest concern would be cost, as the adding of additional features increase the costs of our products. This would also be compounded by the audience size demanding for this kind of feature. We do not foresee a large audience warranting this specific feature on our products. A majority of customers want to operate drives at full speeds and not deal with reducing amount of PCIe lanes per drives.

Another significant concern would be compatibility to all other devices our products would be connecting to. The added complexity of having chipsets with refclock buffers can have unforeseen conflicts between connecting our products to controller host cards of various types. The management of connections might be split between devices or if the conflict is really bad, results into a complete non-connection.

We do appreciate the suggestion you have provided and I will pass along to our engineering to consider. However, do keep in mind that the likelihood of this specific suggestion becoming a part of our products is slim. If you do have any additional product suggestions that you want us to look into, please let us know.

Layer8 2023-04-29 12:50:39 UTC #3

Thanks for reply.

You say the largest concern would be cost.

If I would buy a ToughArmour MB873MP-B (750EUR for one enclosure), I would need two 16i cards (750EUR for a 95xx) to attache aight NVMEs. This would be 2250EUR in total or 281,25EUR cost per NVME. Thats how it is today.

If I could attache aight NVMEs in x2 mode, I would only need one 16i card. This would be 1500EUR or 187,5EUR per NVME.
If I could attache aight NVMEs in x1 mode, I would only need a 8i card. This would be like 1000EUR or 125EUR per NVME.
If I could attach sixteen NVMEs with two enclosures and one 16i card, it would cost 2250EUR or 140,63EUR per NMVME.

All prices are without NVMEs and cables.

This price difference is massive. But lets have a closer look on your arguments using the example of MB873MP-B.

First topic: Marketing and what your customers need.

Well, in todays world, you can win customers not only with the "bigger/faster" argument. What I suggest is definetly no feature cut, so the old aguments (64GB/s with 8x4) for the product will stay untouched.

What I suggest extends your product by the "more efficeny" argument. And these are arguments customers care today also (less power consumption, less investment, less size of the needed system and infrastructure, less cooling needed).

I think there is a market for this feature if we discuss the arguments. Also, I think that the price increase of your product would be non existing (because of possible higher sales) or very low, but I will also discuss this below.

Second topic: Speed.

The main impact of is the flash itself, not the NVME controller. So, I think that a good NVME controller can loadbalance between all four PCIe lanes.

Scenario 1:

If you attach a drive with PCIe Gen 4 x4, you could read 8GB/s and write 8GB/s at the same time.
This is the scenario if you would like to reach maximum speed and less capacity. Maybe if you have a lot of IOPS or if you read/write a lot of big data streams at the same time.

When we look on the market today, there are some Gen 4 and 5 M.2 NVMEs which can transfer >=7GB/s: https://geizhals.de/?cat=hdssd&xf=221_7000~222_7000~4832_3~7127_40.04~7127_50.04
But if you have a closer look on these NVMEs, you will see that
a. Some come with a big headspreader so you cant attache these products into your productcs
b. These are consumer SSDs. Do I have to say more (caching, speed drop, low TBW, bad cell technology, ...)?

So, there is no realistic scenarion that you can reach the speed of x4. The result is waste of PCIe lanes.

Scenario 2:

If you would attache a drive with PCIe Gen 4 x2, you could read 4GB/s and write 4GB/s at the same time.

Of course, you can find a lot of M.2 NVMEs which can reach this speed. So, in theory, you could utilize the complete PCIe bandwith. This means, you would not waste PCIe lanes.

Argument a and b from scenario 1 is not applicable here. You can find enough SSDs which fit into your NVME trays and you can find enough consumer and professional NVMEs.

Scenario 3:

If you would attach a drive with PCIe Gen x1, you could read 2GB/s and write 2GB/s at the same time.

Is this a questionable scencario? Only if 2GB/s per NVME fits your use case.
This would be a ideal setup for a network attached storage system like a software or hardware based raid system.

There is a third argument, which is applicable for all three scenarios:

Your product like MB873MP-B might have a good cooling system, but its noot good enough to keep very fast NVMEs cool enough to prevent throttling. Maybe its possible in a DataCenter, but never in a typical environment with natural ambient temperature.

Based on user stories I found online, cooling of this product is good enough for energy optimized NVMEs (which are typically used in mobile applications like laptops or office PCs). But fast SSDs, which can reach 2GB/s or more over a longer time, allways need realy good cooling. Otherwise they will throttle. And throttling is absolutely not what we would like when in a 8x4 NVME enclosure, right?

This argument is bigger on scenario 1 and smaller on scenario 3.

So, the thermal aspect is one more argument againts x4. You cant simply benefit of the full speed of Gen 4 x4 because drives will get way to hot under workloads.

Some thoughts about what is needed to utilize 64GB/s in theory:

Lets assume you build a system with 8x4 NVME. You would definetly need two raid controllers to transfer 64GB/s, because 32GB/s is the maximum on a Gen 4 x16 slot. You can transfer 64GB/s over a Gen 5 x16 slot, but at the moment, there are no Gen 5 NVME controllers on the market.

So, we need two Gen 4 x16 slots to theoretically transfer 64GB/s. A consumer platform like Ryzen is equiped with only 24Lanes. You defenetly need a workstation or server class platform. But what will you do with a theoretically 64GB/s datastream if you have enough lanes in one system?

Highpoints latest controllers can handle aight NVMEs per card with a Gen 4 x16 host interface. Broadcoms latest 96xx platform is available with x16 but mostly with x8 host interfaces. I think there is a reason, why host interface and NVME lanes is 1:2.

Balancing the components is the key to more efficeny. You cant build the biggest engine (NVME) into you car if your cooler (enclosure) is to small. You cant bring horse power to the street, if your gear is to small or the tires are to thin.

At last, lets have a look on speed efficeny and realistic use cases, leaving the thermal aspect aside:

In scenario 1, its simply not possible to utilize 8x4 NVMEs near 100% in one system. I would say its even not possible to utilize such a setup with 50%, which is the same like a 8x2 NVME setup. You will waste a lot of resources if you bild a 8x4 setup.

But I also ask if its a realistic scenario to utilize 8x2 (32GB/s). Remember, this is equal to the bandwith of a newer generation graphic card.

Lets assume we build a fileserver with 8x1 attached NVMEs. Typically, you use some kind of raid (software or hardware controlled). If you combine all drives, you could reach 8x2GB, which will be 16GB/s. You would need a 200G network interface (~25GB/s) to serve this bandwith without bootlenecks, but a 100G network interface with 12,5GB/s would also be fine.

For me, 8x1 or 8x2 sounds like a more balanced and realistic use case, with less waste of ressources and money.

Third topic: Pricing.

Needed materials:

OK, if we extend the PCB from only 8x4 to 8x4, 8x2 and 8x1 you need some ICs and SMD parts, because this cant be solved with a simpley PCB without these parts.

If you design a simple circuit, the most expensive part would be the RefClock buffer with aight outputs. One refclock buffer with 8 outputs costs like 3,50EUR each if you buy 450pieces here: https://mou.sr/40Ncldk (its just a example, you can of course use some RefClock buffers from TI).
What do you need more for a simple circuit? Maybe some SMD resistores and coils for the power supply of the RefClock buffer, which cost some cent. And maybe your PCB need more than two layers from now on.

I agree, that some logic to switch between x4/x2/x1 mode is needed, but i think you dont need very complex circuits and IC to solve this problem.

I know I know, there is also logistic, margin, handling, QC and so one. But I realy cant see huge costs here, even if the circuit is more complex.

Addition costs to produce a more complex PCB:

I am not familiar with electronic production, but if I look on the market and see that much more complex PCBs are available for very cheap money, I guess that this kind of modification we discuss here would not raise the price of the product.

Complexity of RefClock:

To illustrate the complexity for the audience here, lets have a look at this developer card from Texas Instrument. You will find the board Layout for 4x4 on page 6: https://www.ti.com/lit/ug/snlu299/snlu299.pdf?ts=1681120395353&ref_url=https%253A%252F%252Fwww.google.com%252F
This layout is so easy, even people who are not familiar with circuits can understand it. The only thing which makes it more complex is, that you would need aight instead of four RefClock signals - one for each drive.
Yes, this a 4x4 bifurcation card and what we would like to do is a bit different, but again, its just to illustrate how PCIe is working.

Fourth topic: Engineering:

I agree, that you need some engineering to design a PCB which supports x4, x2 and x1 mode in one part. But I think this is manageable.

Here is a concept:
1. Add a RefClock buffer with aight outputs to the PCB.
2. Attache each outpout to one drive bay.
3. Add aight switches between the RefClock output and the NVME bay, to activate or deactivate the clock signal if needed.
4. Add a dip switch with three positions and some IC to controle the switches from point 3.
5. Maybe, some error interception is needed to make it compatible to a wider range of controllers. I think your enginees will know it better what to do
6. Provide cables to connect the enclosure in x4 (1:1, still available today), x2 (Y-like cable) or x1 mode (1:4 cable) to the controller.

Points 3 - 5 may be a bit too simplistic, but they illustrate the concept.

Layer8 2023-05-14 09:11:00 UTC #4

Regarding this topic, this is an very interesting video: https://www.youtube.com/watch?v=IFwFDZDQWAA

Linus is showing a APEX card, which can handle 21 Gen4 NVME SSDs with a single up to date PCIe Gen 4 switch on it. It features up to 100 non blocking PCIe 4.0 lanes and a PCIe 4.0 x16 host interface.

Product page of the card: https://www.apexstoragedesign.com/apexstoragex21
Product page of the switch: https://www.microchip.com/en-us/products/interface-and-connectivity/pcie-switches
Switch datasheet: https://ww1.microchip.com/downloads/en/DeviceDoc/00002987.pdf

So, its a professional chipset which I think can nearly reach theoretical bandwith of 64GB/s in a real world test setup . But maximum speed they reach in the video is "only" 25GB/s which is the limit of the CPU.

Of course, at this point we can discuss the sense of the test setup, because they used a up to date ryzen consumer system and not a Threadripper or EPYC system. But I think its demonstrating whats needed to reach even half bandwidth of a PCIe 4.0 x16 slot.