Thanks for reply.
You say the largest concern would be cost.
If I would buy a ToughArmour MB873MP-B (750EUR for one enclosure), I would need two 16i cards (750EUR for a 95xx) to attache aight NVMEs. This would be 2250EUR in total or 281,25EUR cost per NVME. Thats how it is today.
If I could attache aight NVMEs in x2 mode, I would only need one 16i card. This would be 1500EUR or 187,5EUR per NVME.
If I could attache aight NVMEs in x1 mode, I would only need a 8i card. This would be like 1000EUR or 125EUR per NVME.
If I could attach sixteen NVMEs with two enclosures and one 16i card, it would cost 2250EUR or 140,63EUR per NMVME.
All prices are without NVMEs and cables.
This price difference is massive. But lets have a closer look on your arguments using the example of MB873MP-B.
First topic: Marketing and what your customers need.
Well, in todays world, you can win customers not only with the "bigger/faster" argument. What I suggest is definetly no feature cut, so the old aguments (64GB/s with 8x4) for the product will stay untouched.
What I suggest extends your product by the "more efficeny" argument. And these are arguments customers care today also (less power consumption, less investment, less size of the needed system and infrastructure, less cooling needed).
I think there is a market for this feature if we discuss the arguments. Also, I think that the price increase of your product would be non existing (because of possible higher sales) or very low, but I will also discuss this below.
Second topic: Speed.
The main impact of is the flash itself, not the NVME controller. So, I think that a good NVME controller can loadbalance between all four PCIe lanes.
Scenario 1:
If you attach a drive with PCIe Gen 4 x4, you could read 8GB/s and write 8GB/s at the same time.
This is the scenario if you would like to reach maximum speed and less capacity. Maybe if you have a lot of IOPS or if you read/write a lot of big data streams at the same time.
When we look on the market today, there are some Gen 4 and 5 M.2 NVMEs which can transfer >=7GB/s: https://geizhals.de/?cat=hdssd&xf=221_7000~222_7000~4832_3~7127_40.04~7127_50.04
But if you have a closer look on these NVMEs, you will see that
a. Some come with a big headspreader so you cant attache these products into your productcs
b. These are consumer SSDs. Do I have to say more (caching, speed drop, low TBW, bad cell technology, ...)?
So, there is no realistic scenarion that you can reach the speed of x4. The result is waste of PCIe lanes.
Scenario 2:
If you would attache a drive with PCIe Gen 4 x2, you could read 4GB/s and write 4GB/s at the same time.
Of course, you can find a lot of M.2 NVMEs which can reach this speed. So, in theory, you could utilize the complete PCIe bandwith. This means, you would not waste PCIe lanes.
Argument a and b from scenario 1 is not applicable here. You can find enough SSDs which fit into your NVME trays and you can find enough consumer and professional NVMEs.
Scenario 3:
If you would attach a drive with PCIe Gen x1, you could read 2GB/s and write 2GB/s at the same time.
Is this a questionable scencario? Only if 2GB/s per NVME fits your use case.
This would be a ideal setup for a network attached storage system like a software or hardware based raid system.
There is a third argument, which is applicable for all three scenarios:
Your product like MB873MP-B might have a good cooling system, but its noot good enough to keep very fast NVMEs cool enough to prevent throttling. Maybe its possible in a DataCenter, but never in a typical environment with natural ambient temperature.
Based on user stories I found online, cooling of this product is good enough for energy optimized NVMEs (which are typically used in mobile applications like laptops or office PCs). But fast SSDs, which can reach 2GB/s or more over a longer time, allways need realy good cooling. Otherwise they will throttle. And throttling is absolutely not what we would like when in a 8x4 NVME enclosure, right?
This argument is bigger on scenario 1 and smaller on scenario 3.
So, the thermal aspect is one more argument againts x4. You cant simply benefit of the full speed of Gen 4 x4 because drives will get way to hot under workloads.
Some thoughts about what is needed to utilize 64GB/s in theory:
Lets assume you build a system with 8x4 NVME. You would definetly need two raid controllers to transfer 64GB/s, because 32GB/s is the maximum on a Gen 4 x16 slot. You can transfer 64GB/s over a Gen 5 x16 slot, but at the moment, there are no Gen 5 NVME controllers on the market.
So, we need two Gen 4 x16 slots to theoretically transfer 64GB/s. A consumer platform like Ryzen is equiped with only 24Lanes. You defenetly need a workstation or server class platform. But what will you do with a theoretically 64GB/s datastream if you have enough lanes in one system?
Highpoints latest controllers can handle aight NVMEs per card with a Gen 4 x16 host interface. Broadcoms latest 96xx platform is available with x16 but mostly with x8 host interfaces. I think there is a reason, why host interface and NVME lanes is 1:2.
Balancing the components is the key to more efficeny. You cant build the biggest engine (NVME) into you car if your cooler (enclosure) is to small. You cant bring horse power to the street, if your gear is to small or the tires are to thin.
At last, lets have a look on speed efficeny and realistic use cases, leaving the thermal aspect aside:
In scenario 1, its simply not possible to utilize 8x4 NVMEs near 100% in one system. I would say its even not possible to utilize such a setup with 50%, which is the same like a 8x2 NVME setup. You will waste a lot of resources if you bild a 8x4 setup.
But I also ask if its a realistic scenario to utilize 8x2 (32GB/s). Remember, this is equal to the bandwith of a newer generation graphic card.
Lets assume we build a fileserver with 8x1 attached NVMEs. Typically, you use some kind of raid (software or hardware controlled). If you combine all drives, you could reach 8x2GB, which will be 16GB/s. You would need a 200G network interface (~25GB/s) to serve this bandwith without bootlenecks, but a 100G network interface with 12,5GB/s would also be fine.
For me, 8x1 or 8x2 sounds like a more balanced and realistic use case, with less waste of ressources and money.
Third topic: Pricing.
Needed materials:
OK, if we extend the PCB from only 8x4 to 8x4, 8x2 and 8x1 you need some ICs and SMD parts, because this cant be solved with a simpley PCB without these parts.
If you design a simple circuit, the most expensive part would be the RefClock buffer with aight outputs. One refclock buffer with 8 outputs costs like 3,50EUR each if you buy 450pieces here: https://mou.sr/40Ncldk (its just a example, you can of course use some RefClock buffers from TI).
What do you need more for a simple circuit? Maybe some SMD resistores and coils for the power supply of the RefClock buffer, which cost some cent. And maybe your PCB need more than two layers from now on.
I agree, that some logic to switch between x4/x2/x1 mode is needed, but i think you dont need very complex circuits and IC to solve this problem.
I know I know, there is also logistic, margin, handling, QC and so one. But I realy cant see huge costs here, even if the circuit is more complex.
Addition costs to produce a more complex PCB:
I am not familiar with electronic production, but if I look on the market and see that much more complex PCBs are available for very cheap money, I guess that this kind of modification we discuss here would not raise the price of the product.
Complexity of RefClock:
To illustrate the complexity for the audience here, lets have a look at this developer card from Texas Instrument. You will find the board Layout for 4x4 on page 6: https://www.ti.com/lit/ug/snlu299/snlu299.pdf?ts=1681120395353&ref_url=https%253A%252F%252Fwww.google.com%252F
This layout is so easy, even people who are not familiar with circuits can understand it. The only thing which makes it more complex is, that you would need aight instead of four RefClock signals - one for each drive.
Yes, this a 4x4 bifurcation card and what we would like to do is a bit different, but again, its just to illustrate how PCIe is working.
Fourth topic: Engineering:
I agree, that you need some engineering to design a PCB which supports x4, x2 and x1 mode in one part. But I think this is manageable.
Here is a concept:
1. Add a RefClock buffer with aight outputs to the PCB.
2. Attache each outpout to one drive bay.
3. Add aight switches between the RefClock output and the NVME bay, to activate or deactivate the clock signal if needed.
4. Add a dip switch with three positions and some IC to controle the switches from point 3.
5. Maybe, some error interception is needed to make it compatible to a wider range of controllers. I think your enginees will know it better what to do 
6. Provide cables to connect the enclosure in x4 (1:1, still available today), x2 (Y-like cable) or x1 mode (1:4 cable) to the controller.
Points 3 - 5 may be a bit too simplistic, but they illustrate the concept.