Hacker News: kmxdm

New comment by kmxdm in "SSDs have become fast, except in the cloud"

kmxdm — Wed, 21 Feb 2024 14:03:12 +0000

Just for fun, ran the same workload on a locally-attached Gen4 enterprise-class 7.68TB NVMe SSD on "bare metal" (which is my home i9 system with an ecore/pcore situation so added cpus_allowed):

  sudo fio --name=read_iops_test   --filename=/dev/nvme0n1 --filesize=1500G   --time_based --ramp_time=1s --runtime=15s   --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrepeat=0   --bs=4K --iodepth=256 --rw=randread   --iodepth_batch_submit=256  --iodepth_batch_complete_max=256 --cpus_allowed=0-7
  read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
  fio-3.28
  Starting 1 process
  Jobs: 1 (f=1): [r(1)][100.0%][r=6078MiB/s][r=1556k IOPS][eta 00m:00s]
  read_iops_test: (groupid=0, jobs=1): err= 0: pid=11085: Wed Feb 21 08:57:35 2024
    read: IOPS=1555k, BW=6073MiB/s (6368MB/s)(89.0GiB/15001msec)
      slat (nsec): min=401, max=93168, avg=7547.42, stdev=4396.47
      clat (nsec): min=1426, max=1958.2k, avg=154599.19, stdev=92730.02
       lat (usec): min=56, max=1963, avg=162.15, stdev=92.68
      clat percentiles (usec):
       |  1.00th=[   71],  5.00th=[   78], 10.00th=[   83], 20.00th=[   92],
       | 30.00th=[  100], 40.00th=[  111], 50.00th=[  124], 60.00th=[  141],
       | 70.00th=[  165], 80.00th=[  200], 90.00th=[  265], 95.00th=[  334],
       | 99.00th=[  519], 99.50th=[  603], 99.90th=[  807], 99.95th=[  898],
       | 99.99th=[ 1106]
     bw (  MiB/s): min= 5823, max= 6091, per=100.00%, avg=6073.70, stdev=47.56, samples=30
     iops        : min=1490727, max=1559332, avg=1554866.87, stdev=12174.38, samples=30
    lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=30.18%
    lat (usec)   : 250=58.12%, 500=10.55%, 750=1.00%, 1000=0.13%
    lat (msec)   : 2=0.02%
    cpu          : usr=25.41%, sys=74.57%, ctx=2395, majf=0, minf=58
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=100.0%
       submit    : 0=0.0%, 4=5.7%, 8=14.8%, 16=54.8%, 32=24.3%, 64=0.3%, >=64=0.1%
       complete  : 0=0.0%, 4=2.9%, 8=13.0%, 16=56.9%, 32=26.8%, 64=0.3%, >=64=0.1%
       issued rwts: total=23320075,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=256
  
  Run status group 0 (all jobs):
     READ: bw=6073MiB/s (6368MB/s), 6073MiB/s-6073MiB/s (6368MB/s-6368MB/s), io=89.0GiB (95.5GB), run=15001-15001msec
  
  Disk stats (read/write):
    nvme0n1: ios=24547748/0, merge=1/0, ticks=3702834/0, in_queue=3702835, util=99.35%

And then again with IOPS limited to ~2GB/s:

  sudo fio --name=read_iops_test   --filename=/dev/nvme0n1 --filesize=1500G   --time_based --ramp_time=1s --runtime=15s   --ioengine=io_uring --fixedbufs --direct=1 --verify=0 --randrepeat=0   --bs=4K --iodepth=256 --rw=randread   --iodepth_batch_submit=256  --iodepth_batch_complete_max=256 --cpus_allowed=0-7 --rate_iops=534000
  read_iops_test: (g=0): rw=randread, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=io_uring, iodepth=256
  fio-3.28
  Starting 1 process
  Jobs: 1 (f=1), 0-534000 IOPS: [r(1)][100.0%][r=2086MiB/s][r=534k IOPS][eta 00m:00s]
  read_iops_test: (groupid=0, jobs=1): err= 0: pid=11114: Wed Feb 21 08:59:30 2024
    read: IOPS=534k, BW=2086MiB/s (2187MB/s)(30.6GiB/15001msec)
      slat (nsec): min=817, max=88336, avg=41533.20, stdev=7711.33
      clat (usec): min=7, max=485, avg=93.19, stdev=39.73
       lat (usec): min=65, max=536, avg=134.72, stdev=37.83
      clat percentiles (usec):
       |  1.00th=[   32],  5.00th=[   41], 10.00th=[   47], 20.00th=[   59],
       | 30.00th=[   70], 40.00th=[   79], 50.00th=[   89], 60.00th=[   98],
       | 70.00th=[  110], 80.00th=[  122], 90.00th=[  145], 95.00th=[  167],
       | 99.00th=[  217], 99.50th=[  235], 99.90th=[  277], 99.95th=[  293],
       | 99.99th=[  334]
     bw (  MiB/s): min= 2084, max= 2086, per=100.00%, avg=2086.08, stdev= 0.38, samples=30
     iops        : min=533715, max=534204, avg=534037.57, stdev=97.91, samples=30
    lat (usec)   : 10=0.01%, 20=0.04%, 50=12.42%, 100=49.30%, 250=37.97%
    lat (usec)   : 500=0.28%
    cpu          : usr=11.48%, sys=27.35%, ctx=2278177, majf=0, minf=58
    IO depths    : 1=0.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.1%, >=64=100.0%
       submit    : 0=0.0%, 4=0.4%, 8=0.2%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=99.3%
       complete  : 0=0.0%, 4=95.4%, 8=4.5%, 16=0.1%, 32=0.1%, 64=0.1%, >=64=0.0%
       issued rwts: total=8009924,0,0,0 short=0,0,0,0 dropped=0,0,0,0
       latency   : target=0, window=0, percentile=100.00%, depth=256

  Run status group 0 (all jobs):
     READ: bw=2086MiB/s (2187MB/s), 2086MiB/s-2086MiB/s (2187MB/s-2187MB/s), io=30.6GiB (32.8GB), run=15001-15001msec

  Disk stats (read/write):
    nvme0n1: ios=8543389/0, merge=0/0, ticks=934147/0, in_queue=934148, util=99.33%

edit: formatting...

New comment by kmxdm in "I tested four NVMe SSDs from four vendors – half lose FLUSH'd data on power loss (2022)"

kmxdm — Wed, 22 Nov 2023 16:17:46 +0000

M.2/2280 makes it hard. Can't use cheaper aluminum can capacitors due to their size/height. The low profile (tantalum?) capacitors are expensive and take up a lot of PCB area, forcing a two sided PCB design on 2280 (the 110mm version would be better here). M.2 only provides 5V. On other form factors you get 12V and can get more charge stored for the same capacitance (q=CV) without needing a DC-DC converter.

New comment by kmxdm in "I tested four NVMe SSDs from four vendors – half lose FLUSH'd data on power loss (2022)"

kmxdm — Wed, 22 Nov 2023 13:26:01 +0000

Yeah, but then you have a write amplification problem. Padding is write amplification from the start, and then GC is invoked many more times than it otherwise would be. There is a fundamental problem with (truly) flushing an IO that is smaller than the media write unit. It will cause problems if "abused." The SSD either needs to take on the cost of mitigation (e.g. caps) or it needs some way to provide hints to the host that don't exist today.

New comment by kmxdm in "I tested four NVMe SSDs from four vendors – half lose FLUSH'd data on power loss (2022)"

kmxdm — Wed, 22 Nov 2023 12:48:37 +0000

Yes, GC should be smart enough to free up space from padding. But then there's a write amplification penalty and meeting endurance specifications is impossible. A padded write already carries a write amplification >1, then GC needs to be invoked much more frequently on top of that to drive it even higher. With pathological Flush usage, you have to pick your poison. Run out of space, run out of SSD life.

New comment by kmxdm in "I tested four NVMe SSDs from four vendors – half lose FLUSH'd data on power loss (2022)"

kmxdm — Wed, 22 Nov 2023 12:36:26 +0000

The more dire problem is the case where the drive runs out of physical capacity before logical capacity. If the host flushes data that is smaller than the physical write unit of the SSD, capacity is lost to padding (if the SSD honors every Flush). A "reasonable" amount of Flush would not make too much of a difference, but a pathological case like flush-after-every-4k would cause the SSD to run out of space prematurely. There should be a better interface to handle all this, but the IO stack would need to be modified to solve what amounts to a cost issue at the SSD level. It's a race to the bottom selling 1TB consumer SSDs for less than $100.

New comment by kmxdm in "I tested four NVMe SSDs from four vendors – half lose FLUSH'd data on power loss (2022)"

kmxdm — Wed, 22 Nov 2023 12:06:54 +0000

Writes are completed to the host when they land on the SSD controller, not when written to Flash. The SSD controller has to accumulate enough data to fill its write unit to Flash (the absolute minimum would be a Flash page, typically 16kB). If it waited for the write to Flash to send a completion, the latency would be unbearable. If it wrote every write to Flash as quickly as possible, it could waste much of the drive's capacity padding Flash pages. If a host tried to flush after every write to force the latter behavior, it would end up with the same problem. Non-consumer drives solve the problem with back-up capacitance. Consumer drives do not have this. Also, if the author repeated this test 10 or 100 times on each drive, I suspect that he would uncover a failure rate for each consumer drive. It's a game of chance.