Using the Block::Request_stream API in the NVMe driver
The NVMe driver component is switched to the Request_stream API in accordance to the road map, where consolidation of the Block-level components is scheduled for the upcoming 20.05 release. As a side-effect, things got a little bit simpler in the driver. Although being one of the first drivers to be converted to new the API it is late to the party nonetheless.
Introduction
When Sebastian started to rewrite the AHCI driver and asked for information on how to use the Request_stream API, naturally I pointed him to the NVMe driver. Lo and behold to my suprise it still used the old Block::Driver API, although I already switched it over about a year ago. Well, on a topic branch at least and somewhere along the line I missed pushing the change upstream. Anyhow, since consolidation is planned for the next release, I dug up the commit and got to work.
Implementation changes
Since its initial commit the driver itself used somewhat conservative settings with regard to the available features of NVMe. Most prominently the number of I/O requests was limited to 128 and there is only one I/O queue. The memory for DMA was backed by a private dataspace, so data from and to a client had to be copied to this dataspace by the driver. For practical purposes all that was not much of an issue, most machines featuring NVMe devices could spare those extra cycles. Nevertheless, if it is possible to avoid an extra copy, make it so.
Sharing the DMA dataspace with the client was already possible with the old Block::Driver interface but it was not as straight forward, which is why I opted for doing the initial implementation without that - well, initial things tend to stick…
With the new API it is quite easy as the dataspace is handed over to the Block_session directly. So, when the client opens the session the DMA dataspace will now be allocated and will be freed when the session is closed (better make sure there no I/O requests still pending). Away goes the extra copy as the client now fills the DMA buffer directly. Since the alignment constraints are propagated to the client we can hook up the data without much hassle.
(Normally the data needs to be memory-page-size (MPS) aligned — other alignments are possible as well but for now the driver always uses 4K and the first LBA format, i.e., block-size, it can find — even if the announced block-size is less. Consumer NVMe devices tend to use 512 bytes most of the time and therefor will transfer up to 8 blocks per page.)
Without going into too much detail, the NVMe 1.3 spec is easy to follow, there a multiple ways how large I/O is handled. For one there is the advanced scatter-gather-list (SGL) that most consumer devices do not implement (so we do not used that) and the simpler physical-page-region (PRP) list (that we use):
An I/O command has two entries, Prp1 and Prp2. Briefly speaking, if your request is less or up to one MPS sized chunk, you set the address in Prp1 and set Prp2 to zero. If it is at least one and up to two MPS sized chunks - well, set the addresses in Prp1 and Prp2. Should the request be larger than two MPS sized chunks, the entry in Prp2 becomes a list. This list contains all the following addresses of the MPS chunks (the last entry will point to the next PRP list page, so you can chain them even further). We already have the data available and already MPS aligned in DMA-able memory, all we need for larger requests is to store the addresses of the overhanging chunks in a list page. As the address is up to 8 bytes we can store 512 entries in one page, which gives us 2 MiB of data per request. Request larger than that are split.
To keep things simple, we allocate the DMA memory to store the list pages beforehand according to the maximum number of I/O requests we support. So in practice every request may be a “large” request. As maximum number of I/O request was increased to 512, we need 2 MiB DMA memory for all the list pages.
(Not all controllers support this many I/O requests or have a maximum data transfer size of 2 MiB - in this cases the driver caps the limits accordingly.)
With the memory handling briefly illustrated, let us focus on the I/O request management:
The old implementation used a home-grown Slot mechanism, which actually was just a convience wrapper on top of an array, where Request objects were allocated and freed. Those object stored the Block request, the request ID and the PRP list page. The command ID, basically a tag given to the NVMe device, was a wrapping counter. The command ID was part of the request ID and allowed for matching I/O completions to Block requests. The Slot mechanism was central to the operation of the driver.
On the other hand, the new implementation solely uses a bitmap or rather bit allocator to create command IDs. These IDs are then used throughout the driver to reference request objects as well as calculate memory addresses and offsets for list pages - there is a 1:1 mapping between the ID and the page - and so on. This simplifies the overall implementation quite a bit. Although the command IDs are no longer unique, i.e., they get reused, that is fine as the bitmap is dimensioned in accordance to the maximum number of I/O entries. The ID will only be reused after we received the I/O completion from the NVMe device. In the end, central to the operation of driver is only managing a few bits now - the rest is already in place, statically.
As far as features go the driver now supports SYNC and TRIM requests, implemented via the NVMe FLUSH and WRITE_ZEROS command.
Summary
Through the changes the logic of the driver got simpilfied. Some of the former limits were raised, e.g. more I/O entries, zero-copy and depending on the test-case, as a side-effect throughput was increased. I have not really looked at the latency side of things but AFAICT it did not get worse. The component is now in a state where starting to apply optimizations is reasonable. The interrupt handling still does not feel quite right to me and apart from that there are still lots of places that take plain ol' integers and could use some strong(er) typing. But there is that, for another time.
I have opened an issue on github, referencing the proper commit. The nvme.run run scripts works well, on Qemu as well as on serveral test machines. I also tested it in Sculpt using a Shuttle DS57U with an mPCIe → m.2 adapter (works well, even booting from NVMe works but the device is starved because of the limited PCIe 2.0 x1 bandwidth) and by now use it on my T470p work laptop. I am fairly confident it will not shred my data ☺ (yeah, famous last words…).
That being said, since I have limited access¹ to various NVMe devices I would appreciate further testing by the community. So, if you feel adventurous and own a system featuring NVMe storage, please give it a try - backup your data first! I would be especially interested in results with some prosumersic or rather enterprise-grade devices (e.g. multiple namespaces).
¹) So far tested: Corsair MP500, Kingston (do not remember the model), Samsung Evo 970 and Toshiba THNSF5256GPUK.