Writing an open source GPU driver – without the hardware

After six months of reverse-engineering, the new Arm “Valhall” GPUs (Mali-G57, Mali-G78) are getting free and open source Panfrost drivers. With a new compiler, driver patches, and some kernel hacking, these new GPUs are almost ready for upstream.

In 2021, there were no Valhall devices running mainline Linux. While a lack of devices poses an obvious obstacle to device driver development, there is no better time to write drivers than before hardware reaches end-users. Developing and distributing production-quality drivers takes time, and we don’t want users to be reliant on closed source blobs. If development doesn’t start until a device hits shelves, that device could reach “end-of-life” by the time there are mature open drivers. But with a head start, we can have drivers ready by the time devices reach end users.

Let’s see how.

Reverse-engineering without root

Over the summer, Collabora purchased an Android phone with a Mali-G78. The phone isn’t rooted, so we can’t replace its graphics drivers with our own. We can put the phone in developer mode, so we can run test applications with the proprietary graphics driver and inject our own code with LD_PRELOAD, allowing us to inspect the graphics memory prepared by the proprietary driver and “passively” reverse-engineer the hardware. This memory includes compiled shader binaries in the Valhall instruction set, as well as Valhall’s data structures controlling graphics state like textures, blending, and culling.

Reverse-engineering “actively” is possible, too. We can modify compiled shaders and GPU data structures, allowing us to experiment with individual bits. We can go further, constructing our own shaders and data structures and validating them against the hardware.

To motivate this technique, consider the reverse-engineering of Valhall’s “buffer descriptor”. This new data structure describes a buffer of memory, accessed by a new “load buffer” instruction (LD_BUFFER). After guessing the layout of the buffer descriptor and encoding of LD_BUFFER, we can build our own buffer descriptor and write a shader using LD_BUFFER to validate our guess and probe the low-level semantics.

When reverse-engineering Valhall’s new data structures, we have legacy to guide us. While Valhall reorganizes its data structures to reduce Vulkan driver overhead, the bit-level contents resemble older Mali GPUs. If we find the “contours” of new data structures, we can fill in the details by comparing with older hardware.

As we learn about the data structures, we document our findings in a formal XML hardware description. This file has the same format as the XML for older Mali architectures already supported by Panfrost. Since the Valhall data structures descend from these older architectures, we can fork an older Mali’s XML to save us some typing and keep naming consistent.

After enough reverse-engineering, we can slot our XML into Panfrost, automatically generating code to pack and unpack the data structures. Thanks to tireless work by Collaboran Boris Brezillon, Panfrost’s performance-critical code is specialized at compile-time to the target architecture, allowing us to add new architectures without adding overhead to existing hardware. So with our XML in hand, we can get started writing a Valhall driver.

Writing drivers without hardware

It is November 2021. I’ve written a Valhall compiler. I’ve reverse-engineered enough to write a driver. I still have no Linux hardware to test my code.

That’s a major road block.

Good thing I know a detour.

We can develop the driver on any Linux machine, without testing against real hardware. To pull that off, unit testing is mandatory. With no hardware, we can’t run integration tests, but unit tests can run on any hardware. For the Valhall compiler, I wrote unit tests for everything from instruction packing to optimization. Although the coverage isn’t exhaustive, it caught numerous bugs early on.

There is a caveat: unit testing can’t tell us if our expectations of the hardware are correct. However, it can confirm that our code matches our expectations. If our reverse-engineering is thorough, these expectations should be correct.

Even so, unit testing alone isn’t enough.

Enter drm-shim.

drm-shim

Mesa drivers like Panfrost can mock test hardware with drm-shim, a small library which stubs out the system calls used by userspace graphics drivers to communicate with the kernel. With drm-shim, unmodified userspace drivers think they’re running against real hardware – including Valhall hardware.

Graphics guru Emma Anholt designed drm-shim to run Mesa’s compilers as cross-compilers for use in continuous integration (CI). Outside of CI, drm-shim allows testing compilers on our development machines, which may be significantly faster than the embedded devices we target. But it’s not limited to compilers; we can run entire test suites under drm-shim, “cross-testing” for any hardware we please. The tests won’t pass, since drm-shim does no rendering; it is a shim, not an emulator. But it allows us to exercise new driver code paths without the constraints of real hardware.

As drm-shim runs on any Linux machine, I wanted to use the fastest Linux machine I own: my Apple M1. Bizarrely, drm-shim didn’t work on my M1 Linux box, although it works on everyone else’s computers. That calls for a debugging session.

After some poking around, I stumbled on the offending code:

bo->addr=util_vma_heap_alloc(&heap, size, 4096);
mmap(NULL, ..., bo->addr);

This code allocates a chunk of memory aligned to a page and uses its address as the offset in a call to mmap. On my system, the mmap call fails, so I consulted the man page for mmap:

offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE).

The mmap in drm-shim works, because the page size on Linux is 4096 bytes (4K)…

Until it isn’t.

Apple’s input/output memory management unit uses larger pages, 16384 bytes (16K) large. As a consequence, when we run Linux bare metal on Apple platforms, we configure Linux to use 16K pages everywhere to keep life simple. That means that on Apple platforms running Linux, sysconf(_SC_PAGE_SIZE) returns 16384, so the mmap fails. The fix is easy:

bo->addr=util_vma_heap_alloc(&heap, size, sysconf(_SC_PAGE_SIZE));
mmap(NULL, ..., bo->addr);

With that, drm-shim works on systems with page sizes larger than 4K, including my M1. That means I can compile thousands of shaders per second with the Valhall compiler, far more than any system with a Mali GPU could. I can also run Khronos’s OpenGL ES Conformance Test Suite:

PAN_MESA_DEBUG=valhall,trace LIBGL_DRIVERS_PATH=~/lib/dri/ LD_PRELOAD=~/mesa/build/src/panfrost/drm-shim/libpanfrost_noop_drm_shim.so PAN_GPU_ID=9091 EGL_PLATFORM=surfaceless ./deqp-gles31 --deqp-surface-type=pbuffer --deqp-gl-config-name=rgba8888d24s8ms0 --deqp-surface-width=256 --deqp-surface-height=256'

Long commands like this one run tests and produce pretty-printed dumps of GPU memory, ready for manual inspection. If the dumps look like the dumps from the proprietary driver, there’s a good chance the tests will pass on real hardware, too.

Code sharing

Since Valhall is similar to its predecessors, the years we’ve spent nurturing Panfrost mean we only need to modify the driver in areas where Valhall introduces breaking changes.

For example, Valhall’s instruction set resembles the older “Bifrost” instruction set, so we may embed the Valhall compiler as an additional backend in the existing Bifrost compiler. Shared compiler passes like instruction selection and register allocation “just work” on Valhall, even though they were developed and debugged for Bifrost.

Once we adapt Panfrost for Valhall, we’ll have a conformant, performant driver ready out-of-the-box.

…In theory.

Real hardware, real pain

I couldn’t test on real Valhall hardware until early January, when I procured a Chromebook with a MediaTek MT8192 system-on-chip and a matching serial cable. MT8192 sports a Valhall “Mali-G57” GPU, compatible with the Mali-G78 I’m reverse-engineering. Mainline kernel support for MT8192 is sparse, but Linux does boot. With patches by other Collaborans, USB works too. That’s enough to get to work on the GPU. Sure, the display doesn’t work, but who needs that?!

We’ll start by teaching Linux how to find the GPU. On desktops, ACPI and UEFI let the operating system discover any connected hardware. While these standards exist for Arm, in practice Arm systems require a device tree describing the hardware: what parts there are, which registers and clocks they use, and how they’re connected. We don’t know much about MT8192, but ChromeOS supports it, so ChromeOS has a complete device tree. Adapting that device tree for mainline, we soon see signs of life:

[  1.942843] panfrost 13000000.gpu: unknown id 0x9093 major 0x0 minor 0x0 status 0x0

The kernel cannot identify the connected Mali GPU, but that’s expected – after all, it has never seen a Mali-G57 before. We need to add a mapping from Mali-G57’s hardware ID to its name, feature list, and hardware bug list. Then the driver loads.

[  1.942843] panfrost 13000000.gpu: mali-g57 id 0x9093 major 0x0 minor 0x0 status 0x0
[  1.982322] [drm] Initialized panfrost 1.2.0 20180908 for 13000000.gpu on minor 0

Based on the downstream kernel module released by Arm, we know the parts of Valhall relevant to the kernel are backwards-compatible with Mali GPUs from a decade ago. Panfrost supports existing Mali hardware, so in theory, we can test drive the Mali-G57 right now.

When it comes to hardware, theory and practice never agree.

Let’s try submitting a “null job” to the hardware, a simple job that does nothing whatsoever:

struct mali_job_descriptor_header job={
    .job_type=MALI_JOB_TYPE_NULL,
    .job_index=1
};

Only 2 bits set in the entire data structure. We can even hard-code this job into the kernel and submit it as soon as the hardware powers on. Since this job is correct, the hardware will run it fine.

[   2.094748] panfrost 13000000.gpu: js fault, js=1, status=DATA_INVALID_FAULT, head=0x6087000, tail=0x6087000

What? The hardware claims the job is invalid, even though the job is clearly valid. Apparently, the hardware is reading something different from memory than we wrote.

That symptom is eerily familiar. When Collaboran Tomeu Vizoso and I added support for Mali-G52 two years ago, we observed the same symptoms on an Amlogic system-on-chip. The culprit was an Amlogic-specific cache coherency issue. That fix doesn’t apply here, so it’s time to hunt for MediaTek-specific bugs.

Crawling through ChromeOS code, I found that MediaTek submitted an unexplained change to the GPU driver, setting a single bit belonging to a clock on MT8192 in order to “disable ACP”, fixing bus faults. This change is the embodiment of a “fix everything” magic bit, the kind only rumoured to exist and the stuff of reverse-engineers’ nightmares.

…But setting that bit in our kernel makes our null job complete successfully.

…Wait, what?

It turns out ACP is the “Accelerator Coherency Port”, responsible for managing cache coherency between the CPU and the GPU. Apparently, ACP was not supposed to be enabled on MT8192, but due a hardware bug was enabled accidentally. The kernel must set this bit to disable ACP as a workaround.

Again, what?

Pressing on, we can submit the same null job from userspace. To the hardware, kernelspace and userspace are the same, so this must work.

It does not.

The job times out before completing. Inspecting the kernel log, we notice an earlier timeout, waiting for the GPU to wake up after being reset.

Littering the kernel with printks, eventually we find that the GPU is powered off once Linux boots, and nothing we do will power it back on. No wonder everything times out.

For some problems, we can only hope for a leprechaun to whisper the solution in our ear. Our leprechaun comes in the form of kernel wizard Heiko Stuebner. Heiko suggested that Linux might be powering off the GPU. To save power, Linux turns off unused clocks and power domains. If Linux doesn’t know a clock or power domain is used by the GPU, it’ll turn off the GPU inadvertently.

For debugging, we can disable this mechanism by setting the clk_ignore_unused pd_ignore_unused kernel arguments. Doing so makes our userspace tests work.

Sometimes the simplest solutions are in front of us.

What is the root cause? MediaTek has a complicated hierarchy of clocks and power domains, and we missed some in our device tree. We’ll need to update our code to teach Linux about the extra clocks and power domains to fix the issue properly.

Nevertheless, we can now test our driver on real hardware. It’s a rough start: the first job we submit returns a Data Invalid Fault. Experimenting, it seems Valhall requires greater pointer alignment of its data structures than Bifrost did. Increasing the alignment at which we allocate fixes the faults, and decreasing again lets us determine the minimum required alignment. This information is accessible once we can run code on the hardware, but inaccessible when studying hardware in vitro. Reverse-engineering and driver development are better together.

Success at last

With these fixes, we finally see our first passing test, running on real hardware, with data structures prepared by our open source Mesa driver and shaders compiled by our Valhall compiler. Woo!

It only took me a few days after getting the hardware and a serial cable to pass hundreds of tests on the new architecture. Months of speculatively developing the driver came with a huge pay off.

Sounds like we’ll have Valhall drivers in time for end-users after all.

Note: This article have been indexed to our site. We do not claim legitimacy, ownership or copyright of any of the content above. To see the article at original source Click Here

Related Posts
Повечето рансъмуер атаки се целят в три индустрии thumbnail

Повечето рансъмуер атаки се целят в три индустрии

Рансъмуерът остава ключова заплаха за киберсигурността на организациите по целия свят(снимка: CC0 Public Domain) Преобладаващата част от атаките с криптиране и искане на откуп се целят в три индустрии, но все пак изследователите на киберсигурността предупреждават, че „нито един бизнес или индустрия не е в безопасност”. Над половината от рансъмуер атаките в последно време са…
Read More
Apple’s AirPods Pro are back on sale for $175 thumbnail

Apple’s AirPods Pro are back on sale for $175

All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you buy something through one of these links, we may earn an affiliate commission.If you missed the chance to buy Apple’s AirPods Pro when they were $175 in February, they’re now…
Read More
Here’s what’s new on BritBox this January 2023 thumbnail

Here’s what’s new on BritBox this January 2023

BBC Studios and ITV have revealed what content is coming to BritBox, a British programming-focused streaming service, in January. BritBox, which costs $9.99/month (or $99.99 per year), offers access to series like Doctor Who, Emmerdale and Five by Five. In August, the service will add content like Pride and Prejudice, About A Boy, Mr. Bean’s Holiday
Read More
Microsoft steps up its Windows 11 file management game -Zip thumbnail

Microsoft steps up its Windows 11 file management game -Zip

(Image credit: TechRadar) If you’ve ever downloaded a 7-Zip file or RAR file (archive file formats that allow users to compress files to more portable, smaller sizes), you will probably find that you need an extra program or app or do something with them in Windows 11, but that’s about to change. Microsoft has announced
Read More
Mod replaces DLSS with AMD’s FSR3 on older Nvidia cards thumbnail

Mod replaces DLSS with AMD’s FSR3 on older Nvidia cards

Image: Brad Chacos/Michael Crider/Foundry The rivalry between Nvidia and AMD is never-ending, but new battle lines are forming around the companies’ competing approaches to frame generation tech. AMD’s FidelityFX Super Resolution 3 (FSR3) recently went open source, allowing modders to add it into games that don’t natively support it. It turns out that this is
Read More
Index Of News
Total
0
Share