Michael Catanzaro's Blog – On Fedora Workstation, GNOME, Epiphany, and WebKitGTK

August 5, 2024August 6, 2024

I Entered My GitHub Credentials into a Phishing Website!

We all think we’re smart enough to not be tricked by a phishing attempt, right? Unfortunately, I know for certain that I’m not, because I entered my GitHub password into a lookalike phishing website a year or two ago. Oops! Fortunately, I noticed right away, so I simply changed my unique, never-reused password and moved on. But if the attacker were smarter, I might have never noticed. (This particular attack website was relatively unsophisticated and proxied only an unauthenticated view of GitHub, a big clue that something was wrong. Update: I want to be clear that it would have been very easy for the attacker to simply redirect me to the real github.com after stealing my credentials, in which case I would not have noticed the attack and would not have known to change my password.)

You might think multifactor authentication is the best defense against phishing. Nope. Although multifactor authentication is a major security improvement over passwords alone, and the particular attack that tricked me did not attempt to subvert multifactor authentication, it’s actually unfortunately pretty easy for phishers to defeat most multifactor authentication if they wish to do so:

Multifactor authentication based on SMS is extremely insecure (because SS7 is insecure)
Multifactor authentication based on phone calls is also insecure (because SIM swapping isn’t going away; determined attackers will steal your phone number if it’s an obstacle to them)
Multifactor authentication based on authenticator apps (using TOTP or HOTP) is much better in general, but still fails against phishing. When you paste your one-time access code into a phishing website, the phishing website can simply “proxy” the access code you kindly provided to them by submitting it to the real website. This only allows authenticating once, but once is usually enough.

Fortunately, there is a solution: passkeys. Based on FIDO2 and WebAuthn, passkeys resist phishing because the authentication process depends on the domain of the service that you’re actually connecting to. If you think you’re visiting https://example.com, but you’re actually visiting a copycat website with a Cyrillic а instead of Latin a, then no worries: the authentication will fail, and the frustrated attacker will have achieved nothing.

The most popular form of passkey is local biometric authentication running on your phone, but any hardware security key (e.g. YubiKey) is also a good bet.

target.com Is More Secure than Your Bank!

I am not joking when I say that target.com is more secure than your bank (which is probably still relying on SMS or phone calls, and maybe even allows you to authenticate using easily-guessable security questions):

Screenshot of target.com prompting the user to register a passkey — target.com is introducing passkeys!

Good job for supporting passkeys, Target.

It’s probably perfectly fine for Target to support passkeys alongside passwords indefinitely. Higher-security websites that want to resist phishing (e.g. your employer’s SSO service) should consider eventually allowing only passkeys.

No Passkeys in WebKitGTK

Unfortunately for GNOME users, WebKitGTK does not yet support WebAuthn, so passkeys will not work in GNOME Web (Epiphany). That’s my browser of choice, so I’ve never touched a passkey before and don’t actually know how well they work in practice. Maybe do as I say and not as I do? If you require high security, you will unfortunately need to use Firefox or Chrome instead, at least for the time being.

Why Was Michael Visiting a Fake github.com?

The fake github.com appeared higher than the real github.com in the DuckDuckGo search results for whatever I was looking for at the time. :(

July 29, 2024

How to Get Hacked by North Korea

Good news: exploiting memory safety vulnerabilities is becoming more difficult. Traditional security vulnerabilities will remain a serious threat, but attackers prefer to take the path of least resistance, and nowadays that is to attack developers rather than the software itself. Once the attackers control your computer, they can attempt to perform a supply chain attack and insert backdoors into your software, compromising all of your users at once.

If you’re a software developer, it’s time to start focusing on the possibility that attackers will target you personally. Yes, you. If you use Linux, macOS, or Windows, take a moment to check your home directory for a hidden .n2 folder. If it exists, alas! You have been hacked by the North Koreans. (Future malware campaigns will presumably be more stealthy than this.)

Attackers who target developers are currently employing two common strategies:

Fake job interview: you’re applying to job postings and the recruiter asks you to solve a programming problem of some sort, which involves installing software from NPM (or PyPI, or another language ecosystem’s package manager).
Fake debugging request: you receive a bug report and the reporter helpfully provides a script for reproducing the bug. The script may have dependencies on packages in NPM (or PyPI, or another language ecosystem’s package manager) to make it harder to notice that it’s malware. I saw a hopefully innocent bug report that was indistinguishable from such an attack just last week.

But of course you would never execute such a script without auditing the dependencies thoroughly! (Insert eyeroll emoji here.)

By now, most of us are hopefully already aware of typosquatting attacks, and the high risk that untrustworthy programming language packages may be malicious. But you might not have considered that attackers might target you personally. Exercise caution whenever somebody asks you to install packages from these sources. Take the time to start a virtual machine before running the code. (An isolated podman container probably works too?) Remember, attackers will target the path of least resistance. Virtualization or containerization raises the bar considerably, and the attacker will probably move along to an easier target.

July 10, 2024

Forcibly Set Array Size in Vala

Vala programmers probably already know that a Vala array is equivalent to a C array plus an integer size. (Unfortunately, the size is gint rather than size_t, which seems likely to be a source of serious bugs.) When you create an array in Vala, the size is set for you automatically by Vala. But what happens when receiving an array that’s created by a library?

Here is an example. I recently wanted to use the GTK function gdk_texture_new_for_surface(), which converts from a cairo_surface_t to a GdkTexture, but unfortunately this function is internal to GTK (presumably to avoid exposing more cairo stuff in GTK’s public API). I decided to copy it into my application, which is written in Vala. I could have put it in a separate C source file, but it’s nicer to avoid a new file and rewrite it in Vala. Alas, I hit an array size roadblock along the way.

Now, gdk_texture_new_for_surface() calls cairo_image_surface_get_data() to return an array of data to be used for creating a GBytes object to pass to gdk_memory_texture_new(). The size of the array is cairo_image_surface_get_height (surface) * cairo_image_surface_get_stride (surface). This is GTK’s code to create the GBytes object:

bytes = g_bytes_new_with_free_func (cairo_image_surface_get_data (surface),
                                    cairo_image_surface_get_height (surface)
                                    * cairo_image_surface_get_stride (surface),
                                    (GDestroyNotify) cairo_surface_destroy,
                                    cairo_surface_reference (surface));

The C function declaration of g_bytes_new_with_free_func() looks like this:

GBytes*
g_bytes_new_with_free_func (
  gconstpointer data,
  gsize size,
  GDestroyNotify free_func,
  gpointer user_data
)

Notice it takes both the array data and its size size. But because Vala arrays already know their size, Vala bindings do not contain a separate parameter for the size of the array. The corresponding Vala API looks like this:

public Bytes.with_free_func (owned uint8[]? data, DestroyNotify? free_func, void* user_data)

Notice there is no way to pass the size of the data array, because the array is expected to know its own size.

I actually couldn’t figure out how to pass a DestroyNotify in Vala. There’s probably some way to do it (please let me know in the comments!), but I don’t know how. Anyway, I compromised by creating a GBytes that copies its data instead, using this simpler constructor:

public Bytes (uint8[]? data)

My code looked something like this:

unowned uchar[] data = surface.get_data ();
return new Gdk.MemoryTexture (surface.get_width (),
                              surface.get_height (),
                              BYTE_ORDER == ByteOrder.LITTLE_ENDIAN ? Gdk.MemoryFormat.B8G8R8A8_PREMULTIPLIED : Gdk.MemoryFormat.A8R8G8B8_PREMULTIPLIED,
                              new Bytes (data),
                              surface.get_stride ());

But this code crashes because the size of data is unset. Without any clear way to pass the size of the array to the Bytes constructor, I was initially stumped.

My first solution was to create my own array on the stack, use Posix.memcpy to copy the data to my new array, then pass the new array to the Bytes constructor. And that actually worked! But now the data is being copied twice, when the C version of the code doesn’t need to copy the data at all. I knew I had to copy the data at least once (because I didn’t know how else to manually call cairo_surface_destroy() and cairo_surface_ref() in Vala, at least not without resorting to custom language bindings), but copying it twice seemed slightly egregious.

Eventually I discovered there is an easier way:

data.length = surface.get_height () * surface.get_stride ();

It’s that simple. Vala arrays have a length property, and you can assign to it. In this case, the bindings didn’t know how to set the length, so it defaulted to the maximum possible value, leading to the crash. After setting the length manually, the code worked properly. Here it is.

August 17, 2023August 17, 2023

GNOME 45 Core Apps Update

It’s been a few months since I last reviewed the state of GNOME core apps. For GNOME 45, we have implemented the changes proposed in the “Imminent Core App Changes” section of that blog post:

Loupe enters core as GNOME’s new image viewer app, developed by Christopher Davis and Sophie Herold. Loupe will be branded as Image Viewer and replaces Eye of GNOME, which will no longer use the Image Viewer branding. Eye of GNOME will continue to be maintained by Felix Riemann, and contributions are still welcome there.
Snapshot enters core as GNOME’s new camera app, developed by Maximiliano Sandoval and Jamie Murphy. Snapshot will be branded as Camera and replaces Cheese. Cheese will continue to be maintained by David King, and contributions are still welcome there.
GNOME Photos has been removed from core without replacement. This application could have been retained if more developers were interested in it, but we have made the decision to remove it due to lack of volunteers interested in maintaining it. Photos will likely be archived eventually, unless a new maintainer volunteers to save it.

GNOME 45 beta will be released imminently with the above changes. Testing the release and reporting bugs is much appreciated.

We are also looking for volunteers interested in helping implement future core app changes. Specifically, improvements are required for Music to remain in core, and improvements are required for Geary to enter core. We’re also not quite sure what to do with Contacts. If you’re interested in any of these projects, consider getting involved.

May 10, 2023July 18, 2023

GNOME Core Apps Update

It’s been a while since my big core app reorganization for GNOME 3.22. Here is a history of core app changes since then:

GNOME 3.26 (September 2017) added Music, To Do (which has since been renamed to Endeavor), and Document Scanner (simple-scan). (I blogged about this at the time, then became lazy and stopped blogging about core app updates, until now.)
To Do was removed in GNOME 3.28 (March 2018) due to lack of consensus over whether it should really be a core app. As a result of this, we improved communication between GNOME release team and design team to ensure both teams agree on future core app changes. Mea culpa.
Documents was removed in GNOME 3.32 (March 2019).
A new Developer Tools subcategory of core was created in GNOME 3.38 (September 2020), adding Builder, dconf Editor, Devhelp, and Sysprof. These apps are only interesting for software developers and are not intended to be installed by default in general-purpose operating systems like the rest of GNOME core.
GNOME 41 (September 2021) featured the first larger set of changes to GNOME core since GNOME 3.22. This release removed Archive Manager (file-roller), since Files (nautilus) is now able to handle archives, and also removed gedit (formerly Text Editor). It added Connections and a replacement Text Editor app (gnome-text-editor). It also added a new Mobile subcategory of core, for apps intended for mobile-focused operating systems, featuring the dialer app Calls. (To date, the Mobile subcategory has not been very successful: so far Calls is the only app included there.)
GNOME 42 (March 2022) featured a second larger set of changes. Screenshot was removed because GNOME Shell gained a built-in screenshot tool. Terminal was removed in favor of Console (kgx). We also moved Boxes to the Developer Tools subcategory, to recommend that it no longer be installed by default in general purpose operating systems.
GNOME 43 (September 2022) added D-Spy to Developer Tools.

OK, now we’re caught up on historical changes. So, what to expect next?

New Process for Core Apps Changes

Although most of the core app changes have gone smoothly, we ran into some trouble replacing Terminal with Console. Console provides a fresher and simpler user interface on top of vte, the same terminal backend used by Terminal, so Console and Terminal share much of the same underlying functionality. This means work of the Terminal maintainers is actually key to the success of Console. Using a new terminal app rather than evolving Terminal allowed for bigger changes to the default user experience without upsetting users who prefer the experience provided by Terminal. I think Console is generally nicer than Terminal, but it is missing a few features that Fedora Workstation developers thought were important to have before replacing Terminal with Console. Long story short: this core app change was effectively rejected by one of our most important downstreams. Since then, Console has not seen very much development, and accordingly it is unlikely to be accepted into Fedora Workstation anytime soon. We messed up by adding the app to core before downstreams were comfortable with it, and at this point it has become unclear whether Console should remain in core or whether we should give up and bring back Terminal. Console remains for now, but I’m not sure where we go from here. Help welcome.

To prevent this situation from happening again, Chris and Sophie developed a detailed and organized process for adding or removing core apps, including a new Incubator category designed to provide notice to downstreams that we are considering adding new apps to GNOME core. The new Incubator is much more structured than my previous short-lived Incubator attempt in GNOME 3.22. When apps are added to Incubator, I’ve been proactively asking other Fedora Workstation developers to provide feedback to make sure the app is considered ready there, to avoid a repeat of the situation with Console. Other downstreams are also welcome to watch the Incubator/Submission project and provide feedback on newly-submitted apps, which should allow plenty of heads-up so downstreams can let us know sooner rather than later if there are problems with Incubator apps. Hopefully this should ensure apps are actually adopted by downstreams when they enter GNOME core.

Imminent Core App Changes

Currently there are two apps in Incubator. Loupe is a new image viewer app developed by Chris and Sophie to replace Image Viewer (eog). Snapshot is a new camera app developed by Maximiliano and Jamie to replace Cheese. These apps are maturing rapidly and have received primarily positive feedback thus far, so they are likely to graduate from Incubator and enter GNOME core sooner rather than later. The time to provide feedback is now. Don’t be surprised if Loupe is included in core for GNOME 45.

In addition to Image Viewer and Cheese, we are also considering removing Photos. Photos is one of our “content apps” designed to allow browsing an entire collection of files independently of their filesystem locations. Historically, the other two content apps were Documents and Music. The content app strategy did not work very well for Documents, since a document browser doesn’t really offer many advantages over a file browser, but Photos and Music are both pretty decent at displaying your collection of pictures or songs, assuming you have such a collection. We have been discussing what to do with Photos and the other content apps for a very long time, at least since 2015. It took a very long time to reach some rough consensus, but we have finally agreed that the design of Photos still makes sense for GNOME: having a local app for viewing both local and cloud photos is still useful. However, Photos is no longer actively maintained. Several basic functionality bugs imperiled timely release of Fedora 37 last fall, and the app is less useful than previously because it no longer integrates with cloud services like Google Photos. (The Google integration depends on libgdata, which was removed from GNOME 44 because it did not survive the transition to libsoup 3.) Photos has failed the new core app review process due to lack of active maintenance, and will be soon be removed from GNOME core unless a new maintainer steps up to take care of it. Volunteers welcome.

Future Core App Changes

Lastly, I want to talk about some changes that are not yet planned, but might occur in the future. Think of this entire section as brainstorming rather than any concrete plans.

Like Photos, we have also been discussing the status of Music. The popularity of DRM-encumbered cloud music services has increased, and local music storage does not seem to be as common as it used to be. If you do have local music, Music is pretty decent at handling it, but there are prominent bugs and missing features (like the ability to select which folders to index) detracting from the user experience. We do not have consensus on whether having a core app to play local music files still makes sense, since most users probably do not have a local music collection anymore. But perhaps all that is a moot point, because Videos (totem) 3.38 removed support for opening audio files, leaving us with no core apps capable of playing audio for the past 2.5 years. Previously, our default music player was Videos, which was really weird, and now we have none; Music can only play audio files that you’ve navigated to using Music itself, so it’s impossible for Music to be our default music player. My suggestion to rename Videos to Media Player and handle audio files again has not been well-received, so the most likely solution to this conundrum is to teach Music how to open audio files, likely securing its future in core. A merge request exists, but it does not look close to landing. Fedora Workstation is still shipping Rhythmbox rather than Music specifically due to this problem. My opinion is this needs to be resolved for Music to remain in core.

It would be nice to have an email client in GNOME core, since everybody uses email and local clients are much nicer than webmail. The only plausible candidate here is Geary. (If you like Evolution, consider that you might not like the major UI changes and many, many feature removals that would be necessary for Evolution to enter GNOME core.) Geary has only one active maintainer, and adding a big application that depends on just one person seems too risky. If more developers were interested in maintaining Geary, it would feel like a safer addition to GNOME core.

Contacts feels a little out of place currently. It’s mostly useful for storing email addresses, but you cannot actually do anything with them because we have no email application in core. Like Photos, Contacts has had several recent basic functionality bugs that imperiled timely Fedora releases, but these seem to have been largely resolved, so it’s not causing urgent problems. Still, for Contacts to remain in the long term, we’re probably going to need another maintainer here too. And perhaps it only makes sense to keep if we add Geary.

Finally, should Maps move to the Mobile category? It seems clearly useful to have a maps app installed by default on a phone, but I wonder how many desktop users really prefer to use Maps rather than a maps website.

GNOME 44 Core Apps

I’ll end this blog post with an updated list of core apps as of GNOME 44. Here they are:

Main category (26 apps):
- Calculator
- Calendar
- Characters
- Cheese
- Clocks
- Connections
- Console (kgx)
- Contacts
- Disks (gnome-disk-utility)
- Disk Usage Analyzer (baobab)
- Document Scanner (simple-scan)
- Document Viewer (evince)
- Files (nautilus)
- Fonts (gnome-font-viewer)
- Help (yelp)
- Image Viewer (eog)
- Logs
- Maps
- Music
- Photos
- Software
- System Monitor
- Text Editor
- Videos (totem)
- Weather
- Web (epiphany)
Developer Tools (6 apps):
- Boxes
- Builder
- dconf Editor
- Devhelp
- D-Spy
- sysprof
Mobile (1 app):
- Calls

March 21, 2023March 24, 2023

WebKitGTK API for GTK 4 Is Now Stable

With the release of WebKitGTK 2.40.0, WebKitGTK now finally provides a stable API and ABI for GTK 4 applications. The following API versions are provided:

webkit2gtk-4.0: this API version uses GTK 3 and libsoup 2. It is obsolete and users should immediately port to webkit2gtk-4.1. To get this with WebKitGTK 2.40, build with -DPORT=GTK -DUSE_SOUP2=ON.
webkit2gtk-4.1: this API version uses GTK 3 and libsoup 3. It contains no other changes from webkit2gtk-4.0 besides the libsoup version. With WebKitGTK 2.40, this is the default API version that you get when you build with -DPORT=GTK. (In 2.42, this might require a different flag, e.g. -DUSE_GTK3=ON, which does not exist yet.)
webkitgtk-6.0: this API version uses GTK 4 and libsoup 3. To get this with WebKitGTK 2.40, build with -DPORT=GTK -DUSE_GTK4=ON. (In 2.42, this might become the default API version.)

WebKitGTK 2.38 had a different GTK 4 API version, webkit2gtk-5.0. This was an unstable/development API version and it is gone in 2.40, so applications using it will break. Fortunately, that should be very few applications. If your operating system ships GNOME 42, or any older version, or the new GNOME 44, then no applications use webkit2gtk-5.0 and you have no extra work to do. But for operating systems that ship GNOME 43, webkit2gtk-5.0 is used by gnome-builder, gnome-initial-setup, and evolution-data-server:

For evolution-data-server 3.46, use this patch which applies on evolution-data-server 3.46.4.
For gnome-initial-setup 43, use this patch which applies on gnome-initial-setup 43.2. (Update: for your convenience, this patch will be included in gnome-initial-setup 43.3.)
For gnome-builder 43, all required changes are present in version 43.7.

Remember, patching is only needed for GNOME 43. Other versions of GNOME will have no problems with WebKitGTK 2.40.

There is no proper online documentation yet, but in the meantime you can view the markdown source for the migration guide to help you with porting your applications. Although the API is now stable and it is close to feature parity with the GTK 3 version, there are some problems to be aware of:

No support for accessibility. The GTK and WebKitGTK developers are collaborating on a plan to fix this, and I hope this will be ready for WebKitGTK 2.42.
Multiple NVIDIA proprietary graphics driver users report that it doesn’t work at all. Nobody currently plans to work on this. The WebKitGTK developers do not use NVIDIA graphics, and it will have to be debugged by somebody who does.
Various smaller regressions here and there. The GTK 4 version is almost as stable as GTK 3, and now is a good time to start using it. As with any interesting new software, there is still some work to do.

Big thanks to everyone who helped make this possible.

November 4, 2022November 4, 2022

Stop Using QtWebKit

Today, WebKit in Linux operating systems is much more secure than it used to be. The problems that I previously discussed in this old, formerly-popular blog post are nowadays a thing of the past. Most major Linux operating systems now update WebKitGTK and WPE WebKit on a regular basis to ensure known vulnerabilities are fixed. (Not all Linux operating systems include WPE WebKit. It’s basically WebKitGTK without the dependency on GTK, and is the best choice if you want to use WebKit on embedded devices.) All major operating systems have removed older, insecure versions of WebKitGTK (“WebKit 1”) that were previously a major security problem for Linux users. And today WebKitGTK and WPE WebKit both provide a webkit_web_context_set_sandbox_enabled() API which, if enabled, employs Linux namespaces to prevent a compromised web content process from accessing your personal data, similar to Flatpak’s sandbox. (If you are a developer and your application does not already enable the sandbox, you should fix that!)

Unfortunately, QtWebKit has not benefited from these improvements. QtWebKit was removed from the upstream WebKit codebase back in 2013. Its current status in Fedora is, unfortunately, representative of other major Linux operating systems. Fedora currently contains two versions of QtWebKit:

The qtwebkit package contains upstream QtWebKit 2.3.4 from 2014. I believe this is used by Qt 4 applications. For avoidance of doubt, you should not use applications that depend on a web engine that has not been updated in eight years.
The newer qt5-qtwebkit contains Konstantin Tokarev’s fork of QtWebKit, which is de facto the new upstream and without a doubt the best version of QtWebKit available currently. Although it has received occasional updates, most recently 5.212.0-alpha4 from March 2020, it’s still based on WebKitGTK 2.12 from 2016, and the release notes bluntly state that it’s not very safe to use. Looking at WebKitGTK security advisories beginning with WSA-2016-0006, I manually counted 507 CVEs that have been fixed in WebKitGTK 2.14.0 or newer.

These CVEs are mostly (but not exclusively) remote code execution vulnerabilities. Many of those CVEs no doubt correspond to bugs that were introduced more recently than 2.12, but the exact number is not important: what’s important is that it’s a lot, far too many for backporting security fixes to be practical. Since qt5-qtwebkit is two years newer than qtwebkit, the qtwebkit package is no doubt in even worse shape. And because QtWebKit does not have any web process sandbox, any remote code execution is game over: an attacker that exploits QtWebKit gains full access to your user account on your computer, and can steal or destroy all your files, read all your passwords out of your password manager, and do anything else that your user account can do with your computer. In contrast, with WebKitGTK or WPE WebKit’s web process sandbox enabled, attackers only get access to content that’s mounted within the sandbox, which is a much more limited environment without access to your home directory or session bus.

In short, it’s long past time for Linux operating systems to remove QtWebKit and everything that depends on it. Do not feed untrusted data into QtWebKit. Don’t give it any HTML that you didn’t write yourself, and certainly don’t give it anything that contains injected data. Uninstall it and whatever applications depend on it.

Update: I forgot to mention what to do if you are a developer and your application still uses QtWebKit. You should ensure it uses the most recent release of QtWebEngine for Qt 6. Do not use old versions of Qt 6, and do not use QtWebEngine for Qt 5.

August 11, 2022

Common GLib Programming Errors, Part Two: Weak Pointers

This post is a sequel to Common GLib Programming Errors, where I covered four common errors: failure to disconnect a signal handler, misuse of a GSource handle ID, failure to cancel an asynchronous function, and misuse of main contexts in library or threaded code. Although there are many ways to mess up when writing programs that use GLib, I believe the first post covered the most likely and most pernicious… except I missed weak pointers. Sébastien pointed out that these should be covered too, so here we are.

Mistake #5: Failure to Disconnect Weak Pointer

In object-oriented languages, weak pointers are a safety improvement. The idea is to hold a non-owning pointer to an object that gets automatically set to NULL when that object is destroyed to prevent use-after-free vulnerabilities. However, this only works well because object-oriented languages have destructors. Without destructors, we have to deregister the weak pointer manually, and failure to do so is a disaster that will result in memory corruption that’s extremely difficult to track down. For example:

static void
a_start_watching_b (A *self,
                    B *b)
{
  // Keep a weak reference to b. When b is destroyed,
  // self->b will automatically be set to NULL.
  self->b = b;
  g_object_add_weak_pointer (b, &self->b);
}

static void
a_do_something_with_b (Foo *self)
{
  if (self->b) {
    // Do something safely here, knowing that b
    // is assuredly still alive. This avoids a
    // use-after-free vulnerability if b is destroyed,
    // i.e. self->b cannot be dangling.
  }
}

Let’s say that the Bar in this example outlives the Foo, but Foo failed to call g_object_remove_weak_pointer() . Then when Bar is destroyed later, the memory that used to be occupied by self->bar will get clobbered with NULL. Hopefully that will result in an immediate crash. If not, good luck trying to debug what’s going wrong when some innocent variable elsewhere in your program gets randomly clobbered. This is often results in a frustrating wild goose chase when trying to track down what is going wrong (example).

The solution is to always disconnect your weak pointer. In most cases, your dispose function is the best place to do this:

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;
  g_clear_weak_pointer (&a->b);
  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Note that g_clear_weak_pointer() is equivalent to:

if (a->b) {
  g_object_remove_weak_pointer (a->b, &a->b);
  a->b = NULL;
}

but you probably guessed that, because it follows the same pattern as the other clear functions that we’ve used so far.

July 27, 2022September 16, 2022

Common GLib Programming Errors

Let’s examine four mistakes to avoid when writing programs that use GLib, or, alternatively, four mistakes to look for when reviewing code that uses GLib. Experienced GNOME developers will find the first three mistakes pretty simple and basic, but nevertheless they still cause too many crashes. The fourth mistake is more complicated.

These examples will use C, but the mistakes can happen in any language. In unsafe languages like C, C++, and Vala, these mistakes usually result in security issues, specifically use-after-free vulnerabilities.

Mistake #1: Failure to Disconnect Signal Handler

Every time you connect to a signal handler, you must think about when it should be disconnected to prevent the handler from running at an incorrect time. Let’s look at a contrived but very common example. Say you have an object A and wish to connect to a signal of object B. Your code might look like this:

static void
some_signal_cb (gpointer user_data)
{
  A *self = user_data;
  a_do_something (self);
}

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  g_signal_connect (b, "some-signal", (GCallback)some_signal_cb, a);
}

Very simple. Now, consider what happens if the object B outlives object A, and Object B emits some-signal after object A has been destroyed. Then the line a_do_something (self) is a use-after-free, a serious security vulnerability. Drat!

If you think about when the signal should be disconnected, you won’t make this mistake. In many cases, you are implementing an object and just want to disconnect the signal when your object is disposed. If so, you can use g_signal_connect_object() instead of the vanilla g_signal_connect(). For example, this code is not vulnerable:

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  g_signal_connect_object (b, "some-signal", (GCallback)some_signal_cb, a, 0);
}

g_signal_connect_object() will disconnect the signal handler whenever object A is destroyed, so there’s no longer any problem if object B outlives object A. This simple change is usually all it takes to avoid disaster. Use g_signal_connect_object() whenever the user data you wish to pass to the signal handler is a GObject. This will usually be true in object implementation code.

Sometimes you need to pass a data struct as your user data instead. If so, g_signal_connect_object() is not an option, and you will need to disconnect manually. If you’re implementing an object, this is normally done in the dispose function:

// Object A instance struct (or priv struct)
struct _A {
  B *b;
  gulong some_signal_id;
};

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  g_assert (a->some_signal_id == 0);
  a->some_signal_id = g_signal_connect (b, "some-signal", (GCallback)some_signal_cb, a, 0);
}

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;
  g_clear_signal_handler (&a->some_signal_id, a->b);
  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Here, g_clear_signal_handler() first checks if &a->some_signal_id is 0. If not, it disconnects and sets &a->some_signal_id to 0. Setting your stored signal ID to 0 and checking whether it is 0 before disconnecting is important because dispose may run multiple times to break reference cycles. Attempting to disconnect the signal multiple times is another common programmer error!

Instead of calling g_clear_signal_handler(), you could equivalently write:

if (a->some_signal_id != 0) {
  g_signal_handler_disconnect (a->b, a->some_signal_id);
  a->some_signal_id = 0;
}

But writing that manually is no fun.

Yet another way to mess up would be to use the wrong integer type to store the signal ID, like guint instead of gulong.

There are other disconnect functions you can use to avoid the need to store the signal handler ID, like g_signal_handlers_disconnect_by_data(), but I’ve shown the most general case.

Sometimes, object implementation code will intentionally not disconnect signals if the programmer believes that the object that emits the signal will never outlive the object that is connecting to it. This assumption may usually be correct, but since GObjects are refcounted, they may be reffed in unexpected places, leading to use-after-free vulnerabilities if this assumption is ever incorrect. Your code will be safer and more robust if you disconnect always.

Mistake #2: Misuse of GSource Handler ID

Mistake #2 is basically the same as Mistake #1, but using GSource rather than signal handlers. For simplicity, my examples here will use the default main context, so I don’t have to show code to manually create, attach, and destroy the GSource. The default main context is what you’ll want to use if (a) you are writing application code, not library code, and (b) you want your callbacks to execute on the main thread. (If either (a) or (b) does not apply, then you need to carefully study GMainContext to ensure you do not mess up; see Mistake #4.)

Let’s use the example of a timeout source, although the same style of bug can happen with an idle source or any other type of source that you create:

static gboolean
my_timeout_cb (gpointer user_data)
{
  A *self = user_data;
  a_do_something (self);
  return G_SOURCE_REMOVE;
}

static void
some_method_of_a (A *self)
{
  g_timeout_add (42, (GSourceFunc)my_timeout_cb, a);
}

You’ve probably guessed the flaw already: if object A is destroyed before the timeout fires, then the call to a_do_something() is a use-after-free, just like when we were working with signals. The fix is very similar: store the source ID and remove it in dispose:

// Object A instance struct (or priv struct)
struct _A {
  gulong my_timeout_id;
};

static gboolean
my_timeout_cb (gpointer user_data)
{
  A *self = user_data;
  a_do_something (self);
  a->my_timeout_id = 0;
  return G_SOURCE_REMOVE;
}

static void
some_method_of_a (A *self)
{
  g_assert (a->my_timeout_id == 0);
  a->my_timeout_id = g_timeout_add (42, (GSourceFunc)my_timeout_cb, a);
}

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;
  g_clear_handle_id (&a->my_timeout_id, g_source_remove);
  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Much better: now we’re not vulnerable to the use-after-free issue.

As before, we must be careful to ensure the source is removed exactly once. If we remove the source multiple times by mistake, GLib will usually emit a critical warning, but if you’re sufficiently unlucky you could remove an innocent unrelated source by mistake, leading to unpredictable misbehavior. This is why we need to write a->my_timeout_id = 0; before returning from the timeout function, and why we need to use g_clear_handle_id() instead of g_source_remove() on its own. Do not forget that dispose may run multiple times!

We also have to be careful to return G_SOURCE_REMOVE unless we want the callback to execute again, in which case we would return G_SOURCE_CONTINUE. Do not return TRUE or FALSE, as that is harder to read and will obscure your intent.

Mistake #3: Failure to Cancel Asynchronous Function

When working with asynchronous functions, you must think about when it should be canceled to prevent the callback from executing too late. Because passing a GCancellable to asynchronous function calls is optional, it’s common to see code omit the cancellable. Be suspicious when you see this. The cancellable is optional because sometimes it is really not needed, and when this is true, it would be annoying to require it. But omitting it will usually lead to use-after-free vulnerabilities. Here is an example of what not to do:

static void
something_finished_cb (GObject      *source_object,
                       GAsyncResult *result,
                       gpointer      user_data)
{
  A *self = user_data;
  B *b = (B *)source_object;
  g_autoptr (GError) error = NULL;

  if (!b_do_something_finish (b, result, &error)) {
    g_warning ("Failed to do something: %s", error->message);
    return;
  }

  a_do_something_else (self);
}

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  b_do_something_async (b, NULL /* cancellable */, a);
}

This should feel familiar by now. If we did not use A inside the callback, then we would have been able to safely omit the cancellable here without harmful effects. But instead, this example calls a_do_something_else(). If object A is destroyed before the asynchronous function completes, then the call to a_do_something_else() will be a use-after-free.

We can fix this by storing a cancellable in our instance struct, and canceling it in dispose:

// Object A instance struct (or priv struct)
struct _A {
  GCancellable *cancellable;
};

static void
something_finished_cb (GObject      *source_object,
                       GAsyncResult *result,
                       gpointer      user_data)
{
  B *b = (B *)source_object;
  A *self = user_data;
  g_autoptr (GError) error = NULL;

  if (!b_do_something_finish (b, result, &error)) {
    if (!g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED))
      g_warning ("Failed to do something: %s", error->message);
    return;
  }
  a_do_something_else (self);
}

static void
some_method_of_a (A *self)
{
  B *b = get_b_from_somewhere ();
  b_do_something_async (b, a->cancellable, a);
}

static void
a_init (A *self)
{
  self->cancellable = g_cancellable_new ();
}

static void
a_dispose (GObject *object)
{
  A *a = (A *)object;

  g_cancellable_cancel (a->cancellable);
  g_clear_object (&a->cancellable);

  G_OBJECT_CLASS (a_parent_class)->dispose (object);
}

Now the code is not vulnerable. Note that, since you usually do not want to print a warning message when the operation is canceled, there’s a new check for G_IO_ERROR_CANCELLED in the callback.

Update #1: I managed to mess up this example in the first version of my blog post. The example above is now correct, but what I wrote originally was:

if (!b_do_something_finish (b, result, &error) &&
    !g_error_matches (error, G_IO_ERROR, G_IO_ERROR_CANCELLED)) {
  g_warning ("Failed to do something: %s", error->message);
  return;
}
a_do_something_else (self);

Do you see the bug in this version? Cancellation causes the asynchronous function call to complete the next time the application returns control to the main context. It does not complete immediately. So when the function is canceled, A is already destroyed, the error will be G_IO_ERROR_CANCELLED, and we’ll skip the return and execute a_do_something_else() anyway, triggering the use-after-free that the example was intended to avoid. Yes, my attempt to show you how to avoid a use-after-free itself contained a use-after-free. You might decide this means I’m incompetent, or you might decide that it means it’s too hard to safely use unsafe languages. Or perhaps both!

Update #2: My original example had an unnecessary explicit check for NULL in the dispose function. Since g_cancellable_cancel() is NULL-safe, the dispose function will cancel only once even if dispose runs multiple times, because g_clear_object() will set a->cancellable = NULL. Thanks to Guido for suggesting this improvement in the comments.

Mistake #4: Incorrect Use of GMainContext in Library or Threaded Code

My fourth common mistake is really a catch-all mistake for the various other ways you can mess up with GMainContext. These errors can be very subtle and will cause functions to execute at unexpected times. Read this main context tutorial several times. Always think about which main context you want callbacks to be invoked on.

Library developers should pay special attention to the section “Using GMainContext in a Library.” It documents several security-relevant rules:

Never iterate a context created outside the library.
Always remove sources from a main context before dropping the library’s last reference to the context.
Always document which context each callback will be dispatched in.
Always store and explicitly use a specific GMainContext, even if it often points to some default context.
Always match pushes and pops of the thread-default main context.

If you fail to follow all of these rules, functions will be invoked at the wrong time, or on the wrong thread, or won’t be called at all. The tutorial covers GMainContext in much more detail than I possibly can here. Study it carefully. I like to review it every few years to refresh my knowledge. (Thanks Philip Withnall for writing it!)

Properly-designed libraries follow one of two conventions for which main context to invoke callbacks on: they may use the main context that was thread-default at the time the asynchronous operation started, or, for method calls on an object, they may use the main context that was thread-default at the time the object was created. Hopefully the library explicitly documents which convention it follows; if not, you must look at the source code to figure out how it works, which is not fun. If the library documentation does not indicate that it follows either convention, it is probably unsafe to use in threaded code.

Conclusion

All four mistakes are variants on the same pattern: failure to prevent a function from being unexpectedly called at the wrong time. The first three mistakes commonly lead to use-after-free vulnerabilities, which attackers abuse to hack users. The fourth mistake can cause more unpredictable effects. Sadly, today’s static analyzers are probably not smart enough to catch these mistakes. You could catch them if you write tests that trigger them and run them with an address sanitizer build, but that’s rarely realistic. In short, you need to be especially careful whenever you see signals, asynchronous function calls, or main context sources.

Sequel

The adventure continues in Common GLib Programming Errors, Part Two: Weak Pointers.

July 15, 2022August 16, 2022

Best Practices for Build Options

Build options are sometimes tricky to get right. Here’s my take on best practices. The golden rule is to set good upstream defaults. Everything else follows from this.

Rule #1: Choose Good Upstream Defaults

Occasionally I see upstream developers complain that a downstream operating system has built their software “incorrectly,” generally because some important dependency or feature has been disabled. Sometimes downstreams really do mess up, but more often poor upstream defaults are to blame. Upstreams must set good defaults because upstream software developers know far more about their projects than downstream packagers do. Upstreams generally have a good idea of how they expect software to be built by downstreams, whereas downstreams generally do not. Accordingly, do the thinking upstream whenever possible. When you set good defaults, it becomes easier for downstreams to build your software the way you expect, because active effort is required for downstreams to mess things up.

For example, say a project has the following two build options:

Option Name	Default Value
`--enable-thing-you-usually-want-enabled`	`false`
`--disable-thing-you-rarely-want-disabled`	`true`

The thing you usually want enabled is not enabled by default, and the thing you rarely want disabled is disabled by default. Sad. Unfortunately, this pattern used to be extremely common with Autotools build systems, because in the real world, the names of the options are more subtle than this, and also because nobody likes squinting at configure.ac files to audit whether the options make sense. Meson build systems tend to be somewhat better because meson_options.txt is separate from the rest of the build definitions, making it easier to review all your options and check to ensure their defaults are set appropriately. However, there are still a few more subtle ways you can mess up your Meson build system, which I’ll discuss below.

Rule #2: Prefer Upstream Defaults Downstream

Conversely, downstreams should not second-guess upstream defaults unless you have a good reason to do so and really know what you’re doing.

For example, glib-networking’s Meson build system provides you with two different TLS backend options: OpenSSL or GnuTLS. The GnuTLS backend is enabled by default (well, sort of, see the next section on auto dependencies) while the OpenSSL backend is disabled by default. There’s a good reason for this: the OpenSSL backend is half-baked, partially due to bugs in glib-networking, and partially because OpenSSL just cannot do certain things that GnuTLS can. The OpenSSL backend is provided because some environments really do require it for license reasons, but it’s not the right choice for general-purpose operating systems. It may be tempting to think that you can pick whichever library you prefer, but you really should not.

Another example: WebKitGTK’s CMake build system provides a USE_WPE_RENDERER build option, which is enabled by default. This option controls which graphics rendering stack is used: if enabled, rendering uses libwpe and wpebackend-fdo, whereas if disabled, rendering uses a legacy internal Wayland compositor. The option is provided because libwpe and wpebackend-fdo are newer dependencies that are expected to be unavailable on older (pre-2020) operating systems, so older operating systems legitimately need to be able to disable it. But this configuration receives little serious testing and the upstream developers do not notice when it breaks, so you really should not be using it unless you have to. This recently caused rendering bugs that appeared to be distribution-specific, which upstream developers were not willing to investigate because upstream developers could not reproduce the issue.

Sticking with upstream defaults is generally safest. Sometimes you really need to override them. If so, go ahead. Just be careful.

Rule #3: Handle Auto Dependencies and Features with Care

The worst default ever is “build with feature enabled only if dependency xyz is installed; otherwise, disable it.” This is called an auto dependency. If using CMake or Autotools, auto dependencies are almost never permissible, and in this case “handle with care” means repent and fix it. Auto dependencies are acceptable only if you are using the Meson build system.

The theory behind auto dependencies is that it’s convenient for people casually building the software to do so with the fewest number of build errors possible, which is true. Problem is, this screws over serious production builds of your software by requiring your downstreams to possess magical knowledge of what dependencies are required to build your software properly. Users generally expect most features to be enabled at build time, but if upstream uses auto dependencies, the result is a build dependencies lottery: your feature will be enabled or disabled due to luck, based on which downstream build dependencies transitively depend on which other build dependencies. Even if it’s built properly today, that could easily change tomorrow when some other dependency changes in some other package. Just say no. Do not expect downstreams to look at your build system at all, let alone study the possible build options and make accurate judgments about which build dependencies are required to enable them. Avoiding auto dependencies is part of setting good upstream defaults.

Look at this example from WebKit’s OptionsGTK.cmake:

if (ENABLE_SPELLCHECK)
    find_package(Enchant)
    if (NOT PC_ENCHANT_FOUND)
        message(FATAL_ERROR "Enchant is needed for ENABLE_SPELLCHECK")
    endif ()
endif ()

ENABLE_SPELLCHECK is ON by default. If you don’t have enchant installed, the build will fail unless you manually disable it by passing -DENABLE_SPELLCHECK=OFF". This makes it hard to mess up: downstreams have to make an intentional choice to build with spellchecking disabled. It cannot happen by accident.

Many projects would instead write it like this:

if (ENABLE_SPELLCHECK)
    find_package(Enchant)
    if (NOT PC_ENCHANT_FOUND)
        set(ENABLE_SPELLCHECK OFF)
    endif ()
endif ()

But this is an auto dependency, which results in downstream build dependency lottery. If you write your build system like this, you cannot complain when the feature winds up disabled by mistake in downstream builds. Don’t do this.

Exception: if you use Meson, auto dependencies are acceptable if you use the feature option type and set the default to auto. Although auto features are silently enabled or disabled by default depending on whether the required dependency is present, you can easily override this behavior for serious production builds by passing -Dauto_features=enabled, which enables all the auto features and will result in build failures if dependencies are missing. All major Linux operating systems do this when building Meson packages, so Meson’s auto features should not cause problems.

Rule #4: Be Very Careful with Meson’s Build Types

Let’s categorize software builds into production builds or non-production builds. A production build is intended to be either distributed to end users or else run production workloads, whereas a non-production build is intended for testing or development and might have extra debug features enabled, like slow assertions. (These are more commonly called release builds or debug builds, but that terminology would be very confusing in the context of this discussion, as you’re about to see.)

The CMake and Meson build systems give us more than just two build types. Compare CMake build types to the corresponding Meson build types:

CMake Build Type	Meson Build Type	Meson `debug` Option	Production Build? (excludes Windows)
`Release`	`release`	`false`	Yes
`Debug`	`debug`	`true`	No
`RelWithDebInfo`	`debugoptimized`	`true`	Yes, be careful!
`MinSizeRel`	`minsize`	`true`	Yes, be careful!
N/A	`plain`	`false`	Yes

To simplify, let’s exclude Windows from the discussion for now. (We’ll come back to Windows in a bit.) Now, notice the nomenclature difference between CMake’s RelWithDebInfo (“release with debuginfo”) build type versus Meson’s debugoptimized build type. This build type functions exactly the same for both Meson and CMake, but CMake’s name is better because it clearly indicates that this is a release or production build type, whereas the Meson name seems to indicate it is a debug or non-production build type, and Meson’s debug option is set to true. In fact, it is an optimized production build with debuginfo enabled, the same style of build that almost all Linux operating systems use for their packages (although operating systems use the plain build type instead). The same problem exists for Meson’s minsize build type. This is another production build type where debug is true.

The Meson build type name accurately reflects that the debug option is enabled, but this is very confusing because for most platforms, that option only controls whether debuginfo is generated. Looking at the table above, you can see that you must never use the debug option alone to decide whether you have a production build or a non-production build. As the table indicates, the only non-production build type is the vanilla debug build type, which you can detect by checking the combination of the debug and optimization options. You have a non-production (debug) build if debug is true and if optimization is 0 or g; otherwise, you have a production build. I wrote this in bold because it is important and not at all obvious. (However, before applying this rule in a cross-platform project, keep reading below to see the huge caveat regarding Windows.)

Here’s an example of what not to do in your meson.build:

# Use debug/optimization flags to determine whether to enable debug or disable
# cast checks
gtk_debug_cflags = []
debug = get_option('debug')
optimization = get_option('optimization')
if debug
  gtk_debug_cflags += '-DG_ENABLE_DEBUG'
  if optimization in ['0', 'g']
    gtk_debug_cflags += '-DG_ENABLE_CONSISTENCY_CHECKS'
  endif
elif optimization in ['2', '3', 's']
  gtk_debug_cflags += ['-DG_DISABLE_CAST_CHECKS', '-DG_DISABLE_ASSERT']
endif

This is from GTK’s meson.build. The code based only on the optimization option is OK, but the code that sets -DG_ENABLE_DEBUG is looking only at the debug option. What the code really wants to do is set G_ENABLE_DEBUG if this is a non-production build, but instead it is tied to debuginfo, which is not the desired result. Downstreams are forced to scratch their heads as to what they should do. Impassioned build engineers have held spirited debates about this particular meson.build snippet. Don’t do this! (I will submit a merge request to improve this.)

Here’s a much better, although still not perfect, example of how to do the same thing, this time from GLib’s meson.build:

# Use debug/optimization flags to determine whether to enable debug or disable
# cast checks
glib_debug_cflags = []
glib_debug = get_option('glib_debug')
if glib_debug.enabled() or (glib_debug.auto() and get_option('debug'))
  glib_debug_cflags += ['-DG_ENABLE_DEBUG']
  message('Enabling various debug infrastructure')
elif get_option('optimization') in ['2', '3', 's']
  glib_debug_cflags += ['-DG_DISABLE_CAST_CHECKS']
  message('Disabling cast checks')
endif

if not get_option('glib_assert')
  glib_debug_cflags += ['-DG_DISABLE_ASSERT']
  message('Disabling GLib asserts')
endif

if not get_option('glib_checks')
  glib_debug_cflags += ['-DG_DISABLE_CHECKS']
  message('Disabling GLib checks')
endif

Notice how GLib provides explicit build options that allow downstreams to decide whether debug should be enabled or not. Using explicit build options here was a good idea! The defaults for glib_assert and glib_checks are intentionally set to true to encourage their use in production builds, while G_DISABLE_CAST_CHECKS is based only on the optimization level. But sadly, if not explicitly configured, GLib sets the value of the glib_debug_cflags option automatically, based on only the value of the debug option. This is actually an OK use of an auto feature, because it is a carefully-considered attempt to provide good default behavior for downstreams, but it fails here because it assumes that debug means “non-production build,” which we have previously established cannot be determined without checking optimization as well. (I will submit a merge request to improve this.)

Here’s another helpful table that shows how the various build types correspond to CFLAGS:

CMake/Meson Build Type	CMake `CFLAGS`	Meson `CFLAGS`
`Release`/`release`	`-O3 -DNDEBUG`	`-O3`
`Debug`/`debug`	`-g`	`-O0 -g`
`RelWithDebInfo`/`debugoptimized`	`-O2 -g -DNDEBUG`	`-O2 -g`
`MinSizeRel`/`minsize`	`-Os -DNDEBUG`	`-Os -g`

Notice Meson’s minsize build type includes debuginfo, while CMake’s does not. Since debuginfo requires a huge amount of space, CMake’s behavior seems better here. We’ll discuss NDEBUG momentarily.

OK, so that all makes sense, right? Well I thought so too, until I ran a draft of this blog post past Jussi, who pointed out that the Meson build types function completely differently on Windows than they do on other platforms. Unfortunately, whereas on most platforms the debug option only controls debuginfo generation, on Windows it instead controls whether the C library enables extra runtime debugging checks. So while debugoptimized and minsize are production build types on Linux and have nice corresponding CMake build types, they are non-production build types on Windows. This is a Meson defect. The point to remember is that the debug option is completely different on Windows than it is on other platforms, so my otherwise-nice rule for detecting production builds does not work properly on Windows. Cross-platform projects need to be especially careful with the debug option. There are various ways this could be fixed in Meson in the future: a nice simple proposal would be to add a new debuginfo option separate from debug, then deprecate the debugoptimized build type and replace it with releasewithdebuginfo.

CMake dodges all these problems and avoids any ambiguity because its build types are named differently: “RelWithDebInfo” and “MinSizeRel” leave no doubt that you are dealing with a release (production) build.

Rule #5: Think about NDEBUG

The other behavior difference visible in the table above is that CMake defines NDEBUG for its production build types, whereas Meson has a separate option bn_debug that controls whether to define NDEBUG. NDEBUG controls whether the C and C++ assert() macro is enabled: if this value is defined, asserts are disabled. CMake is the only build system that defines NDEBUG for you automatically. You really need to think about this: if your software is performance-sensitive and contains slow assertions, the consequences of messing this up are severe, e.g. see this historical mesa bug where Fedora’s mesa package suffered a 10x slowdown because mesa upstream accidentally enabled assertions by default. Again, please, do not blame downstreams for bad upstream defaults: downstreams are (usually) not experts on upstream software, and cannot possibly be expected to pick better defaults than upstream’s.

Meson allows developers to explicitly choose whether to enable assertions in production builds. Assertions are enabled in production by default, the opposite of CMake’s behavior. Some developers prefer that all asserts be disabled in production builds to optimize speed as far as possible, but this is usually not the best choice: having assertions enabled in production provides valuable confidence that your code is actually functioning as intended, and often improves security by converting many code execution exploits into denial of service. Most assertions do not have noticeable performance impact, so I prefer to leave most assertions enabled by default in production, and disable only asserts that are slow. Hence, I like Meson’s default behavior. But many engineers disagree with me, and some projects really do need assertions disabled in production; in particular, everyone agrees that performance-sensitive assertions should not be running in production builds. If you’re using Meson and want assertions disabled in production builds, you’re in theory supposed to use b_ndebug=if-release, but it doesn’t actually work because it only disables assertions if your build type is release or plain, while leaving assertions enabled for debugoptimized and minsize builds. We’ve already established that these are both production build types, so sadly that behavior is broken. Instead, it’s better to manually define NDEBUG except in non-production builds. Again, you have a non-production (debug) build when debug is true and if optimization is 0 or g; otherwise, you have a production build (except on Windows).

Rule #6: plain Means “Production Build,” Not “No Flags”

The GNOME release team recently had an exciting debate about the meaning of Meson’s plain build type. It is impressive how build engineers can be so enthusiastic about build options!

I asked Jussi to explain the plain build type. His response was: “Meson does not, by itself, add compiler flags,” emphasis mine. It does not mean your project should not add its own compiler flags, and it certainly does not mean it’s OK to set bad defaults as long as they are vanilla-flavored. It is a production build type, and you should ensure that it receives defaults in line with the other production build types. You’ll be fine if you follow the same rule we already established: you have a non-production (debug) build if debug is true and if optimization is 0 or g; otherwise, you have a production build (except on Windows).

The plain build type exists because it makes it easier for downstreams to implement their own compiler flags. Downstreams have to pass -O2 -g via CFLAGS because CMake and Meson are the only build systems that can do this automatically, and it’s easier to let downstreams disable this functionality than to force downstreams to set different CFLAGS separately for each supported build system.

Rule #7: Don’t Forget Hardening Flags

Sadly, by default all build systems generate insecure, unhardened binaries that should never be used in production. This is true of Autotools, CMake, Meson, and likely also whatever other build system you are thinking of. You must manually add your own hardening flags or your builds will be insecure. Unfortunately this is a little complicated to do. Fedora and RHEL’s recommended compiler flags are documented here. The freedesktop-sdk and GNOME Flatpak runtimes use these recommendations as the basis for their compiler flags, and by default, so do Flatpak applications based on these runtimes. It’s actually not very easy to replicate the same hardening flags since libraries and executables require different flags, so naively setting CFLAGS is not possible. Fedora and RHEL use GCC spec files to achieve this, whereas freedesktop-sdk relies on building GCC itself with a non-default configuration (yes, second-guessing upstream defaults). The good news is that all major downstreams have figured this out one way or another, so you only need to worry about it if you’re doing your own production builds.

Conclusion

That’s all you need to know. Remember, upstreams know their software better than downstreams do, so the hard thinking should happen upstream. We can minimize mistakes and trouble if upstreams carefully set good defaults, and if downstreams deviate from those defaults only when truly necessary. Keep these rules in mind to avoid unnecessary bug reports from dissatisfied users.

History

I updated this blog post on August 3, 2022 to change the primary guidance to “You have a non-production (debug) build if debug is true and if optimization is 0 or g; otherwise, you have a production build.” Originally, I failed to consider g. -Og means “optimize debugging experience” and it is supposedly a better choice than -O0 for debugging according to gcc(1). It’s definitely not actually, but at least that’s the intent.

Jussi responded to this blog post on August 13, 2022 to discuss why Meson’s build types don’t work so well. Read his response.