Showing posts with label wiki. Show all posts
Showing posts with label wiki. Show all posts

Tuesday, April 23, 2024

MediaWiki Users and Developers Conference Spring 2024

 Last week I went to Portland for the MediaWiki Users and Developers conference (nee EMWCon). This is primarily a conference for people doing stuff with MediaWiki outside of Wikimedia. I had a blast.



I always enjoy conferences on the smaller side. They feel so much more personal. This year's conference had Ward Cunningham as the guest of honour. Ward was a fascinating person to meet and get to talk to.

I also must say hats off to the organizers - conference ran smoothly, venue was great, food was amazing. Seriously some of the best food I've ever had at any Wikimedia conference.

This was also my first time in Portland. Portland is a beautiful city. I didn't have a huge amount of time to explore the city, but I did manage to go to the Chinese garden, which was absolutely stunning. I also loved how many interesting murals there were in the city. Even the graffiti seemed prettier than normal.

 
While listening to the talks, I realized that a good talk is very similar to a good design doc. Perhaps this is an obvious comparison, but I never really noticed before how similar the two things are. In both cases, you want to give the reader/viewer context about the problem you want to solve, what solution you chose, why you chose it and how it worked out. At the same time you want to avoid the temptation to go too far into implementation details.

I think my favourite talk was Jeffery's. He demo'd using LLMs to answer questions based on the content of the Wiki. The demo deities weren't fully in his favour, but I think it also demonstrated an important point that LLMs are cutting edge technologies that don't always give the expected answer 100% of the time. In any case, he did a great job presenting.

I did get the sense that I think some participants were disappointed that there was very little representation of WMF management (whether "real" management or product management) at the conference. Birgit did give a remote talk and Selena did come to a happy hour event after the conference, but neither really participated.

I don't think the participants necessarily wanted anything from WMF management, but there is a little bit of a feeling of being unseen. Many of the conference goers use MediaWiki for their own purposes and are interested to know what WMFs plans are for the future and how it will affect them (as do we all).

 

 

I think some participants were hoping to maybe make some connections for better mutual understanding and just reduce uncertainty about what is on the roadmap for MediaWiki. In theory Birgit's talk was about the plans for MediaWiki, but I suspect it was too laden with annual planning corporate buzzwords for anyone to figure out what it actually meant concretely.

The flip side of that of course is that open source is a do-orcracy. The corporate MediaWiki users as a general rule do not contribute back to MediaWiki core all that often, which is the price of admission to the various power structures of MediaWiki.

Create Camp

At the create camp, I had a long chat with Mark about what parts of the documentation are unclear to users new to MediaWiki. While I think all of will admit that our documentation is sub-par (bug 1), it was great to get a fresh perspective on it.
 
I think adding screencasts in addition to the written documentation can help with the problem of assumed knowledge and missing implied steps.

I also heard a bit about SemanticMediaWiki (SMW) bug 5392. This is a bug where sometimes SMW drops properties associated with a page. It seems like there is a lot of frustration among the SMW community over this bug. At the same time, it doesn't seem like anyone has seriously tried to debug it. The bug does look a bit annoying to track down. It appears to be some sort of race condition, appearing somewhat randomly and more often when there are multiple things going on at the same time (e.g. the job queue is being run with more threads seems to make it more common) but nobody really knows so hence there are no steps to reproduce. Additionally there has been no attempts to create a minimal test case (e.g. What extensions are needed for the bug to appear) nor has anyone posted any debug logs from the parses in question. No one has even determined if the properties are missing at parse time or if they are being overridden at a later time. Anyways, I suspect its going no where unless people post a lot more information on the task or they hand over a server experiencing the bug to someone good at debugging.

Conclusion

I had a great time. Hopefully I'll be able to come again next year.

Saturday, March 30, 2024

MediaWiki edit summary XSS write-up

 Back in January, I discovered a stored XSS vulnerability in core MediaWiki (T355538; CVE-2024-34507). Essentially by setting a specific edit summary when editing a page, you could run javascript (And take over the account of anyone viewing the edit summary, for example on the history page or recentchanges)

MediaWiki core is generally pretty good when it comes to security. There are many sketchy extensions, and sometimes there are issues where an admin might be able to run javascript, but by and large unauthenticated XSS vulns are fairly rare. I think the last one was CVE-2021-44858 from back in 2021. The next one before that was CVE-2017-8815 in 2017, which only applied to wikis configured to have a site language of certain languages (e.g. Serbian and Chinese). At least, those were the ones I found when looking through the list. Hopefully I didn't miss any. In any case, finding XSS triggerable by an unprivleged attacker in MediaWiki core is pretty hard.

So what is the bug? The proof of concept looks like this - Create an edit with the following edit summary:

[[Special:RecentChanges#%1b0000000|link1]] [[PageThatExists#/autofocus/onfocus=alert("xss\n"+document.domain)//|link2]]

This feels a bit random at first glance. How does it work?

The edit summary parser

Whenever you edit a page on MediaWiki, there is a box for your edit summary. This is essentially MediaWiki's version of a commit message.

Very little formatting is allowed in this summary. A major exception is links. You can link to other pages by enclosing the link in [[ and ]].

So this explains a little bit about the proof-of-concept - it involves 2 links. But why 2? It doesn't work with just 1. What is with the weird link targets? They are clearly abnormal, but they also don't look like typical XSS. There are no < or >, there aren't even any unclosed quotes.

Lets take a deeper look at how MediaWiki applies formatting to these edit summaries. The code where all this happens is includes/CommentFormatter/CommentParser.php.

The first thing we might notice is the following line in CommentParser::preprocessInternal: "// \x1b needs to be stripped because it is used for link markers"

In the proof of concept, the first part is [[Special:RecentChanges#%1b0000000|link1]], where %1b appears. This is a good hint that it has something to do with link markers, whatever those are.

Link markers

But what are link markers?

When MediaWiki makes a link, it needs to know whether the page being linked to exists or not, since missing pages use a red colour. The most natural way of doing this is, when encountering a link, to check in the DB whether or not the page exists.

However, there is a problem. When rendering a long page with a lot of links, we have to do a lot of DB lookups. The lookups are simple, but still on a separate (albeit nearby server). Each page to lookup involves a local network request to fetch the page status. While that is happening, MW just sits and waits. This is all very fast, but even still it adds up a little bit if you have say 500 links on a page.

The solution to this problem was to batch the queries. Instead of immediately looking up the page, MW would put a small link marker in the page at that point and carry on. Once it is finished, it would look up all the links all at once, and then do another pass to replace all the link markers.

So this is what a link marker is, just a little marker to tell MW to come back to this spot later after it figured out if all the links exist. The format of this marker is \x1B<number> (So \x1B0000000 for the first one, \x1B0000001 for the second, and so on). \x1B is the ASCII escape character.

Back to the PoC

This explains the first part of the proof of concept: [[Special:RecentChanges#%1b0000000|link1]] - the link target is a link marker. The code has a line:

                                // Fix up urlencoded title texts (copied from Parser::replaceInternalLinks)
                                if ( strpos( $match[1], '%' ) !== false ) {
                                        $match[1] = strtr(
                                                rawurldecode( $match[1] ),
                                                [ '<' => '&lt;', '>' => '&gt;' ]
                                        );
                                }


Which normalizes titles using percent encoding to use the real characters. Thus the %1B gets replaced with an actual 0x1B byte sequence. The code did try and strip 0x1B characters earlier, but at that point, it was still just %1b and did not match the check.

We now have a link with a link marker inside of it. An important note here is that Special:RecentChanges is not a normal page. It is a special page. MediaWiki knows it exists without having to consult the database, so it does not get the link marker treatment. This is important because we cannot use it as a fake link marker if it gets replaced by a real link marker.

At this stage after inserting link markers, the proof of concept becomes the following string:

<a href="/w/index.php/Special:RecentChanges#\x1B000000" title="Special:RecentChanges">link1</a> \x1B0000000

A link with a link marker inside it!

The second link

The \x1B0000000 is a stand in for [[PageThatExists#/autofocus/onfocus=alert("xss\n"+document.domain)//|link2]].

The replacement at the end is a normal replacement, and everything is fine. However there are now two replacements - there is also the replacement inside the link: href="/w/index.php/Special:RecentChanges#\x1B000000"

This is the fake link marker that we contrived to get inserted. Unlike the normal link markers, this is inside an attribute. The replacement text assumes it is being inserted as normal HTML, not as an attribute. Since it is a full link that also has quotes inside it, the two layers of quotes will interfere with each other.

Once the replacements happen we get the following mangled HTML for our proof of concept:

<a href="/w/index.php/Special:RecentChanges#<a href="/w/index.php/Test#/autofocus/onfocus=alert(&quot;xss\n&quot;+document.domain)//" title="Test">link2</a>" title="Special:RecentChanges">link1</a> <a href="/w/index.php/Test#/autofocus/onfocus=alert(&quot;xss\n&quot;+document.domain)//" title="Test">link2</a>

This obviously looks wrong, but its a bit unclear how browsers interpret it. A little known fact about HTML - /'s can separate attributes so long as no equal signs have been encountered yet. After the browser hits the second " mark, it thinks the href attribute is closed and that the remaing is some additional attributes. The browser essentially parses the above html as if it was:

<a href="/w/index.php/Special:RecentChanges#<a href=" w="" index.php="" Test#="" autofocus onfocus="alert(&quot;xss\n&quot;+document.domain)//&quot;" title="Test">link2</a>" title="Special:RecentChanges"&gt;link1</a> <a href="/w/index.php/Test#/autofocus/onfocus=alert(&quot;xss\n&quot;+document.domain)//" title="Test">link2</a>

In other words, an <a> tag, that has an attribute named autofocus and an onfocus event handler. On page load, the link is automatically focused, which triggers the javascript in the onfocus attribute to run, allowing the attacker to do what they want.

Take aways

I think the major take aways is that running Regexes over partially parsed HTML is always scary. We've had similar issues in the past, for example T110143.

The general pattern we've used to fix this and similar issues, is make sure the replacement token has special characters that would be mangled if it appeared in an unexpected context. Concretely, we added " and ' to the token, which would get escaped if placed in an attribute, and thus no longer matching and no longer being replaced.

More generally though, I think this is a good example of why even a minimal CSP policy would be helpful.

CSP is a complex standard, that can do a lot of things and has a lot of pieces. One of the things it can do, is disable "unsafe-inline" javascript. This means javascript from attributes (like onfocus) and javascript URLs. Usually this also includes inline <script> tags without a nonce, but that part is optional. A key point here, is this also generally means you cannot execute javascript via .innerHTML anymore, which is a fairly common vector for XSS via javascript.

Normally disabling unsafe-inline would be part of a broader effort to secure javascript, however its possible to take things a step at a time. This vulnerability would have been stopped just by disabling event attributes. A surprising portion of MediaWiki & extension XSS vulns [Excluding boring - an admin can change something in an unsafe way issues] involve just html attributes (or javascript: urls), which is a web feature that nobody really needs for legit reasons and is generally considered bad practise in normal usage. Even the most minimal CSP policy might really help MediaWiki's overall security posture against XSS vulns.

For more info about the vulnerability, please see the original report at https://phabricator.wikimedia.org/T355538.

Wednesday, January 10, 2024

Imagining Future MediaWiki

 As we roll into 2024, I thought I'd do something a little different on this blog.

A common product vision exercise is to ask someone, imagine it is 20 years from now, what would the product look like? What missing features would it have? What small (or large) annoyances would it no longer have?

I wanted to do that exercise with MediaWiki. Sometimes it feels like MediaWiki is a little static. Most of the core ideas were implemented a long time ago. Sure there is a constant stream of improvements, some quite important, but the core product has been fixed for quite some time now. People largely interact with MediaWiki the same way they always have. When I think of new fundamental features to MediaWiki, I think of things like Echo, Lua and VisualEditor, which can hardly be considered new at this point (In fairness, maybe DiscussionTools should count as a new fundamental feature, which is quite recent). Alternatively, I might think of things that are on the edges. Wikidata is a pretty big shift, but its a separate thing from the main experience and also over a decade old at this point.

I thought it would be fun to brainstorm some crazy ideas for new features of MediaWiki, primarily in the context of large sites like Wikipedia. I'd love to hear feedback on if these ideas are just so crazy they might work, or just crazy. Hopefully it inspires others to come up with their own crazy ideas.

What is MediaWiki to me?

Before I start, I suppose I should talk about what I think the goals of the MediaWiki platform is. What is the value that should be provided by MediaWiki as a product, particularly in the context of Wikimedia-type projects?

Often I hear Wikipedia described as a top 10 document hosting website combined with a medium scale social network. While I think there is some truth to that, I would divide it differently.

I see MediaWiki as aiming to serve 4 separate goals:

  • A document authoring platform
  • A document viewing platform (i.e. Some people just want to read the articles).
  • A community management tool
  • A tool to collect and disseminate knowledge

The first two are pretty obvious. MediaWiki has to support writing Wikipedia articles. MediaWiki has to support people reading Wikipedia articles. While I often think the difference between readers and editors is overstated (or perhaps counter-productive as hiding editing features from readers reduces our recruitment pool), it is true they are different audiences with different needs.

What I think is a bit under-appreciated sometimes but just as important, is that MediaWiki is not just about creating individual articles, it is about creating a place where a community of people dedicated to writing articles can thrive. This doesn't just happen at the scale of tens of thousands of people, all sorts of processes and bureaucracy is needed for such a large group to effectively work together. While not all of that is in MediaWiki, the bulk of it is.

One of my favourite things about the wiki-world, is it is a socio-technical system. The software does not prescribe specific ways of working, but gives users the tools to create community processes themselves. I think this is one of our biggest strengths, which we must not lose sight of. However we also shouldn't totally ignore this sector and assume the community is fine on its own - we should still be on the look out for better tools to allow the community to make better processes.

Last of all, MediaWiki aims to be a tool to aid in the collection and dissemination of knowledge¹. Wikimedia's mission statement is: "Imagine a world in which every single human being can freely share in the sum of all knowledge." No one site can do that alone, not even Wikipedia. We should aim to make it easy to transfer content between sites. If a 10 billion page treatise on Pokemon is inappropriate for Wikipedia, it should be easy for an interested party to set up there own site that can house knowledge that does not fit in existing sites. We should aim to empower people to do their own thing if Wikimedia is not the right venue. We do not have a monopoly on knowledge nor should we.

As anyone who has ever tried to copy a template from Wikipedia can tell you, making forks or splits from Wikipedia is easy in theory but hard in practice. In many ways I feel this is the area where we have most failed to meet the potential of MediaWiki.

With that in mind, here are my ideas for new fundamental features in MediaWiki:

As a document authoring/viewing platform

Interactivity

Detractors of Wikipedia have often criticized how text based it is. While there are certainly plenty of pictures to illustrate, Wikipedia has typically been pretty limited when it comes to more complex multimedia. This is especially true of interactive multimedia. While I don't have first hand experience, in the early days it was often negatively compared to Microsoft Encarta on that front.

We do have certain types of interactive content, such as videos, slippy maps and 3D models, but we don't really have any options for truly interactive content. For example, physics concepts might be better illustrated with "interactive" experiments, e.g. where you can push a pendulum with a mouse and watch what happens.

One of my favourite illustrations on the web is this one of an Enigma machine. The Enigma machine for those not familiar was a mechanical device used in world war 2 to encrypt secret messages. The interactive illustration shows how an inputted message goes through various wires and rotates various disks to give the scrambled output. I think this illustrates what an Enigma machine fundamentally is better than any static picture or even video would ever be able to.

Right now there are no satisfactory solutions on Wikipedia to make this kind of content. There was a previous effort to do something in the vein of interactive content in the graph extension, which allowed using the Vega domain specific language to make interactive graphs. I've previously wrote on how I think that was a good effort but ultimately missed the mark. In short, I believe it was too high level which caused it to lack the flexibility neccessarily to meet the needs of users, while also being difficult to build simplifying abstractions overtop.

I am a big believer that instead of making complicated projects that prescribe certain ways of doing things, it is better to make simpler, lower level tools that can be combined together in complex ways, as well as abstracted over so that users can make simple interfaces (Essentially the unix philosophy). On Wiki, I think this has been borne out by the success of using Lua scripting in templates. Lua is low level (relative to other wiki interfaces), but the users were able to use that to accomplish their goals without MediaWiki developers having to think about every possible thing they might want to do. Users were than able to make abstractions that hid the low level details in every day use.

To that end, what I'd like to see, is to extend Lua to the client side. Allow special lua interfaces that allow calling other lua functions on the client side (run by JS), in order to make parts of the wiki page scriptable while being viewed instead of just while being generated.

I did make some early proof-of-concepts in this direction, see https://bawolff.net/monstranto/index.php/Main_Page for a Demo of Extension:Monstranto. See also a longer piece I wrote, as well as an essay by Yurik on the subject I found inspiring.

Mobile editing

This is one where I don't really know what the answer is, but if I imagine MW in 20 years, I certainly hope this is better.

Its not just MediaWiki, I don't think any website really has authoring long text documents on mobile figured out.

That said, I have seen some interesting ideas around, that I think are worth exploring (None of these are my own ideas)

Paragraph or sentence level editing

This idea was originally proposed about 13 years ago by Jan Paul Posma. In fact, he write a whole bachelor's thesis on it.

In essence, Mobile gets more frustrating the longer the text you are editing is. MediaWiki often works on editing at the granularity of a section, but what about editing at the granularity of a paragraph or a sentence instead? Especially if you just want to fix a typo on mobile, I feel it would be much easier if you could just hit the edit button on a sentence instead of the entire section.

Even better, I suspect that parsoid makes this a lot easier to implement now than it would have been back in the day.

Better text editing UI (e.g. Eloquent)

A while ago I was linked to a very interesting article by Scott Jenson about the problems with text editing on mobile. I think he articulated the reasons it is frustrating very well, and also proposed a better UI which he called Eloquent. I highly recommend reading the article and seeing if it makes sense to you.

In many ways, we can't really do this, as this is an android level UI not something we control in the web app. Even if we did manage to make it in a web app somehow, it would probably be a hard sell to ordinary users not used to the new UI. Nonetheless, I think it would be incredibly beneficial to experiment with alternate UIs like these, and see how far we can get. The world is increasingly going mobile, and Wikipedia is increasingly getting left behind.

Alternative editing interfaces (e.g. voice)

Maybe traditional text editing is not the way of the future. Can we do something with voice control?

It seems like voice controlled IDEs are increasingly becoming a thing. For example, here is a blog post about someone who programs with a voice programming software called Talon. It seems like there are a couple other options out there. I see Serenade mentioned quite a bit.

A project in this space that looks especially interesting is cursorless. The demo looked really cool, and i could imagine that a power user would find it easier to use a system like this to edit large blobs of WikiText than the normal text editing interface on mobile. Anyways, i reccomend watching the demo video to see what you think.

All this is to say, I think we should look really hard at the possibilities in this space for editing MediaWiki from a phone. On screen keyboards are always going to suck, might as well look to other options.

As a community building platform

Extensibility

I think it would be really cool if we had "lua" extensions. Instead of normal php extensions, a user would be able to register/upload some lua code, that gets subscribed to hooks, and do stuff. In this vision, these extension types would not be able to do anything unsafe like raw html, but would be able to do all sorts of stuff that users normally use javascript for.

This could be per user or also global. Perhaps could be integrated with a permission system to control what they can and cannot do.

I'd also like to see a super stable API abstraction layer for these (and normal extensions). Right now our extension API is fairly unstable. I would love to see a simple abstraction layer with hard stability guarantees. It wouldn't replace the normal API entirely, but would allow simpler extensions to be written in such a way that they retain stability in the long term.

Workflows

I think we could do more to support user-created workflows. The Wiki is full of user created workflows and processes. Some are quite complex others simple. For example nominating an article for deletion or !voting in an RFC.

Sometimes the more complicated ones get turned into javascript wizards, but i think that's the wrong approach. As I side earlier, I am a fan of simpler tools that can be used by ordinary users, not complex tools that do a specific task but can only be edited by developers and exist "outside" the wiki.

There's already an extension in this area (not used by Wikimedia) called PageForms. This is in the vein of what I am imagining, but I think still too heavy. Another option in this space is the PageProperties extension which also doesn't really do what I am thinking of.

What I would really want to see is an extension of the existing InputBox/preload feature.

As it stands right now, when starting a new page or section, you can give a url parameter to preload some text as well as parameters to that text to replace $1 markers.

We also have the InputBox extension to provide a text box where you can put in the name of an article to create with specific text pre-loaded.

I'd like to extend this idea, to allow users to add arbitrary widgets² (form elements) to a page, and bind those widgets to specific parameters to be preloaded.

If further processing or complex logic is needed, perhaps an option to allow the new preloaded text to be pre-processed by a lua module. This would allow complex logic in how the page is edited based on the user's inputs. If there is one theme in this blog post, it is I wish lua could be used for more things on wiki.

I still imagine the user would be presented with a diff view and have to press save, in order to prevent shenanigans where users are tricked into doing something they don't intend to.

I believe this is a very light-weight solution that also gives the community a lot of flexibility to create custom workflows in the wiki that are simple for editors to participate in.

Querying, reporting and custom metadata

This is the big controversial one.

I believe that there should be a way for users to attach custom metadata to pages and do complex queries over that metadata (including aggregation). This is important both for organizing articles as well as organizing behind the scenes workflows.

In the broader MediaWiki ecosystem, this is usually provided by either the SemanticMediaWiki or Cargo extensions. Often in third party wikis this is considered MediaWiki's killer feature. People use them to create complex workflows including things like task trackers. In essence it turns MediaWiki into a no-code/low-code user programmable workflow designer.

Unfortunately, these extensions all scale poorly, preventing their use on Wikimedia. Essentially I dream of seeing the features provided by these extensions on Wikipedia.

The existing approaches are as follows:

  • Vanilla MediaWiki: Category pages, and some query pages.
    • This is extremely limited. Category pages allow an alphabetical list. Query pages allow some limited pre-defined maintenance lists like list of double redirects or longest articles. Despite these limitations, Wikipedia makes great use out of categories.
  • Vanilla mediawiki + bots:
    • This is essentially Wikipedia's approach to solving this problems. Have programs do queries offsite and put the results on a page. I find this to be a really unsatisfying solution. A Wikipedian once told me that every bot is just a hacky workaround to MediaWiki failing to meet its users' needs, and I tend to agree. Less ideologically, the main issue here is its very brittle - when bots break often nobody knows who has access to the code or how it can be fixed. Additionally, they often have significant latency for updates (If they run once a week, then the latency is 7 days) and ordinary users are not really empowered to create their own queries.
  • Wikidata (including the WDQS SPARQL endpoint)
    • Wikidata is adjacent to this problem, but not quite trying to solve it. It is more meant as a central clearinghouse for facts, not a way to do querying inside Wikipedia. That said Wikidata does have very powerful query features in the form of SPARQL. Sometimes these are copied into Wikipedia via bots. SPARQL of course has difficult to quantify performance characteristics that make it unsuitable for direct embedding into Wikipedia articles in the MediaWiki architecture. Perhaps it could be iframed, but that is far from being a full solution.
  • SemanticMediaWiki
    •  This allows adding Semantic annotations to articles (i.e. Subject-verb-object type relations). It then allows querying using a custom semantic query language. The complexity of the query language make performance hard to reason about and it often scales poorly.
  • Cargo
    • This is very similar to SemanticMediaWiki, except it uses a relational paradigm instead of a semantic paradigm. Essentially users can define DB tables. Typically the workflow is template based, where a template is attached to a table, and specific parameters to the template are populated into the database. Users can then use (Sanitized) SQL queries to query these tables. The system uses an indexing strategy of adding one index for every attribute in the relation.
  • DPL
    • DPL is an extension to do complex querying and display using MediaWiki's built in metadata like categories. There are many different versions of this extension, but all of them have potential queries that scale linearly with the number of pages in the database, and sometimes even worse.

I believe none of these approaches really work for Wikipedia. They either do not support complex queries or allow too complex queries with unpredictable performance. I think the requirements are as follows:

  • Good read scalability (By read, I mean scalability when generating pages (during "parse" in mediawiki speak). On Wikipedia, pages are read and regenerated a lot more often than they are edited.
    • We want any sort of queries to have very low read latency. Having long pauses waiting for I/O during page parsing is bad in the MediaWiki architecture
    • Queries should scale consistetly. They should at worse be roughly O(log n) in the number of pages on the wiki. If using a relational style database, we would want the number of rows the DBMS have to look at be no more than a fixed max number
  • Eventual write consistency
    • It is ok if it takes a few minutes for things using the custom metadata to update after it is written. Templates already have a delay for updating.
    • That said, it should still be relatively quick. On the order of minutes ideally. If it takes a day or scales badly in terms of the size of the database, that would also be unacceptable.
    • write performance does not have to scale quite as well as read performance, but should still scale reasonably well. 
  • Predictable performance.
    • Users should not be able to do anything that negatively impacts site performance
    • Users should not have to be an expert (or have any knowledge) in DB performance or SQL optimization.
    • Limits should be predictable. Timeouts suck, they can vary depending on how much load the site is under and other factors. Queries should either work or not work. Their validity should not be run-time dependent. It should be obvious to the user if their query is an acceptable query before they try and run it. There should be clear rules about what the limits of the system are.
  • Results should be usable for futher processing
    • e.g. You should be able to use the result inside a lua module and format it in arbitrary ways
  • [Ideally] Able to be isolated from the main database, shardable, etc.
  • Be able to query for a specific page, a range of pages, or aggregates of pages (e.g. Count how many pages are in a range, average of some property, etc)
    • Essentially we want just enough complexity to do interesting user defined queries, but not enough that the user is able to take any action that affects performance.
    • There are some other query types that are more obscure but maybe harder. For example geographic related queries. I don't think we need to support that.
    • Intersection queries are an interesting case, as they are often useful on wiki. Ideally we would support that too.

 

Given these constraints I think the CouchDB model might be the best match for on-wiki querying and reporting.

Much of the CouchDB marketing material is aimed around their local data eventual consistency replication story. Which is cool and all but not what I'm interested in here. A good starting point for how their data model works is their documentation on views. To be clear, I'm not neccesarily suggesting using CouchDB, just that its data model seems like a good match to the requirements.

CouchDB is essentially a document database based around the ideas of map-reduce. You can make views which are similar to an index on a virtual column in mysql. You can also make reduce functions which calculate some function over the view. The interesting part is that the reduce function is indexed in a tree fashion, so you can efficiently get the value of the function applied to any contiguous range of the rows in logrithmic time. This allows computing aggregations of the data very efficiently. Essentially all the read queries are very efficient. Potentially write queries can be less so but it is easy to build controls around that. Creating or editing reduce functions is expensive because it requires regenerating the index, but that is expected to be a rare operation and users can be informed that results may be unreliable until it completes.

In short, the way the CouchDB data model works as applied to MediaWiki could be as follows:

  • There is an emit( relationName, key, data) function added to lua. In many ways this is very similar to adding a page to a category named relationName with a sortkey specificed by key. data is optional extra data associated with this item. For performance reason, there may be a (high) limit to the max number of emit() on a page to prevent DB size from exploding.
  • Lua gets a function query( relationName, startKey, endKey ). This returns all pages between startKey and endKey and their associated data. If there are more than X (e.g. 200) number of pages, only return the first X.
  • Lua gets a queryReduced( relationName, reducerName, startKey, endKey ) which returns the reduction function over the specified range. (Main limitation here is the reduce function output must be small in size in order to make this efficient)
  • A way is added to associate a lua module as a reduce function. Adding or modifying these functions is potentially an expensive operation. However it is probably acceptable to the user that this takes some time

All the query types here are efficient. It is not as powerful as arbitrary SQL or semantic queries, but it is still quite powerful. It allows computing fairly arbitrary aggregation queries as well as returning results in a user-specified order. The main slow parts is when a reduction function is edited or added, which is similar to how a template used on very many pages can take a while to update. Emiting a new item may also be a little slower than reading since the reducers have to be updated up the tree (With possibly contention on the root node), however that is a much rarer operation, and users would likely see it as similar to current delays in updating templates.

I suspect such a system could also potentially support intersection queries with reasonable efficiency subject to a bunch of limitations.

All performance limitations are pretty easy for the user to understand. There is some max number of items that can be emit() from a page to prevent someone from emit()ing 1000 things per page. There is a max number of results that can be returned from a query to prevent querying the entire database, and a max number of queries allowed to be made from a page. The queries involve reading a limited number of rows, often sequential. The system could probably be sharded pretty easily if a lot of data ends up in the database.

I really do think this sort of query model provides the sweet spot of complex querying but predictable good performance and would be ideal for a MediaWiki site running at scale that wanted SMW style features.

As a knowledge collection tool

Wikipedia can't do everything. One thing I'd love to see is better integration between different MediaWiki servers to allow people to go to different places if their content doesn't fit in Wikipedia.

Template Modularity/packaging

Anyone who has ever tried to use Wikipedia templates on another wiki knows it is a painful process. Trying to find all the dependencies is a complex process, not to mention if it relies on WikiData or JsonConfig (Commons data: namespace)

The templates on a Wiki are not just user content, but complex technical systems. I wish we had a better systems for packaging and distributing them.

Even within the Wikimedia movement, there is often a call for global templates. A good idea certainly, but would be less critical if templates could be bundled up and shared. Even still, having distinct boundries around templates would probably make global templates easier than the current mess of dependencies.

I should note, that there are extensions already in this vein. For example Extension:Page_import and Extension:Data_transfer. All of them are nice and all, but I think it would maybe be cooler to have the concept of discrete template/module units on wiki, so that different components are organized together in a way that is easier to follow.

Easy forking

Freedom to fork is the freedom from which all others flow. In addition to providing an avenue for people who disagree with the status quo a way to do their own thing, easy forking/mirroring is critical when censorship is at play and people want to mirror Wikipedia somewhere we cannot normally reach. However running a wiki the size of english wikipedia is quite hard, even if you don't have any traffic. Simply importing an xml dump into a mysql DB can be a struggle at the sizes we are talking about.

I think it would be cool if we made ready to go sqlite db dumps. Perhaps possibly packaged as a phar archive with MediaWiki, so you could essentially just download a huge 100 GB file, plop it somewhere, and have a mirror/fork

Even better if it could integrate with EventStream to automatically keep things up to date.

Conclusion

So those are my crazy ideas for what I think is missing in MediaWiki (With an emphasis on the Wikipedia usecase and not the third party use-case). Agree? Disagree? Hate it? I'd love to know. Maybe you have your own crazy ideas. You should post them, after all, your crazy idea cannot become reality if you keep it to yourself!

Notes:

¹ I left out "Free", because as much as I believe in "Free Culture" I believe the free part is Wikimedia's mission but not MediaWiki's.

² To clarify, by widgets i mean buttons and text boxes. I do not mean widgets in the sense of the MediaWiki extension named "Widgets".

Tuesday, November 14, 2023

WikiConference North America 2023 (part 1)

 


 This weekend I attended WikiConference North America. I decided to go somewhat at the last moment, but am really glad I did. This is the first non-technical Wikimedia community conference I have attended since COVID and it was great to hear what the Wikipedia community has been up to.

I was on a bit of a budget, so i decided to get a cheaper hotel that was about an hour away by public transit from the venue. I don't think I'll do that again. Getting back and forth was really smooth - Toronto has great transit. However it meant an extra hour at the end of the day to get back, and waking up an hour earlier to get there on time, which really added up. By the end I was pretty tired and much rather would have had an extra 2 hours of sleep (or an extra 2 hours chatting with people).

Compared to previous iterations of this conference, there was a much heavier focus on on-wiki governance, power users and "lower-case s" Wikipedia (not Wikimedia) strategy. I found this quite refreshing and interesting since I mostly do MediaWiki dev stuff and do not hear about the internal workings of Wikipedia as much. Previous versions of this conference focused too much (imho) on talks about outreach which while important were often a bit repetitive. The different focus was much more interesting to me.

Key Take-aways

My key take away from this conference was that there is a lot of nervousness about the future. Especially:

  • Wikipedia's power-user demographics curve is shifting in a concerning way. Particularly around admin promotion.
  • AI is changing the way we consume knowledge, potentially cutting Wikipedia out, and this is scary
  • A fear that the world is not as it once was and the conditions that created Wikipedia are no longer present. As they keynote speaker Selena Deckelmann phrased it, "Is Wikipedia a one-generation marvel?"

However I don't want to overstate this. Its unclear to me how pervasive this view is. Lots of presenters presented views of that form, but does the average Wikipedian agree? If so, is it more an intellectual agreement, or are people actually nervous? I am unsure. My read on it is that people were vaguely nervous about these things, but by no means was anyone panicking about them. Honestly though, I don't really know. However, I think some of these concerns are undercut by there being a long history of people worried about similar things and yet Wikipedia has endured. Before admin demographics people were panicking about new user retention. Before AI changing the way we consume content, it was mobile (A threat which I think is actually a much bigger deal).

Admin demographics

That said, I never quite realized the scale of admin demographic crisis. People always talk about there being less admin promotions now than in the past, but i did not realize until it was pointed out that it is not just a little bit less but allegedly 50 times less. There is no doubt that a good portion of the admin base are people who started a decade (or 2) ago, and new user admins are fewer and further between.

A particular thing that struck me as related to this at the conference, is how the definition of "young" Wikipedian seems to be getting older. Occasionally I would hear people talk about someone who is in high school as being a young Wikipedian, with the implication that this is somewhat unusual. However when you talk to people who have been Wikipedians for a long time, often they say they were teenagers when they started. It seems like Wikipedians being teenagers was a really common thing early in the project, but is now becoming more rare.

Ultimately though, I suspect the problem will solve itself with time. As more and more admins retire as time goes on, eventually work load on the remaining will increase until the mop will be handed out more readily out of necessity. I can't help but be reminded of all the panic over new user retention, until eventually people basically decided that it didn't really matter.

AI

As far as AI goes, hating AI seems to be a little bit of a fad right now. I generally think it is overblown. In the Wikipedia context, this seems to come down to three things:

  • Deepfakes and other media manipulation to make it harder to have reliable sources (Mis/Dis-information)
  • Using AI to generate articles that get posted, but perhaps are not properly fact checked or otherwise poor quality in ways that aren't immediately obvious or in ways existing community practice is not as of yet well prepared to handle
  • Voice assistants (alexa), LLMs (ChatGPT) and other knowledge distributions methods that use Wikipedia data but cut Wikipedia out of the loop. (A continuation of the concern that started with google knowledge graph)

I think by and large it is the third point that was the most concerning to people at the conference although all 3 were discussed at various points. The third point is also unique to Wikipedia.

There seemed to be two causes of concern for the third point. First there was worry over lack of attribution and a feeling that large silicon valley companies are exploitatively profiting off the labor of Wikipedians. Second there is concern that by Wikipedia being cut out of the loop we lose the ability to recruit people when there is no edit button and maybe even lose brand awareness. While totally unstated, I imagine the inability to show fundraising banners to users consuming via such systems probably is on the mind of the fundraising department of WMF.

My initial reaction to this is probably one of disagreement with the underlying moral basis. The goal was always to collect the world's knowledge for others to freely use. The free knowledge movement literally has free in the name. The knowledge has been collected and now other people are using it in interesting, useful and unexpected ways. Who are we to tell people what they can and cannot do with it?

This is the sort of statement that is very ideologically based. People come to Wikimedia for a variety of reasons, we are not a monolith. I imagine that people probably either agree with this view or disagree with it, and no amount of argument is going to change anyone's mind about it. Of course a major sticking point here is arguably ChatGPT is not complying with our license and lack of attribution is a reasonable concern.

The more pragmatic concerns are interesting though. The project needs new blood to continue over the long term, and if we are cut out of the distribution loop, how do we recruit. I honestly don't know, but I'd like to see actual data confirming the threat before I get too worried.

The reason I say that, is that I don't think voice assistants and LLMs are going to replace Wikipedia. They may replace Wikipedia for certain use cases but not all use cases, and especially not the use case that our recruitment base is.

Voice assistants generally are good for quick fact questions. "Who is the prime minister of Canada?" type questions. The type of stuff that has a one sentence answer and is probably stored on Wikidata. LLMs are somewhat longer form, but still best for information that can be summarized in a few paragraphs, maybe a page at most and has a relatively objective "right" answer (From what I hear. I haven't actually used ChatGPT). Complex nuanced topics are not well served by these systems. Want to know the historical context that lead to the current flare up in the middle east? I don't think LLMs will give you what you want.

Now think about the average Wikipedia editor. Are they interested in one paragraphs answers? I don't know for sure, but I would posit that they tend to be more interested in the larger nuanced story. Yes other distribution models may threaten our ability to recruit from users using them, but I don't think that is the target audience we would want to focus recruitment on anyways. I suppose time will tell. AI might just be a fad in the end.

Conclusion

I had a great time. It was awesome to see old friends but also meet plenty of new people I did not know. I learned quite a bit, especially about Wikipedia governance. In many ways, it is one of the more surprising wiki conferences I've been too, as it contained quite a bit of content that was new to me. I plan to write a second blog post about my more raw unfiltered thoughts on specific presentations. (Edit: I never did make a second post, and i guess its kind of late enough at this point that i probably won't, so nevermind about that)

Friday, January 20, 2023

The Vector-pocalypse is upon us!

 

tl;dr: [[WP:IDONTLIKEIT]]

Yesterday, a new version of the Vector skin was made default on English Wikipedia.

As will shock absolutely no one who pays attention to Wikipedia politics, the new skin is controversial. Personally I'm a Timeless fan and generally have not liked what I have seen of new vector when it was in development. However, now that it is live I thought I'd give it another chance and share my thoughts on the new skin. For reference I am doing this on my desktop computer which has a large wide-screen monitor. It looks very different on a phone (I actually like it a lot better on the phone). It might even look different on different monitors with different gamuts.


So the first thing that jumps out is there is excessive whitespace on either side of the page. There is also a lot more hidden by default, notably the "sidebar" which is a prominent feature on most skins. One minor thing that jumps out to me is that echo notifications look a little wonky when you have more than 100 of them.

On the positive though, the top bar does look very clean. The table of contents is on the left hand side and sticky (Somewhat similar to WikiWand), which I think is a nice change.

When you scroll, you notice the top bar scrolls with it but changes:

On one hand, this is quite cool. However on reflection I'm not sure if I feel this is quite worth it. It feels like this sticky header is 95% of the way to working but just not quite there. The alignment with the white padding on the right (I don't mean the off-white margin area but the area that comes before that) seems slightly not meeting somehow. Perhaps i am explaining it poorly, but it feels like there should be a division there since the article ends around the pencil icon. Additionally, the sudden change makes it feel like you are in a different context, but it is all the same tools with different icons. On the whole, I think there is a good idea here with the sticky header, but maybe could use a few more iterations.

If you expand the Sidebar menu, the result feels very ugly and out of place to me:


idk, I really hate the look of it, and the four levels of different off-whites. More to the point, one of the key features of Wikipedia is it is edited by users. To get new users you have to hook people into editing. I worry hiding things like "learn to edit" will just make it so people never learn that they can edit. I understand there is a counter-point here, where overwhelming users with links makes users ignore all of them and prevents focus on the important things. I even agree somewhat that there are probably too many links in Monobook/traditional vector. However having all the links hidden doesn't seem right either.


On the fixed width

One of the common complaints is that the fixed width design wastes lots of screen real estate. The counter argument is studies suggest that shorter line lengths improve readability.

As a compromise there is a button in the bottom right corner to make it use the full screen. It is very tiny. I couldn't find it even knowing that it is supposed to be somewhere. Someone had to tell me that it is in the lower-right corner. So it definitely lacks discoverability.

Initially, I thought I hated the fixed-width design too. However after trying it out, I realized that it is not the fixed width that I hate. What I really hate is:

  • The use of an off-white background colour that is extremely close to the main background colour
  • Centering the design in the screen

 I really really don't like the colour scheme chosen. Having it be almost but not quite the same colour white really bothers my eyes.

I experimented with using a darker colour for more contrast and found that I like the skin much much better. Tastes vary of course, so perhaps it is just me. Picking a dark blue colour at random and moving the main content to the left looks something like:


 

 Although I like the contrast of the dark background, my main issue is that in the original the colours are almost identical, so even just making it a slightly more off-white off-white would be fine. If you want to do a throwback to monobook, something like this looks fine to me as well:



I don't really know if this is just my particular tastes or if other people agree with me. However, making it more left aligned and increasing the contrast to the background makes the skin go from something I can't stand to something I can see as usable.



Sunday, December 4, 2022

Hardening SQLite against injection in PHP

tl;dr: What are our options in php to make SQLite not write files when given malicious SQL queries as a hardening measure against SQL injection?

 

One of the most famous web application security vulnerabilities is the SQL injection.

This is where you have code like:

doQuery( "SELECT foo1, foo2 from bar where baz = '" . $_GET['fred'] . "';" );

The attacker goes to a url like ?fred='%20UNION%20ALL%20SELECT%20user%20'foo1',%20password%20'foo2'%20from%20users;--

The end result is: doQuery( "SELECT foo1, foo2 from bar where baz ='' UNION ALL SELECT user 'foo1', password 'foo2' from users ;-- ';" );

and the attacker has all your user's passwords. Portswigger has a really good detailed explanation on how such attacks work.

In addition to dumping all your private info, the usual next step is to try and get code execution. In a PHP environment, often this means getting your DB to write a a php file in the web directory.

In MariaDB/MySQL this looks like:

SELECT '<?php system($_GET["c"]);?>' INTO OUTFILE "/var/www/html/w/foo.php";

Of course, in a properly setup system, permissions are such that mysqld/mariadbd does not have permission to write in the web directory and the DB user does not have FILE privileges, so cannot use INTO OUTFILE.

In SQLite, the equivalent is to use the ATTACH command to create a new database (or VACUUM). Thus the SQLite equivalent is:

ATTACH DATABASE '/var/www/html/w/foo.php' AS foo; CREATE TABLE foo.bar (stuff text); INSERT INTO foo.bar VALUES( '<?php system($_GET["c"]);?>' );

This is harder than the MySQL case, since it involves multiple commands and you can't just add it as a suffix but have to inject as a prefix. It is very rare you would get this much control in an SQL injection.

Nonetheless it seems like the sort of thing we would want to disable in a web application, as a hardening best practice. After all, dynamically attaching multiple databases is rarely needed in this type of application.

Luckily, SQLite implements a feature called run time limits. There are a number of limits you can set. SQLite docs contain a list of suggestions for paranoid people at https://www.sqlite.org/security.html. In particular, there is a LIMIT_ATTACH which you can set to 0 to disable attaching databases. There is also a more fine grained authorizer API which allows setting a permission callback to check things on a per-statement level.

Unfortunately PHP PDO-SQLITE supports neither of these things. It does set an authorizer if you have open_basedir on to prevent reading/writing outside the basedir, but it exposes no way that I can see for you to set them yourself. This seems really unfortunate. Paranoid people would want to set runtime limits. People who have special use-cases may even want to raise them. I really wish PDO-SQLITE supported setting these, perhaps as a driver specific connection option in the constructor.

On the bright side, if instead of using the PDO-SQLITE php extension, you are using the alternative sqlite3 extension there is a solution. You still cannot set runtime limits but you can set a custom authorizer:

$db = new SQLite3($dbFileName);
$db->setAuthorizer(function ( $action, $filename ) {
        return $action === SQLite3::ATTACH ? Sqlite3::DENY : Sqlite3::OK;
});

After this if you try and do an ATTACH you get:

Warning: SQLite3::query(): Unable to prepare statement: 23, not authorized in /var/www/html/w/test.php on line 17

Thus success! No evil SQL can possibly write files.



Sunday, September 11, 2022

Why don't we ever talk about volunteer PMs in open source?

 Recently, on Wikipedia, there was an open letter to the Wikimedia Foundation, asking them to improve the New Page Patrol feature.

This started the usual debate between, WMF should do something vs It is open source, {{sofixit}} (i.e. Send a patch). There's valid points on both sides of that debate, which I don't really want to get into.

However, it occurred to me - the people on the {{sofixit}} side always suggest that people should learn how to program (an unreasonable ask), figure out how to fix something, and do it themselves. On the other hand, in a corporate environment, stuff is never done solely by developers. You usually have either a product manager or a program manager organizing the work.

Instead of saying to users - learn PHP and submit a patch, why don't we say: Be the PM for the things you want done, so a programmer can easily just do them without getting bogged down with organizational questions?

At first glance this may sound crazy - after all, ordinary users have no authority. Being a PM is hard enough when people are paid to listen to you, how could it possibly work if nobody has to listen to you. And I agree - not everything a PM does is applicable here, but i think some things are.

Some things a volunteer person could potentially do:

  • Make sure that bugs are clearly described with requirements, so a developer could just do them instead of trying to figure out what the users need
  • Make sure tasks are broken down into appropriate sized tickets
  • Make a plan of what they wish would happen. A volunteer can't force people to follow their plan, but if you have a plan people may just follow it. Too often all that is present is a big list of bugs of varying priority which is hard for a developer to figure out what is important and what isn't
    • For example, what i mean is breaking things into a few milestones, and having each milestone contain a small number (3-5) tickets around a similar theme. This could then be used in order to promote the project to volunteer developers, using language like "Help us achieve milestone 2" and track progress. Perhaps even gamifying things.
    • No plan survives contact with the enemy of course, and the point isn't to stick to any plan religiously. The point is to have a short list of what the most pressing things to work on right now are. Half the battle is figuring out what to work on and what to work on first.
  • Coordinate with other groups as needed. Sometimes work might depend on other work other people have planned to do. Or perhaps the current work is dependent on someone else's requirements (e.g. new extensions require security review). Potentially a volunteer PM could help coordinate this or help ensure that everyone is on the same page about expectations and requirements.
  • [not sure about this one] Help find constructive code reviewers. In MediaWiki development, code much be reviewed by another developer to be merged in. Finding knowledgeable people can often be difficult and a lot of effort. Sometimes this comes down to personal relationships and politely nagging people until someone bites. For many developers this is a frustrating part of the software development process. Its not clear how productive a non-developer would be here, as you may need to understand the code to know who to talk to. Nonetheless, potentially this is something a non-programmer volunteer can help with.

To use the new page patrol feature as an example - Users have a list of 56 feature requests. There's not really any indication of which ones are more important then others. A useful starting point would be to select the 3 most important. There are plenty of volunteer developers in the MediaWiki ecosystem that might work on them. The less time they have to spend figuring out what is wanted, the more likely they might fix one of the things. There are no guarantees of course, but it is a thing that someone who is not a programmer could do to move things forward.

 To be clear, being a good PM is a skill - all of this is hard and takes practice to be good at. People who have not done it before won't be good at it to begin with. But I think it is something we should talk about more, instead of the usual refrain of fix it yourself or be happy with what you got.


p.s. None of this should be taken as saying that WMF shouldn't fix anything and it should only be up to the communities, simply that there are things non-programmers could do to {{sofixit}} if they were so inclined.

 

Wednesday, July 20, 2022

Interviewed on Between the brackets

 This week I was interviewed by Yaron Karon for the second time for his MediaWiki podcast Between the Brackets.

Yaron has been doing this podcast for several years now, and I love how he highlights the different voices of all the different groups that use, interact and develop MediaWiki. He's had some fascinating people on his podcast over the years, and I highly reccomend giving it a listen.

Anyhow, it's an honour to be on the program again for episode 117. I was previously on the program 4 years ago for episode 5


Tuesday, July 12, 2022

Making Instant Commons Quick

 The Wikimedia family of websites includes one known as Wikimedia Commons. Its mission is to collect and organize freely licensed media so that other people can re-use them. More pragmatically, it collects all the files needed by different language Wikipedias (and other Wikimedia projects) into one place.

 

The 2020 Wikimedia Commons Picture of the Year: Common Kingfisher by Luca Casale / CC BY SA 4.0

 As you can imagine, it's extremely useful to have a library of freely licensed photos that you can just use to illustrate your articles.

However, it is not just useful for people writing encyclopedias. It is also useful for any sort of project.

To take advantage of this, MediaWiki, the software that powers Wikipedia and friends, comes with a feature to use this collection on your own Wiki. It's an option you can select when installing the software and is quite popular. Alternatively, it can be manually configured via $wgUseInstantCommons or the more advanced $wgForeignFileRepos.

The Issue

Unfortunately, instant commons has a reputation for being rather slow. As a weekend project I thought I'd measure how slow, and see if I could make it faster.

How Slow?

First things first, I'll need a test page. Preferably something with a large (but not extreme) number of images but not much else. A Wikipedia list article sounded ideal. I ended up using the English Wikipedia article: List of Governors General of Canada (Long live the Queen!). This has 85 images and not much else, which seemed perfect for my purposes.

I took the expanded Wikitext from https://en.wikipedia.org/w/index.php?title=List_of_governors_general_of_Canada&oldid=1054426240&action=raw&templates=expand, pasted it into my test wiki with instant commons turned on in the default config.

And then I waited...

Then I waited some more...

1038.18761 seconds later (17 minutes, 18 seconds) I was able to view a beautiful list of all my viceroys.

Clearly that's pretty bad. 85 images is not a small number, but it is definitely not a huge number either. Imagine how long [[Comparison_of_European_road_signs]] would take with its 3643 images or [[List_of_paintings_by_Claude_Monet]] with 1676.

Why Slow?

This raises the obvious question of why is it so slow. What is it doing for all that time?

When MediaWiki turns wikitext into html, it reads through the text. When it hits an image, it stops reading through the wikitext and looks for that image. Potentially the image is cached, in which case it can go back to rendering the page right away. Otherwise, it has to actually find it. First it will check the local DB to see if the image is there. If not it will look at Foreign image repositories, such as Commons (if configured).

To see if commons has the file we need to start making some HTTPS requests¹:

  1. We make a metadata request to see if the file is there and get some information about it: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=timestamp%7Cuser%7Ccomment%7Curl%7Csize%7Csha1%7Cmetadata%7Cmime%7Cmediatype%7Cextmetadata&prop=imageinfo&iimetadataversion=2&iiextmetadatamultilang=1&format=json&action=query&redirects=true&uselang=en
  2.  We make an API request to find the url for the thumbnail of the size we need for the article. For commons, this is just to find the url, but on wikis with 404 thumbnail handling disabled, this is also needed to tell the wiki to generate the file we will need: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=300&iiurlheight=-1&iiurlparam=300px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  3.  Some devices now have very high resolution screens. Screen displays are made up of dots. High resolution screens have more dots per inch, and thus can display more fine detailed. Traditionally 1 pixel equalled one dot on the screen. However if you keep that while increasing the dots-per-inch, suddenly everything on the screen that was measured in pixels is very small and hard to see. Thus these devices now sometimes have 1.5 dots per pixel, so they can display fine detail without shrinking everything. To take advantage of this, we use an image 1.5 times bigger than we normally would, so that when it is displayed in its normal size, we can take advantage of the extra dots and display a much more clear picture. Hence we need the same image but 1.5x bigger: https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=450&iiurlheight=-1&iiurlparam=450px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en
  4. Similarly, some devices are even higher resolution and use 2 dots per pixel, so we also fetch an image double the normal size:  https://commons.wikimedia.org/w/api.php?titles=File%3AExample.png&iiprop=url%7Ctimestamp&iiurlwidth=600&iiurlheight=-1&iiurlparam=600px&prop=imageinfo&format=json&action=query&redirects=true&uselang=en

 

This is the first problem - for every image we include we have to make 4 api requests. If we have 85 images that's 340 requests.

Latency and RTT

It gets worse. All of these requests are done in serial. Before doing request 2, we wait until we have the answer to request 1. Before doing request 3 we wait until we get the answer to request 2, and so on.

Internet speed can be measured in two ways - latency and bandwidth. Bandwidth is the usual measurement we're familiar with: how much data can be transferred in bulk - e.g. 10 Mbps.

Latency, ping time or round-trip-time (RTT) is another important measure - it's how long it takes your message to get somewhere and come back.

When we start to send many small messages in serial, latency starts to matter. How big your latency is depends on how close you are to the server you're talking to. For Wikimedia Commons, the data-centers (DCs) are located in San Francisco (ulsfo), Virginia (eqiad), Texas (codfw), Singapore (eqsin) and Amsterdam (esams). For example, I'm relatively close to SF, so my ping time to the SF servers is about 50ms. For someone with a 50ms ping time, all this back and forth will take at a minimum 17 seconds just from latency.

However, it gets worse; Your computer doesn't just ask for the page and get a response back, it has to setup the connection first (TCP & TLS handshake). This takes additional round-trips.

Additionally, not all data centers are equal. The Virginia data-center (eqiad)² is the main data center which can handle everything, the other DCs only have varnish servers and can only handle cached requests. This makes browsing Wikipedia when logged out very speedy, but the type of API requests we are making here cannot be handled by these caching DCs³. For requests they can't handle, they have to ask the main DC what the answer is, which adds further latency. When I tried to measure mine, i got 255ms, but I didn't measure very rigorously, so I'm not fully confident in that number. In our particular case, the TLS & TCP handshake are handled by the closer DC, but the actual api response has to be fetched all the way from the DC in Virginia.

But wait, you might say: Surely you only need to do the TLS & TCP setup once if communicating to the same host. And the answer would normally be yes, which brings us to major problem #2: Each connection is setup and tore down independently, requiring us to re-establish the TCP/TLS session each time. This adds 2 additional RTT. In our 85 image example, we're now up to 1020 round-trips. If you assume 50ms to caching DC and 255ms to Virginia (These numbers are probably quite idealized, there are probably other things I'm not counting), we're up to 2 minutes.

To put it altogether, here is a diagram representing all the back and forth communication needed just to use a single image:

12 RTT per image used! This is assuming TLS 1.3. Earlier versions of TLS would be even worse.

Introducing HTTP/2

In 2015, HTTP/2 came on the scene. This was the first major revision to the HTTP protocol in almost 20 years.

The primary purpose of this revision of HTTP, was to minimize the effect of latency when you are requesting many separate small resources around the same time. It works by allowing a single connection to be reused for many requests at the same time and allowing the responses to come in out of order or jumbled together. In HTTP/1.1 you can sometimes be stuck waiting for some request to finish before being allowed to start on the next one (Head of line blocking) resulting in inefficient use of network resources

This is exactly the problem that instant commons was having.

Now I should be clear, instant commons wasn't using HTTP/1.1 in a very efficient way, and it would be possible to do much better even with HTTP/1.1. However, HTTP/2 will still be that much better than what an improved usage of HTTP/1.1 would be.

Changing instant commons to use HTTP/2 changed two things:

  1. Instead of creating a new connection each time, with multiple round trips to set up TCP and TLS, we just use a single HTTP/2 connection that only has to do the setup once.
  2. If we have multiple requests ready to go, send them all off at once instead of having to wait for each one to finish before sending the next one.

We still can't do all requests at once, since the MediaWiki parser is serial, and it stops parsing once we hit an image, so we need to get information about the current image before we will know what the next one we need is. However, this still helps as for each image, we send 4 requests (metadata, thumbnail, 1.5dpp thumbnail and 2dpp thumbnail), which we can now send in parallel.


The results are impressive for such a simple change. Where previously my test page took 17 minutes, now it only takes 2 (139 seconds).


Transform via 404

In vanilla MediaWiki, you have to request a specific thumbnail size before fetching it; otherwise it might not exist. This is not true on Wikimedia Commons. If you fetch a thumbnail that doesn't exist, Wikimedia Commons will automatically create it on the spot. MediaWiki calls this feature "TransformVia404".

In instant commons, we make requests to create thumbnails at the appropriate sizes. This is all pointless on a wiki where they will automatically be created on the first attempt to fetch them. We can just output <img> tags, and the first user to look at the page will trigger the thumbnail creation. Thus skipping 3 of the requests.

Adding this optimization took the time down from 139 seconds with just HTTP/2 to 18.5 seconds with both this and HTTP/2. This is 56 times faster than what we started with!



Prefetching

18.5 seconds is pretty good. But can we do better?

We might not be able to if we actually have to fetch all the images, but there is a pattern we can exploit.

Generally when people edit an article, they might change a sentence or two, but often don't alter the images. Other times, MediaWiki might re-parse a page, even if there are no changes to it (e.g. Due to a cache expiry). As a result, often the set of images we need is the same or close to the set that we needed for the previous version of the page. This set is already recorded in the database in order to display what pages use an image on the image description page

We can use this. First we retrieve this list of images used on the (previous version) of the page. We can then fetch all of these at once, instead of having to wait for the parser to tell us one at a time which image we need.

It is possible of course, that this list could be totally wrong. Someone could have replaced all the images on the page. If it's right, we speed up by pre-fetching everything we need, all in parallel. If it's wrong, we fetched some things we didn't need, possibly making things slower than if we did nothing.

I believe in the average case, this will be a significant improvement. Even in the case that the list is wrong, we can send off the fetch in the background while MediaWiki does other page processing - the hope being, that MediaWiki does other stuff while this fetch is running, so if it is fetching the wrong things, time is not wasted.

On my test page, using this brings the time to render (Where the previous version had all the same images) down to 1.06 seconds. A 980 times speed improvement! It should be noted, that this is time to render in total, not just time to fetch images, so most of that time is probably related to rendering other stuff and not instant commons.

Caching

All the above is assuming a local cache miss. It is wasteful to request information remotely, if we just recently fetched it. It makes more sense to reuse information recently fetched.

In many cases, the parser cache, which in MediaWiki caches the entire rendered page, will mean that instant commons isn't called that often. However, some extensions that create dynamic content make the parser cache very short lived, which makes caching in instant commons more important. It is also common for people to use the same images on many pages (e.g. A warning icon in a template). In such a case, caching at the image fetching layer is very important.

There is a downside though, we have no way to tell if upstream has modified the image. This is not that big a deal for most things. Exif data being slightly out of date does not matter that much. However, if the aspect ratio of the image changes, then the image will appear squished until InstantCommons' cache is cleared.

To balance these competing concerns, Quick InstantCommons uses an adaptive cache. If the image has existed for a long time, we cache for a day (configurable). After all, if the image has been stable for years, it seems unlikely it is going to be edited in very soon. However, if the image has been edited recently, we use a dynamically determined shorter time to live. The idea being, if the image was edited 2 minutes ago, there is a much higher possibility that it might be edited a second time. Maybe the previous edit was vandalism, or maybe it just got improved further.

As the cache entry for an image begins to get close to expiring, we refetch it in the background. The hope is that we can use the soon to be expired version now, but as MediaWiki is processing other things, we refetch in background so that next time we have a new version, but at the same time we don't have to stall downloading it when MediaWiki is blocked on getting the image's information. That way things are kept fresh without a negative performance impact.

MediaWiki's built-in instant commons did support caching, however it wasn't configurable and the default time to live was very low. Additionally, the adaptive caching code had a bug in it that prevented it from working correctly. The end result was that often the cache could not be effectively used.

Missing MediaHandler Extensions

In MediaWiki's built-in InstantCommons feature, you need to have the same set of media extensions installed to view all files. For example, PDFs won't render via instant commons without Extension:PDFHandler.

This is really unnecessary where the file type just renders to a normal image. After all, the complicated bit is all on the other server. My extension fixes that, and does its best to show thumbnails for file types it doesn't understand. It can't support advanced features without the appropriate extension e.g. navigating in 3D models, but it will show a static thumbnail.

Conclusion

In the end, by making a few, relatively small changes, we were able to improve the performance of instant commons significantly. 980 times as fast!

Do you run a MediaWiki wiki? Try out the extension and let me know what you think.

Footnotes:

¹ This is assuming default settings and an [object] cache miss. This may be different if $wgResponsiveImages is false in which case high-DPI images won't be fetched, or if apiThumbCacheExpiry is set to non-zero in which case thumbnails will be downloaded locally to the wiki server during the page parse instead of being hotlinked.


² This role actually rotates between the Virginia & Texas data center. Additionally, the Texas DC (when not primary) does do some things that the caching DCs don't that isn't particularly relevant to this topic. There are eventual plans to have multiple active DCs which all would be able to respond to the type of API queries being made here, but they are not complete as of this writing - https://www.mediawiki.org/wiki/Wikimedia_Performance_Team/Active-active_MediaWiki


³ The MediaWiki API actually supports an smaxage=<number of seconds> (shared maximum age) url parameter. This tells the API server you don't care if your request is that many seconds out of date, and to serve it from varnish caches in the local caching data center if possible. Unlike with normal Wikipedia page views, there is no cache invalidation here, so it is rarely used and it is not used by instant commons.