Jump to content

Research:Quarry

From Meta, a Wikimedia project coordination wiki
Querying interface project via SQL
powered by community

Quarry

2021 Coolest Tool
Award Winner

in the category
Developer

Information
Started in2014 or earlier
Statistics
Quarries+70,000
Runs+700,000
Contacts
Discussions
Developers
A screenshot of the Quarry SQL writing interface after execution completed. The results set appears directly beneath the query writing pane.
Query execution. A screenshot of the Quarry SQL writing interface after execution completed. The results set appears directly beneath the query writing pane.

Quarry is a public querying interface for Wiki Replicas, a set of live replica SQL databases of public Wikimedia Wikis. Quarry is designed to make running queries against Wiki Replicas easy. Quarry can also be used to query public databases stored in ToolsDB.

Quarry also provides a means for researchers to share and review each other's queries. Users of Quarry are required to agree to Cloud Services Terms of use.


Writing & executing SQL queries

[edit]

Choosing a database

[edit]

Quarry requires you to enter a wiki replica database in the upper left box just above the SQL box. The name of the database is usually the language code concatenated with the wiki family (wiki for Wikipedia, wikibooks etc.). For multi-language wikis it's the name of the wiki and wiki, e.g. commonswiki, wikidatawiki. See https://quarry.wmcloud.org/query/53805 for a fairly complete list or https://db-names.toolforge.org/ for a more user-friendly lookup.

If you are a long-time user, you are probably aware that wiki replica databases that are publicly accessible have names that end with _p. Note that the database entry box does not require you to add _p at the end, but it will also work if you do. The database you are querying behind the scenes does have that appended as all public versions of the databases do. This routes your query to the correct wiki replica server and database. You should only use the database you selected for most intents and purposes.

Querying ToolsDB public databases

[edit]

Quarry can also be used to query the public databases created by Toolforge tools and stored in ToolsDB. You can type the name of the database in the entry box, and Quarry will automatically detect if it's a ToolsDB database based on its name (e.g. s55926__wishlist_p).

Querying Quarry's own database

[edit]

If you type quarry or quarry_p as the database name, you can run a query against a read-only copy of Quarry's internal database, containing all the queries that users run in Quarry.

Advanced use cases and joining wiki databases

[edit]

In general, you should only connect to the database that you want to query unless you have an advanced use case that requires joining wikis from the same section - and you know what that means.

Technically, you can query a database that shares a section with the database you selected (see the database layout). If you know for sure that you want to query a different database that shares a section number with the database you've entered, you can do this by prefixing your query with USE <dbname>_p; or by specifically referencing the database in a join. Cross-wiki joins are only possible within the same section. For example, the following SQL returns the count of revisions saved by "EpochFail" in metawiki (metawiki is on s7, so you would have needed to connect to another database from the same section, such as arwiki or centralauth).

USE metawiki_p;
SELECT COUNT(*)
FROM revision_userindex 
WHERE
    rev_actor = (SELECT actor_id
                 FROM actor
                 WHERE actor_name = "EpochFail") AND 
    rev_timestamp >= "20140101";

Available tables and columns

[edit]

For an overview of the available tables and columns see MediaWiki database layout. Some extensions add additional tables which can be queried too. These should be documented on the extension's documentation page. The Toolforge SQL Optimizer has a database browser that helps to find the needed table and understand its format. Note that not all tables are available. The following data is missing:

  • The table text containing the actual content of the revisions is missing.
  • Private data is suppressed.

Note that this means that the revision and logging tables do not have indexes on user columns. You can use revision_userindex and logging_userindex instead (see wikitech:Help:Toolforge/Database#Tables_for_revision_or_logging_queries_involving_user_names_and_IDs)

To get a list of all tables, you can run the query (replace enwiki_p with the respective wiki):

SHOW TABLES FROM enwiki_p;

To get the schema of a table, you can use DESCRIBE (e.g. https://quarry.wmcloud.org/query/585)

Downloading a resultset

[edit]
Download data button. 

Quarry lets you download the result of a query. A data download button appears right before the resultset allowing you to get query results in CSV, TSV, JSON or wikitable formats. You can also request resultsets programmatically from wiki tools by issuing requests like below.

Format
https://quarry.wmcloud.org/run/query_run_ID/output/resultset_id/format
Example
https://quarry.wmcloud.org/run/1534/output/0/json
Response
{"rows": [[88]], "meta": {"rev_id": 1534, "query_id": 541, "run_id": 1534}, "headers": ["COUNT(*)"]}

Note that query_run_ID (e.g. /run/10140) is not the same as query number (e.g. /query/153) which is visible in the URL. The query run ID is given in the HTML source code of the query page as "qrun_id".

To download a resultset from the latest run of a query:

Format
https://quarry.wmcloud.org/query/query_ID/result/latest/resultset_id/format
Example
https://quarry.wmcloud.org/query/541/result/latest/0/csv
Response
{"rows": [[88]], "meta": {"rev_id": 1536, "query_id": 541, "run_id": 1536}, "headers": ["COUNT(*)"]}

Example queries

[edit]
[edit]

The interwiki map link quarry: provides shorter linking to queries. For example, you can write

instead of the full link

See also

[edit]