domingo, 27 de setembro de 2009

Importing wikimedia dumps

We are trying to gather some particular statistics about portuguese wikipedia usage.
I proposed myself for import the ptwiki-20090926-stub-meta-history dump, which is a XML file, and we'll be running very heavy queries (it's my task to optimize them, somehow).

What I'd like to mention is that the importing mechanism seems to be tremendously simplified. I remember testing a couple of tools in the past, without much success (or robustness). However, I gave a try to mwdumper this time, and it seems it does it. Note however that there were schema changes from the last mwdumper release, so you should have a look at WMF Bug #18328: mwdumper java.lang.IllegalArgumentException: Invalid contributor which releases a proposed fix which seems to work well. Special note to its memory efficiency: RAM is barely touched!

The xml.gz file is ~550MB, and was converted to a ~499MB sql.gz:

1,992,543 pages (3,458.297/sec), 15,713,915 revs (27,273.384/sec)


I've copied the schema from a running (updated!) mediawiki to spare some time. The tables seem to be InnoDB default, so let's simplify I/O a bit (I'm on my laptop). This will also allow to speed up loading times a lot:
[mysql]
mysql> ALTER TABLE `text` ENGINE=Blackhole;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> alter table page drop index page_random, drop index page_len;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> alter table revision drop index rev_timestamp, drop index page_timestamp, drop index user_timestamp, drop index usertext_timestamp;
Query OK, 0 rows affected (0.01 sec)
Records: 0 Duplicates: 0 Warnings: 0
[/mysql]

The important here is to avoid the larger I/O if you don't need it at all. Table text has page/revision content which I'm not interested at all. As regarding MySQL's configuration (and as a personal note, anyway), the following configuration will give you great InnoDB speeds:
[code]
key_buffer = 512K
sort_buffer_size = 16K
read_buffer_size = 2M
read_rnd_buffer_size = 1M
myisam_sort_buffer_size = 512K
query_cache_size = 0
query_cache_type = 0
bulk_insert_buffer_size = 2M

innodb_file_per_table
transaction-isolation = READ-COMMITTED
innodb_buffer_pool_size = 2700M
innodb_additional_mem_pool_size = 20M
innodb_autoinc_lock_mode = 2
innodb_flush_log_at_trx_commit = 0
innodb_doublewrite = 0
skip-innodb-checksum
innodb_locks_unsafe_for_binlog=1
innodb_log_file_size=128M
innodb_log_buffer_size=8388608
innodb_support_xa=0
innodb_autoextend_increment=16
[/code]

Now I'd recommend uncompress the dump so it's easier to trace the whole process if it's taking too long:
[code]
[myself@speedy ~]$ gunzip ptwiki-20090926-stub-meta-history.sql.gz
[myself@speedy ~]$ cat ptwiki-20090926-stub-meta-history.sql | mysql wmfdumps
[/code]

After some minutes on a Dual Quad Core Xeon 2.0GHz and 2.4 GB of datafiles we are ready to rock! I will probably also need later the user table, which Wikimedia doesn't distribute, so I'll rebuild it now:
[mysql]
mysql> alter table user modify column user_id int(10) unsigned NOT NULL;
Query OK, 0 rows affected (0.12 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> alter table user drop index user_email_token, drop index user_name;
Query OK, 0 rows affected (0.03 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> insert into user(user_id,user_name) select distinct rev_user,rev_user_text from revision where rev_user <> 0;
Query OK, 119140 rows affected, 4 warnings (2 min 4.45 sec)
Records: 119140 Duplicates: 0 Warnings: 0

mysql> alter table user drop primary key;
Query OK, 0 rows affected (0.13 sec)
Records: 0 Duplicates: 0 Warnings: 0

mysql> insert into user(user_id,user_name) values(0,'anonymous');
Query OK, 1 row affected, 4 warnings (0.00 sec)
[/mysql]
It's preferable to join on INT's rather than VARCHAR(255) that's why I reconstructed the user table. I actually removed the PRIMARY KEY but I set it after the process. What happens is that there are users that have been renamed and thus they appear with same id, different user_name. The query to list them all is this:
[mysql]
mysql> select a.user_id,a.user_name from user a join (select user_id,count(1) as counter from user group by user_id having counter > 1 order by counter desc) as b on a.user_id = b.user_id order by user_id DESC;
....
14 rows in set (0.34 sec)

mysql> update user a join (select user_id,GROUP_CONCAT(user_name) as user_name,count(1) as counter from user group by user_id having counter > 1) as b set a.user_name = b.user_name where a.user_id = b.user_id;
Query OK, 14 rows affected (2.49 sec)
Rows matched: 14 Changed: 14 Warnings: 0
[/mysql]

The duplicates were removed manually (they're just 7). Now, let's start to go deeper. I'm not concerned about optimizing for now. What I wanted to run right away was the query I asked on Toolserver more than a month ago:

[mysql]
mysql> CREATE TABLE `teste` (
-> `rev_user` int(10) unsigned NOT NULL DEFAULT '0',
-> `page_namespace` int(11) NOT NULL,
-> `rev_page` int(10) unsigned NOT NULL,
-> `edits` int(1) unsigned NOT NULL,
-> PRIMARY KEY (`rev_user`,`page_namespace`,`rev_page`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=latin1 ;
Query OK, 0 rows affected (0.04 sec)

mysql> insert into teste select r.rev_user, p.page_namespace, r.rev_page, count(1) AS edits from revision r JOIN page p ON r.rev_page = p.page_id GROUP BY r.rev_user,p.page_namespace,r.rev_page;
Query OK, 7444039 rows affected (8 min 28.98 sec)
Records: 7444039 Duplicates: 0 Warnings: 0

mysql> create table edits_per_namespace select straight_join u.user_id,u.user_name, page_namespace,count(1) as edits from teste join user u on u.user_id = rev_user group by rev_user,page_namespace;
Query OK, 187624 rows affected (3.65 sec)
Records: 187624 Duplicates: 0 Warnings: 0

mysql> select * from edits_per_namespace order by edits desc limit 5;
+---------+---------------+----------------+--------+
| user_id | user_name | page_namespace | edits |
+---------+---------------+----------------+--------+
| 76240 | Rei-bot | 0 | 365800 |
| 0 | anonymous | 0 | 253238 |
| 76240 | Rei-bot | 3 | 219085 |
| 1740 | LeonardoRob0t | 0 | 145418 |
| 170627 | SieBot | 0 | 121647 |
+---------+---------------+----------------+--------+
5 rows in set (0.09 sec)
[/mysql]

Well, that's funny Rei-artur's bot beats all summed anonymous edits on the main namespace :) I still need to setup a way of discarding the bots, they usually don't count on stats. I'll probably set a flag on the user table myself, but this is enough to get us started..

3 comentários:

  1. hehehe, não fazia ideia disso...

    não sei de ajuda mas podes filtrar os bots usando a API,

    lista de bots: http://pt.wikipedia.org/w/api.php?action=query&list=allusers&aufrom=A&augroup=bot&aulimit=500&format=xml

    ou individualmente:
    http://pt.wikipedia.org/w/api.php?action=query&list=users&ususers=Rei-bot&usprop=editcount|groups&format=xml

    um forte abraço

    ResponderEliminar
  2. Oi Rei-artur! De facto, não me lembrei de explorar a API. A última vez que olhei para ela estava sob intenso desenvolvimento. O primeiro link já me ajuda bastante, mas ainda vou espremer a API e ver o que sai ehhehehe. Um abraço, e obrigado!

    ResponderEliminar
  3. [...] vez em português, decidi dar [alguma] continuidade ao que comecei há uns dias com a importação dos dumps da Wikipédia. Graças à dica do Rei-artur foi fácil extrair a lista de robôs, para excluir das estatísticas. [...]

    ResponderEliminar