GPS Humano: agosto 2009

segunda-feira, 31 de agosto de 2009

SugarCRM data generator

O SugarCRM dispõe de um gerador de dados para popular a DB com accounts, contacts, etc de exemplo. No entanto, pode ser utilizado para gerar DB's para testes, nomeadamente para testes de carga. Segundo o que percebi, este data generator usa um seed fixo para que os dados gerados para uma DB sejam os mesmos para outra, para que se possam comparar entre si, por exemplo. Eis os passos necessários [usados na versão 5.5]:

Procurar large_scale_test no config.php (próx. à linha 200) e alterar para true:

[code]
- 'large_scale_test' => false,
+ 'large_scale_test' => true,
[/code]

Colocar o script seguinte em install/dataGeneratorKit.php. Este script servirá de wrapper para o populateSeedData.php que já dispõe da lógica para popular a DB em termos propocionais:

[php]
define('sugarEntry',1);
require_once('include/entryPoint.php');
require_once('install/install_utils.php');

require_once('modules/TableDictionary.php');

require_once "include/database/DBManagerFactory.php";
include "install/populateSeedData.php";
?>
[/php]

Talvez seja necessário comentar algumas linhas dos ficheiros i18n, parecidas às que se mostram abaixo (fazem referência a SugarThemeRegistry). Pelo menos em 5.5beta foi:

[php]
...
// 'LBL_EMAIL_ADDRESS_BOOK_TITLE_ICON' => 'getImageURL('icon_email_addressbook.gif').' align=absmiddle border=0> Address Book',
// 'LBL_EMAIL_ADDRESS_BOOK_TITLE_ICON_SHORT' => 'getImageURL('icon_email_addressbook.gif').' align=absmiddle border=0> Addr...',
...
[/php]

Se necessário, alterar os valores no ficheiro install/populateSeedData.php:
[php]
if($large_scale_test) {
// increase the cuttoff time to 1 hour
ini_set("max_execution_time", "3600");
$number_contacts = 100000;
$number_companies = 15000;
$number_leads = 100000;
}
[/php]

Finalmente, executar:
[code]
php -f install/dataGeneratorKit.php
[/code]

Notas

Não sei até que ponto o script é eficiente, pois já me crashou com ''memory exhausted'' nos 256MB de memory_limit.

Aproveitei e copiei este artigo para a Wiki da SugarCRM.

sexta-feira, 28 de agosto de 2009

Multiple field index vs external primary key calculation

I bumped over a client which was using a VARCHAR(200) as a MyISAM PRIMARY KEY key for detecting colisions which was externally (by the app) calculated and fed into MySQL. I supsected the reason for this was to [try to] speed up inserts or being fear that there were colisions and not wanting to have unnecessary data fields. The table was just a bunch of e-mail log rows.

The rows were to be quickly processed (selected between a DATETIME range) and archived. Obvious consequences are: big index size, poor performance. Let's see why:

The index was the concatenation of some fields, and an extra one: "Message-ID" - which could be not unique, AFAIWT. In the end, the key itself was randomly inserted, since it started with the 'sender' address (see below the maillogsmiorig schema). This obvious lead to poor key insertion (random order), of course, but also to duplicate content in the key itself, as the table engine was MyISAM.

More, after the loading, the operations were slower: based on MyISAM. An index would eventually be used, but the data access was unavoidable. That could also be circumvented by InnoDB, since the data is close to the index.

Due to various facts - including that DATETIME field - I suggested migrating to InnoDB with a PK composed of the necessary fields. Moreover, the first field would be the DATETIME which would impose some kind of chronological order for better key insertion - at the same time, the processing was being done for specific intervals of that field.

I've sampled 1M rows of the live data for a test: I wanted to compare one method to my overall solution, which you can check in maillogsinnonew below. I took the opportunity to isolate the storage engine benefit, by running the test for the both versions of both solutions (previous/orig and new):

[mysql]
mysql> SELECT * FROM maillogs INTO OUTFILE '/export/home/gpshumano/teste/maillogsorig.sql';
Query OK, 1000000 rows affected (6.47 sec)

mysql> select ip,sender,receiver,date_time,SUBSTRING(chave,LENGTH(SUBSTRING_INDEX(chave,'|',2))+2) FROM maillogsminew INTO OUTFILE '/export/home/gpshumano/teste/maillogsnew.sql';
Query OK, 1000000 rows affected (4.42 sec)

mysql> CREATE TABLE `maillogsinnonew` (
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> `message_id` varchar(200) NOT NULL,
-> PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (1.27 sec)

mysql> CREATE TABLE `maillogsminew` (
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> `message_id` varchar(200) NOT NULL,
-> PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
-> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.13 sec)

mysql> CREATE TABLE `maillogsmiorig` (
-> `chave` varchar(200) NOT NULL,
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> PRIMARY KEY (`chave`),
-> KEY `maillogs_date_time_idx` (`date_time`)
-> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.14 sec)

mysql> CREATE TABLE `maillogsinnoorig` (
-> `chave` varchar(200) NOT NULL,
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> PRIMARY KEY (`chave`),
-> KEY `maillogs_date_time_idx` (`date_time`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (2.81 sec)

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsminew;
Query OK, 1000000 rows affected (34.83 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsinnonew;
Query OK, 1000000 rows affected (1 min 40.56 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsinnoorig;
Query OK, 1000000 rows affected (6 min 54.14 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsmiorig;
Query OK, 1000000 rows affected (1 min 17.06 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0
[/mysql]

This was the aftermath. Comparing the results you'll see a 75% gain in speed for InnoDB, and a 55% in MyISAM. More, after the loadings just did an OPTIMIZE TABLE to get more precise table status:
[mysql]
mysql> show table status;
+----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| maillogsinnonew | InnoDB | 10 | Compact | 1048709 | 144 | 152043520 | 0 | 0 | 6144 | NULL | 2009-08-24 20:00:44 | NULL | NULL | latin1_swedish_ci | NULL | | |
| maillogsinnoorig | InnoDB | 10 | Compact | 1041388 | 392 | 408944640 | 0 | 216006656 | 4096 | NULL | 2009-08-24 20:01:35 | NULL | NULL | latin1_swedish_ci | NULL | | |
| maillogsminew | MyISAM | 10 | Dynamic | 1000000 | 126 | 126120088 | 281474976710655 | 100151296 | 0 | NULL | 2009-08-24 20:00:55 | 2009-08-24 20:06:08 | NULL | latin1_swedish_ci | NULL | | |
| maillogsmiorig | MyISAM | 10 | Dynamic | 1000000 | 173 | 173720872 | 281474976710655 | 166667264 | 0 | NULL | 2009-08-24 20:01:01 | 2009-08-24 20:18:24 | 2009-08-24 20:18:31 | latin1_swedish_ci | NULL | | |
+----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
4 rows in set (0.03 sec)
[/mysql]

[code]
bash-3.00# ls -lah /iscsipool/DATA/tavares/
total 2288740
drwx------ 2 mysql mysql 13 ago 24 20:37 .
drwxr-xr-x 10 mysql mysql 17 ago 24 19:45 ..
-rw-rw---- 1 mysql mysql 65 ago 24 19:45 db.opt
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:29 maillogsinnonew.frm
-rw-rw---- 1 mysql mysql 156M ago 24 20:32 maillogsinnonew.ibd
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:31 maillogsinnoorig.frm
-rw-rw---- 1 mysql mysql 420M ago 24 20:40 maillogsinnoorig.ibd
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:00 maillogsminew.frm
-rw-rw---- 1 mysql mysql 120M ago 24 20:06 maillogsminew.MYD
-rw-rw---- 1 mysql mysql 96M ago 24 20:37 maillogsminew.MYI
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:01 maillogsmiorig.frm
-rw-rw---- 1 mysql mysql 166M ago 24 20:18 maillogsmiorig.MYD
-rw-rw---- 1 mysql mysql 159M ago 24 20:37 maillogsmiorig.MYI
[/code]

As you seen, you not only gain in speed, but also in smaller data files size - 63% for InnoDB and 33% for MyISAM. This smaller sizes also contribute for a longer stay in the buffer pool and reduced I/O. In any case, since the rows were processed immediatelly after the loading, they would probably be all in InnoDB's buffer pool, avoiding the extra I/O to refetch the data from the MyISAM data files.

Also notice as maillogsinnonew considers it's index (PK) size as 0 - that's because in InnoDB the fields in the PK are actually part of the data node itself. In practical terms, this index is just a matter of data ordering, ie, no extra space is required to store the index itself!

quinta-feira, 27 de agosto de 2009

Update on mysql-query-browser "affected rows"

Yesterday I tried massaging mysql-gui-tools a bit to see if I could make affected rows show up on modification DMLs, sucha as INSERT, DELETE and UPDATE. Here is a briefing about it, along with all the patches I used, some taken from the current Fedora 10 source RPM, along a rude one by myself to show affected rows.

Update

Fixing the problem on the root will take a bit more time. This kind of protocol information, such as affected_rows is lost because the guys at MySQL considered that DML changing data never return usefull results - but actually they do: the response protocol packet comes with logs of info, like if you used an index, if you are in a transaction, etc. It could have been due to the protocol changed over time and the Query Browser didn't catchup.

This translates to a lot of fixes: adapting the methods discarding results only when variable result is NULL for not doing that, and find a way to leave affected_rows set somewhere.

So, for the moment, here is a list of patches I used, the bold are my own. The RPM will have to wait, since I wanted to release mysql-gui-tools-5.0r14 (which are the versions you should apply these patches against) instead of the r12 currently available. In the meantime, if I find more patches to submit, may be I have a real look at it. Here are the patches:

mysql-query-browser-sigcfix

mysql-query-browser-missingincludes

mysql-query-browser-gtkdeprecated

mysql-query-browser-default_schema_infinite_loop

mysql-query-browser-affected_rows

mysql-gui-common-sigcfix

mysql-gui-common-encodingfix

mysql-gui-common-affected_rows

quarta-feira, 26 de agosto de 2009

Compiling mysql-gui-tools

Me and a colleague were missing some features in mysql-query-browser and am trying to have a look at them, since no one at MySQL AB is very interested in supporting it. So I thought I could have a look at it. System is Fedora 10 (still), and I use it mainly because it's small, simple to use, and it's GTK!

Setting up the building (compile) environment

Got the sources from the notes at MySQL Forge: Building MySQL GUI Tools on Linux:

[code]
svn co http://svn.mysql.com/svnpublic/mysql-gui-common/trunk mysql-gui-common
svn co http://svn.mysql.com/svnpublic/mysql-query-browser/trunk mysql-query-browser
[/code]

You'll need a patch from Oden Eriksson attached to Bug #32184, or you can use the one from the RPM - otherwise you'll get the error error: ‘SigC’ has not been declared found on that bug report. I had to cut it for building from the SVN tree, and patched mysql-gui-common and mysql-query-browser independently (split the patch).

Building mysql-gui-common is straightforward:

[code]
./autogen.sh
./configure --prefix=/home/nmct/mysql-query-browser/fake
make -j 2
make install
[/code]

Building mysql-query-browser seems to need explicit pointing to the libgtkhtml besides the packages it mentions on error:

[code]
[root@speedy ~]# rpm -qa | grep gtkhtml | grep devel
gtkhtml3-devel-3.24.5-1.fc10.i386
[root@speedy ~]# rpm -ql gtkhtml3-devel | grep pc
/usr/lib/pkgconfig/gtkhtml-editor.pc
/usr/lib/pkgconfig/libgtkhtml-3.14.pc
[/code]

So it's easy to spot the needed --with switch. I had to apply several other patches that I just took source RPM. Most of them were applied with -p2.

[code]
[nmct@speedy mysql-query-browser]$ patch -p 2 < mysql-gui-tools-5.0_p12-libsigc++-2.2.patch
patching file source/linux/MQResultTab.h
[nmct@speedy mysql-query-browser]$ patch -p2 < mysql-gui-tools-gtksourceview-cflags.patch
patching file source/linux/Makefile.in
Hunk #1 succeeded at 119 (offset 17 lines).
[nmct@speedy mysql-query-browser]$ patch -p2 < gtk_deprecated_typedefs.patch
patching file source/linux/gtksourceview/gtksourceview/Makefile.in
...

...
./configure --with-gtkhtml=libgtkhtml-3.14 --prefix=/home/nmct/mysql-query-browser/fake
make -j 2
make install
[/code]

And that should be it - actually there was a path concatenation issue (looking for ...fake/usr/local/share...) which I quickly fixed with symlinks. After that, we should be ready to rock.

First patch: mysql_affected_rows

One of the features I miss most is the number of affected rows of some DML commands, such as UPDATE and INSERT. This was not easy to do in five minutes because of the UI split: mysql_affected_rows() doesn't seem to reach the GUI. So I've made a simple test, and succeeded.

This looks promising. I just set a global var, which will do for now. I still have to check for potential race conditions, but expect the polished patch, along with a new RPM for Fedora 10, at least, in the near future.

Percorrer uma tabela MyISAM de forma controlada

Lembrei-me de partilhar um teste que fiz há tempos. O que se pretendia era demonstrar que podemos percorrer uma tabela MyISAM (e só neste caso específico) de forma controlada (ie, prever a ordem pela qual os registos são devolvidos) e se esse percurso se mantém em outras operações para além do SELECT, como no caso do DELETE. É fácil pensar numa aplicação: se eu quiser, por exemplo, transladar blocos de registos de uma tabela para outra, torna-se fundamental que a operação DELETE também obedeça ao expectável já que, como sabemos, o MyISAM não é transaccional e, se alguma coisa falhar, queremos poder ter acesso aos registos desta forma determinística para saber o que reverter.

Para alguns pode parecer óbvio, mas sem olhar para o código nunca vamos ter a certeza. Para além disso, o código pode mudar, por isso mais vale termos a certeza :-) Não vamos sequer tentar extrapolar as conclusões para InnoDB porque internamente trabalha de forma completamente diferente. Aliás, um único aspecto da sua arquitectura - o famoso clustered index, que pode ou não ser interno, mas existe sempre! - dá logo para desconfiar que o comportamento seja completamente diferente.

Portanto, na prática, o que se pretende mesmo é ter a certeza que sabemos que registos vão surgindo em várias iterações, e se essa certeza se extrapola para DELETEs (e, eventualmente, para UPDATEs) - ie, tornar o nosso processo determinístico.

Vamos começar com uma tabela simples:

[mysql]
CREATE TABLE `teste` (
`_int` int(11) NOT NULL DEFAULT '0',
`_char` varchar(5) NOT NULL DEFAULT '',
KEY `idx_int` (`_int`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1
[/mysql]

E inserimos alguns dados:
[mysql]
mysql> INSERT INTO teste VALUES (2,'c'), (1,'e'), (1,'b'), (1,'z'), (2,'b'), (2,'d'),(3,'m');
Query OK, 6 rows affected (0.00 sec)
Records: 6 Duplicates: 0 Warnings: 0

mysql> SELECT SQL_NO_CACHE * FROM teste;
+------+-------+
| _int | _char |
+------+-------+
| 2 | c |
| 1 | e |
| 1 | b |
| 1 | z |
| 2 | b |
| 2 | d |
| 3 | m |
+------+-------+
7 rows in set (0.00 sec)
[/mysql]

A ordem pela qual foram inseridos os registos é fundamental. Podemos observar que este table scan é feito de forma natural, também segundo o Query Optimizer:
[mysql]
mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM teste;
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
| 1 | SIMPLE | teste | ALL | NULL | NULL | NULL | NULL | 7 | |
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
1 row in set (0.00 sec)
[/mysql]

Perfeito. Estou a pedir os campos todos de cada linha, onde se inclui _char, que não é indexado, e o Optimizer comporta-se como suposto. Mas agora vejamos uma query ligeiramente diferente:
[mysql]
mysql> EXPLAIN SELECT SQL_NO_CACHE _int FROM teste;
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
| 1 | SIMPLE | teste | index | NULL | idx_int | 4 | NULL | 7 | Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------------+
1 row in set (0.00 sec)
[/mysql]

É interessante como o Optimizer reconheceu que, se eu só quero um campo que por acaso até está indexado, então posso obtê-lo directamente do índice (e é isso que significa Using index no campo Extra) evitando ir aos datafiles. Porém isto significa que iremos obter os registos pela ordem do índice:
[mysql]
mysql> SELECT SQL_NO_CACHE _int FROM teste;
+------+
| _int |
+------+
| 1 |
| 1 |
| 1 |
| 2 |
| 2 |
| 2 |
| 3 |
+------+
7 rows in set (0.00 sec)
[/mysql]

Isto é mais importante que o que possa parecer para este teste. Se eu for obrigado a requisitar mais campos do que esse, o Optimizer vai voltar ao table scan. E não basta colocar um ORDER BY...
[mysql]
mysql> explain SELECT SQL_NO_CACHE * FROM teste ORDER BY _int;
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
| 1 | SIMPLE | teste | ALL | NULL | NULL | NULL | NULL | 7 | Using filesort |
+----+-------------+-------+------+---------------+------+---------+------+------+----------------+
1 row in set (0.00 sec)
[/mysql]
... porque o Optimizer pode querer usar uma tabela temporária para fazer a ordenação, que é que nos diz Using filesort no campo Extra. Isto pode parecer uma falha do Optimizer, mas a verdade é que o Optimizer é inteligente o suficiente para determinar que um ''full table scan'' pode ser mais eficiente que percorrer o índice por uma ordem e ir buscar os restantes dados aos data files com localização (nos discos) completamente desordenada, provavelmente aleatória (o I/O manifestar-se-ia imediatamente) - claro que nesta tabela talvez não seja o caso, mas para tabelas muito grandes pode ser desastroso. Assim sendo, teríamos que forçar explicitamente a utilização do índice, já que, pelo menos no meu caso, nem a pista USE INDEX ajudava:
[mysql]
mysql> explain SELECT SQL_NO_CACHE * FROM teste FORCE INDEX(idx_int) ORDER BY _int;
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
| 1 | SIMPLE | teste | index | NULL | idx_int | 4 | NULL | 7 | |
+----+-------------+-------+-------+---------------+---------+---------+------+------+-------+
1 row in set (0.00 sec)
[/mysql]

De facto, o Optimizer é tão teimoso que mesmo forçando a utilização do índice ele descarta-o se não usarmos o ORDER BY, pois sabe que, para um table scan a ordem dos registos é indiferente e, como tal, não precisa do índice para nada. Deve haver uma explicação para este comportamento - que vou ter que pesquisar - mas este comportamento interessa-nos e muito: se o Optimizer pegasse ao acaso um índice que lhe parecesse bem, seria difícil obter os registos pela ordem natural sem testar com um EXPLAIN primeiro. Parece-me interessante salientar o seguinte:
[mysql]
mysql> SELECT SQL_NO_CACHE * FROM teste FORCE INDEX(idx_int) ORDER BY _int;
+------+-------+
| _int | _char |
+------+-------+
| 1 | e |
| 1 | b |
| 1 | z |
| 2 | c |
| 2 | b |
| 2 | d |
| 3 | m |
+------+-------+
7 rows in set (0.00 sec)
[/mysql]

Ou seja, dentro do índice, quando há colisões, elas são simplesmente adicionadas no fim. Isto significa que, após a ordenação, a ordem pela qual obtemos os registos é... a ordem natural.

Mas pronto, agora sim, podemos assumir, para já, que se percorrermos a tabela com SELECT ... LIMIT 1, podemos ir obtendo registo a registo quer pela ordem natural, quer pela ordem do índice que quisermos. Mas o meu grande problema era na remoção, pois não temos EXPLAIN para o DELETE. Qual dos dois métodos o DELETE utiliza?

[mysql]
mysql> DELETE FROM teste LIMIT 1;
Query OK, 1 row affected (0.00 sec)

mysql> SELECT SQL_NO_CACHE * FROM teste;
+------+-------+
| _int | _char |
+------+-------+
| 1 | e |
| 1 | b |
| 1 | z |
| 2 | b |
| 2 | d |
| 3 | m |
+------+-------+
6 rows in set (0.00 sec)
[/mysql]

Bem, para já, parece ser a ordem natural. Claro que se eu especificar um ORDER BY _int o próximo registo a apagar deveria ser (1,e) - porque é o primeiro no índice idx_int, e sabemos nós que o valor no campo _char será o da ordem natural - resta saber se o Optimizer não pensa que precisa duma tabela temporária para essa ordenação. Eu estou convencido que não, pois como não há selecção de nenhum campo específico, não há porque não utilizar o índice idx_int;. Vamos só confirmar:
[mysql]
mysql> DELETE FROM teste ORDER BY _int LIMIT 1;
Query OK, 1 row affected (0.00 sec)

mysql> SELECT SQL_NO_CACHE * FROM teste;
+------+-------+
| _int | _char |
+------+-------+
| 1 | b |
| 1 | z |
| 2 | b |
| 2 | d |
| 3 | m |
+------+-------+
5 rows in set (0.00 sec)
[/mysql]

Tudo bem, conforme previsto. Mas há mais. Raramente há tabelas só com um índice e pode acontecer que o nosso campo _char fosse apanhado por um índice, o que tornaria as coisas um pouco diferentes:
[mysql]
mysql> ALTER TABLE teste ADD KEY idx_char(_char);
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM teste;
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
| 1 | SIMPLE | teste | ALL | NULL | NULL | NULL | NULL | 5 | |
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
1 row in set (0.00 sec)
[/mysql]

Acho interessante porque é como se o Optimizer «em caso de dúvida, optasse por nenhum», ou seja, como desconhece um critério para escolher um ou outro índice, não usa nenhum.

Na verdade, o Optimizer não usa nenhum índice porque não tem nenhuma pista sobre qual agarrar. Por norma irá utilizar o de maior cardinalidade, para as pistas que tiver disponíveis:

[mysql]
mysql> OPTIMIZE TABLE teste;
+------------+----------+----------+----------+
| Table | Op | Msg_type | Msg_text
+------------+----------+----------+----------+
| test.teste | OPTIMIZE | status | OK
+------------+----------+----------+----------+
1 row in SET (0.00 sec)

mysql> SHOW INDEXES FROM teste;
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | NULL | Index_type | Comment
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| teste | 1 | idx_int | 1 | _int | A | 2 | NULL | NULL | | BTREE |
| teste | 1 | idx_char | 1 | _char | A | 5 | NULL | NULL | | BTREE |
+-------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
2 rows in SET (0.00 sec)

mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM teste WHERE _int = 1 AND _char = 'e';
+----+-------------+-------+------+------------------+----------+---------+-------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+------+------------------+----------+---------+-------+------+-------------+
| 1 | SIMPLE | teste | ref | idx_int,idx_char | idx_char | 7 | const | 1 | USING WHERE
+----+-------------+-------+------+------------------+----------+---------+-------+------+-------------+
1 row in SET (0.00 sec)
[/mysql]

Ou seja, idx_char é utilizado para a filtragem e, como tinha potencial para filtrar mais registos, é esse índice o escolhido pelo Optimizer, que nos diz ainda que vai precisar de percorrer os datafiles para filtrar o campo idx_int (Using where).

Eu sei que o * é expandido para os campos todos pelo Optimizer; então e se houvesse um covering index?

[mysql]
mysql> ALTER TABLE teste ADD KEY idx_int_char(_int,_char);
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 WARNINGS: 0

mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM teste;
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
| 1 | SIMPLE | teste | index | NULL | idx_int_char | 11 | NULL | 5 | USING index
+----+-------------+-------+-------+---------------+--------------+---------+------+------+-------------+
1 row in SET (0.00 sec)
[/mysql]

Como era de esperar, o índice será usado. Vamos ver o que acontece com o DELETE. Para testar, forçamos os registos a mudar a posição no índice (mantendo a ordem natural) e vamos criar outra tabela similar, com os mesmos registos, porque vai-nos interessar esta ordem mais abaixo:

[mysql]
mysql> UPDATE teste SET _int = 4 WHERE _int = 1 AND _char = 'b'; UPDATE teste SET _int = 5, _char = 'f' WHERE _int = 2 AND _char = 'b';

mysql> create table teste3 like teste;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into teste3 select * from teste IGNORE INDEX(idx_int_char);
Query OK, 5 rows affected (0.00 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> select * from teste3 IGNORE INDEX(idx_int_char);
+------+-------+
| _int | _char |
+------+-------+
| 4 | b |
| 1 | z |
| 5 | f |
| 2 | d |
| 3 | m |
+------+-------+
5 rows in set (0.00 sec)

mysql> DELETE FROM teste3 LIMIT 1;
Query OK, 1 row affected (0.00 sec)

mysql> select * from teste3 IGNORE INDEX(idx_int_char);
+------+-------+
| _int | _char |
+------+-------+
| 1 | z |
| 5 | f |
| 2 | d |
| 3 | m |
+------+-------+
4 rows in set (0.00 sec)
[/mysql]

Ou seja, com o SELECT antes e depois temos oportunidade de comprovar que o DELETE não pegou nenhum índice, nem mesmo o covering index.

Mas voltando atrás, à tabela teste, será que o facto de se usar uma PRIMARY KEY (que é única e não nula) influencia os resultados? Esta pergunta provavelmente só será pertinente para quem conheça o InnoDB.

[mysql]
mysql> ALTER TABLE teste DROP KEY idx_int;
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> ALTER TABLE teste ADD PRIMARY KEY(_int);
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM teste;
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
| 1 | SIMPLE | teste | ALL | NULL | NULL | NULL | NULL | 5 | |
+----+-------------+-------+------+---------------+------+---------+------+------+-------+
1 row in set (0.00 sec)
[/mysql]

Muito bem, nada a assinalar.

Mas nós também sabemos o que acontece com o MyISAM: após os DELETEs começam a surgir buracos nos data files (que, consequentemente, acabam por interferir com as inserções em concorrência). Vejamos. Recriei a tabela teste com o dataset original, e fiz os dois DELETE que fizémos até aqui. A seguir:

[mysql]
mysql> SELECT @@concurrent_insert;
+---------------------+
| @@concurrent_insert
+---------------------+
| 1
+---------------------+
1 row in SET (0.00 sec)

mysql> INSERT INTO teste VALUES (6,'e'),(5,'f'),(4,'g');
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 WARNINGS: 0

mysql> SELECT SQL_NO_CACHE * FROM teste;
+------+-------+
| _int | _char |
+------+-------+
| 5 | f |
| 6 | e |
| 1 | b |
| 1 | z |
| 2 | b |
| 2 | d |
| 3 | m |
| 4 | g |
+------+-------+
8 rows in SET (0.00 sec)
[/mysql]

Repare-se que os dois primeiros registos ficaram fora da ordem natural pois o MyISAM conseguiu reciclar o espaço livre, exactamente nos mesmos sítios. O terceiro elemento calhou no único sítio onde cabia: no fim. Esta também é uma observação importante porque se houverem escritas na tabela durante as operações, então é preciso um cuidado especial. Este cuidado agrava-se pelo facto de nem conseguirmos desactivar este comportamento. Recrie-se novamente a tabela teste, com o INSERT inicial, depois os dois DELETEs feitos até agora, e depois o INSERT do teste anterior:

[mysql]
mysql> SET global concurrent_insert = 0;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO teste VALUES (6,'e'),(5,'f'),(4,'g');
Query OK, 3 rows affected (0.00 sec)
Records: 3 Duplicates: 0 WARNINGS: 0

mysql> SELECT SQL_NO_CACHE * FROM teste;
+------+-------+
| _int | _char |
+------+-------+
| 5 | f |
| 6 | e |
| 1 | b |
| 1 | z |
| 2 | b |
| 2 | d |
| 3 | m |
| 4 | g |
+------+-------+
8 rows in SET (0.00 sec)
[/mysql]

Exactamente o mesmo. Perde-se, portanto, a ordem natural.

Nota sobre o InnoDB

A título de nota, deixa-se uma breve explicação do porquê de não analisarmos o caso do InnoDB:

[mysql]
mysql> CREATE TABLE teste2 like teste;
Query OK, 0 rows affected (0.00 sec)

mysql> INSERT INTO teste2 SELECT * FROM teste;
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 WARNINGS: 0

mysql> UPDATE teste2 SET _char = 'k' WHERE _int = 1 and _char = 'b';
Query OK, 1 row affected (0.03 sec)
Rows matched: 1 Changed: 1 WARNINGS: 0

mysql> ALTER TABLE teste2 engine=INNODB, DROP INDEX idx_int, DROP INDEX idx_char, DROP INDEX idx_int_char;
Query OK, 5 rows affected (0.01 sec)
Records: 5 Duplicates: 0 WARNINGS: 0

mysql> alter table teste2 add primary key (_char);
Query OK, 5 rows affected (0.02 sec)
Records: 5 Duplicates: 0 Warnings: 0

mysql> EXPLAIN SELECT SQL_NO_CACHE * FROM teste2;
+----+-------------+--------+------+---------------+------+---------+------+------+-------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
+----+-------------+--------+------+---------------+------+---------+------+------+-------+
| 1 | SIMPLE | teste2 | ALL | NULL | NULL | NULL | NULL | 5 |
+----+-------------+--------+------+---------------+------+---------+------+------+-------+
1 row in SET (0.00 sec)

mysql> SELECT * FROM teste2;
+------+-------+
| _int | _char |
+------+-------+
| 2 | b |
| 2 | d |
| 1 | k |
| 3 | m |
| 1 | z |
+------+-------+
5 rows in SET (0.00 sec)
[/mysql]

O que queria que reparassem é que não há indicação de estar a ser utilizada nenhuma chave e, no entanto, os registos vêm ordenados pela PRIMARY KEY. Como disse logo no início, com InnoDB os resultados seriam diferentes, e isto deve-se ao clustered index deste storage engine, que armazena os registos (fiscamente nos datafiles) pela ordem da chave (a famosa primary key order). Como consequência, não é possível ter acesso aos registos na forma natural (ie, pela ordem em que foram inseridos) - e, de forma consistente com o que tínhamos visto para MyISAM, o Optimizer prefere não usar nenhum índice, passando para o storage engine a mesma query expandida.

Penso que consegui demonstrar com alguma segurança que podemos obter registos com SELECT de uma forma controlada. Dependendo do cenário, nem sempre se torna possível usar um ou outro método (também pode não ser possível criar um novo índice, por exemplo) pelo que antes de tirar conclusões precipitadas, o melhor é usar o EXPLAIN para explorar as possibilidades!

sábado, 22 de agosto de 2009

VARCHAR index size in InnoDB

Although my previous conclusions about VARCHAR influence on index size could be quite storage engine specific, I'd like to see if we can extend them to InnoDB, so I took the tables still lying on my disk and did:

[mysql]
mysql> alter table idx_varchar_big engine=innodb;
Query OK, 374706 rows affected (10.15 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> alter table idx_varchar_small engine=innodb;
Query OK, 374706 rows affected (10.56 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> alter table idx_varchar_mixed engine=innodb;
Query OK, 374706 rows affected (7.27 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> show table status;
+--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| idx_varchar_big | InnoDB | 10 | Compact | 375091 | 51 | 19447808 | 0 | 13172736 | 5242880 | NULL | 2009-08-22 16:43:50 | NULL | NULL | latin1_swedish_ci | NULL | | |
| idx_varchar_mixed | InnoDB | 10 | Compact | 375257 | 34 | 13123584 | 0 | 6832128 | 4194304 | NULL | 2009-08-22 16:44:31 | NULL | NULL | latin1_swedish_ci | NULL | | |
| idx_varchar_small | InnoDB | 10 | Compact | 375257 | 34 | 13123584 | 0 | 6832128 | 4194304 | NULL | 2009-08-22 16:44:08 | NULL | NULL | latin1_swedish_ci | NULL | | |
+--------------------------+-----------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
3 rows in set (0.01 sec)
[/mysql]

Apparently, the same initial conclusion apply to InnoDB (except for the rant on the packed index, which is MyISAM specific). Looking at the file sizes (innodb_file_per_table):

[code]
[root@speedy ~]# ls -la /var/lib/mysql/test/idx_varchar_{small,big,mixed}.ibd
-rw-rw---- 1 mysql mysql 41943040 Ago 22 16:43 /var/lib/mysql/test/idx_varchar_big.ibd
-rw-rw---- 1 mysql mysql 28311552 Ago 22 16:44 /var/lib/mysql/test/idx_varchar_mixed.ibd
-rw-rw---- 1 mysql mysql 28311552 Ago 22 16:44 /var/lib/mysql/test/idx_varchar_small.ibd
[/code]

Good to know.

quinta-feira, 20 de agosto de 2009

The bigger smaller than the smaller one

I was trying to determine if the storage size of a VARCHAR field in MySQL had any fixed influence in the key size. I've created a few tables, but an interesting thing came out, as you will see. Let's create the test tables:

[mysql]
create table idx_varchar_size ( a varchar(5) not null, b varchar(20) not null ) ENGINE=MyISAM;

insert into idx_varchar_size('abcef','1234567890123456789012345678901234567890');
[/mysql]

I did this a couple of times:
[mysql]
insert into idx_varchar_size select * from idx_varchar_size;
[/mysql]

I actually used this table to be the source data to feed into the test tables:

[mysql]
mysql> create table idx_varchar_mixed ( a varchar(20) not null, key idx_big(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)

mysql> create table idx_varchar_big ( a varchar(20) not null, key idx_big(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.00 sec)

mysql> create table idx_varchar_small ( a varchar(5) not null, key idx_small(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into idx_varchar_small select a from idx_varchar_size ;
Query OK, 374706 rows affected (2.04 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> insert into idx_varchar_big select b from idx_varchar_size ;
Query OK, 374706 rows affected (3.38 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> insert into idx_varchar_mixed select a from idx_varchar_size ;
Query OK, 374706 rows affected (3.06 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

So I've created a small dataset, a "big" dataset, and a "big" schema holding a small dataset. Let's see the output of SHOW TABLE STATUS:

[mysql]
mysql> show table status;
+-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| idx_varchar_big | MyISAM | 10 | Dynamic | 374706 | 24 | 8992944 | 281474976710655 | 818176 | 0 | NULL | 2009-08-20 14:33:52 | 2009-08-20 14:34:07 | 2009-08-20 14:34:09 | latin1_swedish_ci | NULL | | |
| idx_varchar_mixed | MyISAM | 10 | Dynamic | 374706 | 20 | 7494120 | 281474976710655 | 798720 | 0 | NULL | 2009-08-20 14:32:28 | 2009-08-20 14:34:33 | 2009-08-20 14:34:35 | latin1_swedish_ci | NULL | | |
| idx_varchar_size | MyISAM | 10 | Dynamic | 374706 | 32 | 11990592 | 281474976710655 | 5514240 | 0 | NULL | 2009-08-20 13:02:49 | 2009-08-20 13:06:23 | NULL | latin1_swedish_ci | NULL | | |
| idx_varchar_small | MyISAM | 10 | Dynamic | 374706 | 20 | 7494120 | 281474976710655 | 4599808 | 0 | NULL | 2009-08-20 14:32:40 | 2009-08-20 14:32:53 | 2009-08-20 14:32:54 | latin1_swedish_ci | NULL | | |
+-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
4 rows in set (0.00 sec)
[/mysql]

Well, this is odd. My small table as Index_length a lot bigger (5 times) than my ''big'' dataset?? Maybe it's a SHOW TABLE STATUS bug, let's see how much space the data files actually use.

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small}*.MYI
-rw-rw---- 1 mysql mysql 818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI
-rw-rw---- 1 mysql mysql 798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI
[/code]

Nope. It's true. After a quick thinking, I reminded that MySQL can pack the keys and now this resembles the benefit of packed indexes. Let's make a simple comparison with an explicited packed key:

[mysql]
mysql> create table idx_varchar_small_packed ( a varchar(5) not null, key idx_small(a) ) ENGINE=MyISAM PACKED_KEYS=1;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into idx_varchar_small_packed select a from idx_varchar_size ;
Query OK, 374706 rows affected (2.04 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small_packed.MYI
-rw-rw---- 1 mysql mysql 798720 Ago 20 18:14 /var/lib/mysql/test/idx_varchar_small_packed.MYI
[/code]

Indeed, it does - it's the same size as idx_varchar_mixed. But it already seems to answer our initial question: VARCHAR size won't influence the key size unnecessary (compare idx_varchar_mixed with idx_varchar_small_packed).

But now I got curious about the smaller size of the key for the bigger dataset. Is it feasible to assume that the MyISAM storage engine, when PACK_KEYS is not specified, it auto-selects a minimum length for VARCHARs which it thinks it worths packing them? The documentation makes no reference to it:

PACK_KEYS takes effect only with MyISAM tables. Set this option to 1 if you want to have smaller indexes. This usually makes updates slower and reads faster. Setting the option to 0 disables all packing of keys. Setting it to DEFAULT tells the storage engine to pack only long CHAR, VARCHAR, BINARY, or VARBINARY columns.

[mysql]
create table idx_varchar_big2 ( a varchar(20) not null, key idx_big(a) ) ENGINE=MyISAM PACK_KEYS=1;
Query OK, 0 rows affected (0.01 sec)

insert into idx_varchar_big2 select * from idx_varchar_big ;
Query OK, 374706 rows affected, 65535 warnings (2.27 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small,vbig}*.MYI
-rw-rw---- 1 mysql mysql 818176 Ago 20 18:52 /var/lib/mysql/test/idx_varchar_big2.MYI
-rw-rw---- 1 mysql mysql 818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI
[/code]

So they match the very same bytes and now I want to know which 'minimum' is that!. I'll be creating many tables (using a simple mental algorithm to speed up) until the packed index pops up. I'll also use the 'b' field from idx_varchar_size to fill each test table column completely to force the key to be bigger (see what otherwise happened with idx_varchar_mixed!), so ignore the warnings after the INSERT INTO. I eventually came up to the split value:

[mysql]
create table idx_varchar_small7 ( a varchar(7) not null, key idx_verybig(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)

create table idx_varchar_small8 ( a varchar(8) not null, key idx_verybig(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)
[/mysql]

[mysql]
mysql> insert into idx_varchar_small7 select b from idx_varchar_size;
Query OK, 374706 rows affected, 65535 warnings (2.25 sec)
Records: 374706 Duplicates: 0 Warnings: 374706

mysql> insert into idx_varchar_small8 select b from idx_varchar_size;
Query OK, 374706 rows affected, 65535 warnings (2.34 sec)
Records: 374706 Duplicates: 0 Warnings: 374706
[/mysql]

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small?.MYI
-rw-rw---- 1 mysql mysql 5366784 Ago 20 20:07 /var/lib/mysql/test/idx_varchar_small7.MYI
-rw-rw---- 1 mysql mysql 801792 Ago 20 20:08 /var/lib/mysql/test/idx_varchar_small8.MYI
[/code]

I really suspect this value is some kind of measure of efficiency.

I'll postpone (keep reading) some calculations on this because now just popped up a question about our conclusion to the initial question: Does the size of a VARCHAR field in MySQL have any fixed influence in a NOT packed key size?

To answer that, let's create the 'small' and 'big' tables with explicit PACK_KEYS=0:
[mysql]
mysql> create table idx_varchar_small_nopack ( a varchar(5) not null, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
Query OK, 0 rows affected (0.01 sec)

mysql> create table idx_varchar_mixed_nopack ( a varchar(20) not null, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into idx_varchar_small_nopack select a from idx_varchar_size;
Query OK, 374706 rows affected (2.47 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> insert into idx_varchar_mixed_nopack select a from idx_varchar_size;
Query OK, 374706 rows affected (3.20 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

[code]
[root@speedy ~]# ls -la /var/lib/mysql/test/idx_varchar*{nopack,small,mixed}.MYI
-rw-rw---- 1 mysql mysql 798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 21 00:23 /var/lib/mysql/test/idx_varchar_mixed_nopack.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 21 00:22 /var/lib/mysql/test/idx_varchar_small_nopack.MYI
[/code]

These new 'nopack' tables are indeed of the same size, so it's safe to say:

VARCHAR size won't influence the key size unnecessary

The efficiency of a packed key entry

The VARCHAR index entry is tipically like this:

1 (number of bytes of same prefix) + N (bytes of different suffix) 4 (record pointer size)

I'll remeber that my datasets were Latin1, so each character matches to a single byte. Now 8-5=3. If we had packed keys for a VARCHAR(6) and the data was kind of sequential for each record (like rows being generated by nested loops, such as "aaa", "aab", "aac", etc), thus unique but highly ''packable'', the sum of bytes would be something like this:

1 (number of bytes of same prefix) + 1 (bytes of different suffix) 4 (record pointer size) = 6

This packed index entry size matches the length of VARCHAR; everything below 6 would waste overhead with the first byte for the prefix for no gain at all, right? Which means that to be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 6. Assuming that there is a bytecode separator (High Performance MySQL uses an analogy with a colon, like in "5,a" or "5,b"), we can shift that rule to:

To be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 7!

Now we should confirm there is indeed a separator. I thought of using an hex viewer to see if I could come with a pattern. The best would be to look at MYI specification (either in MySQL source code or MySQL Documentation/Wiki):

[html]
00000000 fe fe 07 01 00 03 01 4d 00 b0 00 64 00 c4 00 01 |.......M...d....|
00000010 00 00 01 00 08 01 00 00 00 00 30 ff 00 00 00 00 |..........0.....|
[... typical file format heading, some 00 and ff ...]
00000400 03 fd 00 08 31 32 33 34 35 36 37 38 00 00 00 00 |....12345678....|
00000410 00 00 0e 14 0e 28 0e 3c 0e 50 0e 64 0e 78 0e 8c |.....(.<.P.d.x..|
00000420 0e a0 0e b4 0e c8 0e dc 0e f0 0d 01 04 0e 18 0e |................|
00000430 2c 0e 40 0e 54 0e 68 0e 7c 0e 90 0e a4 0e b8 0e |,.@.T.h.|.......|
00000440 cc 0e e0 0e f4 0d 02 08 0e 1c 0e 30 0e 44 0e 58 |...........0.D.X|
00000450 0e 6c 0e 80 0e 94 0e a8 0e bc 0e d0 0e e4 0e f8 |.l..............|
[...]
000c3300 0e 80 0e 94 0e a8 0e bc 0e d0 0e e4 0e f8 0d 59 |...............Y|
000c3310 0c 0e 20 0e 34 0e 48 0e 5c 0e 70 0e 84 0e 98 0e |.. .4.H.\.p.....|
000c3320 ac 0e c0 0e d4 00 00 00 00 00 00 00 00 00 00 00 |................|
000c3330 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c3400 80 67 00 00 00 00 03 02 00 08 31 32 33 34 35 36 |.g........123456|
000c3410 37 38 00 00 00 71 0d 18 00 00 00 00 03 04 0d 32 |78...q.........2|
[... nonsense data (to me) that could be a file format footer/terminator ...]
000c3860 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c3c00
[/html]

Notice the data is shown in plain text('12345678'). The [...] indicates content that shows the pattern (sorry can't highlight it) -- actually, it seems to be repeated in intervals of 0x400 (this value definitely has a name and a resason). Now:

The "number of bytes of same prefix" should be a constant, and that's probably the value that causes the pattern. I can't explain why it's 0x0e and not 0x08, but that's definitely overhead for some reason.

My different suffix is "", since the data is repeated, which is 0 bytes (non-existent) in "bytes of different suffix" above.

More, our data is stored adjacentely (the order of "referred rows" in the index is the same as in the data file), and we know that PACKED_KEYS can also pack "record pointer size" above. The offset for the next record (in the data file) is almost certainly smaller than 256 - even with an eventual MYD format overhead, the only field is 8 chars/bytes long -, so it can easily fit on a single byte, which is the way an intelligent packing of the pointer size the should be done.

Therefore, I suspect that a packed key row would take 1+1=2 bytes, forming a kind of pattern composed by the first byte a constant value, and the second an evenly increasing value - although there isn't an ''index entry separator''. And that's just what we're looking at the above hexdump: the second one is increasing 0x14 (20d) sequentially, and that also suggests each record (of 8 bytes) in the data file is actually stored in 20 bytes blocks.

Of course I should fundament some assumptions by looking at the specs, but it seems pretty obvious: and I just wanted to test what I completed minutes ago. Testes were done with MySQL 5.1.37 on an x86. :-)

domingo, 9 de agosto de 2009

A importância da Wikipédia enquanto fonte de dados e não [tanto] de informação

Tão cedo comecei a ganhar destreza na Wikipédia, não pude evitar lamentar-me com o desperdício, em termos de esforço, da criação de artigos em texto corrido a partir de dados na forma bruta - não havia, aparentemente, grande forma de contornar. Com efeito, os artigos da Wikipédia são pautados por relações intrínsecas de dados sobre determinado assunto, e digeridos numa determinada língua para que nos sejam facultados na forma de informação, o que faz com que se tornem mais ou menos eloquentes, menos brutos, mas menos isolados, menos reutilizáveis. Por exemplo, IIRC Jorge, um dos pioneiros da Wikipédia lusófona, teve um esforço imenso em criar as Freguesias e Municípios de Portugal, em pequenos, sucintos, artigos com tanto português quanto se poderia gerar a partir de alguns dados do INE. O problema é que os anos iriam passar, e não haveria forma de actualizar esta informação a não ser fazendo-o manualmente um a um, porque entretanto alguém mudaria o formato do português. Mais tarde, no projecto da criação dos municípios brasileiros, orientado IIRC pelo E2m, alguém se terá apercebido desta dificuldade, e surgiram então os artigos com horríveis marcações (exemplo), provavelmente para alimentar bots que fariam parsing dos dados e fariam a substituição. Mas neste caso, como alguém barafustou meses mais tarde, a edição tornava-se terrível especialmente para os novatos, que se a medo editavam, então quando viam aquelas marcações fugiam!

Demorar-me-ia apenas 6 meses a aprender a trabalhar com bots e a perceber a utilidade das predefinições - a tal ponto que era conhecido pelo maluquinho das predefinições [desculpem não facultar referências, mas teria que procurá-las nos primórdios dos meus milhares de edições...] - para convencer-me que "já que perdemos tempo a fazer isto, faça-mo-lo de forma estruturada, aproximando-nos da linguagem das máquinas, sem prejuízo para a edição, e lancei-me no esforço de fazer isso mesmo: ressuscitando as freguesias e municípios com dados estrutrados.

Terminada esta tarefa, foi altura de iniciar a criação de artigos com base na informação estruturada, mantendo-a siponível (na verdade, houve séries de artigos que foram mesmo feitos com predefinições e, com uma passagem final, foram instanciados com subst:). Mas a informação estruturada iria agora manter-se, e mesmo que não constasse no texto corrido, seria sempre acessível (e facilmente actualizável) nos quadros informativos - basta correr um bot com um simples search & replace por dados actualizados.

Creio que hoje, quiçá por estar mais normalizado em termos de estética (o pessoal, sem querer, foi-se habituando a estes quadros informativos) do que pelos benefícios tecnológicos, já poucos ousam fazer qualquer artigo deste género (do género que se baseia em dados estruturados para constituir informação) sem uma predefinição: temos as Cidades, os Animais (sempre difíceis devido às várias formas de classificação, mas enfim..), os Asteróides, etc.

Mas isto porquê? Porque hoje descobri um projecto interessantíssimo: a DBpedia que, segundo a visão do Tim Berners-Lee, o autor da World Wide Web, é o primeiro passo para aquilo que ele chama de Linked Data: chegámos a um ponto em que as interrelações de informação estão mais do que estabelecidas - mas e as interrelações de dados? O engraçado é que somos vários a pensar assim: OK, uma página web tem, de facto, informação, mas como é que podemos usá-la fora do contexto dessa página - e em grandes quantidades? Será que esses dados - e o esforço de publicá-los - estão condenados a serem só aquilo: inúteis para terceiros? É que extrair informação de páginas de múltiplas fontes não-estruturadas é virtualmente impossível (pode bastar mudar uma vírgula ou uma cor de texto para que o parsing falhe) e obrigar cada pessoa que deseje usar a informação a ter que construir mecanismos que extraia essa informação parece-me um gigantesco desperdício de recursos.. aliás, uma das aplicações que se projectava para o XML/XSL é que ele substituísse o HTML mais tarde ou mais cedo, mas parece que isso nunca vai acontecer.

Então o que Tim Berners-Lee propõe é que a disseminação da informação seja complementada com os dados em bruto que a gerou - ou disponibilizada de forma a que estes possam ser reutilizáveis. E isto é particularmente importante num momento em que há imensas comunidades a gerar conteúdo - é curioso como do trabalho humano passámos para o PC e evoluímos para arquitecturas distribuídas e de escala, e destas evoluímos para plataformas distribuídas em que o factor humano pode ser também (novamente) gerador de substância a uma escala muito, muito maior... mas isto é outro post, noutro dia..

Deixo-vos este artigo interessante sobre a Web semântica, onde se expõem várias formas de relacionamento de dados que se podem obter da web, de forma semântica, e como eles estão (ou podem vir) a ser utilizados:

Vale a pena ver, especialmente para quem, como eu, acha que vivemos numa era dos diabos em que tudo pode acontecer, inclusivé uma

Web [in which computers] become capable of analyzing all the data on the Web

Tim Berners-Lee, 1999

sexta-feira, 7 de agosto de 2009

Google Translator Toolkit

Traduzir artigos de outras Wikipédias para a Wikipédia da Língua Portuguesa é uma forma corrente de, pelo menos, dar um arranque aos artigos com conteúdo. Tomei agora conhecimento do Google Translator Toolkit que, muito embora proponha traduções simplistas, decerto irá ao encontro de muitos editores: torna possível rever e retocar a tradução — que é como sabemos — em dual view, para além de integrar um dicionário de acesso rápido.

Mas o que é fantástico é que a Google propõe aproveitar as correcções para melhorar o seu próprio motor de tradução. Mais um brilhante exemplo de como as comunidades podem gerar mais-valias para os projectos, ao contrário da visão tradicional. Aqui fica o vídeo de demonstração.

Ainda não testei, mas está na calha :)