GPS Humano: The bigger smaller than the smaller one

I was trying to determine if the storage size of a VARCHAR field in MySQL had any fixed influence in the key size. I've created a few tables, but an interesting thing came out, as you will see. Let's create the test tables:

[mysql]
create table idx_varchar_size ( a varchar(5) not null, b varchar(20) not null ) ENGINE=MyISAM;

insert into idx_varchar_size('abcef','1234567890123456789012345678901234567890');
[/mysql]

I did this a couple of times:
[mysql]
insert into idx_varchar_size select * from idx_varchar_size;
[/mysql]

I actually used this table to be the source data to feed into the test tables:

[mysql]
mysql> create table idx_varchar_mixed ( a varchar(20) not null, key idx_big(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)

mysql> create table idx_varchar_big ( a varchar(20) not null, key idx_big(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.00 sec)

mysql> create table idx_varchar_small ( a varchar(5) not null, key idx_small(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into idx_varchar_small select a from idx_varchar_size ;
Query OK, 374706 rows affected (2.04 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> insert into idx_varchar_big select b from idx_varchar_size ;
Query OK, 374706 rows affected (3.38 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> insert into idx_varchar_mixed select a from idx_varchar_size ;
Query OK, 374706 rows affected (3.06 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

So I've created a small dataset, a "big" dataset, and a "big" schema holding a small dataset. Let's see the output of SHOW TABLE STATUS:

[mysql]
mysql> show table status;
+-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| idx_varchar_big | MyISAM | 10 | Dynamic | 374706 | 24 | 8992944 | 281474976710655 | 818176 | 0 | NULL | 2009-08-20 14:33:52 | 2009-08-20 14:34:07 | 2009-08-20 14:34:09 | latin1_swedish_ci | NULL | | |
| idx_varchar_mixed | MyISAM | 10 | Dynamic | 374706 | 20 | 7494120 | 281474976710655 | 798720 | 0 | NULL | 2009-08-20 14:32:28 | 2009-08-20 14:34:33 | 2009-08-20 14:34:35 | latin1_swedish_ci | NULL | | |
| idx_varchar_size | MyISAM | 10 | Dynamic | 374706 | 32 | 11990592 | 281474976710655 | 5514240 | 0 | NULL | 2009-08-20 13:02:49 | 2009-08-20 13:06:23 | NULL | latin1_swedish_ci | NULL | | |
| idx_varchar_small | MyISAM | 10 | Dynamic | 374706 | 20 | 7494120 | 281474976710655 | 4599808 | 0 | NULL | 2009-08-20 14:32:40 | 2009-08-20 14:32:53 | 2009-08-20 14:32:54 | latin1_swedish_ci | NULL | | |
+-------------------+--------+---------+------------+--------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
4 rows in set (0.00 sec)
[/mysql]

Well, this is odd. My small table as Index_length a lot bigger (5 times) than my ''big'' dataset?? Maybe it's a SHOW TABLE STATUS bug, let's see how much space the data files actually use.

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small}*.MYI
-rw-rw---- 1 mysql mysql 818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI
-rw-rw---- 1 mysql mysql 798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI
[/code]

Nope. It's true. After a quick thinking, I reminded that MySQL can pack the keys and now this resembles the benefit of packed indexes. Let's make a simple comparison with an explicited packed key:

[mysql]
mysql> create table idx_varchar_small_packed ( a varchar(5) not null, key idx_small(a) ) ENGINE=MyISAM PACKED_KEYS=1;
Query OK, 0 rows affected (0.01 sec)

mysql> insert into idx_varchar_small_packed select a from idx_varchar_size ;
Query OK, 374706 rows affected (2.04 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small_packed.MYI
-rw-rw---- 1 mysql mysql 798720 Ago 20 18:14 /var/lib/mysql/test/idx_varchar_small_packed.MYI
[/code]

Indeed, it does - it's the same size as idx_varchar_mixed. But it already seems to answer our initial question: VARCHAR size won't influence the key size unnecessary (compare idx_varchar_mixed with idx_varchar_small_packed).

But now I got curious about the smaller size of the key for the bigger dataset. Is it feasible to assume that the MyISAM storage engine, when PACK_KEYS is not specified, it auto-selects a minimum length for VARCHARs which it thinks it worths packing them? The documentation makes no reference to it:

PACK_KEYS takes effect only with MyISAM tables. Set this option to 1 if you want to have smaller indexes. This usually makes updates slower and reads faster. Setting the option to 0 disables all packing of keys. Setting it to DEFAULT tells the storage engine to pack only long CHAR, VARCHAR, BINARY, or VARBINARY columns.

[mysql]
create table idx_varchar_big2 ( a varchar(20) not null, key idx_big(a) ) ENGINE=MyISAM PACK_KEYS=1;
Query OK, 0 rows affected (0.01 sec)

insert into idx_varchar_big2 select * from idx_varchar_big ;
Query OK, 374706 rows affected, 65535 warnings (2.27 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_{big,mixed,small,vbig}*.MYI
-rw-rw---- 1 mysql mysql 818176 Ago 20 18:52 /var/lib/mysql/test/idx_varchar_big2.MYI
-rw-rw---- 1 mysql mysql 818176 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_big.MYI
[/code]

So they match the very same bytes and now I want to know which 'minimum' is that!. I'll be creating many tables (using a simple mental algorithm to speed up) until the packed index pops up. I'll also use the 'b' field from idx_varchar_size to fill each test table column completely to force the key to be bigger (see what otherwise happened with idx_varchar_mixed!), so ignore the warnings after the INSERT INTO. I eventually came up to the split value:

[mysql]
create table idx_varchar_small7 ( a varchar(7) not null, key idx_verybig(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)

create table idx_varchar_small8 ( a varchar(8) not null, key idx_verybig(a) ) ENGINE=MyISAM;
Query OK, 0 rows affected (0.01 sec)
[/mysql]

[mysql]
mysql> insert into idx_varchar_small7 select b from idx_varchar_size;
Query OK, 374706 rows affected, 65535 warnings (2.25 sec)
Records: 374706 Duplicates: 0 Warnings: 374706

mysql> insert into idx_varchar_small8 select b from idx_varchar_size;
Query OK, 374706 rows affected, 65535 warnings (2.34 sec)
Records: 374706 Duplicates: 0 Warnings: 374706
[/mysql]

[code]
[root@speedy test]# ls -la /var/lib/mysql/test/idx_varchar_small?.MYI
-rw-rw---- 1 mysql mysql 5366784 Ago 20 20:07 /var/lib/mysql/test/idx_varchar_small7.MYI
-rw-rw---- 1 mysql mysql 801792 Ago 20 20:08 /var/lib/mysql/test/idx_varchar_small8.MYI
[/code]

I really suspect this value is some kind of measure of efficiency.

I'll postpone (keep reading) some calculations on this because now just popped up a question about our conclusion to the initial question: Does the size of a VARCHAR field in MySQL have any fixed influence in a NOT packed key size?

To answer that, let's create the 'small' and 'big' tables with explicit PACK_KEYS=0:
[mysql]
mysql> create table idx_varchar_small_nopack ( a varchar(5) not null, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
Query OK, 0 rows affected (0.01 sec)

mysql> create table idx_varchar_mixed_nopack ( a varchar(20) not null, key idx_small(a) ) ENGINE=MyISAM PACK_KEYS=0;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into idx_varchar_small_nopack select a from idx_varchar_size;
Query OK, 374706 rows affected (2.47 sec)
Records: 374706 Duplicates: 0 Warnings: 0

mysql> insert into idx_varchar_mixed_nopack select a from idx_varchar_size;
Query OK, 374706 rows affected (3.20 sec)
Records: 374706 Duplicates: 0 Warnings: 0
[/mysql]

[code]
[root@speedy ~]# ls -la /var/lib/mysql/test/idx_varchar*{nopack,small,mixed}.MYI
-rw-rw---- 1 mysql mysql 798720 Ago 20 14:34 /var/lib/mysql/test/idx_varchar_mixed.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 21 00:23 /var/lib/mysql/test/idx_varchar_mixed_nopack.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 20 14:32 /var/lib/mysql/test/idx_varchar_small.MYI
-rw-rw---- 1 mysql mysql 4599808 Ago 21 00:22 /var/lib/mysql/test/idx_varchar_small_nopack.MYI
[/code]

These new 'nopack' tables are indeed of the same size, so it's safe to say:

VARCHAR size won't influence the key size unnecessary

The efficiency of a packed key entry

The VARCHAR index entry is tipically like this:

1 (number of bytes of same prefix) + N (bytes of different suffix) 4 (record pointer size)

I'll remeber that my datasets were Latin1, so each character matches to a single byte. Now 8-5=3. If we had packed keys for a VARCHAR(6) and the data was kind of sequential for each record (like rows being generated by nested loops, such as "aaa", "aab", "aac", etc), thus unique but highly ''packable'', the sum of bytes would be something like this:

1 (number of bytes of same prefix) + 1 (bytes of different suffix) 4 (record pointer size) = 6

This packed index entry size matches the length of VARCHAR; everything below 6 would waste overhead with the first byte for the prefix for no gain at all, right? Which means that to be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 6. Assuming that there is a bytecode separator (High Performance MySQL uses an analogy with a colon, like in "5,a" or "5,b"), we can shift that rule to:

To be worthy, PACKED_KEYS should be applied to VARCHARs bigger than 7!

Now we should confirm there is indeed a separator. I thought of using an hex viewer to see if I could come with a pattern. The best would be to look at MYI specification (either in MySQL source code or MySQL Documentation/Wiki):

[html]
00000000 fe fe 07 01 00 03 01 4d 00 b0 00 64 00 c4 00 01 |.......M...d....|
00000010 00 00 01 00 08 01 00 00 00 00 30 ff 00 00 00 00 |..........0.....|
[... typical file format heading, some 00 and ff ...]
00000400 03 fd 00 08 31 32 33 34 35 36 37 38 00 00 00 00 |....12345678....|
00000410 00 00 0e 14 0e 28 0e 3c 0e 50 0e 64 0e 78 0e 8c |.....(.<.P.d.x..|
00000420 0e a0 0e b4 0e c8 0e dc 0e f0 0d 01 04 0e 18 0e |................|
00000430 2c 0e 40 0e 54 0e 68 0e 7c 0e 90 0e a4 0e b8 0e |,.@.T.h.|.......|
00000440 cc 0e e0 0e f4 0d 02 08 0e 1c 0e 30 0e 44 0e 58 |...........0.D.X|
00000450 0e 6c 0e 80 0e 94 0e a8 0e bc 0e d0 0e e4 0e f8 |.l..............|
[...]
000c3300 0e 80 0e 94 0e a8 0e bc 0e d0 0e e4 0e f8 0d 59 |...............Y|
000c3310 0c 0e 20 0e 34 0e 48 0e 5c 0e 70 0e 84 0e 98 0e |.. .4.H.\.p.....|
000c3320 ac 0e c0 0e d4 00 00 00 00 00 00 00 00 00 00 00 |................|
000c3330 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c3400 80 67 00 00 00 00 03 02 00 08 31 32 33 34 35 36 |.g........123456|
000c3410 37 38 00 00 00 71 0d 18 00 00 00 00 03 04 0d 32 |78...q.........2|
[... nonsense data (to me) that could be a file format footer/terminator ...]
000c3860 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
000c3c00
[/html]

Notice the data is shown in plain text('12345678'). The [...] indicates content that shows the pattern (sorry can't highlight it) -- actually, it seems to be repeated in intervals of 0x400 (this value definitely has a name and a resason). Now:

The "number of bytes of same prefix" should be a constant, and that's probably the value that causes the pattern. I can't explain why it's 0x0e and not 0x08, but that's definitely overhead for some reason.

My different suffix is "", since the data is repeated, which is 0 bytes (non-existent) in "bytes of different suffix" above.

More, our data is stored adjacentely (the order of "referred rows" in the index is the same as in the data file), and we know that PACKED_KEYS can also pack "record pointer size" above. The offset for the next record (in the data file) is almost certainly smaller than 256 - even with an eventual MYD format overhead, the only field is 8 chars/bytes long -, so it can easily fit on a single byte, which is the way an intelligent packing of the pointer size the should be done.

Therefore, I suspect that a packed key row would take 1+1=2 bytes, forming a kind of pattern composed by the first byte a constant value, and the second an evenly increasing value - although there isn't an ''index entry separator''. And that's just what we're looking at the above hexdump: the second one is increasing 0x14 (20d) sequentially, and that also suggests each record (of 8 bytes) in the data file is actually stored in 20 bytes blocks.

Of course I should fundament some assumptions by looking at the specs, but it seems pretty obvious: and I just wanted to test what I completed minutes ago. Testes were done with MySQL 5.1.37 on an x86. :-)

4 comentários:

GPS Humano » Blog Archive » VARCHAR index size in InnoDB22 de agosto de 2009 às 09:55
[...] my previous conclusions about VARCHAR influence on index size could be quite storage engine specific, I'd like to see if we can extend them to InnoDB, so I took [...]
ResponderEliminar
Respostas
ntavares25 de agosto de 2009 às 13:40
Actually, as regarding to the value 0x400, I just bumped on this page about Estimating Query Performance, where it reads:

In MySQL, an index block is usually 1,024 bytes and the data pointer is usually four bytes.

Maybe there's more info about the subject, but that's enough for now.
ResponderEliminar
Respostas
ntavares12 de setembro de 2009 às 10:50
Also worth mentioning, the posts by peter in MySQL Performance Blog about PACK_KEYS effect on write intensive scenarios and JOINs, respectively:

http://www.mysqlperformanceblog.com/2008/08/12/beware-of-myisam-key-cache-mutex-contention/

http://www.mysqlperformanceblog.com/2006/05/13/to-pack-or-not-to-pack-myisam-key-compression/
ResponderEliminar
Respostas
GPS Humano » Blog Archive » Tamanho e character set de campos VARCHAR e consequências23 de setembro de 2009 às 19:07
[...] do tamanho dos campos VARCHAR não influenciar o tamanho dos índices, este tipo de dados (e outros semelhantes) comportam uma série de características que podem [...]
ResponderEliminar
Respostas

Adicionar comentário

quinta-feira, 20 de agosto de 2009

The bigger smaller than the smaller one

4 comentários: