sexta-feira, 28 de agosto de 2009

Multiple field index vs external primary key calculation

I bumped over a client which was using a VARCHAR(200) as a MyISAM PRIMARY KEY key for detecting colisions which was externally (by the app) calculated and fed into MySQL. I supsected the reason for this was to [try to] speed up inserts or being fear that there were colisions and not wanting to have unnecessary data fields. The table was just a bunch of e-mail log rows.

The rows were to be quickly processed (selected between a DATETIME range) and archived. Obvious consequences are: big index size, poor performance. Let's see why:


  • The index was the concatenation of some fields, and an extra one: "Message-ID" - which could be not unique, AFAIWT. In the end, the key itself was randomly inserted, since it started with the 'sender' address (see below the maillogsmiorig schema). This obvious lead to poor key insertion (random order), of course, but also to duplicate content in the key itself, as the table engine was MyISAM.


  • More, after the loading, the operations were slower: based on MyISAM. An index would eventually be used, but the data access was unavoidable. That could also be circumvented by InnoDB, since the data is close to the index.



Due to various facts - including that DATETIME field - I suggested migrating to InnoDB with a PK composed of the necessary fields. Moreover, the first field would be the DATETIME which would impose some kind of chronological order for better key insertion - at the same time, the processing was being done for specific intervals of that field.

I've sampled 1M rows of the live data for a test: I wanted to compare one method to my overall solution, which you can check in maillogsinnonew below. I took the opportunity to isolate the storage engine benefit, by running the test for the both versions of both solutions (previous/orig and new):

[mysql]
mysql> SELECT * FROM maillogs INTO OUTFILE '/export/home/gpshumano/teste/maillogsorig.sql';
Query OK, 1000000 rows affected (6.47 sec)

mysql> select ip,sender,receiver,date_time,SUBSTRING(chave,LENGTH(SUBSTRING_INDEX(chave,'|',2))+2) FROM maillogsminew INTO OUTFILE '/export/home/gpshumano/teste/maillogsnew.sql';
Query OK, 1000000 rows affected (4.42 sec)


mysql> CREATE TABLE `maillogsinnonew` (
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> `message_id` varchar(200) NOT NULL,
-> PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (1.27 sec)


mysql> CREATE TABLE `maillogsminew` (
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> `message_id` varchar(200) NOT NULL,
-> PRIMARY KEY (`date_time`,`sender`,`receiver`,`message_id`)
-> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.13 sec)

mysql> CREATE TABLE `maillogsmiorig` (
-> `chave` varchar(200) NOT NULL,
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> PRIMARY KEY (`chave`),
-> KEY `maillogs_date_time_idx` (`date_time`)
-> ) ENGINE=MyISAM DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.14 sec)

mysql> CREATE TABLE `maillogsinnoorig` (
-> `chave` varchar(200) NOT NULL,
-> `ip` varchar(16) NOT NULL,
-> `sender` varchar(320) NOT NULL,
-> `receiver` varchar(320) NOT NULL,
-> `date_time` datetime NOT NULL,
-> PRIMARY KEY (`chave`),
-> KEY `maillogs_date_time_idx` (`date_time`)
-> ) ENGINE=InnoDB DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (2.81 sec)

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsminew;
Query OK, 1000000 rows affected (34.83 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsnew.sql' INTO TABLE maillogsinnonew;
Query OK, 1000000 rows affected (1 min 40.56 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsinnoorig;
Query OK, 1000000 rows affected (6 min 54.14 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0

mysql> LOAD DATA INFILE '/export/home/gpshumano/teste/maillogsorig.sql' INTO TABLE maillogsmiorig;
Query OK, 1000000 rows affected (1 min 17.06 sec)
Records: 1000000 Deleted: 0 Skipped: 0 Warnings: 0
[/mysql]

This was the aftermath. Comparing the results you'll see a 75% gain in speed for InnoDB, and a 55% in MyISAM. More, after the loadings just did an OPTIMIZE TABLE to get more precise table status:
[mysql]
mysql> show table status;
+----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| Name | Engine | Version | Row_format | Rows | Avg_row_length | Data_length | Max_data_length | Index_length | Data_free | Auto_increment | Create_time | Update_time | Check_time | Collation | Checksum | Create_options | Comment |
+----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
| maillogsinnonew | InnoDB | 10 | Compact | 1048709 | 144 | 152043520 | 0 | 0 | 6144 | NULL | 2009-08-24 20:00:44 | NULL | NULL | latin1_swedish_ci | NULL | | |
| maillogsinnoorig | InnoDB | 10 | Compact | 1041388 | 392 | 408944640 | 0 | 216006656 | 4096 | NULL | 2009-08-24 20:01:35 | NULL | NULL | latin1_swedish_ci | NULL | | |
| maillogsminew | MyISAM | 10 | Dynamic | 1000000 | 126 | 126120088 | 281474976710655 | 100151296 | 0 | NULL | 2009-08-24 20:00:55 | 2009-08-24 20:06:08 | NULL | latin1_swedish_ci | NULL | | |
| maillogsmiorig | MyISAM | 10 | Dynamic | 1000000 | 173 | 173720872 | 281474976710655 | 166667264 | 0 | NULL | 2009-08-24 20:01:01 | 2009-08-24 20:18:24 | 2009-08-24 20:18:31 | latin1_swedish_ci | NULL | | |
+----------------------+--------+---------+------------+---------+----------------+-------------+-----------------+--------------+-----------+----------------+---------------------+---------------------+---------------------+-------------------+----------+----------------+---------+
4 rows in set (0.03 sec)
[/mysql]

[code]
bash-3.00# ls -lah /iscsipool/DATA/tavares/
total 2288740
drwx------ 2 mysql mysql 13 ago 24 20:37 .
drwxr-xr-x 10 mysql mysql 17 ago 24 19:45 ..
-rw-rw---- 1 mysql mysql 65 ago 24 19:45 db.opt
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:29 maillogsinnonew.frm
-rw-rw---- 1 mysql mysql 156M ago 24 20:32 maillogsinnonew.ibd
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:31 maillogsinnoorig.frm
-rw-rw---- 1 mysql mysql 420M ago 24 20:40 maillogsinnoorig.ibd
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:00 maillogsminew.frm
-rw-rw---- 1 mysql mysql 120M ago 24 20:06 maillogsminew.MYD
-rw-rw---- 1 mysql mysql 96M ago 24 20:37 maillogsminew.MYI
-rw-rw---- 1 mysql mysql 8,5K ago 24 20:01 maillogsmiorig.frm
-rw-rw---- 1 mysql mysql 166M ago 24 20:18 maillogsmiorig.MYD
-rw-rw---- 1 mysql mysql 159M ago 24 20:37 maillogsmiorig.MYI
[/code]

As you seen, you not only gain in speed, but also in smaller data files size - 63% for InnoDB and 33% for MyISAM. This smaller sizes also contribute for a longer stay in the buffer pool and reduced I/O. In any case, since the rows were processed immediatelly after the loading, they would probably be all in InnoDB's buffer pool, avoiding the extra I/O to refetch the data from the MyISAM data files.

Also notice as maillogsinnonew considers it's index (PK) size as 0 - that's because in InnoDB the fields in the PK are actually part of the data node itself. In practical terms, this index is just a matter of data ordering, ie, no extra space is required to store the index itself!

Sem comentários:

Enviar um comentário