Tuesday, April 19, 2011

Performance schema faq #2, what are all these server startup options ?

One of the first thing to do to use the performance schema is to add "performance_schema" in the my.cnf file.

This typically prompts a long list of question from users, as there is a lot to know here.

This article attempts to cover more specifically this area.

It is long, but (I hope) also worth it, so keep reading ;)

Also, since some new performance schema features have been advertised in the mysql labs, this article also covers some never seen before parts of the performance schema.

Really, keep reading ;)

Q: Why do I even have to do anything ? Can the performance schema not be enabled by default ?

No. It could, but this is not a good choice.

Let's face it, some users do not even read the documentation.

If someone is not even aware that the performance schema feature exists in 5.5 (or 5.6), turning it on by default for a new 5.5 installation, or even worse, turning it on by default after a 5.1 to 5.5 upgrade is not going to do anyone any good.

The performance schema will collect data, eat memory, spend time computing all kinds of aggregates that can be viewed from nice tables in the new performance_schema database ... that nobody will ever select from.

The performance schema is a new feature, and it is also an optional feature.

Forcing to use it by default would only irritate people doing a migration to a new release, which is never the best time to try new features, no matter how good they are. When doing a migration, the concern is typically to be back in production as soon as possible, and anything that can be sorted out later should be done ... later.

So, the long answer is that, by default, the performance schema is disabled. This is a conscious, conservative, choice.

Q: Ok, so I need to turn P_S on. Now, why are there so many other parameters ?

By the way, "P_S" is the acronym that is used internally for the performance schema, which mirrors "I_S" for the information_schema. This is not to be confused with "PS", which stands for prepared statements.

So yes, the performance schema has in fact many different startup parameters, which can also be seen as server variables.

Looking at mysql-5.6-labs-performance-schema for example:

mysql> show variables like "performance_schema%";
+--------------------------------------------------------+-------+
| Variable_name                                          | Value |
+--------------------------------------------------------+-------+
| performance_schema                                     | ON    |
| performance_schema_events_stages_history_long_size     | 1000  |
| performance_schema_events_stages_history_size          | 10    |
| performance_schema_events_statements_history_long_size | 1000  |
| performance_schema_events_statements_history_size      | 10    |
| performance_schema_events_waits_history_long_size      | 10000 |
| performance_schema_events_waits_history_size           | 10    |
| performance_schema_max_cond_classes                    | 80    |
| performance_schema_max_cond_instances                  | 1000  |
| performance_schema_max_file_classes                    | 50    |
| performance_schema_max_file_handles                    | 32768 |
| performance_schema_max_file_instances                  | 10000 |
| performance_schema_max_hosts                           | 100   |
| performance_schema_max_mutex_classes                   | 200   |
| performance_schema_max_mutex_instances                 | 5000  |
| performance_schema_max_rwlock_classes                  | 30    |
| performance_schema_max_rwlock_instances                | 5000  |
| performance_schema_max_stage_classes                   | 100   |
| performance_schema_max_statement_classes               | 165   |
| performance_schema_max_table_handles                   | 1000  |
| performance_schema_max_table_instances                 | 500   |
| performance_schema_max_thread_classes                  | 50    |
| performance_schema_max_thread_instances                | 200   |
| performance_schema_max_user_hosts                      | 100   |
| performance_schema_max_users                           | 100   |
| performance_schema_setup_actors_size                   | 100   |
| performance_schema_setup_objects_size                  | 100   |
+--------------------------------------------------------+-------+
27 rows in set (0.00 sec)

The reason why these parameters even exist is the result of significant architectural choices, which in turn are driven by very hard constraints on the system, so let's look at some details here ...

Problem: Monitoring a server should not freeze the server.

In other words, executing select * from performance_schema. should not cause other sessions to stop working.

That includes precisely the application connection that is currently executing instrumented code, which in turn feeds the very performance_schema table being read.

This has nothing to do with how the data is exposed, this would be the same for a performance_schema table, an information_schema table, a SHOW command.

This has to do with how the statistics are represented internally.

A way to implement instrumentation that is known to cause problems is for example the implementation of the SHOW PROCESSLIST statement.

A monitoring application that execute SHOW PROCESSLIST will partially freeze the server, preventing new sessions to connect, or old sessions to disconnect. This is because SHOW PROCESSLIST will lock a critical server mutex, LOCK_thread_count, for the time it takes to loop through all the threads.

See Bug#42930, for an example of similar problems.

Solution: the performance schema is lock-less

One of the goal of the performance schema implementation is to never be in that situation. To avoid that the monitoring application causes other threads to choke, the best way is to have *no* lock shared between the monitoring connection and the instrumented code.

No matter how it's done, the performance schema instrumentation should not use locks.

Problem: Monitoring a server should not crash the server

Ok, so let's pretend the SHOW PROCESSLIST implementation used as example here is not locking LOCK_thread_count at all. Developers to not add mutex locks in the code just for pleasure, they add locks to enforce data integrity and prevent bad things to happen.

The data protected by the lock is a linked list of all the connected sessions, where each session is represented by a THD object. The list is global to the server, and each THD is allocated and freed when users connect / disconnect from the server.

Without the lock, very soon, the execution (or this now hypothetical) SHOW PROCESSLIST statement would crash when using a bogus THD record, typically when looking at freed (and worse already reused) memory for a thread that just disconnected.

What this points to is that, when using code that is lock-free, using memory that is dynamically allocated and freed by other threads is going to be a real challenge, since there are no more protections in place.

Solution: the performance schema is not using dynamic memory

Since using malloc / free cause problem, then ... the performance schema code does not use malloc and free.

Instead, memory is allocated once and for all at server startup, and is freed when the server shuts down. During normal operations (everything in between), the performance schema memory is considered "static".

This has actually significant benefits, since a pointer to a region of this "performance schema static memory" is always guaranteed to point to valid memory.

Since using a linked list also causes problems, then ... the performance schema code prefers to use arrays.

The net result is that all the performance schema memory used to keep track of statistics in fact consist of arrays, allocated once.

Different arrays are used for different things. The size of these various arrays is given by the various server startup parameters.

Still looking at the SHOW PROCESSLIST statement, the performance schema instrumentation is in fact exposing a table named performance_schema.threads.

The data for the threads table is stored in an array, and the size of this array is given by the parameter performance_schema_max_thread_instances.

The benefit: select * from performance_schema.threads is lock-less. The inconvenience: performance_schema_max_thread_instances must come from somewhere.

Q: Ok, so there are sizing parameters. But so many of them ?

Every new instrumentation (mutex, file io, table lock, statements, ...) instruments different parts of the code, and needs different types of performance schema buffers. This is actually a good property, because this allows to control each type of instrumentation -- and how much memory to use for it -- independently.

So, each time a new buffer is used, how to size it is exposed as a parameter.

Q: How about default values that work out of the box ?

Given the wide range of deployments where the MySQL server can be installed, from a laptop that will ever see 2 clients and 10 tables for a demo, to a production scale server that can support a very large web farm and data model, whatever "default value" is written in the code will never satisfy everybody.

That is a given, there is no way around it.

If the default values are too big, starting a server on a small machine will consume too much memory, giving the impression that the performance schema is a "monster" and can not be used, say, in development when testing an application.

If the default values are too small, the performance schema will miss events at runtime, and will not be able to collect data for everything, giving the impression that it is broken and does not work.

The default values that are currently in the code should be reasonable enough so that running a server with the performance schema gives meaningful results, so that people can get a first impression of what the performance schema can do, out of the box, by changing just *one* parameter (to enable the performance schema).

Past that, when deploying the performance schema in a real production environment, or to do serious benchmarks, taking a closer look at the configuration itself will unfortunately be a necessary step at some point.

That being said, the picture is actually not so complex, since not so many parameters really need to be changed.

Q: What are the parameters I can just forget about ?

mysql> show variables like "performance_schema_%_classes";
+------------------------------------------+-------+
| Variable_name                            | Value |
+------------------------------------------+-------+
| performance_schema_max_cond_classes      | 80    |
| performance_schema_max_file_classes      | 50    |
| performance_schema_max_mutex_classes     | 200   |
| performance_schema_max_rwlock_classes    | 30    |
| performance_schema_max_stage_classes     | 100   |
| performance_schema_max_statement_classes | 165   |
| performance_schema_max_thread_classes    | 50    |
+------------------------------------------+-------+
7 rows in set (0.00 sec)

All these parameters follow the same principle, so let's take for example performance_schema_max_mutex_classes. What this value of 200 means, is that there is room by default for 200 different mutex instrumentation.

The number of mutex instrumentation varies as more and more code is instrumented, which varies by releases, and the way to find out what is currently instrumented is just to look at the setup_instruments table.

mysql> select count(*) from performance_schema.setup_instruments where name like "wait/synch/mutex/%";
+----------+
| count(*) |
+----------+
|      125 |
+----------+
1 row in set (0.00 sec)

125 of the 200 mutex classes slots are used, so there is still room left: everything is fine here.

There is a catch here, as both the 125 and the 200 comes from code shipped with the server ... so, if one value exceeds the other, it will be adjusted anyway.

The only reason this parameter is even exposed in the first place, is that, if a third party plugin also contains some performance schema instrumentation for mutex classes (and in this case, for a lot of them, since there are still 75 slots left), loading this third party plugin in the server will cause starvation, so the max_mutex_class setting will need to be adjusted.

And no, we can not predict how many mutexes a third party will use.

In practice, this settings can safely be forgotten. The day it will be needed, it will be there, and people might appreciate then the fact that not everything is hard coded.

Q: What are the parameters that are not really sensitive, where the value has little effect on performance and/or functionality ?

mysql> show variables like "performance_schema_%_history%_size";
+--------------------------------------------------------+-------+
| Variable_name                                          | Value |
+--------------------------------------------------------+-------+
| performance_schema_events_stages_history_long_size     | 1000  |
| performance_schema_events_stages_history_size          | 10    |
| performance_schema_events_statements_history_long_size | 1000  |
| performance_schema_events_statements_history_size      | 10    |
| performance_schema_events_waits_history_long_size      | 10000 |
| performance_schema_events_waits_history_size           | 10    |
+--------------------------------------------------------+-------+
6 rows in set (0.01 sec)

These settings affect how many records are kept in the various history buffers, so they have a direct impact on the amount of memory used. The time is takes to write a new record in history (and drop the oldest one) is fixed, and does not depends on the history size, to this sizing parameters do not impact runtime performances.

mysql> show variables like "performance_schema_setup_%_size";
+---------------------------------------+-------+
| Variable_name                         | Value |
+---------------------------------------+-------+
| performance_schema_setup_actors_size  | 100   |
| performance_schema_setup_objects_size | 100   |
+---------------------------------------+-------+
2 rows in set (0.00 sec)

This directly control the size of corresponding tables setup_actor and setup_object. How many entries are in these tables is directly up to the DBA. The performance schema uses a hash index internally, plus some level of cache on top, so that the number of records in these tables has no effect at all on performances.

mysql> show variables like "performance_schema_max_user%";
+-----------------------------------+-------+
| Variable_name                     | Value |
+-----------------------------------+-------+
| performance_schema_max_user_hosts | 100   |
| performance_schema_max_users      | 100   |
+-----------------------------------+-------+
2 rows in set (0.01 sec)

mysql> show variables like "performance_schema_max_host%";
+------------------------------+-------+
| Variable_name                | Value |
+------------------------------+-------+
| performance_schema_max_hosts | 100   |
+------------------------------+-------+
1 row in set (0.00 sec)

For the performance_schema.users, performance_schema.hosts, and performance_schema.user_hosts tables, the performance schema also uses a hash index by user name and/or host name, so the total size of these tables is not going to significantly impact the time to look up a given record, which is done once when a session connects to the server.

Sizing these tables according to how many different users and or machines are connecting to the database server, however, will have a direct impact on whether or not the performance schema can keep statistics for everybody.

Q: What are the parameters I should really pay attention to then ?

mysql> show variables like "performance_schema_%_instances";
+-----------------------------------------+-------+
| Variable_name                           | Value |
+-----------------------------------------+-------+
| performance_schema_max_cond_instances   | 1000  |
| performance_schema_max_file_instances   | 10000 |
| performance_schema_max_mutex_instances  | 5000  |
| performance_schema_max_rwlock_instances | 5000  |
| performance_schema_max_table_instances  | 500   |
| performance_schema_max_thread_instances | 200   |
+-----------------------------------------+-------+
6 rows in set (0.01 sec)

mysql> show variables like "performance_schema_%_handles";
+--------------------------------------+-------+
| Variable_name                        | Value |
+--------------------------------------+-------+
| performance_schema_max_file_handles  | 32768 |
| performance_schema_max_table_handles | 1000  |
+--------------------------------------+-------+
2 rows in set (0.00 sec)

The most critical parameters that affect how the performance schema is performing are the parameters that control the different buffers used to instrument what the code is doing, and the lower level it gets, the more sensitive it gets.

First, to make things clear, the values you see pasted in this blog are nowhere near realistic values. They are the values used by the mysql-test-run test scripts, which are chosen to be very low just to put more stress on the performance schema code itself. The goal of the test suite is to torture the performance schema, not to make it fly.

All these different parameters follow the same principle, so let's take an example again, with performance_schema_max_thread_instances.

A value of N means that the performance schema will ever be able to keep track of N threads at a time. In practice however, performance will degrade badly when reaching the very last -- and rare -- open slots in an array of size N, when trying to allocate a new entry for the very last thread that connects.

Finding an empty slot in an array 50 percent full is easier than finding an empty slot in an array 60, 70, 80 ... 95 ? 99 ? percent full.

The more memory is given to the performance schema for a given buffer, the lower the density of allocated records will be in that buffer, making performances (when allocating a new entry) better. Once an entry is allocated, all the performance schema code directly use the record with a direct pointer, so there is nothing else to tune.

Sizing this parameter to the "proper" value depends on two things a) the (estimated) maximum number of instrumented objects the server will ever see at a given time, and b) the load factor to apply to this buffer.

Estimating the former really depends on the nature of each buffer.

For threads for example, the total number of threads is going to be driven by the --max-connection parameter, plus a few extra threads used internally. There might be a few threads for replication, the event scheduler, storage engines, etc, but not that many, so for example "max-connection + 50" seems very generous already.

Estimating the later (the load factor) really depends on the volatility of each buffer.

For buffers where records are rarely added, like for example the performance_schema.host table (a record is inserted only when a connection from a never before seen machine is found), the record density of the table is not going to have a huge impact on performances.

For buffers where records are allocated and freed all the time, finding a free record the first time beats looping forever to find one, so a very low density is key.

An example of a high volatility buffer is performance_schema_max_table_handles, as table handles are allocated each time a table is opened, which, in a database server, happens very often.

Q: Humm. What happens when a buffer is too small ?

That part is actually very simple: the performance schema will tell you.

mysql> show status like "performance_schema%";
+-------------------------------------------+-------+
| Variable_name                             | Value |
+-------------------------------------------+-------+
| Performance_schema_cond_classes_lost      | 0     |
| Performance_schema_cond_instances_lost    | 0     |
| Performance_schema_file_classes_lost      | 0     |
| Performance_schema_file_handles_lost      | 0     |
| Performance_schema_file_instances_lost    | 0     |
| Performance_schema_hosts_lost             | 0     |
| Performance_schema_locker_lost            | 0     |
| Performance_schema_mutex_classes_lost     | 0     |
| Performance_schema_mutex_instances_lost   | 0     |
| Performance_schema_rwlock_classes_lost    | 0     |
| Performance_schema_rwlock_instances_lost  | 0     |
| Performance_schema_stage_classes_lost     | 0     |
| Performance_schema_statement_classes_lost | 0     |
| Performance_schema_table_handles_lost     | 0     |
| Performance_schema_table_instances_lost   | 0     |
| Performance_schema_thread_classes_lost    | 0     |
| Performance_schema_thread_instances_lost  | 0     |
| Performance_schema_user_hosts_lost        | 0     |
| Performance_schema_users_lost             | 0     |
+-------------------------------------------+-------+
19 rows in set (0.00 sec)

Q: Ok. And what happens when a buffer is too big ?

Nothing really.

The performance schema will use memory depending on the sizing used, so big sizing means a lot of memory, but that's all.

To find how much memory is used, and in particular how each setting affects each buffer size, there is SHOW ENGINE PERFORMANCE_SCHEMA STATUS.

mysql> SHOW ENGINE PERFORMANCE_SCHEMA STATUS;
+--------------------+----------------------------------------------------------------+----------+
| Type               | Name                                                           | Status   |
+--------------------+----------------------------------------------------------------+----------+
| performance_schema | events_waits_current.row_size                                  | 168      |
| performance_schema | events_waits_current.row_count                                 | 800      |
| performance_schema | events_waits_history.row_size                                  | 168      |
| performance_schema | events_waits_history.row_count                                 | 2000     |
| performance_schema | events_waits_history.memory                                    | 336000   |
...
| performance_schema | events_statements_current.row_size                             | 1976     |
| performance_schema | events_statements_current.row_count                            | 200      |
| performance_schema | events_statements_current.memory                               | 395200   |
| performance_schema | performance_schema.memory                                      | 58928908 |
+--------------------+----------------------------------------------------------------+----------+
129 rows in set (0.00 sec)

Q: Can the sizing part be automated somehow ?

It is not automated yet, but it sure can be, to some extent.

What is needed to get that done is first, to identify the magic formula that says, for each instrumentation type, something like:

total number of mutex = (number of singleton) + (number of tables * mutexes by table) + (number of threads * mutexes by thread) + ...

The challenge here is that this magic formula changes all the time, as implementation changes, but overall, there are not that many parameters in it. In any case, that part can be resolved with at least good enough heuristics.

Some parts of the magic formula also comes from the application that sends queries to the server. All things being equal, a statement that joins 20 tables is going to consume more table handles compared to a statement that just uses 1 or 2 tables, so the way the application is written also has an effect.

The second part is some feedback from users, and some benchmarking, to find reasonable values for load factors, based on each instrument type.

This is a long post already, thanks for reading it.

Regards,

-- Marc Alff, Oracle

4 comments:

  1. I think not having performance_schema on by default is a good decision. But there should be a example config with performance_schema enabled. See also Bug #60306.

    ReplyDelete
  2. Marc,

    IS.table_statistics IS.user_statistics do not need an FAQ like this. They are always enabled, even in the benchmarks that I run. They are also always on in production. Can the equivalent of table_stats in P_S be enabled for everyone without fear of CPU or memory overhead?

    ReplyDelete
  3. Hi Daniel.

    Yes, thanks for the comments.

    Regards,
    -- Marc

    ReplyDelete
  4. Hi Mark.

    Thanks for the comments.

    I know we have different views on how to implement instrumentation, but I think we have the same goal, which is effectively to have a solution that can be always enabled in production.

    Regards,
    -- Marc

    ReplyDelete

Followers