Database Management: Optimizer, Statistics and Performance
Ask yourself: What is the purpose of optimization statistics?
It is to get statistics for the data in the RDBMS so the optimizer
can create a query plan to find a fast path to the data, it is part
of database management.
Statistics are created by running 'ANALYZE' in Oracle, 'update statistics'
in Informix or 'vacuum' in Postgresql and other utilities in other
RDBMS. But a performance problem may occur if the processing to
get the statistics takes a significant amount of time in a large application.
Sometimes people see large increases in performance from statistics and
think that they must run the statistics jobs to reflect every change in
the data. This is usually not needed, the optimizer will probably
pick the same method to get data from a table that has a million rows
as when it has a million and a half or even ten million.
So, if the time to process statistics is crippling the application
performance to improve performance, what to do?
If table sizes do not change drastically and the distribution
of values do not change drastically the easiest way to optimize runtime
of the statistics job is to run it fewer times. Instead of five times
a week, run it once a week, or once a month or split the
job to run a few tables per day. Or just once if the size and
distribution of the data is more or less static and no indexes are
changed.
If a table is fluctuating in size by magnitudes of "X" and/or
attribute value distribution has a "large" difference over time
run the statistics job on that particular table more often if it makes
a difference in the performance of the application. If an index
is added to a table, also run the statistics on that table.
Some RDBMS utilities can get statistics on a sample of the data,
or some subset of statistics. The statistics job will run faster.
The idea is that some statistics, even if not complete or totally
accurate, will help the optimizer enough to get a fast path to the data.