Database configuration and maintenance challenges can, if not addressed, cause serious disruptions to database environments and an organization’s ongoing business. A small problem can go unnoticed today but become a larger problem tomorrow that impacts database and company performance, and data security and protection.
KEY AREAS
Backups
Security
Configuration
Maintenance
Disaster recovery
BACKUPS
– Database backups are a standard task and fundamental core responsibility
– Two most common issues that require a database restoration are user errors and hardware failure
– The database is a constantly changing part of a company’s technology landscape. In fact, studies show that a typical database doubles in size every three years. As a database grows, the backup procedures in place do not keep up with the many additions and changes that are made to its structure.
– Point-in-time recovery provides the capability so that in the event of failure, the database can be recovered to the time right before the database failure
– Backups need to be planned in such a way as to make point-in-time recovery possible
– Since most full backups are scheduled at most daily, without point-in-time recovery the organization will lose up to a full day’s worth of transactions
– A large number of broken transaction log chains due to missing backup files
– Only one path to restore a database to the most recent PIT (versus multiple full backups with appropriate differential and transaction log backups)
– A lack of multiple backup chains available (even though disk space is inexpensive)
– Corrupt backup files resulting from a lack of continuous testing. Backups are useless until verified.
– Time and experience demonstrate that a database will eventually fail due to a disk subsystem failure, corrupted metadata, upgrade problems, etc. To meet established RTO SLAs, you must have a complete set of backups. In the event of a disaster, your backup strategy and its execution must meet or exceed the expected levels of data retention.
– For example, a mission-critical database will require near zero data loss in a crash scenario in many cases. Under these business parameters, the database must be in full recovery model and have transaction log backups every 5 to 15 minutes (based on SLAs).
– Many databases in FULL recovery model had no transaction log backups occurring. This scenario leads to several major problems
– The transaction log is not truncated and internal log segments are not marked for reuse unless a transaction log backup is taken (for a database in FULL recovery model).
– This leads to an ever-growing transaction log. It is not uncommon to find a database with the transaction log file multiple times larger than the data file — the log at a 100 percent usage level and continuing to grow.
– In addition, point-in-time recovery is not guaranteed since there is no guarantee that the log can be backed up during a crash scenario.
– Other common mistakes include failing to back up system databases and failing to add recently created databases to existing backup maintenance plans.
– Ensure that the current backup strategy meets the business needs on a database by database basis.
– Ensure that databases are in their proper recovery model, and have the proper backup strategy configured, including nightly full/differential backups and appropriate transaction log backups occurring at a frequency that meets required SLAs for data loss.
– Your organization must also ensure that its database backups are protected by designing and implementing off-site replication solutions for those backups, whether it is to an off-site facility or the cloud.
CONFIGURATION
– It is extremely common to find SQL Server instances (and databases) configured with their out-of-box default settings permanently left in that state. Unfortunately, there are performance and security risks associated with the default configuration. When first launched, the performance impact is not noticed. However, as databases grow and their associated transactional activity increases — along with resource needs — the performance penalties of default configurations can severely impact the environment’s operations. Worse yet, the impact of default configurations from a security perspective is immediately detrimental and exposes a company’s environment to inadvertent and malicious attacks.
– Service packs and cumulative updates were not installed — all essential in plugging security holes, fixing program bugs, and enabling additional features to improve the environment’s efficiency and performance.
– Many configuration errors involving SQL memory settings were discovered. At default settings, the SQL Server buffer pool will expand until it causes contention with the operating system and other third-party services. This causes the entire server to slow down, perform unneeded I/O, and sometimes even hang or crash
– Many default settings were found at the database level, which can lead to performance problems such as file growth factors. For example, on a large multi-GB file, the default 10% growth factor can lead to many gigabytes of allocation in a single growth session, which can cause significant I/O pauses.
– Businesses must optimize configuration settings and install security updates to ensure the best performance and bugfree operation of their databases. Following these protocols, your organization’s environment will be better protected and function at a more efficient level.
– Database administrators (DBAs) must develop and execute upgrade plans for standalone instances as well as those that utilize advanced features such as mirroring, Microsoft Cluster Server (MSCS), replication, and Always On Availability Groups.
– Continually manage the reconfiguration of memory and other instance-level settings to ensure optimal use of server resources.
– From a database level, DBAs must make sure that the settings are aligned with the functionality and performance. For example, it is important to determine the proper file growth factors by taking into account the frequency of growth, the size of the growth session, and the time the growth occurs.
– Physical Memory – Often it is found that the infrastructure had adequate memory but the SQL Server instance was not able to utilize it for a variety of reasons. Consider virtualization technology, which is being deployed increasingly in virtualized database environments where extra care and attention is required
– CPU Pressure – When unacceptable CPU pressure occurs, the root cause is either a configuration setting or inefficient TSQL code. Reconfigure instance settings and/or tune TSQL statements to reduce or eliminate CPU pressure and the need to add hardware.
– Disk Layout – Pay special attention to the disk sub-system and how it interacts with Microsoft SQL Server databases, checking to see how it is optimized for performance and redundancy. Also check the physical layout of the databases with special attention given to data and log files for both user and system types of databases.
– Reconfigure instance settings and/or tune TSQL statements
– Consider virtualization technology
MAINTENANCE
– Database maintenance keeps a SQL Server environment running at optimum performance and can stop an impending disaster before it occurs. Maintenance routines act as automated tuning agents that preemptively remedy a host of issues that can occur naturally inside the databases such as index fragmentation, early database corruption detection, and system databases and file system cleanup.
– Missing or improperly configured maintenance plans
– No regular defragmentation is done on database indexes to ensure efficient data access and avoid unnecessary additional I/O. Similar to the defragmentation done on a server or PC hard drive, the same needs to be done on database indexes.
– Regular validation of data pages was not performed to ensure no corruption. Undetected corruption can grow over time. A database may appear to be fully operational while the corruption is being backed up. At some point, the corruption can spread and cause the database to become unusable. Unfortunately when that happens, the problem has already been backed up for awhile. Therefore, it is critical to detect corruption and handle it as soon as possible to eliminate or minimize data loss.
– Implement comprehensive set of best practices maintenance procedures. These procedures ensure database structures, such as indexes and statistics, are regularly defragmented and updated — resulting in the repair of lost performance due to naturally occurring data changes.
– Proper Purging Routines – Each instance keeps a history (e.g. for SQL jobs) that is kept in the MSDB system database and in the file system. It is important that this history is purged or archived on a regular basis to ensure optimal database performance
– Index Maintenance – To speed database queries, it is common practice to add indexes to key tables within the database. As data is added and removed from the table that contains an index, the index itself can be affected. Over time, these indexes can get lopsided and heavily fragmented and severely impact the performance of the database. To ensure optimal performance of a SQL Server database, it is a best practice to perform maintenance on indexes
– Data Integrity Checks – A data integrity failure can start small and then grow over time. If an integrity failure is left uncorrected, it can make its way into database backups. Eventually a database will go suspect or unavailable. At that point, data loss is inevitable and can be potentially quite large. Perform regular data integrity checks
– Perform regular data integrity checks
– Perform maintenance on indexes
– Purge or archive instance histories on a regular basis
DISASTER RECOVERY
– A disaster recovery (DR) plan is crucial in the event of a critical failure at the primary datacenter site and also during small-scale local events, such as extended power outages.
– There is no one-size-fitsall DR solution across such a diverse range of companies because of budget fluctuations and differences in the DR appropriations.
– DR plans could not meet the service level agreements (SLAs) for recovery time objectives (RTO) and/or recovery point objectives (RPO).
– Backups were shipped off-site to a safe location but no facility was in place to restore them or the related instance-level objectives in a timely manner.
– Customers assumed that periodically restoring backups on a remote server would allow for recovery in the event of failure — a process that doesn’t meet the requirements for a complete and usable DR site.
– Many customers failed to successfully test and validate DR procedures and plans.
– can achieve its DR goals by employing a variety of technologies, including log shipping, database mirroring, geo-dispersed clusters, SAN/virtual machine replication, and Availability Groups.
– The DR infrastructure and procedures must be implemented based on best practices and rigorously tested on an on-going basis.
– Each organization must take a tailored approach to effectively balance RTO and RPO needs against the implementation and maintenance costs.
– Consider a cloud service that eliminates the costs associated with the physical hardware and software of an on-premises solution.
– Microsoft lessens the cost burden of disaster recovery and fosters the concept of maintaining a DR site by requiring passive sites to pay licenses only once they become production sites for a certain period of time.