Recover from disk full errors

We’ve all been there, at some point: You set up a new QA server, but you’re a busy guy, and put off setting up alerts for later. The server gets a ton of usage, and all of a sudden, it runs out of space before you had a chance do something about it. This post is about one we way I use to recover from disk full errors.

In this particular case, we created a new SSIS catalog. The SSISDB database is created by the system,  so at first you don’t get to select where the files are located. Sure, you could’ve modified the database defaults post-setup, but you didn’t do that either! Now the log file is in the data volume, and the volume is all filled up. You’d like to move the log file, but you can’t detach SSISB because, again, the volume is full and nothing works right. So what do you do?

Whenever SQL server restarts, it reads the entries from sys.master_files and sys.databases to figure out where the databases are. When you alter any of the database properties, those changes are registered in that table. So what we need to do here is update those entries (not directly, please!) and then restart the service. Since this particular server is non-prod, restarts are ok! So here’s the syntax:

--run this first to get the current logical name, you'll need this for the next step 
SELECT DB_NAME(database_id),
       name,
       physical_name
FROM sys.master_files; 
--Now the actual trick, where filename is the physical name of the file 
ALTER DATABASE SSISDB
MODIFY FILE
(
    NAME = 'log',
    FILENAME = 'L:\sqllogs\ssisdb.ldf'
);

After this, stop SQL and manually move the file to the new location (as defined in your script — SQL will not move the files for you). When done, start SQL again. Your database should come right up!

Now, let’s say that, in your hurry to get things back up, you restarted the service but forgot to actually move the files. Despair not! As long as SQL hasn’t acquired a filesystem lock on the files, the following commands will allow you to move the files to the proper places. Once everything is in the proper places, the following commands will initialize the database:

ALTER DATABASE SSISDB SET OFFLINE;
ALTER DATABASE SSISDB SET ONLINE;

Useful Links:

Alter Database

More Troubleshooting!

Upgrade shenanigans

Terminate user sessions for fun and profit

This is a bit of a drastic script: every now and then I’ll have a misbehaving user or application that will create way too many sessions, causing all sorts of issues. I could kill the sessions one at a time, but that’s such a drag! So let’s terminate those user sessions!

Even inactive sessions can be an issue: For every open session, SQL needs to allocate a bit of RAM and CPU to maintain that connection. If you have a ton of sessions that are just idle, then you could be wasting memory that other sessions could use. This is basically how DDOS attacks work: Overwhelm the server with connections until you run out of resources.

That’s why I created the script below — you give it a user name, and it kills every session associated with the login. You can even filter it down to just one database! Or not, that’s totally your call. All I can tell you is that it does what the name says: it slays all, and that’s just metal.

DECLARE @spids TABLE (pk INT IDENTITY(1,1), spid SMALLINT)
DECLARE @i INT = 1

insert INTO @spids(spid)
SELECT s.spid FROM sys.sysprocesses s
INNER JOIN sys.server_principals p ON s.sid = p.sid
WHERE p.name = 'username' --kill all logins for that one user
--and DB_NAME(dbid) IN ('dbname') --kill all users in that one database

DECLARE @sql VARCHAR(1000)
WHILE @i<=(SELECT MAX(pk) FROM @spids)
BEGIN
SELECT @sql = 'kill '+ CAST(spid AS VARCHAR(100)) FROM @spids WHERE pk = @i
PRINT @sql
EXEC (@sql)
SELECT @i = @i + 1
end

Helpful links

Kill documentation – here

Other Scripts!

Get session progress

Script – Document your IP addresses

You should always keep detailed documentation on each of your servers as part of your DR scenario.This document will become your holy grail in case you ever have to recover your servers.

True story: Our BI team had 3 servers, and one of them was 2 weeks away from being retired. Suddenly, in the middle of the afternoon, the server went off-line. I contact our infrastructure team, and ask them to take a peek into the server itself, in the server room, and tell me what happen. After 10 harrowing minutes later, they return and deliver the grisly news: the motherboard had fried, and the box was dead. Lucky for me I had current backups of all the databases, but not of the replication Publications. Those can be rebuilt, sure, but it would have saved me a bunch of time if I had those pubs scripted out.

Again, your documentation should be thorough. Here’s a few things that should be part of your documentation:

  • What are my driver letters?
  • How much disk space do I have?
  • Who are all my local admins?
  • What are my server IPs?
  • If using SQL Replication, save your publication scripts
  • etc….

Sure, you can have a lot of this information from database backups. But always hope for the best, and plan for the worst!

For this post, this is the script we use to document the IP addresses for our server farm:

SELECT
CONNECTIONPROPERTY('net_transport') AS net_transport,
CONNECTIONPROPERTY('protocol_type') AS protocol_type,
CONNECTIONPROPERTY('auth_scheme') AS auth_scheme,
CONNECTIONPROPERTY('local_tcp_port') AS local_tcp_port,
CONNECTIONPROPERTY('client_net_address') AS client_net_address

In future posts, we’ll go into more details on our DR documentation strategy.

Other Scripts!

Remove old logins

Get session progress

Script: Where in the disk are my indexes?

Index maintenance is one of those things that you have to stay on top of, in order to keep your data flowing smoothly. In no particular order:

  • Index defrags/rebuilds
  • Removing unused indexes
  • Optimizing existing indexes
  • Saving index definitions (ever had someone complain an index “went missing?”)
  • Move them to a separate filegroup/disk
  • Etc…

A few months back, we were having issues with heavy I/Os, so one of our plans was to move all the indexes into a separate filegroup, located in a different disk. This way, we’d end up with separate threads: One for the update itself, and one for the index update. The snag was that, overtime, new indexes were added to the original data filegroup, so now they were all over the place.

In an attempt to figure out the lay of the land, I wrote the following query. This script will tell you the filegroup an index lives in, which will help me identify what needs to be moved where. It does not include heaps (which are not indexes) and system indexes (because you should probably leave those alone).

SELECT ds.name AS FileGroupName,
s.name AS schemaname,
o.name AS TableName,
i.name AS indexname
FROM sys.objects o
INNER JOIN sys.indexes i
ON o.object_id = i.object_id
INNER JOIN sys.data_spaces ds
ON ds.data_space_id = i.data_space_id
INNER JOIN sys.schemas s
ON o.schema_id = s.schema_id
WHERE o.is_ms_shipped = 0
AND i.index_id > 1
ORDER BY ds.name,
o.name,
i.index_id;

Once I had them identified, it was easy to write some dynamic sql to move rebuild them into a new location. The syntax is easy:

CREATE INDEX [indexname]
ON table
WITH (ONLINE = ON, DROP_EXISTING = ON)
ON [filegroupname];

You have to write out the entire index definition as it currently stands, but the Drop_existing = ON will instruct SQL to recreate the index in the specified filegroup, instead of just complaining that the index already exists. Online = ON (if you’re on Enterprise edition) will allow you to run this without dropping the index first, or additional blocking.

Mini-script – Get Session Progress

Don’t you wish to know how far along that backup/restore has come? If you’re running it manually, and you remembered to add the STATS parameter, you can see it in the default output, but what if it’s coming from an enterprise tool, like Commvault?

It would also be great to get the progress on other operations, like shrinks, DBCC CheckDB, etc. While not every operation will report its progress, this little script will report progress on whatever operations do provide it.

SELECT  r.session_id 
         ,DB_NAME(r.database_id) 
         ,r.status 
         ,r.command 
         ,r.wait_type 
         ,r.percent_complete 
         ,CASE WHEN estimated_completion_time < 36000000 THEN '0'
              ELSE ''
         END 
         + RTRIM(r.estimated_completion_time / 1000 / 3600) + ':'
         + RIGHT('0' + RTRIM(( r.estimated_completion_time / 1000 ) % 3600 / 60),2) 
         + ':' + RIGHT('0' + RTRIM(( r.estimated_completion_time / 1000 ) % 60), 2) 
         AS TimeRemaining 
         ,GETDATE() + ( CASE WHEN estimated_completion_time < 36000000 THEN '0'
                            ELSE ''
                       END 
                       + RTRIM(r.estimated_completion_time / 1000 / 3600)
                       + ':' + RIGHT('0' + RTRIM(( r.estimated_completion_time / 1000 ) % 3600 / 60), 2)
                       + ':' + RIGHT('0' + RTRIM(( r.estimated_completion_time / 1000 ) % 60), 2) ) 
         AS EstimatedCompletionTime
FROM    sys.dm_exec_requests r
WHERE   r.percent_complete > 0

SQL Server upgrade shenanigans

It’s late at night, I’m watching Yours, Mine and Ours (the 1968 version), and it’s time to upgrade one of our SQL2014 dev instances to SQL2017. We’d already gone through the upgrade advisor, the issues raised (mostly CLR) were not a big deal, so let’s go ahead and get this done. It’s all smooth sailing until I get this error:

Very odd! After a quick googling, I find this page by Pinal Dave –> https://blog.sqlauthority.com/2017/01/27/sql-server-sql-installation-fails-error-code-0x851a001a-wait-database-engine-recovery-handle-failed/

His suggestion is that there’s an issue with the AD service account, and the fix is to switch to network service. This is dev anyway, and I feel a little careless, so why not? Unfortunately, though, it didn’t work! Not only that, when I tried switching service accounts, I got even weirder errors:

 

I absolutely love the %1 to logoff.

After a restart and another attempt at the upgrade, and getting the same results, I decided to ditch the upgrade summary and take a peek at the sql server log itself. This is the actual service log, saved under MSQL12.MSSQLSERVER\MSSQL\Log. In there, I’m able to verify a few things:

  • The server starts
  • It tries to run the upgrade scripts

So that’s good, right? Then I find the real culprit (database names hidden to protect the innocent):

Based on this message, what the issue appears to be is that the upgrade script is trying to upgrade some system objects (you can tell they’re system objects because the begin with MS%), and when that fails, it fails the entire upgrade. Interesting! I know that this particular object is left over from some weird replication issues that have not been properly cleaned up, so it’s safe to drop them. But how? I can’t even get the service to come up, but I know I have a few options:

  • I can look up the trace flags that skip upgrade scripts (This very handy link lists all the trace flags –> http://www.sqlservercentral.com/articles/trace-flag/152989/ ) The correct flag is 992 902 — this flag will skip upgrade scripts in the master database. Once the server comes up, I can fix the issue, remove the flag, then retry the upgrade
  • I can be lazy and rename the data file and log files in the filesystem, then restart the upgrade. In theory, SQL should fail to attach the database (because I renamed the files!), then apply the upgrade scripts on any of the databases that actually mounted, and then complete the upgrade. I’m feeling adventurous, so I try that, and it worked! I have a working instance again.So now I have to clean after my mess. I switched the service account back to the proper AD account, and I also stop SQL and rename the data files to the correct extensions. Once I start SQL again, the database mounts properly… I think it’s odd that it didn’t fail the upgrade scripts, so I can only assume that the upgrade scripts did not run again. I’m not super worried, at this point, because next up is fixing the issue by dropping the orphaned replication objects. Since I have to apply CU11 afterwards, I figure that at that point the upgrade scripts will be applied. 10 mins later, I was done.

What are the lessons learned? First of all, before you run an upgrade, make sure your databases are clean, with no weird system objects laying around. Second, if you have a failed upgrade, take a look at the SQL server log itself, there’s usually a ton of useful information in there to help you get up and running again.

Not a minute to waste!

Ok, so I’ve been really excited about this “deployment automation” stuff! That’s not really new, though. I’ve been trying to implement some form of database source control for a few years now, with minimal success. The reasons are various:

  • The tools aren’t all that great. That’s not really their fault, though.
  • There’s not really a true way of doing it, as far as databases are concerned (which is why the tools aren’t that great). Like everything else in database land, the approach sorta… “depends” on a number of different things.

But this isn’t a post about source control or automation! Not yet, anyway. This is a post about how I’m getting ready for it.

We live in an amazing era; tools and processes that were really difficult 5 years ago are easily available now, and work really well. Take Azure and devops, for example! It is really easy to get setup with a free account, download a bunch of tools, and get going. That should be its own post…

I’ve already got my azure instance up and running, so next for me is setup a local environment and a bunch of vms. I picked up a relatively inexpensive machine just two nights ago, a dell inspiron 3670. With a 8th-gen core i5-8400, a 1TB drive and 12GB of RAM, it should provide enough umph to run a couple of VMs and setup some basic integration between local and the cloud. In the next few posts, I’m going to detail how I got things setup.

New (old) directions

I just returned from my early peregrination to Seattle, to attend SQL PASS. This was really an amazing year, and I learned a great many things! I’ve decided that, after 11 years as a full-time DBA, it was time to take up the proverbial pencil and share some of the things I’ve learned, and some of the projects I’m working on.

Things have changed quite a bit since SQL2000 days, and there’s a lot of exciting things going on. I figured that writing them down will help me keep track of how I’ve accomplished them, as well as share the knowledge with others. So I’ve repurposed the old site, dusted the cobwebs, made a few tweaks, and now I’m ready to go! I haven’t actively written about work since I left PCMagazine all the way back in the 90’s, when I worked for the Brazilian edition of PCMag, so I’m a little nervous, and more than a little rusty, but hey, what could possibly go wrong?