Renewing the STS cert in VCenter

The STS cert is used by vcenter to authenticate against external SSO targets, such as Okta or ActiveDirectory, and it’s a key service for various other services – like the Web UI!

The fun part about STS is that it’s autogenerated at some point (probably during install or upgrade), and it will expire 2 years after that – but vCenter won’t tell you when the expiration is coming up! It’ll just gloriously fail (see https://kb.vmware.com/s/article/82332 and https://kb.vmware.com/s/article/79248), and then you’ll get either an http500 or a “No healthy upstream” error.

To renew the cert, follow the steps on https://kb.vmware.com/s/article/76719, namely:

1 – SSH into vcenter

ssh -l root <your vcenter url>

2 – Move into /tmp

cd /tmp

3 – Create a bash script and paste the script below

vim fixsts.sh
#!/bin/bash
# Copyright (c) 2020-2021 VMware, Inc. All rights reserved.
# VMware Confidential
#
# Run this from the affected PSC/VC
#
# NOTE: This works on external and embedded PSCs
# This script will do the following
# 1: Regenerate STS certificate
#
# What is needed?
# 1: Offline snapshots of VCs/PSCs
# 2: SSO Admin Password

NODETYPE=$(cat /etc/vmware/deployment.node.type)
if [ "$NODETYPE" = "management" ]; then
    echo "Detected this node is a vCenter server with external PSC."
    echo "Please run this script from a vCenter with embedded PSC, or an external PSC"
    exit 1
fi

if [ "$NODETYPE" = "embedded" ]  &&  [ ! -f  /usr/lib/vmware-vmdir/sbin/vmdird ]; then
    echo "Detected this node is a vCenter gateway"
    echo "Please run this script from a vCenter with embedded PSC, or an external PSC"
    exit 1
fi

echo "NOTE: This works on external and embedded PSCs"
echo "This script will do the following"
echo "1: Regenerate STS certificate"
echo "What is needed?"
echo "1: Offline snapshots of VCs/PSCs"
echo "2: SSO Admin Password"
echo "IMPORTANT: This script should only be run on a single PSC per SSO domain"

mkdir -p /tmp/vmware-fixsts
SCRIPTPATH="/tmp/vmware-fixsts"
LOGFILE="$SCRIPTPATH/fix_sts_cert.log"

echo "==================================" | tee -a $LOGFILE
echo "Resetting STS certificate for $HOSTNAME started on $(date)" | tee -a $LOGFILE
echo ""| tee -a $LOGFILE
echo ""
DN=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmdir]' | grep dcAccountDN | awk '{$1=$2=$3="";print $0}'|tr -d '"'|sed -e 's/^[ \t]*//')
echo "Detected DN: $DN" | tee -a $LOGFILE
PNID=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep PNID | awk '{print $4}'|tr -d '"')
echo "Detected PNID: $PNID" | tee -a $LOGFILE
PSC=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep DCName | awk '{print $4}'|tr -d '"')
echo "Detected PSC: $PSC" | tee -a $LOGFILE
DOMAIN=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep DomainName | awk '{print $4}'|tr -d '"')
echo "Detected SSO domain name: $DOMAIN" | tee -a $LOGFILE
SITE=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep SiteName | awk '{print $4}'|tr -d '"')
MACHINEID=$(/usr/lib/vmware-vmafd/bin/vmafd-cli get-machine-id --server-name localhost)
echo "Detected Machine ID: $MACHINEID" | tee -a $LOGFILE
IPADDRESS=$(ifconfig | grep eth0 -A1 | grep "inet addr" | awk -F ':' '{print $2}' | awk -F ' ' '{print $1}')
echo "Detected IP Address: $IPADDRESS" | tee -a $LOGFILE
DOMAINCN="dc=$(echo "$DOMAIN" | sed 's/\./,dc=/g')"
echo "Domain CN: $DOMAINCN"
ADMIN="cn=administrator,cn=users,$DOMAINCN"
USERNAME="administrator@${DOMAIN^^}"
ROOTCERTDATE=$(openssl x509  -in /var/lib/vmware/vmca/root.cer -text | grep "Not After" | awk -F ' ' '{print $7,$4,$5}')
TODAYSDATE=$(date +"%Y %b %d")

echo "#" > $SCRIPTPATH/certool.cfg
echo "# Template file for a CSR request" >> $SCRIPTPATH/certool.cfg
echo "#" >> certool.cfg
echo "# Country is needed and has to be 2 characters" >> $SCRIPTPATH/certool.cfg
echo "Country = DS" >> $SCRIPTPATH/certool.cfg
echo "Name = $PNID" >> $SCRIPTPATH/certool.cfg
echo "Organization = VMware" >> $SCRIPTPATH/certool.cfg
echo "OrgUnit = VMware" >> $SCRIPTPATH/certool.cfg
echo "State = VMware" >> $SCRIPTPATH/certool.cfg
echo "Locality = VMware" >> $SCRIPTPATH/certool.cfg
echo "IPAddress = $IPADDRESS" >> $SCRIPTPATH/certool.cfg
echo "Email = email@acme.com" >> $SCRIPTPATH/certool.cfg
echo "Hostname = $PNID" >> $SCRIPTPATH/certool.cfg

echo "==================================" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
echo ""
echo "Detected Root's certificate expiration date: $ROOTCERTDATE" | tee -a $LOGFILE
echo "Detected today's date: $TODAYSDATE" | tee -a $LOGFILE

echo "==================================" | tee -a $LOGFILE

flag=0
if [[ $TODAYSDATE > $ROOTCERTDATE ]];
then
    echo "IMPORTANT: Root certificate is expired, so it will be replaced" | tee -a $LOGFILE
    flag=1
    mkdir /certs && cd /certs
    cp $SCRIPTPATH/certool.cfg /certs/vmca.cfg
    /usr/lib/vmware-vmca/bin/certool --genselfcacert --outprivkey /certs/vmcacert.key  --outcert /certs/vmcacert.crt --config /certs/vmca.cfg
    /usr/lib/vmware-vmca/bin/certool --rootca --cert /certs/vmcacert.crt --privkey /certs/vmcacert.key
fi

echo "#" > $SCRIPTPATH/certool.cfg
echo "# Template file for a CSR request" >> $SCRIPTPATH/certool.cfg
echo "#" >> $SCRIPTPATH/certool.cfg
echo "# Country is needed and has to be 2 characters" >> $SCRIPTPATH/certool.cfg
echo "Country = DS" >> $SCRIPTPATH/certool.cfg
echo "Name = STS" >> $SCRIPTPATH/certool.cfg
echo "Organization = VMware" >> $SCRIPTPATH/certool.cfg
echo "OrgUnit = VMware" >> $SCRIPTPATH/certool.cfg
echo "State = VMware" >> $SCRIPTPATH/certool.cfg
echo "Locality = VMware" >> $SCRIPTPATH/certool.cfg
echo "IPAddress = $IPADDRESS" >> $SCRIPTPATH/certool.cfg
echo "Email = email@acme.com" >> $SCRIPTPATH/certool.cfg
echo "Hostname = $PNID" >> $SCRIPTPATH/certool.cfg

echo ""
echo "Exporting and generating STS certificate" | tee -a $LOGFILE
echo ""

cd $SCRIPTPATH

/usr/lib/vmware-vmca/bin/certool --server localhost --genkey --privkey=sts.key --pubkey=sts.pub
/usr/lib/vmware-vmca/bin/certool --gencert --cert=sts.cer --privkey=sts.key --config=$SCRIPTPATH/certool.cfg

openssl x509 -outform der -in sts.cer -out sts.der
CERTS=$(csplit -f root /var/lib/vmware/vmca/root.cer '/-----BEGIN CERTIFICATE-----/' '{*}' | wc -l)
openssl pkcs8 -topk8 -inform pem -outform der -in sts.key -out sts.key.der -nocrypt
i=1
until [ $i -eq $CERTS ]
do
    openssl x509 -outform der -in root0$i -out vmca0$i.der
    ((i++))
done

echo ""
echo ""
read -s -p "Enter password for administrator@$DOMAIN: " DOMAINPASSWORD
echo ""

# Find the highest tenant credentials index
MAXCREDINDEX=1
while read -r line
do
    INDEX=$(echo "$line" | tr -dc '0-9')
    if [ $INDEX -gt $MAXCREDINDEX ]
    then
        MAXCREDINDEX=$INDEX
    fi
done < <(/opt/likewise/bin/ldapsearch -h localhost -p 389 -b "cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "(objectclass=vmwSTSTenantCredential)" cn | grep cn:)

# Sequentially search for tenant credentials up to max index  and delete if found
echo "Highest tenant credentials index : $MAXCREDINDEX" | tee -a $LOGFILE
i=1
if [ ! -z $MAXCREDINDEX ]
then
    until [ $i -gt $MAXCREDINDEX ]
    do
        echo "Exporting tenant $i to $SCRIPTPATH" | tee -a $LOGFILE
        echo ""
        ldapsearch -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -b "cn=TenantCredential-$i,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" > $SCRIPTPATH/tenantcredential-$i.ldif
                if [ $? -eq 0 ]
                then
                    echo "Deleting tenant $i" | tee -a $LOGFILE
                        ldapdelete -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "cn=TenantCredential-$i,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" | tee -a $LOGFILE
                else
                    echo "Tenant $i not found" | tee -a $LOGFILE
                    echo ""
                fi
                ((i++))
                done
fi
echo ""

# Find the highest trusted cert chains index
MAXCERTCHAINSINDEX=1
while read -r line
do
    INDEX=$(echo "$line" | tr -dc '0-9')
    if [ $INDEX -gt $MAXCERTCHAINSINDEX ]
    then
        MAXCERTCHAINSINDEX=$INDEX
    fi
done < <(/opt/likewise/bin/ldapsearch -h localhost -p 389 -b "cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "(objectclass=vmwSTSTenantTrustedCertificateChain)" cn | grep cn:)

# Sequentially search for trusted cert chains up to max index  and delete if found
echo "Highest trusted cert chains index: $MAXCERTCHAINSINDEX" | tee -a $LOGFILE
i=1
if [ ! -z $MAXCERTCHAINSINDEX ]
then
    until [ $i -gt $MAXCERTCHAINSINDEX ]
    do
            echo "Exporting trustedcertchain $i to $SCRIPTPATH" | tee -a $LOGFILE
            echo ""
                ldapsearch -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -b "cn=TrustedCertChain-$i,cn=TrustedCertificateChains,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" > $SCRIPTPATH/trustedcertchain-$i.ldif
            if [ $? -eq 0 ]
            then
                echo "Deleting trustedcertchain $i" | tee -a $LOGFILE
                ldapdelete -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "cn=TrustedCertChain-$i,cn=TrustedCertificateChains,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" | tee -a $LOGFILE
            else
                echo "Trusted cert chain $i not found" | tee -a $LOGFILE
            fi
            echo ""
                ((i++))
                done
fi
echo ""

i=1
echo "dn: cn=TenantCredential-1,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" > sso-sts.ldif
echo "changetype: add" >> sso-sts.ldif
echo "objectClass: vmwSTSTenantCredential" >> sso-sts.ldif
echo "objectClass: top" >> sso-sts.ldif
echo "cn: TenantCredential-1" >> sso-sts.ldif
echo "userCertificate:< file:sts.der" >> sso-sts.ldif
until [ $i -eq $CERTS ]
do
    echo "userCertificate:< file:vmca0$i.der" >> sso-sts.ldif
    ((i++))
done
echo "vmwSTSPrivateKey:< file:sts.key.der" >> sso-sts.ldif
echo "" >> sso-sts.ldif
echo "dn: cn=TrustedCertChain-1,cn=TrustedCertificateChains,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" >> sso-sts.ldif
echo "changetype: add" >> sso-sts.ldif
echo "objectClass: vmwSTSTenantTrustedCertificateChain" >> sso-sts.ldif
echo "objectClass: top" >> sso-sts.ldif
echo "cn: TrustedCertChain-1" >> sso-sts.ldif
echo "userCertificate:< file:sts.der" >> sso-sts.ldif
i=1
until [ $i -eq $CERTS ]
do
    echo "userCertificate:< file:vmca0$i.der" >> sso-sts.ldif
    ((i++))
done
echo ""
echo "Applying newly generated STS certificate to SSO domain" | tee -a $LOGFILE

/opt/likewise/bin/ldapmodify -x -h localhost -p 389 -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -f sso-sts.ldif | tee -a $LOGFILE
echo ""
echo "Replacement finished - Please restart services on all vCenters and PSCs in your SSO domain" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
echo "IMPORTANT: In case you're using HLM (Hybrid Linked Mode) without a gateway, you would need to re-sync the certs from Cloud to On-Prem after following this procedure" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
if [ $flag == 1 ]
then
    echo "Since your Root certificate was expired and was replaced, you will need to replace your MachineSSL and Solution User certificates" | tee -a $LOGFILE
    echo "You can do so following this KB: https://kb.vmware.com/s/article/2097936" | tee -a $LOGFILE
fi

4 – chmod the script

chmod a+x fixsts.sh

5 – Run the script! Note the first authentication prompt will require the ROOT login (same one you used SSH). The second prompt will require the ADMINISTRATOR@VSPHERE.LOCAL user. Don’t be like me and enter the root user’s password, then wonder why it didn’t work.

./fixsts.sh

6 – Restart the vcenter services, no reboot required.

service-control --stop --all && service-control --start --all

If all goes well, the script should return something like this (notice no errors!)

And then the service restart – vcenter should come online shortly after this completes.

Time for some housecleaning

What do you know, it’s been 4 years since I’ve last written in here. Believe it or not, I had a great many plans and posts prepared! I just never quite got around to them… Much to my surprise, my little site still gets quite a few visits, seemingly from folks very interested in how I use dbatools to improve patching or how I Install SQL Server unattended! Thanks folks for all the visits over the years, hopefully you found the content useful!

But it shouldn’t surprise anyone that in the years since I’ve written these posts, I’ve continued to work on those tools, as well as written other tools (hello Poshbot — to be clear, I didn’t write Poshbot, but I sure wrote a lot of stuff that uses it). It’s time I reward all the people that have come looking for these posts with up-to-date stuff, so those should be coming up in the near future.

My life has become quite busier in the last few years — I was already dabbling in virtualization, but now I’m not only a DBA, I’m also an infrastructure manager! I picked up a few new responsibilities, including VMWare, BigIP, Storage, Patching, Enterprise backups, Disaster Recovery, etc. I probably have a few things to talk about on those topics.

I also decided to make more room on my site for other things besides work. Along with that comes a bit of a site redesign, but for some reason Edge (don’t @ me) is struggling with it. Don’t worry, I’m reliably told that no one besides me uses edge, so you probably didn’t even notice!

Using dbatools to improve your patching process

One of the the key responsibilities of a production DBA is keeping SQL servers patched. You have to have a good process in place, however; it won’t do you any good to apply a patch that just broke your application.

sql server patching

How do I get started?

Generally speaking, you’re either running an in-house application, or a 3rd party one. 3rd party is relatively easy! You should always check with your vendor first to see what is their supported level. They might only specify a major version (2012, 2016, etc), or even a minimum Service Pack or CU level. Once you have that information, spin up a test server, patch it and test it. I like to give it plenty of baking time, at least a week, before I’m comfortable with the patch.

An in-house application takes a bit more work, but it should also (hopefully) have more eyes on it. You need to get that patch through every pre-prod environment before you can even begin to think of production. My workflow will usually follow the same path as the application itself:

Dev -> QA -> Stage ->Prod

I still take a week between each pre-prod environment, all the while communicating with the dev and QA teams so they can report back on any issues they notice. Ideally you also have metrics in place to help you spot drops in performance, where you can check against your per-environment baseline.

Taking inventory

Your first step is to figure out where things are at, and organize them in some fashion. I have lots of teams, each with their environments, and that’s a lot of servers! I’m a huge fan of SQL Server Central Management Servers and Registered Servers, they’re a must have if you have more than a few servers. I try to keep the organization simple, like this

central management server

Obviously, you can group this however you want! I’ve learned that a smaller number of folders makes it easier to find things quickly. It’s also important to keep your list up-to-date! It may seem like extra work, but it really pays off, particularly for the topic at hand.

dbatools to the rescue

Now that you have your servers organized, it’s time to collect everyone’s server levels. First we create a table to store all the information:

CREATE TABLE [dbo].[PatchLevels] (
    [SqlInstance] [NVARCHAR](MAX) NULL,
    [Build] [NVARCHAR](MAX) NULL,
    [NameLevel] [NVARCHAR](MAX) NULL,
    [SPLevel] [NVARCHAR](MAX) NULL,
    [CULevel] [NVARCHAR](MAX) NULL,
    [KBLevel] [NVARCHAR](MAX) NULL,
    [BuildLevel] [NVARCHAR](MAX) NULL,
    [SupportedUntil] [DATETIME2] NULL,
    [MatchType] [NVARCHAR](MAX) NULL,
    [Warning] [NVARCHAR](MAX) NULL,
    [Compliant] [BIT] NULL,
    [MinimumBuild] [NVARCHAR](MAX) NULL,
    [MaxBehind] [NVARCHAR](MAX) NULL,
    [SPTarget] [NVARCHAR](MAX) NULL,
    [CUTarget] [NVARCHAR](MAX) NULL,
    [BuildTarget] [NVARCHAR](MAX) NULL
)

Now to the good stuff! I absolutely love dbatools, and cannot express how grateful I am for the work they’ve done, it’s saved me many hours of work and frustration. I just run this script to update the PatchLevels table:

$datatable = get-dbaregserver -sqlinstance &lt;your central management server> |test-dbabuild -maxbehind "2CU" 
Write-DbaDbTableData -SqlInstance &lt;your DBA server> -InputObject $datatable -table dbo.PatchLevels -database DBA -AutoCreateTable -Truncate
  1. First we create a variable $DATATABLE to store the results of GET-DBAREGSERVER — this is your registered server list.
  2. Next up we pipe those servers through TEST-DBABUILD, which gives us our server’s build levels. -MAXBEHIND flags any servers older than 2 Cumulative Updates (CU) as not being compliant
  3. We then call on WRITE-DBADBTABLEDATA to save the values of our variable. It doesn’t apply a lot of smarts to the table creation, which is why the columns are all nvarchar(max), but you can definitely optimize the data types, and this will still work. This is the final results

This is the final result for the table

Look ma, no RDP

This is one of the things I still get giddy about — the actual patching without needing to login to each server. Like everything in life, there are a few caveats:

  • We keep all our installation media and patches on a network share, organized by major version, so we need to run a script for each major version I support. I make that easier by just using a variable.
  • The UPDATE-DBAINSTANCE command doesn’t know how to use your logged in credentials, so you have to prompt the user for them.

And here’s the actual script

#create a list of servers that you want to patch. Change the group name to match your target
$ServerList = get-dbaregserver -sqlinstance mssqlmgmt -Group QA
 
#create a variable to pass in to the Update-DbaInstance command; this will prompt you for your password
$cred = Get-Credential 
 
#Set the version that you want to update to
$version = '2017'
$UpdatePath = '\\&lt;some server>\SQL\'+$($version)+'\updates'


#Start Patching! The -Restart option will allow it to restart the SQL Server as needed
Update-DbaInstance -ComputerName $ServerList -Path $UpdatePath -Credential $cred -Version $version -Restart
 

And that’s it! Like, really, that’s it. Just this last weekend, I updated 17 servers across 2014, 2016 and 2017 in about 30 mins, most of which I spent playing videogames while I waited for the patches to complete installing. The installs are not serialized, they’re run in parallel, so that’s part of the magic.

Once the scripts are completed, I just run the script to update the patch levels table. Now it’s time to let the proper teams know that updates are completed.

Thanks for reading! Here’s a few more links for your enjoyment:

dbatools website

SQL Server builds – wonderful list of every SQL server version and build numbers, along with links for their respective updates

Other things I’ve written

I’ve been gone for a little while

And see what happened! A bunch of people came to visit, but I wasn’t home. How rude of me…

Well, I haven’t been idle! Quite the opposite, which is why this site has been rather neglected lately. Here’s a bit of what I’ve been up to professionally:

  • Driving database automation with SSDT
  • Implementing auditing all over the place
  • Automating all sorts of things
  • Salesforce integration

And on the non-work side of life too:

  • 3 ranks of kung-fu (white -> yellow -> orange, and now working on my purple)
  • Raising my own chickens!
  • Camping (a lot of camping)

I find that the secret to all of this is finding balance — deciding where your priorities are and allocating time so one thing is not needlessly taking more time/effort than others. This is a continuous process, and it takes time to get good at it. A few months back, I had to find balance, and that meant not blogging. Now I’ve done some more balancing, and it means time to write again! So if you’re reading this, thank you for taking the time, and stay tuned for more.

–Marcio

Clean up the SSIS database

SQL Server Integration Services (or SSIS) is a very powerful service and set of tools to import data to and from your database. Some companies will have entire teams dedicated to writing SSIS packages (also known as ISPACs) to perform their ETL (Extract-Transform-Load) processes.

As part of your SSIS maintenance, there are a few things you need to keep an eye on:

  • Is your SSIS database in simple or full mode? I keep mine in simple, but this is a bit of a personal preference — If I lose my database, I’d just redeploy my ISPACs.
  • Are your maintenance jobs running? — This is really important! Your ISPACs generate all sort of logging information, and the more packages you have, the more logging. Without the maintenance, your database will balloon out of control with al that logging.
  • How big is the retention window? — You can configure the amount of time that logging data will be kept. I’ll decrease the amount depending on the number of packages I have in the store, and their frequency. Disabling the logging does not actually disables logging! A few other things will still generate logging data.
  • Is the database huge anyway? — Unfortunately, the maintenance job doesn’t do that good of a job at keeping up with growth! There are a few things that are not actually removed by the default maintenance job, but they grow anyway. .
The default retention period is 1 whole year! That’s a bit excessive in my opinion…

So, as much as I try to keep up with the maintenance, sometimes there’s only one things to be done: Nuke the whole thing! Especially the defaults where never changed.

Of course I had to use this picture. It’s a classic!

I’m sure there’s a variety of ways of cleaning up the catalog, but this so far is my favorite. It manages to remove all the logging information without any operational impact. At worst, I’ve had to stop the SSIS service long enough for the operation to complete. I don’t necessarily recommend shrinking the data file afterwards (unless you really need to!), because your logs are likely to grow back anyway, based on your day-to-day volumes. Credits for this beautiful script goes to Ibrahim Naji (home | twitter) and the original post is here.

Now for the script:

USE SSISDB
go

CREATE PROCEDURE [internal].[cleanup_server_retention_window_truncateall]
--WITH EXECUTE AS 'AllSchemaOwner'
AS 
   
----- STRUCTURE OF TABLES:

--[internal].[operations]
--  [internal].[executions]
--      [internal].[executable_statistics]
--      [internal].[execution_component_phases]
--      [internal].[execution_data_statistics]
--      [internal].[execution_data_taps]
--      [internal].[execution_parameter_values]
--      [internal].[execution_property_override_values]
--  [internal].[extended_operation_info]
--  [internal].[operation_messages] --(DATA HEAVY)
--      [internal].[event_messages] --(DATA HEAVY)
--          [internal].[event_message_context]
--  [internal].[operation_os_sys_info]
--  [internal].[operation_permissions]
--  [internal].[validations]

    SET NOCOUNT ON
   
    DECLARE @enable_clean_operation bit
    DECLARE @retention_window_length INT
   
    DECLARE @caller_name nvarchar(256)
    DECLARE @caller_sid  varbinary(85)
    DECLARE @operation_id BIGINT
   
    EXECUTE AS CALLER
        SET @caller_name =  SUSER_NAME()
        SET @caller_sid =   SUSER_SID()
    REVERT
         
   
    BEGIN TRY
        SELECT @enable_clean_operation = CONVERT(bit, property_value)
            FROM [catalog].[catalog_properties]
            WHERE property_name = 'OPERATION_CLEANUP_ENABLED'
       
        IF @enable_clean_operation = 1
        BEGIN
            SELECT @retention_window_length = CONVERT(INT,property_value)  
                FROM [catalog].[catalog_properties]
                WHERE property_name = 'RETENTION_WINDOW'
               
            IF @retention_window_length <= 0
            BEGIN
                RAISERROR(27163    ,16,1,'RETENTION_WINDOW')
            END
           
            INSERT INTO [internal].[operations] (
                [operation_type],  
                [created_time],
                [object_type],
                [object_id],
                [object_name],
                [STATUS],
                [start_time],
                [caller_sid],
                [caller_name]
                )
            VALUES (
                2,
                SYSDATETIMEOFFSET(),
                NULL,                    
                NULL,                    
                NULL,                    
                1,      
                SYSDATETIMEOFFSET(),
                @caller_sid,            
                @caller_name            
                )
            SET @operation_id = SCOPE_IDENTITY()






            -- Remove all [internal].[executions] dependancies
            TRUNCATE TABLE [internal].[executable_statistics]
            TRUNCATE TABLE [internal].[execution_component_phases]
            TRUNCATE TABLE [internal].[execution_data_statistics]
            TRUNCATE TABLE [internal].[execution_data_taps]
            TRUNCATE TABLE [internal].[execution_parameter_values]
            TRUNCATE TABLE [internal].[execution_property_override_values]


            -- Remove all [internal].[event_message_context] dependancies
            TRUNCATE TABLE [internal].[event_message_context]

            -- Remove all non-dependant tables
            TRUNCATE TABLE [internal].[operation_os_sys_info]
            TRUNCATE TABLE [internal].[operation_permissions]
            TRUNCATE TABLE [internal].[validations]
            TRUNCATE TABLE [internal].[extended_operation_info]

            -- Deal with [internal].[event_messages] and [internal].[operation_messages]
            ALTER TABLE [internal].[event_message_context] DROP CONSTRAINT [FK_EventMessageContext_EventMessageId_EventMessages]
       
            TRUNCATE TABLE internal.event_messages
       
            ALTER TABLE [internal].[event_message_context]  WITH CHECK ADD  CONSTRAINT [FK_EventMessageContext_EventMessageId_EventMessages] FOREIGN KEY([event_message_id])
            REFERENCES [internal].[event_messages] ([event_message_id])
            ON DELETE CASCADE

            ALTER TABLE [internal].[event_messages] DROP CONSTRAINT [FK_EventMessages_OperationMessageId_OperationMessage]
       
            TRUNCATE TABLE [internal].[operation_messages]

            ALTER TABLE [internal].[event_messages]  WITH CHECK ADD  CONSTRAINT [FK_EventMessages_OperationMessageId_OperationMessage] FOREIGN KEY([event_message_id])
            REFERENCES [internal].[operation_messages] ([operation_message_id])
            ON DELETE CASCADE

            -- Deal with [internal].[executions]

            ALTER TABLE [internal].[executable_statistics] DROP CONSTRAINT [FK_ExecutableStatistics_ExecutionId_Executions]
            ALTER TABLE [internal].[execution_component_phases] DROP CONSTRAINT [FK_ExecCompPhases_ExecutionId_Executions]
            ALTER TABLE [internal].[execution_data_statistics] DROP CONSTRAINT [FK_ExecDataStat_ExecutionId_Executions]
            ALTER TABLE [internal].[execution_data_taps] DROP CONSTRAINT [FK_ExecDataTaps_ExecutionId_Executions]
            ALTER TABLE [internal].[execution_parameter_values] DROP CONSTRAINT [FK_ExecutionParameterValue_ExecutionId_Executions]
            ALTER TABLE [internal].[execution_property_override_values] DROP CONSTRAINT [FK_ExecutionPropertyOverrideValue_ExecutionId_Executions]

            TRUNCATE TABLE [internal].[executions]

            ALTER TABLE [internal].[execution_property_override_values]  WITH CHECK ADD  CONSTRAINT [FK_ExecutionPropertyOverrideValue_ExecutionId_Executions] FOREIGN KEY([execution_id])
            REFERENCES [internal].[executions] ([execution_id])
            ON DELETE CASCADE

            ALTER TABLE [internal].[execution_parameter_values]  WITH CHECK ADD  CONSTRAINT [FK_ExecutionParameterValue_ExecutionId_Executions] FOREIGN KEY([execution_id])
            REFERENCES [internal].[executions] ([execution_id])
            ON DELETE CASCADE

            ALTER TABLE [internal].[execution_data_taps]  WITH CHECK ADD  CONSTRAINT [FK_ExecDataTaps_ExecutionId_Executions] FOREIGN KEY([execution_id])
            REFERENCES [internal].[executions] ([execution_id])
            ON DELETE CASCADE

            ALTER TABLE [internal].[execution_data_statistics]  WITH CHECK ADD  CONSTRAINT [FK_ExecDataStat_ExecutionId_Executions] FOREIGN KEY([execution_id])
            REFERENCES [internal].[executions] ([execution_id])
            ON DELETE CASCADE
       
            ALTER TABLE [internal].[execution_component_phases]  WITH CHECK ADD  CONSTRAINT [FK_ExecCompPhases_ExecutionId_Executions] FOREIGN KEY([execution_id])
            REFERENCES [internal].[executions] ([execution_id])
            ON DELETE CASCADE
       
            ALTER TABLE [internal].[executable_statistics]  WITH CHECK ADD  CONSTRAINT [FK_ExecutableStatistics_ExecutionId_Executions] FOREIGN KEY([execution_id])
            REFERENCES [internal].[executions] ([execution_id])
            ON DELETE CASCADE
       

            -- Deal with [internal].[operations]
            DECLARE @deleted_ops TABLE(operation_id BIGINT, operation_type SMALLINT)

            DELETE --TOP (@delete_batch_size)
            FROM [internal].[operations]
            OUTPUT DELETED.operation_id, DELETED.operation_type INTO @deleted_ops
            WHERE operation_id != @operation_id

           
           
            DECLARE @execution_id BIGINT
            DECLARE @sqlString              nvarchar(1024)
            DECLARE @key_name               [internal].[adt_name]
            DECLARE @certificate_name       [internal].[adt_name]
           
           
            DECLARE execution_cursor CURSOR LOCAL FOR
                SELECT operation_id FROM @deleted_ops
                WHERE operation_type = 200
           
            OPEN execution_cursor
            FETCH NEXT FROM execution_cursor INTO @execution_id
           
            WHILE @@FETCH_STATUS = 0
            BEGIN
                SET @key_name = 'MS_Enckey_Exec_'+CONVERT(VARCHAR,@execution_id)
                SET @certificate_name = 'MS_Cert_Exec_'+CONVERT(VARCHAR,@execution_id)
                SET @sqlString = 'IF EXISTS (SELECT name FROM sys.symmetric_keys WHERE name = ''' + @key_name +''') '
                    +'DROP SYMMETRIC KEY '+ @key_name
                    EXECUTE sp_executesql @sqlString
                SET @sqlString = 'IF EXISTS (select name from sys.certificates WHERE name = ''' + @certificate_name +''') '
                    +'DROP CERTIFICATE '+ @certificate_name
                    EXECUTE sp_executesql @sqlString
                FETCH NEXT FROM execution_cursor INTO @execution_id
            END
            CLOSE execution_cursor
            DEALLOCATE execution_cursor

            END
    END TRY
    BEGIN CATCH
       
       
        IF (CURSOR_STATUS('local', 'execution_cursor') = 1
            OR CURSOR_STATUS('local', 'execution_cursor') = 0)
        BEGIN
            CLOSE execution_cursor
            DEALLOCATE execution_cursor            
        END
       
        UPDATE [internal].[operations]
            SET [STATUS] = 4,
            [end_time] = SYSDATETIMEOFFSET()
            WHERE [operation_id] = @operation_id;      
        THROW
    END CATCH
   
    RETURN 0

Now to executive the thing

Use ssisdb
Exec [internal].[cleanup_server_retention_window_truncateall]

Other links!

More logging fun!

Get your object sizes

Remove replication objects from your database

As part of your DBA duties, you’re probably restoring your production databases into a test server. Then one day you notice that your log volume is getting pretty close to full! You take a look in the transaction log folder, and notice that that log file is even bigger than the data file. You did remember to set the recovery mode to simple, so why is the log file so big?

Naturally, you run DBCC SHRINKFILE against the log file, but the file doesn’t shrink. You take a peek at sys.databases, and the LOG_REUSE_WAIT_DESC column says “REPLICATION”.

“It’s waiting for replication? Odd, I haven’t set replication on this test server!

A bit of background

Internally, a database log file is divided into smaller blocks called Virtual log files (VLF). During the normal course of events, a VLF is filled with transactions, and certain flags control when a VLF can be overwritten. The whole topic is super complex, so for much more detailed explanation, read here.

When a database is setup as a replication publisher, one of the VLF “don’t overwrite me, bro!” flags is held in place until the replication log reader has had a chance to read the contents of the VLF and copy them into the Distribution database (again, this is over-simplifying things).

In this issue, the database was backed up while those flags were still in place, so the flags were restored in your test server! Since my test server isn’t replicated, there’s nothing actively reading the VLF and clearing the flags, so the VLFs are never allowed to be reused. This is why the log_reuse flag is set to “Replication”, and the result is that the log will just grow forever (or until you run out of space, whichever comes first).

The easiest fix is to run the command below — It’ll remove all the replication objects, flags, etc, from the database. After that, your shrink should work as expected:

USE master
EXEC sys.sp_removedbreplication @dbname = 'your database name', @type ='both'

Other links!

Write a backup report!

Shrink your log files painlessly

All the advice you’ve heard out there is true: It’s not a good idea to regularly shrink your files! If you find yourself in that situation, you probably have other problems to deal with, such as

  • You might need more storage!
  • Data files are supposed to grow (in most cases)! If they’re growing faster than normal, however, you should find out why, and plan accordingly (and you do have growth baselines, right?)
  • Shrinks cause fragmentation!
  • Shrinks cause unnecessary I/O (because they files are just going to grow again).

Unfortunately, sometimes problems do happen, and now you have a fire to put out. While I have taken the time to size my volumes correctly, sometimes we’ll have a large data load, bulk update, etc. Over time, my log volume will be almost full, and one large update away from filling out. Not good!

Another situation I might run into is when I’m building a new test server, and using prod backups. The log file in that restore will be as big as it was when the backup took place, and the dev server may not have been sized as big as the prod server (or maybe it was, but as of 2 years ago) so now I can’t fit the log files!

Or you can use a Size-Ray on your database. It worked for Lois Lane!

For those situations, I wrote the script below. It’s not anything super fancy, it’ll just go through each of the user databases (database_ID >4) and issue a TRUNCATE_ONLY. What that does is instruct SQL to return any empty space that’s already available at the end of the log file back to the Operating System. If you set the parameter @ChangetoSimple to 1, it’ll change the recovery mode to simple before it issues the truncate_only. The ins and outs of the log space usage is a topic for a different post, but for my purposes here, all it means is that it’ll free up space in the log file.

The big letter caveat is DO NOT DO SET THIS TO 1 ON PRODUCTION, IF YOU DEPEND ON LOG BACKUPS! It will ruin your log backup chain, and you’ll need to start the chain again (with a full log backup). I usually only change things to simple in non-production environments.

USE master;

DECLARE @dbname      sysname,
        @logicalname sysname,
        @sql         VARCHAR(200);

DECLARE @ChangetoSimple INT = 0;
DECLARE @simple VARCHAR(200) = ' ';

SELECT name,
       log_reuse_wait_desc
FROM sys.databases
ORDER BY name;

DECLARE list CURSOR FOR
SELECT DISTINCT
       DB_NAME(mf.database_id),
       mf.name
FROM sys.master_files mf
    INNER JOIN sys.databases db
        ON mf.database_id = db.database_id
WHERE DB_NAME(mf.database_id) NOT IN ( 'master', 'tempdb', 'distribution', 'msdb', 'model' )
      AND mf.type_desc = 'log';

OPEN list;

FETCH NEXT FROM list
INTO @dbname,
     @logicalname;
WHILE @@FETCH_STATUS = 0
    BEGIN
        IF (@ChangetoSimple) = 1
            SELECT @simple = 'alter database [' + @dbname + '] set recovery simple';

        SET @sql
            = 'use [' + @dbname + ']; ' + @simple + '  DBCC shrinkfile (N''' + @logicalname
              + ''',0, truncateonly);
		';
        PRINT @sql;
        FETCH NEXT FROM list
        INTO @dbname,
             @logicalname;
        BEGIN TRY
            EXEC (@sql);
            PRINT @sql;
        END TRY
        BEGIN CATCH
            SELECT ERROR_NUMBER(),
                   ERROR_SEVERITY(),
                   ERROR_MESSAGE();
        END CATCH;
    END;
CLOSE list;
DEALLOCATE list;

Useful links!

More troubleshooting:

Recover from disk full issues

Get the size of all your objects

I try to stay in touch with the internals of my databases, including the size of things. Growth patterns are a good indicator of where most of your database activity lies, and that in turn can help you locate activity hot-spots. For today’s post, I have a simple script will help you get all your object sizes, including tables, heaps and indexes.

As an added bonus, it’ll also give you suggestions of which indexes are just giving you maintenance overhead. Every DML also updates the indexes, but you have the potential for index overhead if you have more updates than reads. To identify such candidates, I use the formula below:

( user_seeks + user_scans + user_lookups )  / (user_updates) 

The idea here is simple: if the combined seeks, scans and lookups add up to more than the updates, then the index is probably good! If you’re having more updates than reads, then the index is probably just baggage, and you should consider tossing it overboard.

That’s a lot of baggage!

There’s a few caveats to consider, however

  • Index stats are not preserved during service restarts.
  • Index stats can also be manually flushed.
  • It is very possible that an index is used for just one report, which only runs once a year! And of course, this one index shaves off several hours of report time execution.
  • sys.dm_db_index_usage_stats only holds information for indexes that have been used since the last service start up.
  • sys.dm_db_index_usage_stats does not return information about memory-optimized indexes (see here).

And here’s the code itself


SELECT 
    ds.name AS filegroupname,
    df.name AS filename,
    df.physical_name,
    SCHEMA_NAME(o.schema_id) AS SCHEMAName,
    OBJECT_NAME(ps.object_id) AS objname,
    i.name indexname,
    o.create_date,
    i.index_id,
    ps.row_count,
    SUM(ps.used_page_count) / 128 AS used_mb,
    SUM(ps.reserved_page_count) / 128 AS reserved_mb,
    SUM(ps.in_row_data_page_count) / 128 In_row_mb,
    SUM(ps.in_row_used_page_count) / 128 AS row_used,
    SUM(ps.in_row_reserved_page_count) / 128 AS row_reserved,
    SUM(ps.lob_used_page_count) / 128 AS lob_mb,
    SUM(ps.lob_reserved_page_count) / 128 AS lob_reserved,
    us.user_updates AS [Total Writes],
    us.user_seeks + us.user_scans + us.user_lookups AS [Total Reads],
    case
		when user_updates > 0 then
		Convert(decimal(6,2),( user_seeks + user_scans + user_lookups )  / (user_updates* 1.0)) 
	else 
		Convert(decimal(6,2),( user_seeks + user_scans + user_lookups ) )
	end as ReadToWriteRatio
FROM sys.dm_db_partition_stats ps
    left JOIN sys.objects o
        ON o.object_id = ps.object_id
    left JOIN sys.indexes i
        ON o.object_id = i.object_id
    left JOIN sys.data_spaces ds
        ON ds.data_space_id = i.data_space_id
    left JOIN sys.database_files df
        ON df.data_space_id = ds.data_space_id
           AND df.data_space_id = i.data_space_id
    left JOIN sys.dm_db_index_usage_stats us
        ON us.object_id = i.object_id
           AND us.index_id = i.index_id
GROUP BY 
SCHEMA_NAME(o.schema_id),
         OBJECT_NAME(ps.object_id),
         us.user_seeks + us.user_scans + us.user_lookups,
         us.user_updates - (us.user_seeks + us.user_scans + us.user_lookups),
         ds.name,
         df.name,
         df.physical_name,
         i.name,
         o.create_date,
         i.index_id,
         ps.row_count,
         us.user_updates
ORDER BY objname,
         filegroupname,
         i.name;

For a fresh restore of a StackOverflow database (thanks Brent!) this is what the output will look like. You can save the results into a table (along with a time-stamp) so you can review your usage trends overtime.

Other links!

sys.dm_db_index_physical_stats – get fragmentation values!

Where in the disk are my indexes? – similar to this, but without the usage values

Add a SQL subscription with just a script

SQL Replication is one of my favorite things about SQL Server. For the most part, and as long as it’s properly maintained, it’s rock solid! It’ll just churn along without too much fuss. In this post, we’ll show you how to add a subscription with just a script. Why? There’s two main reasons:

  • The default UI for adding a subscription is too screen-y! There’s too many steps for something that is relatively simple.
  • If you’re sending data to a Always-On subscriber, you can’t even use they UI! That’s because the wizard will perform some validation on the host name. Since your AG listener is different from your nodes, the listener name will fail the validation since it isn’t a “true” host.
You’re REALLY not going to pass

So the actual solution here is to use a script! It always surprises me that SSMS doesn’t offer an out-of-box option to generate this script, like it does in so many options. Lucky for you, I have just the script you need, right here:

--RUN THIS FROM YOUR PUBLISHER DATABASE!
--Not from the distribution or subscriber database
USE [Publishing database]

DECLARE @DBName sysname
DECLARE @PubName sysname
DECLARE @SubDB sysname
DECLARE @Subscriber sysname


SELECT @DBName	= DB_NAME() --current server
DECLARE @pubName2 sysname = 'YourPubName' --publication name
SELECT @PubName = COALESCE(@PubName, @pubName2)

SELECT @SubDB	= 'YourSubscriptionDatabase' --subscription database
SELECT @Subscriber = 'YourSubscriptionServer' --server name


IF (SELECT COUNT(name) FROM dbo.syspublications) =1
	SELECT @PubName = name FROM dbo.syspublications
ELSE
IF @pubName2 IS null
BEGIN
	PRINT 'This database has multiple pubs! You need to set the variable of @pubname2 above!'
	RETURN
END	

IF (DB_NAME() <> @DBName)
BEGIN
	PRINT 'you are in the wrong database!'
	RETURN
end

DECLARE @error VARCHAR(100)

SELECT @error = 'The publication ' + COALESCE(@PubName,@pubName2) + ' does not exist!'

IF NOT EXISTS (
SELECT 1 FROM dbo.syspublications WHERE name = @PubName)
BEGIN
	PRINT @error
	RETURN
end
--No more validation, this is the real thing!
--First we add the subscription
EXEC sp_addsubscription @publication = @PubName,   
       @subscriber = @Subscriber,   
       @destination_db = @SubDB,   
       @subscription_type = N'Push',   
       @sync_type = N'automatic', @article = N'all', @update_mode = N'read only', @subscriber_type = 0;  

--This is for push subscriptions only!
EXEC sp_addpushsubscription_agent @publication = @pubname,   
       @subscriber = @Subscriber,   
       @subscriber_db = @SubDB,   
       @job_login = 'LogReaderServiceAccount', @job_password = 'LogReaderServiceAccountPassword'
	   , @subscriber_security_mode = 1;  
GO  

The script starts with a bunch of data validations: This is just to ensure that you’re in the correct database, and that the pub exists. Once we complete the validation, we add the subscription to the proper system tables, and then we create the actual SQL Agent job that pushes the data to that particular subscriber. This script is meant only for push subscriptions,  where the jobs run from a centralized replication server. This differs from a pull subscription, where the pull jobs run from the subscribers. Neither approach is “better”, it just really depends on your environment.

Documentation!

sp_addsubscription

sp_addpushsubscription

Other scripts!

Create a backup report

Create a SQL Backup Report

One of the most important duties of a production DBA is to ensure backups are happening regularly. No one wants to face a manager and tell him/her “I’m sorry, but the most recent backup I have is from a week ago”. Create a SQL backup report, so you can review your situation!

My production environment is rather large, so it would be easy to rely on backup failure alerts. Don’t fall into that trap!

You are trusting that both your backup system, your notification system AND your email system are all in working order, and that’s a lot of trust. I much prefer being proactive, which includes verifying manually (as often as i can afford) that my backups ran.

To address that need, I wrote the following script. Once a week, I run it against a Registered Server list (you have one of those, right?). After a few seconds, you will have a complete list with the last time each database was backed up, as well as backup type, and age of the backup.

 SELECT DISTINCT DatabaseNames.NAME                                 AS NAME,
                Max(bs.backup_start_date)                          AS
                [Latest Backup],
                Datediff(dd, Max(bs.backup_start_date), Getdate()) AS
                [Backup Age],
                CASE
                  WHEN type = 'D' THEN 'Full'
                  WHEN type = 'I' THEN 'Differential'
                  WHEN type = 'L' THEN 'Log'
                  ELSE 'Some other type'
                END                                                AS
                Backup_Type,
                CASE
                  WHEN Max(bs.backup_start_date) IS NULL THEN 'NEEDS BACKUP'
                  WHEN Datediff(dd, Max(bs.backup_start_date), Getdate()) > 1
                THEN
                  'NOT UP TO DATE'
                  WHEN Datediff(dd, Max(bs.backup_start_date), Getdate()) = 1
                THEN
                  'DATABASE BACKED UP'
                  WHEN Datediff(dd, Max(bs.backup_start_date), Getdate()) = 0
                THEN
                  'DATABASE BACKED UP'
                  ELSE 'PLEASE REVIEW MANUALLY'
                END                                                AS Status
FROM   sys.databases AS DatabaseNames
       LEFT OUTER JOIN msdb.dbo.backupset AS bs
                    ON bs.database_name = DatabaseNames.NAME
WHERE  databasenames.NAME NOT IN ( 'tempdb', 'reportservertempdb' )
GROUP  BY DatabaseNames.NAME,
          type
HAVING Datediff(dd, Max(bs.backup_start_date), Getdate()) > 0
--more than 1 day since last backup
ORDER  BY NAME,
          backup_type  
Look at all those backups!

From here, there’s a couple of directions you could go. You could save the output somewhere, if that’s part of your audit requirements! Or you could automate this report to yourself, as long as you have enough discipline to still review it… I find that emails are easy to ignore when you get > 100 emails/day, so if that’s you, I would keep on running it manually.

Useful Links!

Backup Documentation

BackupSet documentation

Other Scripts!

Document your IP

Get Session Progress