Renewing the STS cert in VCenter

The STS cert is used by vcenter to authenticate against external SSO targets, such as Okta or ActiveDirectory, and it’s a key service for various other services – like the Web UI!

The fun part about STS is that it’s autogenerated at some point (probably during install or upgrade), and it will expire 2 years after that – but vCenter won’t tell you when the expiration is coming up! It’ll just gloriously fail (see https://kb.vmware.com/s/article/82332 and https://kb.vmware.com/s/article/79248), and then you’ll get either an http500 or a “No healthy upstream” error.

To renew the cert, follow the steps on https://kb.vmware.com/s/article/76719, namely:

1 – SSH into vcenter

ssh -l root <your vcenter url>

2 – Move into /tmp

cd /tmp

3 – Create a bash script and paste the script below

vim fixsts.sh
#!/bin/bash
# Copyright (c) 2020-2021 VMware, Inc. All rights reserved.
# VMware Confidential
#
# Run this from the affected PSC/VC
#
# NOTE: This works on external and embedded PSCs
# This script will do the following
# 1: Regenerate STS certificate
#
# What is needed?
# 1: Offline snapshots of VCs/PSCs
# 2: SSO Admin Password

NODETYPE=$(cat /etc/vmware/deployment.node.type)
if [ "$NODETYPE" = "management" ]; then
    echo "Detected this node is a vCenter server with external PSC."
    echo "Please run this script from a vCenter with embedded PSC, or an external PSC"
    exit 1
fi

if [ "$NODETYPE" = "embedded" ]  &&  [ ! -f  /usr/lib/vmware-vmdir/sbin/vmdird ]; then
    echo "Detected this node is a vCenter gateway"
    echo "Please run this script from a vCenter with embedded PSC, or an external PSC"
    exit 1
fi

echo "NOTE: This works on external and embedded PSCs"
echo "This script will do the following"
echo "1: Regenerate STS certificate"
echo "What is needed?"
echo "1: Offline snapshots of VCs/PSCs"
echo "2: SSO Admin Password"
echo "IMPORTANT: This script should only be run on a single PSC per SSO domain"

mkdir -p /tmp/vmware-fixsts
SCRIPTPATH="/tmp/vmware-fixsts"
LOGFILE="$SCRIPTPATH/fix_sts_cert.log"

echo "==================================" | tee -a $LOGFILE
echo "Resetting STS certificate for $HOSTNAME started on $(date)" | tee -a $LOGFILE
echo ""| tee -a $LOGFILE
echo ""
DN=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmdir]' | grep dcAccountDN | awk '{$1=$2=$3="";print $0}'|tr -d '"'|sed -e 's/^[ \t]*//')
echo "Detected DN: $DN" | tee -a $LOGFILE
PNID=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep PNID | awk '{print $4}'|tr -d '"')
echo "Detected PNID: $PNID" | tee -a $LOGFILE
PSC=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep DCName | awk '{print $4}'|tr -d '"')
echo "Detected PSC: $PSC" | tee -a $LOGFILE
DOMAIN=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep DomainName | awk '{print $4}'|tr -d '"')
echo "Detected SSO domain name: $DOMAIN" | tee -a $LOGFILE
SITE=$(/opt/likewise/bin/lwregshell list_values '[HKEY_THIS_MACHINE\Services\vmafd\Parameters]' | grep SiteName | awk '{print $4}'|tr -d '"')
MACHINEID=$(/usr/lib/vmware-vmafd/bin/vmafd-cli get-machine-id --server-name localhost)
echo "Detected Machine ID: $MACHINEID" | tee -a $LOGFILE
IPADDRESS=$(ifconfig | grep eth0 -A1 | grep "inet addr" | awk -F ':' '{print $2}' | awk -F ' ' '{print $1}')
echo "Detected IP Address: $IPADDRESS" | tee -a $LOGFILE
DOMAINCN="dc=$(echo "$DOMAIN" | sed 's/\./,dc=/g')"
echo "Domain CN: $DOMAINCN"
ADMIN="cn=administrator,cn=users,$DOMAINCN"
USERNAME="administrator@${DOMAIN^^}"
ROOTCERTDATE=$(openssl x509  -in /var/lib/vmware/vmca/root.cer -text | grep "Not After" | awk -F ' ' '{print $7,$4,$5}')
TODAYSDATE=$(date +"%Y %b %d")

echo "#" > $SCRIPTPATH/certool.cfg
echo "# Template file for a CSR request" >> $SCRIPTPATH/certool.cfg
echo "#" >> certool.cfg
echo "# Country is needed and has to be 2 characters" >> $SCRIPTPATH/certool.cfg
echo "Country = DS" >> $SCRIPTPATH/certool.cfg
echo "Name = $PNID" >> $SCRIPTPATH/certool.cfg
echo "Organization = VMware" >> $SCRIPTPATH/certool.cfg
echo "OrgUnit = VMware" >> $SCRIPTPATH/certool.cfg
echo "State = VMware" >> $SCRIPTPATH/certool.cfg
echo "Locality = VMware" >> $SCRIPTPATH/certool.cfg
echo "IPAddress = $IPADDRESS" >> $SCRIPTPATH/certool.cfg
echo "Email = email@acme.com" >> $SCRIPTPATH/certool.cfg
echo "Hostname = $PNID" >> $SCRIPTPATH/certool.cfg

echo "==================================" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
echo ""
echo "Detected Root's certificate expiration date: $ROOTCERTDATE" | tee -a $LOGFILE
echo "Detected today's date: $TODAYSDATE" | tee -a $LOGFILE

echo "==================================" | tee -a $LOGFILE

flag=0
if [[ $TODAYSDATE > $ROOTCERTDATE ]];
then
    echo "IMPORTANT: Root certificate is expired, so it will be replaced" | tee -a $LOGFILE
    flag=1
    mkdir /certs && cd /certs
    cp $SCRIPTPATH/certool.cfg /certs/vmca.cfg
    /usr/lib/vmware-vmca/bin/certool --genselfcacert --outprivkey /certs/vmcacert.key  --outcert /certs/vmcacert.crt --config /certs/vmca.cfg
    /usr/lib/vmware-vmca/bin/certool --rootca --cert /certs/vmcacert.crt --privkey /certs/vmcacert.key
fi

echo "#" > $SCRIPTPATH/certool.cfg
echo "# Template file for a CSR request" >> $SCRIPTPATH/certool.cfg
echo "#" >> $SCRIPTPATH/certool.cfg
echo "# Country is needed and has to be 2 characters" >> $SCRIPTPATH/certool.cfg
echo "Country = DS" >> $SCRIPTPATH/certool.cfg
echo "Name = STS" >> $SCRIPTPATH/certool.cfg
echo "Organization = VMware" >> $SCRIPTPATH/certool.cfg
echo "OrgUnit = VMware" >> $SCRIPTPATH/certool.cfg
echo "State = VMware" >> $SCRIPTPATH/certool.cfg
echo "Locality = VMware" >> $SCRIPTPATH/certool.cfg
echo "IPAddress = $IPADDRESS" >> $SCRIPTPATH/certool.cfg
echo "Email = email@acme.com" >> $SCRIPTPATH/certool.cfg
echo "Hostname = $PNID" >> $SCRIPTPATH/certool.cfg

echo ""
echo "Exporting and generating STS certificate" | tee -a $LOGFILE
echo ""

cd $SCRIPTPATH

/usr/lib/vmware-vmca/bin/certool --server localhost --genkey --privkey=sts.key --pubkey=sts.pub
/usr/lib/vmware-vmca/bin/certool --gencert --cert=sts.cer --privkey=sts.key --config=$SCRIPTPATH/certool.cfg

openssl x509 -outform der -in sts.cer -out sts.der
CERTS=$(csplit -f root /var/lib/vmware/vmca/root.cer '/-----BEGIN CERTIFICATE-----/' '{*}' | wc -l)
openssl pkcs8 -topk8 -inform pem -outform der -in sts.key -out sts.key.der -nocrypt
i=1
until [ $i -eq $CERTS ]
do
    openssl x509 -outform der -in root0$i -out vmca0$i.der
    ((i++))
done

echo ""
echo ""
read -s -p "Enter password for administrator@$DOMAIN: " DOMAINPASSWORD
echo ""

# Find the highest tenant credentials index
MAXCREDINDEX=1
while read -r line
do
    INDEX=$(echo "$line" | tr -dc '0-9')
    if [ $INDEX -gt $MAXCREDINDEX ]
    then
        MAXCREDINDEX=$INDEX
    fi
done < <(/opt/likewise/bin/ldapsearch -h localhost -p 389 -b "cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "(objectclass=vmwSTSTenantCredential)" cn | grep cn:)

# Sequentially search for tenant credentials up to max index  and delete if found
echo "Highest tenant credentials index : $MAXCREDINDEX" | tee -a $LOGFILE
i=1
if [ ! -z $MAXCREDINDEX ]
then
    until [ $i -gt $MAXCREDINDEX ]
    do
        echo "Exporting tenant $i to $SCRIPTPATH" | tee -a $LOGFILE
        echo ""
        ldapsearch -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -b "cn=TenantCredential-$i,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" > $SCRIPTPATH/tenantcredential-$i.ldif
                if [ $? -eq 0 ]
                then
                    echo "Deleting tenant $i" | tee -a $LOGFILE
                        ldapdelete -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "cn=TenantCredential-$i,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" | tee -a $LOGFILE
                else
                    echo "Tenant $i not found" | tee -a $LOGFILE
                    echo ""
                fi
                ((i++))
                done
fi
echo ""

# Find the highest trusted cert chains index
MAXCERTCHAINSINDEX=1
while read -r line
do
    INDEX=$(echo "$line" | tr -dc '0-9')
    if [ $INDEX -gt $MAXCERTCHAINSINDEX ]
    then
        MAXCERTCHAINSINDEX=$INDEX
    fi
done < <(/opt/likewise/bin/ldapsearch -h localhost -p 389 -b "cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "(objectclass=vmwSTSTenantTrustedCertificateChain)" cn | grep cn:)

# Sequentially search for trusted cert chains up to max index  and delete if found
echo "Highest trusted cert chains index: $MAXCERTCHAINSINDEX" | tee -a $LOGFILE
i=1
if [ ! -z $MAXCERTCHAINSINDEX ]
then
    until [ $i -gt $MAXCERTCHAINSINDEX ]
    do
            echo "Exporting trustedcertchain $i to $SCRIPTPATH" | tee -a $LOGFILE
            echo ""
                ldapsearch -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -b "cn=TrustedCertChain-$i,cn=TrustedCertificateChains,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" > $SCRIPTPATH/trustedcertchain-$i.ldif
            if [ $? -eq 0 ]
            then
                echo "Deleting trustedcertchain $i" | tee -a $LOGFILE
                ldapdelete -h localhost -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" "cn=TrustedCertChain-$i,cn=TrustedCertificateChains,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" | tee -a $LOGFILE
            else
                echo "Trusted cert chain $i not found" | tee -a $LOGFILE
            fi
            echo ""
                ((i++))
                done
fi
echo ""

i=1
echo "dn: cn=TenantCredential-1,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" > sso-sts.ldif
echo "changetype: add" >> sso-sts.ldif
echo "objectClass: vmwSTSTenantCredential" >> sso-sts.ldif
echo "objectClass: top" >> sso-sts.ldif
echo "cn: TenantCredential-1" >> sso-sts.ldif
echo "userCertificate:< file:sts.der" >> sso-sts.ldif
until [ $i -eq $CERTS ]
do
    echo "userCertificate:< file:vmca0$i.der" >> sso-sts.ldif
    ((i++))
done
echo "vmwSTSPrivateKey:< file:sts.key.der" >> sso-sts.ldif
echo "" >> sso-sts.ldif
echo "dn: cn=TrustedCertChain-1,cn=TrustedCertificateChains,cn=$DOMAIN,cn=Tenants,cn=IdentityManager,cn=Services,$DOMAINCN" >> sso-sts.ldif
echo "changetype: add" >> sso-sts.ldif
echo "objectClass: vmwSTSTenantTrustedCertificateChain" >> sso-sts.ldif
echo "objectClass: top" >> sso-sts.ldif
echo "cn: TrustedCertChain-1" >> sso-sts.ldif
echo "userCertificate:< file:sts.der" >> sso-sts.ldif
i=1
until [ $i -eq $CERTS ]
do
    echo "userCertificate:< file:vmca0$i.der" >> sso-sts.ldif
    ((i++))
done
echo ""
echo "Applying newly generated STS certificate to SSO domain" | tee -a $LOGFILE

/opt/likewise/bin/ldapmodify -x -h localhost -p 389 -D "cn=administrator,cn=users,$DOMAINCN" -w "$DOMAINPASSWORD" -f sso-sts.ldif | tee -a $LOGFILE
echo ""
echo "Replacement finished - Please restart services on all vCenters and PSCs in your SSO domain" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
echo "IMPORTANT: In case you're using HLM (Hybrid Linked Mode) without a gateway, you would need to re-sync the certs from Cloud to On-Prem after following this procedure" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
echo "==================================" | tee -a $LOGFILE
if [ $flag == 1 ]
then
    echo "Since your Root certificate was expired and was replaced, you will need to replace your MachineSSL and Solution User certificates" | tee -a $LOGFILE
    echo "You can do so following this KB: https://kb.vmware.com/s/article/2097936" | tee -a $LOGFILE
fi

4 – chmod the script

chmod a+x fixsts.sh

5 – Run the script! Note the first authentication prompt will require the ROOT login (same one you used SSH). The second prompt will require the ADMINISTRATOR@VSPHERE.LOCAL user. Don’t be like me and enter the root user’s password, then wonder why it didn’t work.

./fixsts.sh

6 – Restart the vcenter services, no reboot required.

service-control --stop --all && service-control --start --all

If all goes well, the script should return something like this (notice no errors!)

And then the service restart – vcenter should come online shortly after this completes.

Recover from disk full errors

We’ve all been there, at some point: You set up a new QA server, but you’re a busy guy, and put off setting up alerts for later. The server gets a ton of usage, and all of a sudden, it runs out of space before you had a chance do something about it. This post is about one we way I use to recover from disk full errors.

In this particular case, we created a new SSIS catalog. The SSISDB database is created by the system,  so at first you don’t get to select where the files are located. Sure, you could’ve modified the database defaults post-setup, but you didn’t do that either! Now the log file is in the data volume, and the volume is all filled up. You’d like to move the log file, but you can’t detach SSISB because, again, the volume is full and nothing works right. So what do you do?

Whenever SQL server restarts, it reads the entries from sys.master_files and sys.databases to figure out where the databases are. When you alter any of the database properties, those changes are registered in that table. So what we need to do here is update those entries (not directly, please!) and then restart the service. Since this particular server is non-prod, restarts are ok! So here’s the syntax:

--run this first to get the current logical name, you'll need this for the next step 
SELECT DB_NAME(database_id),
       name,
       physical_name
FROM sys.master_files; 
--Now the actual trick, where filename is the physical name of the file 
ALTER DATABASE SSISDB
MODIFY FILE
(
    NAME = 'log',
    FILENAME = 'L:\sqllogs\ssisdb.ldf'
);

After this, stop SQL and manually move the file to the new location (as defined in your script — SQL will not move the files for you). When done, start SQL again. Your database should come right up!

Now, let’s say that, in your hurry to get things back up, you restarted the service but forgot to actually move the files. Despair not! As long as SQL hasn’t acquired a filesystem lock on the files, the following commands will allow you to move the files to the proper places. Once everything is in the proper places, the following commands will initialize the database:

ALTER DATABASE SSISDB SET OFFLINE;
ALTER DATABASE SSISDB SET ONLINE;

Useful Links:

Alter Database

More Troubleshooting!

Upgrade shenanigans

SQL Server upgrade shenanigans

It’s late at night, I’m watching Yours, Mine and Ours (the 1968 version), and it’s time to upgrade one of our SQL2014 dev instances to SQL2017. We’d already gone through the upgrade advisor, the issues raised (mostly CLR) were not a big deal, so let’s go ahead and get this done. It’s all smooth sailing until I get this error:

Very odd! After a quick googling, I find this page by Pinal Dave –> https://blog.sqlauthority.com/2017/01/27/sql-server-sql-installation-fails-error-code-0x851a001a-wait-database-engine-recovery-handle-failed/

His suggestion is that there’s an issue with the AD service account, and the fix is to switch to network service. This is dev anyway, and I feel a little careless, so why not? Unfortunately, though, it didn’t work! Not only that, when I tried switching service accounts, I got even weirder errors:

 

I absolutely love the %1 to logoff.

After a restart and another attempt at the upgrade, and getting the same results, I decided to ditch the upgrade summary and take a peek at the sql server log itself. This is the actual service log, saved under MSQL12.MSSQLSERVER\MSSQL\Log. In there, I’m able to verify a few things:

  • The server starts
  • It tries to run the upgrade scripts

So that’s good, right? Then I find the real culprit (database names hidden to protect the innocent):

Based on this message, what the issue appears to be is that the upgrade script is trying to upgrade some system objects (you can tell they’re system objects because the begin with MS%), and when that fails, it fails the entire upgrade. Interesting! I know that this particular object is left over from some weird replication issues that have not been properly cleaned up, so it’s safe to drop them. But how? I can’t even get the service to come up, but I know I have a few options:

  • I can look up the trace flags that skip upgrade scripts (This very handy link lists all the trace flags –> http://www.sqlservercentral.com/articles/trace-flag/152989/ ) The correct flag is 992 902 — this flag will skip upgrade scripts in the master database. Once the server comes up, I can fix the issue, remove the flag, then retry the upgrade
  • I can be lazy and rename the data file and log files in the filesystem, then restart the upgrade. In theory, SQL should fail to attach the database (because I renamed the files!), then apply the upgrade scripts on any of the databases that actually mounted, and then complete the upgrade. I’m feeling adventurous, so I try that, and it worked! I have a working instance again.So now I have to clean after my mess. I switched the service account back to the proper AD account, and I also stop SQL and rename the data files to the correct extensions. Once I start SQL again, the database mounts properly… I think it’s odd that it didn’t fail the upgrade scripts, so I can only assume that the upgrade scripts did not run again. I’m not super worried, at this point, because next up is fixing the issue by dropping the orphaned replication objects. Since I have to apply CU11 afterwards, I figure that at that point the upgrade scripts will be applied. 10 mins later, I was done.

What are the lessons learned? First of all, before you run an upgrade, make sure your databases are clean, with no weird system objects laying around. Second, if you have a failed upgrade, take a look at the SQL server log itself, there’s usually a ton of useful information in there to help you get up and running again.