Troubleshooting an Obscure DeployStudio Runtime Error

DeployStudio is an old hat classic ’round these parts, and many Mac admins are familiar with its foibles and idiosyncrasies. For those of you who haven’t moved on to Imagr yet, this sad story about troubleshooting DeployStudio may encourage you to hop onto the gravy train and off the failboat.

The story starts with a simple premise: my NetBoot clients would start up DeployStudio Runtime, but then would throw a repository access error when trying to mount the DS repository (which I have configured to be served via SMB):
12637256_10204432329603258_97004526_o

DeployStudio doesn’t like when this happens. It also doesn’t give you very useful information about what happens, because the repository access error triggers the “timeout until restart” countdown. If your trigger is an unforgiving number, i.e. 0 or 1 seconds, this will result in an instant reboot without you being able to troubleshoot the environment at all.

There’s nothing really useful in the log about why it failed, or how, either. Not very helpful, there, DeployStudio.

I’m troubleshooting this remotely, so I don’t have physical access to these machines. I’m doing all this relayed through messages to the local field technicians.

What we know at this point: DS Runtime can’t mount the SMB repo share.

Step 1: Verify DS’s Repo share

Simplest thing: check the server to make sure SMB is running and that DS knows about it. That’s simple enough to do in the System Preferences’ DeployStudio pane, which will show the status of the service and the address of the DS repository it’s offering.

Just for kicks, let’s try restarting the DS service.

Did that work? Nope.

Step 2: Verify SMB sharing

Okay, DS thinks it’s fine. Maybe SMB itself is freaking out?
sudo serveradmin stop smb
sudo serveradmin start smb

Let’s try mounting the share directly on another client:
mkdir -p /Volumes/DS/
mount -t smbfs //username@serverIP/DeployStudio /Volumes/DS
ls /Volumes/DS/

Works fine. Well, gee, that’s both good news and disconcerting news, because if the share works fine on other clients, why are these DS clients not mounting it?

???

So at this point, we know the SMB share works on some clients, fails on other clients, but is otherwise configured correctly on the server. We approach Hour 3 of All Aboard The Fail Boat.

Okay, just in case, let’s try rebuilding the NBI using DS Assistant. Did that fix it? Nope.

Ping test from broken client to server. No packet loss. Connection looks solid.

Telnet test from broken client to server on SMB port. It connects. No firewall, no network ACLs, no change in VLAN, no weird stuff.

Packet capture. Spanning tree set up between ports to carefully monitor traffic. Why are 60% of these clients failing to mount the share, but 40% still working?

Tear your hair out in frustration. Move on to hour 4.

A Glimmer of Hope

Time to get ugly. We need more data to determine what’s happening, and part of that is figuring out the difference between successful SMB authentications and failed ones. To see that, we need log data.

Hat tip to Rich Trouton for helpfully pointing me to this link:
http://web.stanford.edu/group/macosxsig/blog/2011/08/enable-logging-with-107-smbx-w.html

SMB logging sounds good. On 10.10, the above link is an easy solution – just unload the SMB launchd, edit the plist to add in the -debug and -stdout options, reload on the launchd, and watch the system log.

On 10.11, it’s a bit more work – your best bet would be to disable Apple’s launchd for SMB, make a copy of it with a different identifier, and load that (hat tip to @elios in MacAdmins Slack for this).

Once we’ve got logging enabled, let’s look very carefully at a success vs. a failure.
Success:

Failure:

This seems to be the key indicator of success:
kdc: ok user=F5KP60PFF9VN\username proto=ntlmv2
Compare that to the failure log:
kdc failed with -1561745592 proto=ntlmv2

Hmm, what the heck error code is that?

Googling got me to one specific hint, which is what gave the solution away:

Linux cifs mount with ntlmssp against an Mac OS X (Yosemite
10.10.5) share fails in case the clocks differ more than +/-2h:

The clock!

Well, That Was Obvious In Hindsight

I needed to verify the clock on one of the affected machines. Sure enough, the technician confirmed that the date was December 31, 1969. Definitely a bit more than 2 hours difference to the server.

In my defense, I’d like to remind you that I was troubleshooting this remotely and therefore couldn’t have noticed this without someone telling me and yes I’m rationalizing my failures stop looking at me like that I’m hideous don’t even look at me

The real question, then, is why this was happening. DeployStudio NBIs, when built via DeployStudio Assistant or Per Oloffson’s excellent AutoDSNBI, use an NTP server to sync up the date and time to prevent precisely this problem. What went wrong here?

The next silly thing: it turns out we changed our NTP server, and I simply failed to notice. The old NTP server didn’t resolve anymore, and that’s why any client that happened to have an expired clock battery (and therefore set back to the default time) failed to sync back up.

So the 60% fail rate we were seeing was essentially random luck against a pile of old machines, some of whom had been powered off for so long the clock battery ran out and the system time was reset.

Rebuilding the NBIs with the correct NTP server fixed the problem immediately.

The lesson from all of this?

Check the damn clock.

Advertisements