So this is a bit of a doozy; please bear with me.
I’m not an AD expert, but I do have domain admin (lol). We’ve been having a weird issue for several months now where Kerberos SSO from tomcat (BusinessObjects) suddenly started failing against certain domain controllers. This happened after a malicious actor gained access to the network. We contained the breach but were left with lots of issues. In panic, they added a bunch of security tools and they also had to rebuild a few servers, including one of the domain controllers. I was suspicious of the security apps (FireEye and SentinelOne) but I haven’t been able to nail them down as an issue watching in procmon. We’re getting the “message stream modified” error in the logs (and seeing it in netmon and Wireshark). I was new to all of these tools but I’ve been Googling like a madman and trying to figure it out.
The issue only seems to occur in the negotiation from some clients and Citrix servers to certain domain controllers. All of this stuff (servers, accounts, etc.) are in the same domain.
There is a service account with constrained delegation and services like:
BOCMS/SVC_NAME.FQDN
HTTP/SERVER
HTTP/SERVER.FQDN
…
The machines themselves have the HOST/SERVER SPNs, so I’m not sure if that matters–it’s never been a problem before.
Everything is set up per instructions from SAP and was working before the incident.
IT and Security claim that there’s nothing that has been changed or implemented that would cause this.
SAP Support gave me a big ol’ shrug because it’s Tomcat/AD/Kerberos and not directly related to BusinessObjects.
Microsoft said it’s an SPN issue where the SPN is assigned to the wrong account or there are duplicate SPNs. We spent weeks making no progress with Microsoft Support. We tried recreating the service account and adding the SPNs back to it. I’ve done several checks and eliminated the only pair of duplicate SPNs I could find, which didn’t resolve the issue. We also made sure that the domain controllers were synced.
I’ve tried checking Chrome browser versions and Internet Options to make sure SSO is allowed. Chrome browser version doesn’t seem to matter and Internet Options provide for SSO in the Intranet.
I can see an “Audit Failure” in the Security logs when an attempt is made, but I’m not sure what to do with that information or if it’s even actionable.
Kinit works normally for me on all of the servers; logging into BusinessObjects manually does too.
I’m lost and everyone else is, too. I can’t think of anything else to check and it seems that everything is a dead end at this point.
I’d welcome any suggestions for further tracing or troubleshooting.
—
Thanks for all of the suggestions and questions! I’m able to check some things myself but I’ll probably have to get with the actual AD folks to make changes, turn on logging, whatever with the domain controllers.
Edit to add some more details, specifically about the attack:
I’m not part of the core IT, Infrastructure, or Security teams that handled the attack and mitigation, and so I’m not sure of a lot of the details. I learned most of what I know about it from being closer to people on those teams and being part of the remediation. The company’s official statement was that it may have been related to a phishing attack; then they made us take some online phishing training module. The company is still in process of building out and implementing proper security policies and protocols, because why would they have done that sooner when we have customers with highly regulated data? The whole thing was a circus.
Apparently an IT person overseas who was working the night shift noticed some “suspicious activity” on a domain admin contractor account. One of the things that they mentioned noticing was that the attacker was using nslookup to search the domain. They found that infections were in process and basically just switched the whole thing off. They isolated it from the Internet and everything else and started installing FireEye and SentinelOne. Some servers they were able to “restore” or “clean”, but there was a shortlist of servers that had to be rebuilt from scratch, including at least one of the domain controllers (that I mentioned above). I don’t know the extent of what else was done. They also said they were making everyone change their passwords, but I never got a forced password reset at that time.
They started up all of the applications on the prod servers, and our hardware choked. Turning on over 100 applications where their hosts had previously experienced hard shutdowns was too much for the DBs and VMs and most of them died again. They turned everything back off as gracefully as possible and restarted just a few servers, basically the main ones where the customers were threatening us–we had over 100 customers with a complete outage of accounting applications for over a week. They had to turn everything back on slowly and play whack-a-mole with host and database resource issues. About a week and a half into the recovery, they discovered something else that they thought was suspicious and turned off some servers and disabled a bunch of accounts. They stripped domain admin from almost everyone. I never heard what happened with this because they tried to downplay it, but I believe that a few more servers may have been lost.
It took about 3-4 weeks for them to get most of prod up and running, and issues continued for about 2 months after the incident before we got back to mostly “normal”. About 4 weeks in, they discovered that a related environment had been compromised (not sure if it was on the same domain because it’s completely separate from my business unit). They turned it off and called it a total loss. I think they ended up migrating the customers to a different platform entirely.
It took about 6-8 weeks before the UAT environment was turned back on, and it was smoother because of the lessons learned from production.
I suspect that some other things may have occurred, as many of you have suggested, but it’s difficult to get a comprehensive answer–everyone seems to know only a small amount of what has occurred and there aren’t any good records or documentation of what happened and what was done because of the panic in the process.
I could just let this issue be someone else’s problem because I found out recently that my team is being eliminated soon for offshoring. I kinda want to solve it just to show my competency and for resume/interview content.