0

We have been getting "Login failed for user '[UserName]'" as a transient error happening randomly for all our applications that connect to our on-premise SQL Server Database. Here is a few points to help narrow it down:

  1. The error is transient and occurs randomly at a rate of about 10 errors / day. (Our production DB is queried well over 1m times / day).

  2. As the error is rare it is mainly happening for our 2 most used service-accounts, shared between most applications (I know not best practice to have a single DB user shared between a multitude of apps, but the point is it is not isolated to a single db user). Normally this error is tied to user incorrect user credentials, but the credentials are not changing.

  3. The users are not AD users, they are SQL server Logins. (So I conclude it has nothing to do with connection to the AD services.)

  4. The errors started at a distinct point in time: the 9th Jan 2023. We had no software deployments that day. This leads me to believe that it may possibly be related to a SQL Server patch, Windows Server Update (?), or Hardware change (???). The SQL server is a single on-premise VM running in a private cluster. I don't have access to SQL server update history or any Hardware changes etc as I'm not responsible for that. (I asked the IT staff in charge of VMs if they knew of any changes but it turned up nothing.) At this moment we are running SQL Server 14.0.3456.2

I've googled a ton of course but it seems pretty much all info about this error is related to invalid credentials which does not apply here.

Is there any reasons why this error would show up in a random transient fashion?

RoleyBaxter
  • 223
  • 3
  • 14
  • Have you checked the SQL Server logs for the true authentication error? What was that error? – Thom A Mar 23 '23 at 15:23
  • In addition to a patch or hardware change, it could also be related to load growth over time, that Jan 9 was the day you finally got big enough that logins started timing out during high-load periods. – Joel Coehoorn Mar 23 '23 at 15:46
  • Do you have access to the instance's ERRORLOG file (either via SSMS or the file system)? Start by checking the error 18456 messages to see what `State` number they're reporting and compare that against [MSSQLSERVER_18456](https://learn.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-18456-database-engine-error) for the actual login failure reason. I'd also be suspicious of any Logon Triggers you might have defined due to the potential for locking-related timeouts when under load, or if they're sending data to a remote service that itself is buckling under load. – AlwaysLearning Mar 23 '23 at 21:39
  • I have added an image of the occurrences in the application logs, to the post above. I do have access to the SQL Server Logs. I can see 18456 errors pertaining to the 04/02 incident where you can see a sharp peak in errors, but these are different from the daily occurring errors and was due to the instance entering shutdown state. Other that that there are no 18456 errors in the log...strange. We are using EF Core for the most part in the applications, though not exclusively. Any reason such an error could be thrown in the application if no connection to the SQL Server could be established? – RoleyBaxter Mar 24 '23 at 07:14

0 Answers0