[BUG REPORT, C# SKD v2.16.2905] Managed heap corruption caused by Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead

// Expert user has replied.
M Mijail Todorovich 10 months 2 weeks ago
215 3 0

Hi.

I've been chasing a nasty heap corruption bug for a while and I think I finally found the culprit in Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead.

Using WinDbg's !verifyheap command I found that the heap corruption was always caused by a string similar to this one overwritting something:

{"device.host_status":"030,0,1,0642,003,0,0,0,000,0,0,0\r\n001,0,0,0,1,2,6,0,00000001,1,000\r\n0000,0"}

Which is the output of the following SGD command:

{}{"device.host_status":null}

This points to the Printer.GetCurrentStatus method.

Using dotPeek I looked at the decompiled source code of the SDK (version 2.16.2905), tracing the code for GetCurrentStatus until I arrived at Zebra.Sdk.Comm.Usb.Direct.Internal.UsbDataStream.BeginRead where I found something fishy.

Here's the decompiled source code for BeginRead:

public override unsafe IAsyncResult BeginRead(
     byte[] buffer,
     int offset,
     int count,
     AsyncCallback callback,
     object state)
   {
     NativeMethods.SetLastError(0U);
     GCHandle gcHandle = GCHandle.Alloc((object) buffer, GCHandleType.Pinned);
     UsbAsyncResult ar = new UsbAsyncResult(state, callback, true);
     Overlapped overlapped1 = new Overlapped(0, 0, ar.AsyncWaitHandle.SafeWaitHandle.DangerousGetHandle(), (IAsyncResult) ar);
     try
     {
       byte[] userData = new byte[count];
       NativeOverlapped* overlapped2 = overlapped1.Pack(this.m_IOCompletionCallback, (object) userData);
       byte* buffer1 = (byte*) ((ulong) gcHandle.AddrOfPinnedObject().ToInt64() + (ulong) offset);
       if (gcHandle.IsAllocated)
         gcHandle.Free();
       if (NativeMethods.ReadFile(this.m_hFile, buffer1, (uint) count, out uint _, overlapped2))
       {
         ar.m_bCompletedSynchronously = true;
         return (IAsyncResult) ar;
       }
       int lastWin32Error = Marshal.GetLastWin32Error();
       if (lastWin32Error == 997)
         return (IAsyncResult) ar;
       throw new IOException("Unable to initialize read. Error code: " + lastWin32Error.ToString());
     }
     finally
     {
       if (gcHandle.IsAllocated)
         gcHandle.Free();
     }
   }

I think buffer is being moved by the GC after the assignment of the pointer buffer1 and before ReadFile finishes.

This could be avoided if buffer was passed to overlapped1.Pack instead of userData.

From the documentation for Overlapped.Pack:

Remarks

[...]

The buffer or buffers specified in userData must be the same as those passed to the unmanaged operating system function that performs the asynchronous I/O.

Note:

The runtime pins the buffer or buffers specified in userData for the duration of the I/O operation. If the application domain is unloaded, the runtime keeps the memory pinned until the I/O operation completes.

 

 

I patched that change into SdkApi.Desktop.Usb.dll using dnSpy and ran a 48 hour stress test and the issue has not reappeared (before, it crashed after about 4 to 8 hours). It could be luck, though. I'll continue the stress test as long as I can and report back.

I can provide a few memory dumps if necessary.

Please Register or Login to post a reply

3 Replies

M Mijail Todorovich

After 6 days of continuous stress test with the patched DLL, the issue has still not reappeared.

S Steven Si

This is an interesting finding. Thanks for reporting and sharing solution. How do you conduct the stress test? Do you just keep calling the getCurrentStatus() at a fixed interval?

M Mijail Todorovich

I conducted the stress test using a program I'm developing. It's not a minimal reproducible example by any means, but it's not that complex either.

It's a WPF app which basically boils down to calling getCurrentStatus and printing a label every time a NumberBox is increased. To keep the UI responsive while avoiding two threads from talking to the printer at the same time, there's a queue implemented with a TPL DataFlow which processes the print jobs one at a time on ThreadPool threads. For the test, I added a timer which increases the NumberBox every two seconds.

I ran the test using a paused ZD421. To avoid filling the printer's buffer, I send ~JA every 5 prints (using the queue).

To try and catch the error as close as possible to the corruption, I set the following flags using GFlags: Enable heap tail checking, Enable heap free checking, Enable heap parameter checking, Enable heap validation on call, Enable page heap, Enable heap tagging, Enable application verifier.

 

I think the important parts for reproducing are calling getCurrentStatus in a background thread while on the main thread having enough allocations and deallocations to trigger heap compaction. Also, accessing objects in the heap so corruption is eventually noticed (alternatively, setting the environment varible COMPLUS_HeapVerify=1 and enabling the gcUnmanagedToManaged MDA to verify the heap every time a thread transitions from unmanaged to managed code, though I couldn't get that to work).

CONTACT
Can’t find what you’re looking for?