Code Optimization

Interesting things in software development and code optimization

Desktop Tools: Protractor, Ruler - in C#

Hi dear friend,


This is the free application that allows you measure distance and angels right on your desktop.

It seats in try and allows you to use it at any time - just right-click on the icon and select a tool.

In some cases it will switch your windows color schema so do not be afraid.


Click three points to measure angle - first point is the angle center;

Select ruler to show ruler and measure distance in cm use rotation to rotate ruler.


Thank you


DeskTools.zip (46KB)

1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



DataGridView and huge amount of data rows

Hello my friends,

Did you have a need to populate the DataGridView control with a lot of data? I'm sure you did have.

If you have a huge amount of rows, like 10 000 and more, you will see a huge problem in performance.

To avoid performance leak - you need to set proper value into the RowHeadersWidthSizeMode property.

So the best way is to disable auto resizing during data binding:

dataGridView1.RowHeadersWidthSizeMode = DataGridViewRowHeadersWidthSizeMode.DisableResizing;

you actually can set EnableResizing but avoid to use the DataGridViewRowHeadersWidthSizeMode.AutoSizeToAllHeaders

The AutoSizeToAllHeaders is most time consumable parameter.

In addition would be better to set the RowHeadersVisible to false

dataGridView1.RowHeadersVisible = false;

Now you can bind data source, and enable it all or set what you want it to be 


Thank you, see you next time.


1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



Language per chat in Skype

Hello,

If you use Skype and write messages in different languages then you have noticed that Skype does not remember input language for each chat, and this is really annoying.

I wrote an app that will do it for you.


Thank you and feedbacks are welcome.

ToolsForSkype.zip (111KB)

1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



Remote SQL Backup to local PC

Hello friends,


Each of us had issue with remote SQL backup files. Not with files itself but actually how to get bak files from remote sql server to local PC.

I googled a lot about this question (and I'm sure you did it as well) and did not find any solution. To be honest, there was no way to do it till yesterday :)

Yesterday, I did face with this problem again and seems I did find reliable solution or at least a chance to do it but have to say that it is not 100% working solution, it is about 99% and I will describe why at the end of this post.

So, in general it looks like this:

- execute an sql script to make backup of database;

- create a temp db with table and column of type varbinary type;

- get the *.bak file and insert it into the temp table;

- stream this row to your local pc and save as file;

- drop temp table and db;


That's it. Sounds like not very complex task but I can say that there may be some problems and you will have to solve them, and some problems even may be not possible to solve and then you are in the 1% who is not luck :(


Now lets take a look into each step more precisely, so first step is to create backup. But there are some problems, first problem is that any binary type field can include up to 2^31-1 bytes that is almost 2GB - 1 byte in size, so we will split our backup onto a few files, and there is the second problem - SQL Server supports splitting up to 64 files. So if our backup size is more than about 128GB I think that would be not possible :(

Ok, we have a few GB database and it is enough to split it onto up to 64 bak files (even less than 64), lets do it.

Lets calculate a size of our database: 

USE MyDB

SELECT CAST(SUM(size) * 8. / 1024 AS BIGINT)

FROM sys.master_files WITH(NOWAIT)

WHERE database_id = DB_ID() GROUP BY database_id

this will return long value in MB.

Next step is to create our bak files:

BACKUP DATABASE MyDB TO DISK = N'MyDB_tmp_1.bak'

,DISK = N'MyDB_tmp_2.bak'

,DISK = N'MyDB_tmp_3.bak'

,DISK = N'MyDB_tmp_4.bak'

,DISK = N'MyDB_tmp_5.bak'

WITH NOFORMAT, NOINIT, NAME = N'MyDB-Full Database Backup', SKIP, NOREWIND, NOUNLOAD, STATS = 10

This will backup our db into default backup location on the server.

Now its time to create temp db and table, or you may use the same db if you have no permission to create a new db:

IF db_id('TempDB') IS NULL

begin

create database [TempDB];

end

else

begin

use master;

ALTER DATABASE [TempDB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE;

drop database [TempDB];

create database [TempDB];

end

use [TempDB];

create table Temp (filename nvarchar(512), [file] varbinary(max));

also we need to know the default backup path:

SELECT TOP 1 physical_device_name

FROM msdb.dbo.backupset b

JOIN msdb.dbo.backupmediafamily m ON b.media_set_id = m.media_set_id

WHERE database_name = '{0}'

and backup_finish_date >=N'{1:yyyy-MM-dd}'

and backup_finish_date < N'{2:yyyy-MM-dd}'

ORDER BY backup_finish_date DESC

Now we have everything we need to insert each file into the temp table and download them one by one,

lets insert file:

INSERT INTO [{0}].dbo.Temp([filename], [file])

SELECT N'{1}' as [filename], * FROM OPENROWSET(BULK N'{1}', SINGLE_BLOB) AS [file]

Here we may have another problem - you may get error that you have no permission to BULK INSERT and this is real problem as well, so you will finish here or you may try to upload a web app (if you use web hosting) and code it to add to the table as byte array.

Now everything is ready to download the file, but pay attention its better to use streaming instead of default batch reading:

SELECT * FROM TempDB.dbo.Temp WHERE [filename] = N'{0}'

and C# code:

sqlCmd = new SqlCommand("SELECT * FROM [" + tmpDBName + "].dbo.Temp WHERE [filename] = N'" +

string.Format(bakFileName, this.defaultBakPath, sqlConnection.Database, i) + "'", sqlConnection);


sqlCmd.CommandTimeout = sqlConnection.ConnectionTimeout;


SqlDataReader sqldr = sqlCmd.ExecuteReader(System.Data.CommandBehavior.SequentialAccess);

sqldr.Read();

string fileName = sqldr.GetString(0);

System.IO.FileStream file = new System.IO.FileStream(System.IO.Path.Combine(this.localPath, System.IO.Path.GetFileName(fileName)),

System.IO.FileMode.OpenOrCreate, System.IO.FileAccess.ReadWrite);

long startIndex = 0;

const int ChunkSize = 1024 * 32; //32 KB block

byte[] buffer = new byte[ChunkSize];

while (true)

{

long retrievedBytes = sqldr.GetBytes(1, startIndex, buffer, 0, ChunkSize);

file.Write(buffer, 0, (int)retrievedBytes);

startIndex += retrievedBytes;

if (retrievedBytes != ChunkSize)

break;

}

file.Close();

sqlCmd.Dispose();

ok, we have got first file so now we need to repeat the same by deleting each downloaded row in the temp table and inserting next one file and so on.

Finally, delete everything you don't need anymore - temp database and table:

DELETE FROM [TempDB].dbo.Temp

use master; ALTER DATABASE [TempDB] SET SINGLE_USER WITH ROLLBACK IMMEDIATE; drop database [TempDB];

Voila! You have your remote SQL backup file on your local PC. Cool!

I have written a simple C#.NET application that will do it all for you, but, please, make sure y ou have the following permissions:

- you have BULK INSERT permission or complete admin rights

- your backup files in total size less than 128GB

- some other problems that I have not faced with yet


You can extend it by deleting bak files on a disk, by calculating size of bak file to split it onto less number of bak files, and more.

If you have any comment - you are welcome.

Thank you.


RemoteToLocalSQLBackup.zip (9.4KB)
RemoteToLocalSQLBackup_src.zip (13.8KB)

1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



Bit manipulations

Hi,

Today, I will explain bit manipulations. As you know each bit set in its own position from 0 to 7.

As an example, Lets use the following byte 00000010 which is 0x02 in hex-decimal and 2 in decimal.

It has "1" at 1st place (lets count it from zero place to 7 place).


So shifting bits to the left will move all bits to the left:

00000010 << 2 = 00001000

As you see now 1 on the 3rd place, instead of 1st place, because of we have shifted all bits to the left 2 times. Same rule is for the right shifting.


Another type of bit manipulation is boolean operations like XOR, OR and AND.

OR just sets 1 on its place:

0010 OR 0001 = 0011
0010 OR 0010 = 0010


XOR makes the same as OR and inverts bits at the same places if they both equal to 1  (outputs 1 only when both bits differ):

0010 XOR 0001 = 0011
0010 XOR 0010 = 0000


AND leaves bits that equal to 1 at the same place only:

0010 AND 0001 = 0000
0010 AND 0010 = 0010


Ha, seems nothing too hard, doesn't? :)


You may ask - why do we need it? I will explain and show an example where and how to use it.
Lets imagine that you work with bitmaps in C#, do you want it to be really fast? I'm guess yes for sure. So lets figure out how to make it very fast.
As I said, graphic memory and images itself can be RGB and RGBA (in most general cases), and we will work with RGBA as it has transparent component with is called Alpha. So RGBA is just a group of 4 bytes, do you remember what type is represented by 4 bytes in C#? Yes, it is Int32. So we are going to use Int32 to work with our images. And here is how:

Rectangle rect = new Rectangle(0, 0, tmpi.Width, tmpi.Height);

BitmapData bmpd = tmpi.LockBits(rect, ImageLockMode.ReadWrite, PixelFormat.Format32bppArgb);

Int32[] src = new Int32[bmpd.Stride / 4 * tmpi.Height];

System.Runtime.InteropServices.Marshal.Copy(bmpd.Scan0, src, 0, src.Length);

Do you think I'm crazy? :)
No, because of reading and writing 4 bytes instead of 1 byte is much faster, thus I do not use byte[].
But now we need to split Int32 onto RGBA and here is how we will do it:

public void SetPointAsIs(byte r, byte g, byte b, byte a, int x, int y)

{

src[Stride * Y + x] = (a << 24) + (r << 16) + (g << 8) + b;

}

public ColorBgra GetPoint(int x, int y)

{

Int32 argb = src[Stride * y + x];

ColorBgra c = new ColorBgra();

c.A = (byte)(argb >> 24);

c.R = (byte)(argb >> 16 & 0xff);

c.G = (byte)(argb >> 8 & 0xff);

c.B = (byte)(argb & 0xff);

return c;

}

This code will compiled into command like MOV ds:[edx + offset], 0x11223344, instead of four MOV op-codes that will move each byte separately.



Thank you, and see you next time :)

1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



Hardware and Bits

Hello,


As I said we would go from basic to complex things step by step and this post will clarify some important things that you must know about hardware and bits.


Each PC consists of different devices like monitor, HDD, keyboard, mouse, etc., and each device has its own controller (a chip) that controls device. For example, mouse has a small chip that controls laser and gets and sends commands and data to and from PC.


Almost each device has its own IRQ - this is interrupt number that assigned to a device by system to be able to get and send commands. So when you move your mouse it triggers a IRQ and provides data like DX and DY and system knows what direction should it move a cursor to and how many points the cursor should be moved.

This is general explanation and to be able to understand more you have to find a book and read it.


So devices talk to each other or to system via Bytes. They send a lot of bytes and each byte consists of 8 bits, like:

...

0 0 0 0   0 0 0 0

0 1 1 0   1 1 1 1

...

To understand what these bits represent in a more human readable way you have to know how to convert binary numeral system into hex-decimal or decimal numeral system. You can find it on internet or a book.


So, when I started to learn programming my first language was Assembler for Z-80 CPU. There was a book that had description for all 255 op-codes (commands of CPU) and some additional information, but it was not enough for me to understand anything and more over to create at least such simple program like the popular "Hello World!".

I did learn bits, bytes, bit operations like shift to the left/right, conditional operations like jump here or there, CPU registers, etc. And that was much but not enough to make the "Hello World!".

Then I did figure out memory structure (my brother did help me in this) and at some point there was like a flash in my mind - "Or God! That is how it all works!".


Main thing is to understand memory structure and how controllers works. In my case I had video memory mapped from 16384 address to 32767, and one part of this memory did represent pixels, and another part - colors.

Filling just pixel memory with correct bits will lead to this:


that is represented by bits:

1 1 0 0 1 1 0 0    = 0xCC
1 1 0 0 1 1 0 0    = 0xCC
1 1 0 0 1 1 0 0    = 0xCC
1 1 1 1 1 1 0 1    = 0xFD
1 1 1 1 1 1 1 1    = 0xFF
1 1 0 0 1 1 1 0    = 0xCE
1 1 0 0 1 1 0 1    = 0xCD
1 1 0 0 1 1 0 0    = 0xCC
....

So, as you can see you have to set bits in bytes in corresponding places and bytes into correct memory address.

Then you can color it as you need by setting color-bytes into right place in memory.


Nova days graphic memory structure is more simple and represented by 3 or 4 bytes RGB or RGBA. This leads to that, that modern PCs have to have more memory and resources and speed to provide such simplicity. 


Next step we will look into bits and bit shifting as well as bit operations - that is very important in software development.


Thank you.


1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



C#.NET and unmanaged static C++ library

Hi friends,

Today I'm going to share my experience with c++ static libraries.


Static library differs from dynamic library by that, that parts of static library code will be included into caller code.

For example, I have an exe and my exe code calls a function from a static library. It means that this static library function's code will be included into my exe and I will not need any library to be with the exe.


But with C# such things are going to be more tricky, because of C#.NET is managed code and static library is unmanaged code and thus could not be linked and included.

So we need another way to do it. As you know we can use PInvoke to be able to access exported dll functions, and this is going to help us a lot.

First step is to create a Dll C++ project and link a static library to this Dll, our dll project is going to be something like a wrapper, and add some export method so we would be able to invoke them from C#.NET code.

Here is how to do it:

extern "C" __declspec(dllexport) int InitializeLib2(int type, const char *data, BOOL useFlag)

{

return ::InitializeLib(type, data, useFlag);

}

so we have declared our InitializeLib2 function for export and to be used via PInvoke from C# code, inside of this InitializeLib2 function we have a call to a static library InitializeLib function  and just pass parameters from our method.

Now we build it and get a Dll file that can be pinvoked from our C# code.

Here is how to do it:

[DllImport("MyWrapper.dll", CallingConvention = CallingConvention.Cdecl)]

public static extern int InitializeLib2(int zero, ref byte str, bool b);

and here how to call this method:

byte[] str = ASCIIEncoding.ASCII.GetBytes("my string data" + ((char)0).ToString());

int a = InitializeLib2(0, ref str[0], false);

so we pass an integer value, I like to pass strings as byte array and last parameter is boolean value.


Thats all. At the end of this you will have MyWrapper.dll and managed exe file.

Thank you, and good luck :)


1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



C#.NET - Fast Memory Copy method with x86 Assembler

Introduction

I'm Oleksandr Karpov and this is my first article here, thanks for reading it.

Here, I'm going to show and explain how to copy data really fast and how to use assembly under C# and .NET. In my case, I use it in a video creating application from images, video and sound.
Also, if you have an assembly method or function that you need to use under C#, it will show you how to do it in a quick and simple way.

Background

To understand it all, it would be great for you to know assembly language, memory alignment and some C#, Windows and .NET advanced techniques.
To be able to copy-paste data really fast, you need it to have 16 byte aligned memory address in other way it will have almost the same speed (in my case, about 1.02 time faster).

The code uses SSE instructions that are supported by processors from Pentium III+ (KNI/MMX2), AMD Athlon (AMD EMMX).

I have tested it on my Pentium Dual-Core E5800 3.2GHz with 4GB RAM in dual mode.
For me, the fast copy method is 1.5 times faster than the standard with 16 byte memory aligned and
almost the same (1.02 times faster) with non-aligned memory addresses.

To be able to allocate 16 byte aligned memory in C# under Windows, we have three ways to do it:

a) On this time it seems that Bitmap object (actually windows itself inside) allocates memory  with 16 byte aligned address, so we can use Bitmap to easy and quick aligned memory allocation;

b) As managed array by adding 8 bytes more (as windows heap is 8 byte aligned) and calculating 16 byte aligned memory point within allocated memory:

int dataLength = 4096;


// +8 bytes as windows heap is 8 byte aligned

byte[] buffer = new byte[dataLength + 8];


IntPtr addr = Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0);


//(int)(((long)addr + 15) / 16 * 16 - getting point to 16 byte aligned address

int bufferAlignedOffset = (int)(((long)addr + 15) / 16 * 16 - addr);

c) By allocating memory with VirtualAlloc API:

IntPtr addr = VirtualAlloc(IntPtr.Zero,

new UIntPtr(dataLength + 8),

AllocationTypes.Commit | AllocationTypes.Reserve,

MemoryProtections.ExecuteReadWrite);


addr = new IntPtr(((long)addr + 15) / 16 * 16);

Using the Code

This is a complete performance test that will show you performance measurements and how to use it all.

The FastMemCopy class contains all things for fast memory copy logic.

First thing you need is to create a default Windows Forms application project and put two buttons on the form and the PictureBox control as we will test it on images.

Let's declare some fields:

string bitmapPath;

Bitmap bmp, bmp2;

BitmapData bmpd, bmpd2;

byte[] buffer = null;

Now, we will create two methods to handle click events for our buttons.

For standard method:

private void btnStandard_Click(object sender, EventArgs e)

{

using (OpenFileDialog ofd = new OpenFileDialog())

{

if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)

return;

bitmapPath = ofd.FileName;

}


//open a selected image and create an empty image with the same size

OpenImage();


//unlock for read and write images

UnlockBitmap();

//copy data from one image to another by standard method

CopyImage();

//lock images to be able to see them

LockBitmap();

//lets see what we have

pictureBox1.Image = bmp2;

}

and for fast method:

private void btnFast_Click(object sender, EventArgs e)

{

using (OpenFileDialog ofd = new OpenFileDialog())

{

if (ofd.ShowDialog() != System.Windows.Forms.DialogResult.OK)

return;

bitmapPath = ofd.FileName;

}

//open a selected image and create an empty image with the same size

OpenImage();

//unlock for read and write images

UnlockBitmap();

//copy data from one image to another with our fast method

FastCopyImage();

//lock images to be able to see them

LockBitmap();

//lets see what we have

pictureBox1.Image = bmp2;

}

Ok, now we have buttons and event handlers so let's implement methods that will open images, lock, unlock them and standard copy method:

Open an image:

void OpenImage()

{

pictureBox1.Image = null;

buffer = null;

if (bmp != null)

{

bmp.Dispose();

bmp = null;

}

if (bmp2 != null)

{

bmp2.Dispose();

bmp2 = null;

}

GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);

bmp = (Bitmap)Bitmap.FromFile(bitmapPath);

buffer = new byte[bmp.Width * 4 * bmp.Height];

bmp2 = new Bitmap(bmp.Width, bmp.Height, bmp.Width * 4, PixelFormat.Format32bppArgb,

Marshal.UnsafeAddrOfPinnedArrayElement(buffer, 0));

}

Lock and unlock bitmaps:

void UnlockBitmap()

{

bmpd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,

PixelFormat.Format32bppArgb);

bmpd2 = bmp2.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height), ImageLockMode.ReadWrite,

PixelFormat.Format32bppArgb);

}

void LockBitmap()

{

bmp.UnlockBits(bmpd);

bmp2.UnlockBits(bmpd2);

}

and copy data from one image to another and show measured time:

void CopyImage()

{

//start stopwatch

Stopwatch sw = new Stopwatch();

sw.Start();

//copy-past data 10 times

for (int i = 0; i < 10; i++)

{

System.Runtime.InteropServices.Marshal.Copy(bmpd.Scan0, buffer, 0, buffer.Length);

}

//stop stopwatch

sw.Stop();

//show measured time

MessageBox.Show(sw.ElapsedTicks.ToString());

}

That's it for the standard copy-paste method. Actually, there is nothing too complex, we use well-known System.Runtime.InteropServices.Marshal.Copy method.

And one more "middle-method" for the fast copy logic:

void FastCopyImage()

{

FastMemCopy.FastMemoryCopy(bmpd.Scan0, bmpd2.Scan0, buffer.Length);

}

Now, let's implement the FastMemCopy class. Here is the declaration of the class and some types we will use inside of it:

internal static class FastMemCopy

{

[Flags]

private enum AllocationTypes : uint

{

Commit = 0x1000, Reserve = 0x2000,

Reset = 0x80000, LargePages = 0x20000000,

Physical = 0x400000, TopDown = 0x100000,

WriteWatch = 0x200000

}

[Flags]

private enum MemoryProtections : uint

{

Execute = 0x10, ExecuteRead = 0x20,

ExecuteReadWrite = 0x40, ExecuteWriteCopy = 0x80,

NoAccess = 0x01, ReadOnly = 0x02,

ReadWrite = 0x04, WriteCopy = 0x08,

GuartModifierflag = 0x100, NoCacheModifierflag = 0x200,

WriteCombineModifierflag = 0x400

}

[Flags]

private enum FreeTypes : uint

{

Decommit = 0x4000, Release = 0x8000

}

[UnmanagedFunctionPointerAttribute(CallingConvention.Cdecl)]

private unsafe delegate void FastMemCopyDelegate();

private static class NativeMethods

{

[DllImport("kernel32.dll", SetLastError = true)]

internal static extern IntPtr VirtualAlloc(

IntPtr lpAddress,

UIntPtr dwSize,

AllocationTypes flAllocationType,

MemoryProtections flProtect);

[DllImport("kernel32")]

[return: MarshalAs(UnmanagedType.Bool)]

internal static extern bool VirtualFree(

IntPtr lpAddress,

uint dwSize,

FreeTypes flFreeType);

}

Now let's declare the method itself:

public static unsafe void FastMemoryCopy(IntPtr src, IntPtr dst, int nBytes)

{

if (IntPtr.Size == 4)

{

//we are in 32 bit mode

//allocate memory for our asm method

IntPtr p = NativeMethods.VirtualAlloc(

IntPtr.Zero,

new UIntPtr((uint)x86_FastMemCopy_New.Length),

AllocationTypes.Commit | AllocationTypes.Reserve,

MemoryProtections.ExecuteReadWrite);

try

{

//copy our method bytes to allocated memory

Marshal.Copy(x86_FastMemCopy_New, 0, p, x86_FastMemCopy_New.Length);

//make a delegate to our method

FastMemCopyDelegate _fastmemcopy =

(FastMemCopyDelegate)Marshal.GetDelegateForFunctionPointer(p,

typeof(FastMemCopyDelegate));

//offset to the end of our method block

p += x86_FastMemCopy_New.Length;

//store length param

p -= 8;

Marshal.Copy(BitConverter.GetBytes((long)nBytes), 0, p, 4);

//store destination address param

p -= 8;

Marshal.Copy(BitConverter.GetBytes((long)dst), 0, p, 4);

//store source address param

p -= 8;

Marshal.Copy(BitConverter.GetBytes((long)src), 0, p, 4);

//Start stopwatch

Stopwatch sw = new Stopwatch();

sw.Start();

//copy-past all data 10 times

for (int i = 0; i < 10; i++)

_fastmemcopy();

//stop stopwatch

sw.Stop();

//get message with measured time

System.Windows.Forms.MessageBox.Show(sw.ElapsedTicks.ToString());

}

catch (Exception ex)

{

//if any exception

System.Windows.Forms.MessageBox.Show(ex.Message);

}

finally

{

//free allocated memory

NativeMethods.VirtualFree(p, (uint)(x86_FastMemCopy_New.Length),

FreeTypes.Release);

GC.Collect(GC.MaxGeneration, GCCollectionMode.Forced);

}

}

else if (IntPtr.Size == 8)

{

throw new ApplicationException("x64 is not supported yet!");

}

}

and assembly code that is represented as an array of bytes with explanation:

private static byte[] x86_FastMemCopy_New = new byte[]

{

0x90, //nop do nothing

0x60, //pushad store flag register on stack

0x95, //xchg ebp, eax eax contains memory address of our method

0x8B, 0xB5, 0x5A, 0x01, 0x00, 0x00, //mov esi,[ebp][00000015A] get source buffer address

0x89, 0xF0, //mov eax,esi

0x83, 0xE0, 0x0F, //and eax,00F will check if it is 16 byte aligned

0x8B, 0xBD, 0x62, 0x01, 0x00, 0x00, //mov edi,[ebp][000000162] get destination address

0x89, 0xFB, //mov ebx,edi

0x83, 0xE3, 0x0F, //and ebx,00F will check if it is 16 byte aligned

0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy

0xC1, 0xE9, 0x07, //shr ecx,7 divide length by 128

0x85, 0xC9, //test ecx,ecx check if zero

0x0F, 0x84, 0x1C, 0x01, 0x00, 0x00, //jz 000000146 ? copy the rest

0x0F, 0x18, 0x06, //prefetchnta [esi] pre-fetch non-temporal source data for reading

0x85, 0xC0, //test eax,eax check if source address is 16 byte aligned

0x0F, 0x84, 0x8B, 0x00, 0x00, 0x00, //jz 0000000C0 ? go to copy if aligned

0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch more source data

0x0F, 0x10, 0x06, //movups xmm0,[esi] copy 16 bytes of source data

0x0F, 0x10, 0x4E, 0x10, //movups xmm1,[esi][010] copy more 16 bytes

0x0F, 0x10, 0x56, 0x20, //movups xmm2,[esi][020] copy more

0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more

0x0F, 0x10, 0x5E, 0x30, //movups xmm3,[esi][030]

0x0F, 0x10, 0x66, 0x40, //movups xmm4,[esi][040]

0x0F, 0x10, 0x6E, 0x50, //movups xmm5,[esi][050]

0x0F, 0x10, 0x76, 0x60, //movups xmm6,[esi][060]

0x0F, 0x10, 0x7E, 0x70, //movups xmm7,[esi][070] we've copied 128 bytes of source data

0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned

0x74, 0x21, //jz 000000087 ? go to past if aligned

0x0F, 0x11, 0x07, //movups [edi],xmm0 past first 16 bytes to non-aligned destination address

0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more

0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2

0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3

0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4

0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5

0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6

0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we've pasted 128 bytes of source data

0xEB, 0x1F, //jmps 0000000A6 ? continue

0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past first 16 bytes to aligned destination address

0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more

0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2

0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3

0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4

0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5

0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6

0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we've pasted 128 bytes of source data

0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128

0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128

0x83, 0xE9, 0x01, //sub ecx,1 decrement counter

0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 000000035 ? continue if not zero

0xE9, 0x86, 0x00, 0x00, 0x00, //jmp 000000146 ? go to copy the rest of data

0x0F, 0x18, 0x86, 0x80, 0x02, 0x00, 0x00, //prefetchnta [esi][000000280] pre-fetch source data

0x0F, 0x28, 0x06, //movaps xmm0,[esi] copy 128 bytes from aligned source address

0x0F, 0x28, 0x4E, 0x10, //movaps xmm1,[esi][010] copy more

0x0F, 0x28, 0x56, 0x20, //movaps xmm2,[esi][020]

0x0F, 0x18, 0x86, 0xC0, 0x02, 0x00, 0x00, //prefetchnta [esi][0000002C0] pre-fetch more data

0x0F, 0x28, 0x5E, 0x30, //movaps xmm3,[esi][030]

0x0F, 0x28, 0x66, 0x40, //movaps xmm4,[esi][040]

0x0F, 0x28, 0x6E, 0x50, //movaps xmm5,[esi][050]

0x0F, 0x28, 0x76, 0x60, //movaps xmm6,[esi][060]

0x0F, 0x28, 0x7E, 0x70, //movaps xmm7,[esi][070] we've copied 128 bytes of source data

0x85, 0xDB, //test ebx,ebx check if destination address is 16 byte aligned

0x74, 0x21, //jz 000000112 ? go to past if aligned

0x0F, 0x11, 0x07, //movups [edi],xmm0 past 16 bytes to non-aligned destination address

0x0F, 0x11, 0x4F, 0x10, //movups [edi][010],xmm1 past more

0x0F, 0x11, 0x57, 0x20, //movups [edi][020],xmm2

0x0F, 0x11, 0x5F, 0x30, //movups [edi][030],xmm3

0x0F, 0x11, 0x67, 0x40, //movups [edi][040],xmm4

0x0F, 0x11, 0x6F, 0x50, //movups [edi][050],xmm5

0x0F, 0x11, 0x77, 0x60, //movups [edi][060],xmm6

0x0F, 0x11, 0x7F, 0x70, //movups [edi][070],xmm7 we've pasted 128 bytes of data

0xEB, 0x1F, //jmps 000000131 ? continue copy-past

0x0F, 0x2B, 0x07, //movntps [edi],xmm0 past 16 bytes to aligned destination address

0x0F, 0x2B, 0x4F, 0x10, //movntps [edi][010],xmm1 past more

0x0F, 0x2B, 0x57, 0x20, //movntps [edi][020],xmm2

0x0F, 0x2B, 0x5F, 0x30, //movntps [edi][030],xmm3

0x0F, 0x2B, 0x67, 0x40, //movntps [edi][040],xmm4

0x0F, 0x2B, 0x6F, 0x50, //movntps [edi][050],xmm5

0x0F, 0x2B, 0x77, 0x60, //movntps [edi][060],xmm6

0x0F, 0x2B, 0x7F, 0x70, //movntps [edi][070],xmm7 we've pasted 128 bytes of data

0x81, 0xC6, 0x80, 0x00, 0x00, 0x00, //add esi,000000080 increment source address by 128

0x81, 0xC7, 0x80, 0x00, 0x00, 0x00, //add edi,000000080 increment destination address by 128

0x83, 0xE9, 0x01, //sub ecx,1 decrement counter

0x0F, 0x85, 0x7A, 0xFF, 0xFF, 0xFF, //jnz 0000000C0 ? continue copy-past if non-zero

0x8B, 0x8D, 0x6A, 0x01, 0x00, 0x00, //mov ecx,[ebp][00000016A] get number of bytes to copy

0x83, 0xE1, 0x7F, //and ecx,07F get rest number of bytes

0x85, 0xC9, //test ecx,ecx check if there are bytes

0x74, 0x02, //jz 000000155 ? exit if there are no more bytes

0xF3, 0xA4, //rep movsb copy rest of bytes

0x0F, 0xAE, 0xF8, //sfence performs a serializing operation on all store-to-memory instructions

0x61, //popad restore flag register

0xC3, //retn return from our method to C#

0x00, 0x00, 0x00, 0x00, //source buffer address

0x00, 0x00, 0x00, 0x00,

0x00, 0x00, 0x00, 0x00, //destination buffer address

0x00, 0x00, 0x00, 0x00,

0x00, 0x00, 0x00, 0x00, //number of bytes to copy-past

0x00, 0x00, 0x00, 0x00

};

We will call this assembly method via delegate we have created earlier.

This method works in 32 bit mode for now and I will implement the 64 bit mode later.
I will add source code if anyone is interested in it (almost all code is there in the article).

Pay attention, the assembly code throws an exception if it is run under Visual Studio, and I still don't understand why.

Points of Interest

During implementation and testing this method, I have found that prefetchnta command is not very clear described even by the Intel specification, so I did try to figure out it myself and via Google.
Also, pay attention to movntps and movaps instructions as they work with 16-byte memory aligned addresses only.

History

  • Bitmap and 16 byte memory alignment
  • Source code and memory alignment samples were added
  • First version - 06/23/2015
FastMemoryCopy_src.zip (14.4KB)

1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y



Do you want to be best software developer?

Hi everyone,

I'm going to share two simple things you need to know to be one of the best software developers.

Really two things and they are very simple:

- You must know English language at least;

- You must understand hardware, bits, bytes and bit operations;


To be honest, one more and very useful thing is Math, thats all what you need.

Huh, seems not too much but not too little.


Lets go through all of this step by step, I will post articles here with everything you have to go through to get success.

Thank you, and be patient ;)




1vqHSTrq1GEoEF7QsL8dhmJfRMDVxhv2y