Tuesday, February 12, 2008

Disk Alignment




Disk Alignment

This is one of the most crucial pieces that we can talk about so far regarding performance. Having the disks that make up the LUN misaligned can be a performance impact of up to 30% on an application. The reason this occurs is because of the “Signature” or MetaData information that a host writes to the beginning of a LUN/Disk. To understand this we must first look at how the Clariion formats the LUNs.

In an earlier blog, we described how the Clariion formats the disks. The Clariion formats the disks in blocks of 128 per disk, which is equivalent to a 64 KB of data that is written to a disk from Cache. Why this is a problem, is that when an Operating System like Windows, grabs the LUN, it wants to initialize the disk, or write a disk signature. The size of this disk signature is 63 blocks, or 31 ½ KB of disk space. Because the Clariion formats the disks in 128 blocks, or 64 KB of disk space, that leaves 65 blocks, or 32 KB of disk space remaining on the first disk for the host to write data. The problem is that the host writes to Cache in whatever block size it does. Cache then holds the data a writes the data out to disk in a 64 KB Data Chunk. Because of the “Signature”, the 64 KB Data Chunk now has to go across two physical disks on the Clariion. Usually, we say that hitting more disks is better for performance. However, with this DISK CROSS, performance will go down on a LUN because Cache is now waiting for an acknowledgement from two disks instead of one disk. If one disk is overloaded with I/O, a disk is failing, etc…this will cause a delay in the acknowledgement back to the Storage Processor. This will be the case from now on when Cache writes every chunk of data out to this LUN. This will impact not only the LUN Cache is writing too, but to every LUN on the Raid Group may be affected.

By using an offset on a LUN from a Host Based Utility, ie Diskpart, or Diskpar for Windows, we are allowing the Clariion to write a 64 KB Data Chunk to one physical disk at a time. Essentially, what we are doing is giving up the remaining disk space on the first physical disk in the Raid Group as the illustration shows above. Window’s still writes it’s “Signature” to the first 63 blocks, but we use Diskpart, or Diskpar to offset the disk space the Clariion Writes to of the remaining space on the first disk. When Cache writes out to disk now, it will begin writing out to the first block on the second disk in the Raid Group, thereby giving the full 128 blocks/64 KB chunk that the Clariion hopes to write out to one physical disk.

The problem with all of this is that this offset or alignment needs to set on a Window’s Disk/LUN before any data is written to the LUN. Once there is data on the LUN, this cannot be done without destroying the existing LUN/data. The only way to now fix this problem is to create a new LUN on the Clariion, assign it to the host, set the offset/alignment, and do a host-based copy/migration. Again, a Clariion LUN Migration is a block for block LUN copy/move. All you are doing with a LUN Migration is moving the problem to a new location on the Clariion.
Windows has two utilities from the Command prompt that can be run to set the offset/alignment, Diskpar and Diskpart.

Diskpar is used for Window’s systems running Windows 2000, or 2003, not using at least Service Pack 1. Diskpar can be downloaded as part of the Resource Kit, and requires through its command line interface that the offset be equal to 128. Diskpar sets the offset in blocks, Since the Clariion formats the disks in 128 blocks, the Clariion will now offset writing to the LUN to block number 128, which is the first block on the second disk.

Diskpart is for Windows Systems running Windows 2003, service pack 1 and up. Diskpart sets the alignment in KiloBytes. Since the Clariion formats the disk in 64 KB, the Clariion will now align the writing to the LUN in 64 KB Chunks, or the first full 64 KB chunk, which is the second physical disk in the Raid Group.

This is also an issue with Linux servers, as an offset will need to set as well. Here again, the number to use is 128, because fdisk uses the number of blocks, not KiloBytes.
The following blog entry will list the steps for setting the offset for Windows 2003, as well as Linux servers.

13 comments:

stucky said...

I had a lengthy discussion with EMC about that.
Everyone always mentions having to set the alignment offset as if a partition table was a mandatory thing to have when it's not.

Scenario 1.
I grab my clariion luns then pvcreate them, then add them to VG's and then carve out LV's.
Then I put an ext3 filesystem onto those LV's - works like a charm and no partition table was ever created. I don't see the point of partitioning my lun further when I wanna use the whole thing anyway.

Scenario 2.
I grab my clariion luns and hand them straight to ASM. Again - no partition table ever !

I'd assume these 2 scenarios are much more common than using fdisk !
The EMC engineer agreed that the 2 scenarios do *not* require alignment offsets to be set.

san guy said...

That is true...because you are using LVM, we don't have to do the offset of the LUNs for the host, and again, this only applies to Intel based systems (Windows and Linux)

Unknown said...

Hey San Guy: Do you know if disk alignment is required for an Intel x86 based Solaris system? If so, can you point me to any documentation you might know of that discusses it?

stucky said...

Actually I have to take back my comment about ASM as it turns out oracle does not support whole disks for ASM. They force you to put at least one partition down.
Their explanation for this insanity ?
Apparently they wanna avoid a situation where linux or any other app think this disk is unused and therefore available since it's missing a partition table.
I don't know of a situation where linux randomly grabs disk you didn't tell it to grab but hey - it is what it is so yes we're back to having to align the table here.

DOuG pRATt said...

If the alignment offset is set correctly when defining the LUN within the RAID group, the adjustment in DISKPART isn't needed, is it?

stucky said...

Doug

You can set this on the lun level but for reasons I totally don't remember EMC recommends that you don't do it on the clariion but rather on the host using fdisk or similar tools.
Oracle also has a howto on doing that for ASM so it seems that is the way to go.
If only I remembered the reason...

Unknown said...

You should always use the host-based method for alignment offset when possible. The lun level alignment offset will misalign the I/O requests made from any software that operates on the raw lun. This would include replication products such as san copy and mirrorview.

stucky said...

Thanks Ben..I remember now.

DOuG pRATt said...

Thanks very much. We're about to enable asynch Mirrorview between two locations, and the less overhead the better.

Bob Flynn said...

Regard alex's comments re asm. Looking at http://www.oracle.com/technology/products/database/asm/pdf/asm_10gr2_bestpractices%2009-07.pdf

Page 9 it seems very blase regarding getting the alignment correct. If I present luns to a linux server. What steps to I need to complete prior to presenting the luns to ASM, as the document refers to sparc.

Johannes Rohrauer said...

so, when i understand you right, when i create a meta-lun, containing a couple of luns inside, point my esx-host to the lun and format it with vmfs, then i am on the safe side until then. but the I need to start e.g. a winpe or a linux live cd and create my partitions on that by hand, aligned?

is this the way to go?

flywowgoldkylin said...

Save On Jordan Shoes Retro Vii Jordan Shoes Retro Vii. Compare
http://www.powerfulkicks.com/
Jordan shoes
Jordan kicks
basketball shoes
cheap shoes
Jordan 11

nomas said...

In early 2013, when I deployed 10 Oracle as a VM on ESXi, with Red Hat, I used fdisk to set the offset to 128 (64kb) for my VNX 5300 SANs (CLarrion). All VM hard disks for Oracle were 1 ASM disk group disk to 1 udev SCSI disk to 1 VM hard disk (to one of four paravirtualization controllers depending on Oracle/RHFS type) to 1 ESXI data store to single LUN on VNX. All VM Hard disks followed practice to be persistent and to be fully thick provisioned to avoid future write delays.

Later when VEEAM replication came out, we had to back off to thin provisioning.

My question is what is block offset for all flash SSD array such as my EMC UNITY 450F, with a single dynamic storage pool, disk encryption turned on (d.a.r.e)?

Using dynamic storage pool, we went from 13 to 21 LUNs per server to 3 LUNs per server, and easy way to 'undercommit' storage, but easy procedure to add disk drive to DATA or FRA ASM disk groups as needed in optimized disk size.

Every LUN is created with thin provisioning and data reduction across

I'm just curious whether UNITY / All Flash which has other block offset option still has EMC recommend 128 for Oracle workloads on the UNITY.

Thank you in advance,
Larry T