Wednesday, October 21, 2009

File Type changes on a very large table

I had a situation recently where in looking at some tables we realized that they were originally created with a number of columns set to BigInt. As we know, the BigInt type can hold values upto 9,223,372,036,854,775,807, and has a cost of 8 bytes. We don't think that we will ever have values in these specific fields bigger than the max size of the int data type (2,147,483,647) and int uses half the size (4 bytes). This was the case for around 15-20 tables on several rows in each table and each table was around 20 gb. In many of the tables every field was a BigInt, so changing the datatype to Int would save us half the storage space needed for the table. Without thinking it through one of the DBAs tried to change the data type of one of the tables (after hours) and the Tranlog went through the roof, to the point that we actually got down to almost no free space left on our tranlog drive....not good, so how to solve the problem?

These tables were setup on a partition, where each month gets its own file, this will come in handy later on.

/***********************************************
First: create staging table:
***********************************************/
1) This I do with the GUI...Right click on the table, choose Script Table As: Create To: New Query Editor Window
2) Do the same thing for indexs and triggers, etc (but here I copy to clipboard and paste into the new table window.
3) Do a replace all on the tableName to add a _Staging to the end. Note, this depending on your naming, this should add _Staging to the indexes and triggers as well.



/***********************************************
Next: determine the min date of the data:
***********************************************/

--We use a date/time dimension to store date values, so dates/times are stored as int values in all tables
SELECT MIN(TDT.DateTimeField) FROM TableName TN WITH(NOLOCK) JOIN TimeDimTable TDT WITH(NOLOCK) ON TN.TimeKeyID = TDT.TimeKeyID


/***********************************************
Next: Next we BCP out all the data for each
month and import it into our staging table.
I do this in yearly batches so that I can check
numbers to make sure everything matches as I go.
Also, stop before you get to the current month.
***********************************************/
SET NOCOUNT ON

DECLARE @sql VARCHAR(8000)
DECLARE @TblName VARCHAR (255)
DECLARE @query VARCHAR(8000)
DECLARE @dbname VARCHAR (255)
DECLARE @StartDate VARCHAR(20)
DECLARE @ThisStartDate DATETIME
DECLARE @EndDate VARCHAR(20)
DECLARE @ThisEndDate DATETIME
DECLARE @path VARCHAR(255)

SET @TblName = 'TableName'
SET @dbname = 'DBName'
SET @ThisStartdate = '01/01/2008'
SET @ThisEndDate = '02/01/2008'
SET @path = 'Z:\BCP\'

--The date you enter below needs to be the first of the month, 2 months after the last month you want to process. So 2/1/2009 means that we will process upto 12/31/2008.
WHILE @ThisEndDate < '11/01/2009'
BEGIN

PRINT 'Start Time: ' + CAST(GETDATE() AS VARCHAR(30))
PRINT 'Running for Date: ' + CAST(@ThisStartDate AS VARCHAR(20))

SELECT @startdate = MIN(TimeKeyId), @enddate = MAX(TimeKeyID)-1 FROM TimeDimTable WITH(NOLOCK) WHERE DATETIME BETWEEN @ThisStartDate AND @ThisEndDate
SET @query = 'SELECT COUNT(*) FROM ' + @dbname + '.dbo.'+ @tblname +' WITH(NOLOCK) WHERE TimeKeyID BETWEEN ' + @startdate + ' AND ' + @enddate
PRINT @query
EXEC(@query)
SELECT @startdate = MIN(TimeKeyId), @enddate = MAX(TimeKeyID)-1 FROM TimeDimTable WITH(NOLOCK) WHERE DATETIME BETWEEN @ThisStartDate AND @ThisEndDate

-- Writing Format File
PRINT 'Writing Format File: ' + CAST(GETDATE() AS VARCHAR(30))
SELECT @sql = 'bcp ' + @dbname + '.dbo.' + @tblname + ' format nul -f ' + @path + @tblname + '.fmt -t, -r \n -c -T -S' + @@ServerName
EXEC master..xp_cmdshell @sql
--PRINT @sql

-- Writing Data File
PRINT 'Writing Data File: ' + CAST(GETDATE() AS VARCHAR(30))
SELECT @sql = 'bcp ' + @query + ' queryout ' + @path + @tblname + '.csv -t, -r \n -c -T -S' + @@ServerName
EXEC master..xp_cmdshell @sql
--PRINT @sql

-- Import Data
PRINT 'Starting Import: ' + CAST(GETDATE() AS VARCHAR(30))
set @sql = 'BULK INSERT @dbname.dbo.' + @tblname + '_staging FROM ''' + @path + @tblname + '.csv''
WITH
(
ORDER (TimeKeyID ASC)
,FORMATFILE = ''' + @path + @tblname + '.fmt''
,ROWS_PER_BATCH = 33000000
,TABLOCK

)'

--PRINT @sql
EXEC (@sql)

SET @ThisStartDate = DATEADD(m,1,@ThisStartDate)
SET @ThisEndDate = DATEADD(m,1,@ThisEndDate)
PRINT 'Month Complete: ' + CAST(GETDATE() AS VARCHAR(30)) + '

'
END

PRINT 'Entire Run Complete: ' + CAST(GETDATE() AS VARCHAR(30)) + '

'


/***************************************************** Next check that the row counts match up. Here is where having the data in a partition is nice I pull the rowcount for each partition for the tables. I pase this into Excel with some code to alert me if the values dont match: =IF(B3,IF(D3<>E3,"BAD",""),"")
The data I paste into excel is:
IndexID,PartitionNumber,FileGroupName,RowCt,NewRowCt

Here is the query to get this data:
*****************************************************/
DECLARE @FileGroupName varchar(50)
,@TableName varchar(50)
SET @FileGroupName = 'Months_Partition_Im_Working_With'
SET @TableName = 'TableName'-- do this twice, adding _Staging the second time

SELECT OBJECT_NAME(p.object_id) AS TableName,
-- i.name AS IndexName,
p.index_id AS IndexID,
--NULL AS PartitionScheme,
p.partition_number AS PartitionNumber,
fg.name AS FileGroupName,
NULL AS LowerBoundaryValue,
NULL AS UpperBoundaryValue,
-- NULL AS Boundary,
p.rows
FROM sys.partitions AS p
JOIN sys.indexes AS i ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.data_spaces AS ds ON ds.data_space_id = i.data_space_id
JOIN sys.filegroups AS fg ON fg.data_space_id = i.data_space_id
WHERE OBJECTPROPERTY(p.object_id, 'ISMSShipped') = 0
and object_name(p.object_id) LIKE @TableName
AND fg.NAME LIKE @FileGroupName
UNION ALL
--get info for partitioned table/indexes
SELECT OBJECT_NAME(p.object_id) AS TableName,
-- i.name AS IndexName,
p.index_id AS IndexID,
--ds.name AS PartitionScheme,
p.partition_number AS PartitionNumber,
fg.name AS FileGroupName,
prv_left.value AS LowerBoundaryValue,
prv_right.value AS UpperBoundaryValue,
-- CASE pf.boundary_value_on_right WHEN 1 THEN 'RIGHT' ELSE 'LEFT' END AS Boundary,
p.rows AS Rows
FROM sys.partitions AS p
JOIN sys.indexes AS i ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.data_spaces AS ds ON ds.data_space_id = i.data_space_id
JOIN sys.partition_schemes AS ps ON ps.data_space_id = ds.data_space_id
JOIN sys.partition_functions AS pf ON pf.function_id = ps.function_id
JOIN sys.destination_data_spaces AS dds2 ON dds2.partition_scheme_id = ps.data_space_id AND dds2.destination_id = p.partition_number
JOIN sys.filegroups AS fg ON fg.data_space_id = dds2.data_space_id
LEFT JOIN sys.partition_range_values AS prv_left ON ps.function_id = prv_left.function_id AND prv_left.boundary_id = p.partition_number - 1
LEFT JOIN sys.partition_range_values AS prv_right ON ps.function_id = prv_right.function_id AND prv_right.boundary_id = p.partition_number
WHERE OBJECTPROPERTY(p.object_id, 'ISMSShipped') = 0
and object_name(p.object_id) LIKE @TableName
and fg.name LIKE @FileGroupName
-- and p.rows > 0
-- and p.index_id IN (0,1)
-- and p.partition_number = 2
ORDER BY TableName, IndexID, PartitionNumber


/*****************************************************
Next: Repeat for all months until you get to current.
We were lucky that this was for a DataWarehouse and
the data was imported hourly, so we didn't have to be
concerned about getting the current months data and
having live data coming in at the same time, so we
just did the same thing for time span
first of current month - today+1 then as soon as we
verified it, we did the swap below.

If we had to worry about live data, I would have done
a date span of first of current month to today-1 bcp
then done an after hours insert to get todays data,
within a transaction and then done the swap below:
*****************************************************/
USE [DBName]
EXEC sp_rename 'TableName.IndexName','ZZ_IndexName', 'INDEX'
--Repeat for all origional table indexes, triggers, etc.

EXEC sp_rename 'TableName_Staging.IndexName_Staging', 'IndexName', 'INDEX'
--Repeat for all _Staging table indexes, triggers, etc.


EXEC sp_rename 'TableName', 'ZZ_TableName'
EXEC sp_rename 'TableName_Staging', 'TableName'

--Now all our tables and their indexes, triggers, etc should be swapped

Wednesday, October 7, 2009

Fixing a partition problem

So last time I wrote about using a partition and mentioned it can be used for archiving data; I came across an issue where because of an error Septembers partition didn't start on the first of the month, like it should, so Augusts partition contains data for September. This is a problem because we want to archive off the data for August, but if we stop and drop the August partition then we will also get rid of the few days of September as well and we don't want to do that.


We have a few options for resolving this, most aren't good.
1) we can re-configure the partition.
-- This is bad because of blocking and because of the log file size

2) We can keep the August partition and delete the august data.
-- This is bad, again because of the file size.

3) We can keep the August partition and find a different way of getting rid of the data.
-- This has promise.

What I'm going to end up doing is this:

First I am going to do my normal BCP out of the August data.

Next, make an exact schema copy of the tables in question.

Next switch the August partition to the new tables using this code for each table:
USE DatabaseName ALTER TABLE TableName SWITCH PARTITION 2 TO TableName_Staging PARTITION 2

Now that the data is in a "_Staging" table I can work with it without causing blocking on the main table.

I do the following to get rid of the August data:


/*************************************************
1) Get row count of Sept data
*************************************************/
SELECT COUNT(*) FROM TableName_Staging WHERE CreatedTime >= '9/1/2009' -- 2857208

/*************************************************
2) Copy Sept Data into a Temp Table
-- Confirm row count
*************************************************/
SELECT * INTO #Table_Staging FROM TableName_Staging WHERE CreatedTime >= '9/1/2009'

/*************************************************
3) Truncate Staging Table
4) Copy data from temp back into staging
-- Confirm row count before issuing Commit
*************************************************/
BEGIN TRAN
TRUNCATE TABLE TableName_Staging
INSERT INTO dbo.TableName_Staging
( ColumA ,
ColumnB,
ColumnC ,
ColumnD

)
SELECT * FROM #TableName_Staging

--ROLLBACK
--COMMIT

/*************************************************
2) Drop Temp table
*************************************************/
DROP TABLE #TableName_Staging


So now I should have removed the August data from the _Staging table and rater than deleting 50 million rows, I only had to copy 2.8 million rows; which is a much smaller hit on my tran log file and much, much faster.

Now I switch the partition back from the staging table to the live table:

USE ALTER TABLE TableName_Staging SWITCH PARTITION 2 TO TableName PARTITION 2

And Bobs your uncle, as a co-worker likes to say.