Eric Employed -- The Story of a DBA: File Type changes on a very large table

I had a situation recently where in looking at some tables we realized that they were originally created with a number of columns set to BigInt. As we know, the BigInt type can hold values upto 9,223,372,036,854,775,807, and has a cost of 8 bytes. We don't think that we will ever have values in these specific fields bigger than the max size of the int data type (2,147,483,647) and int uses half the size (4 bytes). This was the case for around 15-20 tables on several rows in each table and each table was around 20 gb. In many of the tables every field was a BigInt, so changing the datatype to Int would save us half the storage space needed for the table. Without thinking it through one of the DBAs tried to change the data type of one of the tables (after hours) and the Tranlog went through the roof, to the point that we actually got down to almost no free space left on our tranlog drive....not good, so how to solve the problem?

These tables were setup on a partition, where each month gets its own file, this will come in handy later on.

/***********************************************
First: create staging table:
***********************************************/
1) This I do with the GUI...Right click on the table, choose Script Table As: Create To: New Query Editor Window
2) Do the same thing for indexs and triggers, etc (but here I copy to clipboard and paste into the new table window.
3) Do a replace all on the tableName to add a _Staging to the end. Note, this depending on your naming, this should add _Staging to the indexes and triggers as well.

/***********************************************
Next: determine the min date of the data:
***********************************************/

--We use a date/time dimension to store date values, so dates/times are stored as int values in all tables
SELECT MIN(TDT.DateTimeField) FROM TableName TN WITH(NOLOCK) JOIN TimeDimTable TDT WITH(NOLOCK) ON TN.TimeKeyID = TDT.TimeKeyID

/***********************************************
Next: Next we BCP out all the data for each
month and import it into our staging table.
I do this in yearly batches so that I can check
numbers to make sure everything matches as I go.
Also, stop before you get to the current month.
***********************************************/
SET NOCOUNT ON

DECLARE @sql VARCHAR(8000)
DECLARE @TblName VARCHAR (255)
DECLARE @query VARCHAR(8000)
DECLARE @dbname VARCHAR (255)
DECLARE @StartDate VARCHAR(20)
DECLARE @ThisStartDate DATETIME
DECLARE @EndDate VARCHAR(20)
DECLARE @ThisEndDate DATETIME
DECLARE @path VARCHAR(255)

SET @TblName = 'TableName'
SET @dbname = 'DBName'
SET @ThisStartdate = '01/01/2008'
SET @ThisEndDate = '02/01/2008'
SET @path = 'Z:\BCP\'

--The date you enter below needs to be the first of the month, 2 months after the last month you want to process. So 2/1/2009 means that we will process upto 12/31/2008.
WHILE @ThisEndDate < '11/01/2009'
BEGIN

PRINT 'Start Time: ' + CAST(GETDATE() AS VARCHAR(30))
PRINT 'Running for Date: ' + CAST(@ThisStartDate AS VARCHAR(20))

SELECT @startdate = MIN(TimeKeyId), @enddate = MAX(TimeKeyID)-1 FROM TimeDimTable WITH(NOLOCK) WHERE DATETIME BETWEEN @ThisStartDate AND @ThisEndDate
SET @query = 'SELECT COUNT(*) FROM ' + @dbname + '.dbo.'+ @tblname +' WITH(NOLOCK) WHERE TimeKeyID BETWEEN ' + @startdate + ' AND ' + @enddate
PRINT @query
EXEC(@query)
SELECT @startdate = MIN(TimeKeyId), @enddate = MAX(TimeKeyID)-1 FROM TimeDimTable WITH(NOLOCK) WHERE DATETIME BETWEEN @ThisStartDate AND @ThisEndDate

-- Writing Format File
PRINT 'Writing Format File: ' + CAST(GETDATE() AS VARCHAR(30))
SELECT @sql = 'bcp ' + @dbname + '.dbo.' + @tblname + ' format nul -f ' + @path + @tblname + '.fmt -t, -r \n -c -T -S' + @@ServerName
EXEC master..xp_cmdshell @sql
--PRINT @sql

-- Writing Data File
PRINT 'Writing Data File: ' + CAST(GETDATE() AS VARCHAR(30))
SELECT @sql = 'bcp ' + @query + ' queryout ' + @path + @tblname + '.csv -t, -r \n -c -T -S' + @@ServerName
EXEC master..xp_cmdshell @sql
--PRINT @sql

-- Import Data
PRINT 'Starting Import: ' + CAST(GETDATE() AS VARCHAR(30))
set @sql = 'BULK INSERT @dbname.dbo.' + @tblname + '_staging FROM ''' + @path + @tblname + '.csv''
WITH
(
ORDER (TimeKeyID ASC)
,FORMATFILE = ''' + @path + @tblname + '.fmt''
,ROWS_PER_BATCH = 33000000
,TABLOCK

)'

--PRINT @sql
EXEC (@sql)

SET @ThisStartDate = DATEADD(m,1,@ThisStartDate)
SET @ThisEndDate = DATEADD(m,1,@ThisEndDate)
PRINT 'Month Complete: ' + CAST(GETDATE() AS VARCHAR(30)) + '

'
END

PRINT 'Entire Run Complete: ' + CAST(GETDATE() AS VARCHAR(30)) + '

'

/***************************************************** Next check that the row counts match up. Here is where having the data in a partition is nice I pull the rowcount for each partition for the tables. I pase this into Excel with some code to alert me if the values dont match: =IF(B3,IF(D3<>E3,"BAD",""),"")
The data I paste into excel is:
IndexID,PartitionNumber,FileGroupName,RowCt,NewRowCt

Here is the query to get this data:
*****************************************************/
DECLARE @FileGroupName varchar(50)
,@TableName varchar(50)
SET @FileGroupName = 'Months_Partition_Im_Working_With'
SET @TableName = 'TableName'-- do this twice, adding _Staging the second time

SELECT OBJECT_NAME(p.object_id) AS TableName,
-- i.name AS IndexName,
p.index_id AS IndexID,
--NULL AS PartitionScheme,
p.partition_number AS PartitionNumber,
fg.name AS FileGroupName,
NULL AS LowerBoundaryValue,
NULL AS UpperBoundaryValue,
-- NULL AS Boundary,
p.rows
FROM sys.partitions AS p
JOIN sys.indexes AS i ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.data_spaces AS ds ON ds.data_space_id = i.data_space_id
JOIN sys.filegroups AS fg ON fg.data_space_id = i.data_space_id
WHERE OBJECTPROPERTY(p.object_id, 'ISMSShipped') = 0
and object_name(p.object_id) LIKE @TableName
AND fg.NAME LIKE @FileGroupName
UNION ALL
--get info for partitioned table/indexes
SELECT OBJECT_NAME(p.object_id) AS TableName,
-- i.name AS IndexName,
p.index_id AS IndexID,
--ds.name AS PartitionScheme,
p.partition_number AS PartitionNumber,
fg.name AS FileGroupName,
prv_left.value AS LowerBoundaryValue,
prv_right.value AS UpperBoundaryValue,
-- CASE pf.boundary_value_on_right WHEN 1 THEN 'RIGHT' ELSE 'LEFT' END AS Boundary,
p.rows AS Rows
FROM sys.partitions AS p
JOIN sys.indexes AS i ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.data_spaces AS ds ON ds.data_space_id = i.data_space_id
JOIN sys.partition_schemes AS ps ON ps.data_space_id = ds.data_space_id
JOIN sys.partition_functions AS pf ON pf.function_id = ps.function_id
JOIN sys.destination_data_spaces AS dds2 ON dds2.partition_scheme_id = ps.data_space_id AND dds2.destination_id = p.partition_number
JOIN sys.filegroups AS fg ON fg.data_space_id = dds2.data_space_id
LEFT JOIN sys.partition_range_values AS prv_left ON ps.function_id = prv_left.function_id AND prv_left.boundary_id = p.partition_number - 1
LEFT JOIN sys.partition_range_values AS prv_right ON ps.function_id = prv_right.function_id AND prv_right.boundary_id = p.partition_number
WHERE OBJECTPROPERTY(p.object_id, 'ISMSShipped') = 0
and object_name(p.object_id) LIKE @TableName
and fg.name LIKE @FileGroupName
-- and p.rows > 0
-- and p.index_id IN (0,1)
-- and p.partition_number = 2
ORDER BY TableName, IndexID, PartitionNumber

/*****************************************************
Next: Repeat for all months until you get to current.
We were lucky that this was for a DataWarehouse and
the data was imported hourly, so we didn't have to be
concerned about getting the current months data and
having live data coming in at the same time, so we
just did the same thing for time span
first of current month - today+1 then as soon as we
verified it, we did the swap below.

If we had to worry about live data, I would have done
a date span of first of current month to today-1 bcp
then done an after hours insert to get todays data,
within a transaction and then done the swap below:
*****************************************************/
USE [DBName]
EXEC sp_rename 'TableName.IndexName','ZZ_IndexName', 'INDEX'
--Repeat for all origional table indexes, triggers, etc.

EXEC sp_rename 'TableName_Staging.IndexName_Staging', 'IndexName', 'INDEX'
--Repeat for all _Staging table indexes, triggers, etc.

EXEC sp_rename 'TableName', 'ZZ_TableName'
EXEC sp_rename 'TableName_Staging', 'TableName'

--Now all our tables and their indexes, triggers, etc should be swapped

Eric Employed -- The Story of a DBA

Wednesday, October 21, 2009

File Type changes on a very large table

1 comment:

About Me

Blog Archive

Followers