search
top

How To Get The Most Frequently Used Column Values

Whenever I import external data, integrate to another database, or am new to a project, I need to get familiar with the database. The table schemas, relational integrity, and constraints are the first thing I look at and take me a long way, but soon I need to know what the data looks like.

In an ideal world, relational integrity and database constraints would define control this, and all I’d really need to do is look at those. But the reality is, in 15 years of working in this industry, most of the databases I’ve worked on, that I didn’t design, have barely used constraints and some haven’t even used relation integrity fully!

The need to get a good feel of the data is even more prevalent when working with dirty data, or when refactoring poorly written applications to ensure any refactoring doesn’t introduce other issues. I will usually wind up writing the following query repeatedly:

1
2
3
4
SELECT	column_name, COUNT(*)
FROM	TABLE_NAME
GROUP BY column_name
ORDER BY COUNT(*) DESC, column_name

This little query often reveals; inconsistencies between data and the application, where an application sets column X to possible values of ‘A’, ’B’, ‘C’, ‘D’, or ‘E’, but in reality, there may be zero ‘C’ and ‘E’ values in that column, but there is 6 ‘X’s, 1 ‘Q’, and an ‘?’. Or I may find that there are only 6 rows with data in that column, out of almost 3 million rows, indicating the column / application feature is unused.

Anyway, yesterday I finally wrote a little stored procedure which will print out the most frequent N values in each column for a specified table.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/*
Purpose : Retrieves the top N most frequent values in 
        each column of the specified data table.
Copyright © Jaten Systems Inc. 2009, All rights reserved
http://www.jatensystems.com/
http://whileicompile.wordpress.com/
http://www.johnmacintyre.ca/
*/
CREATE proc jaten_diag_GetTableContents
(@tblName nvarchar(200),
@rowCount INT)
AS
BEGIN
	DECLARE @colId INT;
	DECLARE @colName nvarchar(100);
	DECLARE @SQL nvarchar(2048);
	DECLARE @tblRowCount INT;
 
	-- cursor to get columns
	DECLARE curColumns CURSOR FAST_FORWARD 
			FOR	SELECT	colid, [name] 
					FROM	syscolumns
					WHERE	id=object_id(@tblName)
					ORDER BY colid
	OPEN curColumns;
 
	-- get table row count
	SET	@SQL = 'SELECT @retCount=COUNT(*) FROM ' + @tblName;
	EXEC sp_executeSQL @SQL, N'@retCount INT OUTPUT', @retCount = @tblRowCount OUTPUT;
 
	-- print table header
	print '';
	print '---------------------------------';
	print '--- ' + @tblName;
	print '--- Row count : ' + cast(@tblRowCount as nvarchar)
	print '---------------------------------';
 
	-- get info for each column
	fetch NEXT FROM curColumns INTO @colId, @colName;
	while 0 = @@fetch_status
	BEGIN
			-- print column header
			print '';
			print '---------------------------------';
			print '--- Column [' + cast(@colId as nvarchar) + '] - ' + @colName + ' ---';
 
			-- compile & execute grouping sql
			SELECT	@SQL = 'SELECT	top ' + CAST(@rowCount AS nvarchar) 
							+ '		COUNT(*) AS [COUNT], '
							+ '		CAST(((COUNT(*) * 100)/' + CAST( @tblRowCount AS nvarchar) + ') AS nvarchar) + ''%'' AS [Percentage], ' 
							+ '		[' + @colName + '] AS [col_value] ' 
							+ 'FROM	' + @tblName + ' ' 
							+ 'GROUP BY [' + @colName + '] ' 
							+ 'ORDER BY COUNT(*) DESC, [' + @colName + ']';
			EXEC sp_executeSQL @SQL;
			--print @sql;
 
			-- next
			fetch NEXT FROM curColumns INTO @colId, @colName;
	END
 
	-- clean up
	close curColumns;
	deallocate curColumns;
END

Please note 2 things :
1. You need to run it with ‘Results to Text’ or ‘Results to File’ setting.
2. The table parameter will need square brackets if the table name uses unconventional characters.

If you create it and run it in AdventureWorks on the ‘Production.Product’ table

1
EXEC jaten_diag_GetTableContents 'Production.Product', 5

… you will get these results

———————————
— Production.Product
— Row count : 504
———————————

———————————
— Column [1] – ProductID —
count Percentage col_value
———– ——————————- ———–
1 0% 1
1 0% 2
1 0% 3
1 0% 4
1 0% 316

(5 row(s) affected)

….

———————————
— Column [6] – Color —
count Percentage col_value
———– ——————————- —————
248 49% NULL
93 18% Black
43 8% Silver
38 7% Red
36 7% Yellow

(5 row(s) affected)

….

Notice how the Color column reveals that almost half of the products do not have a color setting? This could imply relevancy or this data possibly has a problem being maintained. But also, notice how unique columns will obviously provide meaningless data.

The AdventureWorks database is a very clean database, so this example is a bit contrived, but in the real world, there are plenty of databases where this little procedure will allow you to get some insight into the data.

How do you get familiar with new data?

Copyright © John MacIntyre 2009, All rights reserved

WARNING – All source code is written to demonstrate the current concept. It may be unsafe and not exactly optimal.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

top