"You'll never be able to figure out our data", that's what I was told. 48 hours later after putting in some evening hours, mission accomplished. This is not a hypothetical event but rather a recurring theme in what we do at Premier. Every company believes that their data is so unique and different that only they can make sense of it. In some cases, they believe their data is so complex that it can't even be located, let alone pulled out of the underlying database. The truth, however, is that data can only be structured in so many ways. In order for a system to function, that structure must be logical. Whether it's financial data, manufacturing data, or payroll data and whether it resides in a mainframe or relational database, it's all just rows and columns. In order to make sense of it, one needs only to identify the important tables and then understand the relationships between them.
I've personally run into several projects where my client didn't know where the data of interest was stored. They could identify the database and the content they were looking for through the front end, but they couldn't speak to the names of the underlying tables or columns. This always presents a fun challenge and a great opportunity to add value. It’s a process I like to think of as “Data Spelunking” – searching through the database trying to see what you can find and going wherever it takes you.
Assuming the data is in a relational database, there are couple of strategies you can employ to trek through the data and find what you're looking for. Both start with the system catalog at the heart of the database (or data dictionary, information schema, etc.) and are straight forward.
If the database has tables and columns with meaningful names and you're looking for something specific, simply using the catalog to search for tables or tables containing columns with likely names can often quickly lead you to the data you're interested in. For instance, if you were looking at an Oracle database supporting some system and wanted to find tables and columns related to vendors, you could try the following 2 queries (the first for table names by themselves, the second for tables corresponding to column names):
SELECT OWNER, TABLE_NAME FROM ALL_TABLES WHERE TABLE_NAME LIKE '%VENDOR%'
SELECT TABLE_NAME, COLUMN_NAME FROM USER_TAB_COLUMNS A WHERE COLUMN_NAME LIKE '%VENDOR%' ORDER BY TABLE_NAME
Varying the search term and common abbreviations of it may enable you to pinpoint what you're looking for. This works extremely well for something like PeopleSoft where most of the names are very meaningful.
Sometimes, however, that's not enough due to the sheer number of tables in a database, the number of tables sharing the same column names, or the need to identify many types of data. In such a case, the system catalog can be used to generate a list of tables with corresponding row counts for further analysis. You can actually use a SQL Query with the system catalog to construct a whole series of queries that obtain counts for each table. For Oracle (as an example), you could use:
select 'Select ' || '''' || owner || '''' || ' as "Owner_Name", ' || '''' || table_name || '''' || ' as "Table_Name", Count(*) from ' || owner || '.' || table_name || 'union' from all_tables
This query will return a result that is actually series of SQL queries followed by the word "union".
Select 'DEV' as "Owner_Name", 'PS_CUSTOMER' as" Table_Name", Count(*) from DEV.PS_CUSTOMER union
Select 'DEV' as "Owner_Name", 'PS_CUSTOMER_FSS' as" Table_Name", Count(*) from DEV.PS_CUSTOMER_FSS union
Select 'DEV' as "Owner_Name", 'PS_CUSTOMER_LANG' as" Table_Name", Count(*) from DEV.PS_CUSTOMER_LANG union
Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEO1' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEO1 union
Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEO2' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEO2 union
Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEO3' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEO3 union
Select 'DEV' as "Owner_Name", 'PS_CUSTOPT_TEOA' as" Table_Name", Count(*) from DEV.PS_CUSTOPT_TEOA union
Select 'DEV' as "Owner_Name", 'PS_CUST_ADDRESS' as" Table_Name", Count(*) from DEV.PS_CUST_ADDRESS union
Select 'DEV' as "Owner_Name", 'PS_CUST_ADDRSQ_LNG' as "Table_Name", Count(*) from DEV.PS_CUST_ADDRSQ_LNG union
Select 'DEV' as "Owner_Name", 'PS_CUST_ADDR_CNTCT' as "Table_Name", Count(*) from DEV.PS_CUST_ADDR_CNTCT union
Select 'DEV' as "Owner_Name", 'PS_CUST_ADDR_EXS' as "Table_Name", Count(*) from DEV.PS_CUST_ADDR_EXS union
Select 'DEV' as "Owner_Name", 'PS_CUST_ADDR_SEQ' as "Table_Name", Count(*) from DEV.PS_CUST_ADDR_SEQ union
You can then feed this result (minus the final "union") back into a SQL query tool approximately 500 rows at a time and obtain lists of tables along with the current row counts for those tables.
DEV PS_CUSTOMER 21914
DEV PS_CUSTOMER_FSS 14800
DEV PS_CUSTOMER_LANG 568
DEV PS_CUSTOPT_TEO1 0
DEV PS_CUSTOPT_TEO2 0
DEV PS_CUSTOPT_TEO3 0
DEV PS_CUSTOPT_TEOA 0
DEV PS_CUST_ADDRESS 29362
DEV PS_CUST_ADDRSQ_LNG 443
DEV PS_CUST_ADDR_CNTCT 0
DEV PS_CUST_ADDR_EXS 1236
DEV PS_CUST_ADDR_SEQ 27945
These lists can then be added to a spreadsheet, sorted based on row count or name, and quickly scanned to eliminate tables that won't be of interest.
Generally, you will already have an expectation on data volume and can quickly eliminate tables based on counts. For instance, it's unlikely that you're going to migrate data from tables with under 100 rows or maybe even 500. Similarly, there may be some obvious upper bounds on counts for the type of data you're looking for. In most cases you don't have millions of items or customers. High volumes are likely transactional data or temporary tables used for system reporting.
The naming conventions of the tables will also allow you to pare the list down. Suffixes like "TMP" or "RPT" are typically giveaways that these tables are used by system processes and are not the sources of truth you're looking for. Sometimes you'll see date suffixes indicating that someone backed up a table. Scanning through the list, trends should begin to leap off the page. Additionally, you'll be able to pick out situations where multiple tables with similar names have identical row counts allowing you to quickly speculate on relationships you'll want to research further.
Using this process, you can generally reduce the entire database down to a list of likely tables in under an hour. Then start the more tedious process of looking at the content and index structure of each remaining table. I've done it successfully many times over the last decade, and it should work for nearly any relational database with minor tweaks to the queries used.
Happy Spelunking!