0

Basically, I have three tables that contain all the data that I want, but I am having to do some crazy JOIN and WHERE statements that are not working for me. I have finally resorted to using temporary tables, but I was wondering if there is a more long term solution.

The Situation: We Pull large amounts of data via SOAP to our database, we have no control on how the data is organized, combined together, labeled and etc., we need to split it up the best we can so that it can eventually become useful to us.

What I am asking is how do the pro's or etc. "Prep" data so that it can either be finally inserted into a usefully table via other tables quickly, and how does it stay updated with the flow of new data coming in? What is the terminology? What should I research?

Thanks in advance!

  • 1
    What do you mean, "statements that are not working for me?" Maybe you should address that first? – OldProgrammer Dec 30 '14 at 20:31
  • It's impossible to answer without more context. – Jason Dec 30 '14 at 20:36
  • 1
    Keep working on the joins and where statements. Once you have achieved success with that, create a view or stored procedure or function and you don't have to worry about new data coming in. – Dan Bracuk Dec 30 '14 at 20:44
  • 1
    I am pretty confident in my ability to join tables if that is what you are getting at, what I am looking for is direction on how the industry usually combines data into a nice usable table instead of having to create super complex queries that either take to long to run or cannot be done unless you prep the data first. – trail_runner Dec 30 '14 at 20:45
  • 1
    We have some summary tables in our data warehouse. They are updated along with all the other tables when data is loaded, in this case daily. If you really need another table, you could take a similar approach. It just won't have real time data in it. – Dan Bracuk Dec 30 '14 at 20:51
  • In terms of populating a data warehouse or data mart, the process is almost always described as three steps: **Extract, Transform, Load** (ETL). In terms of researching the strategies and approaches, I'd be looking for articles on the topics of "**Data Warehouse**" and "**Data Mart**". – spencer7593 Dec 30 '14 at 21:16

1 Answers1

1

The terminology I use for that is for preparing the data and getting ready for insertion is "staging" the data. We typically insert/update rows into temporary staging tables.

We massage and tweak the data in the staging tables, assigning foreign keys, fixing malformed data, splitting big multiuse fields into individual fields, and so on, to get the data cleaned up BEFORE the rows are inserted into the actual target tables.

(I don't know that this is a standard terminology, others may refer to it differently.)


FOLLOWUP

To improve query performance for complex data, we sometimes store pre-joined and pre-calculated results. Basically, we populate a table with "query ready" results, to make for much simpler queries of historic data. The big drawback to this is that we now have redundant data, which can become "out-of-sync" with the operational data. We use a scheduled (nightly) process to re-populate these tables.

(I'm unsure of the industry standard jargon for these types of tables.)

In terms of researching this, these approaches are going to be described in articles/white papers on "data warehouse" and "data mart". And that's almost always described as "ETL" The three major steps: Extract - Transform - Load. There's also a lot of noise in the industry press about "data mining" and "big data".

spencer7593
  • 106,611
  • 15
  • 112
  • 140
  • I just did a quick google search on "staging" led to this wikipedia page: [http://en.wikipedia.org/wiki/Staging(data)](http://en.wikipedia.org/wiki/Staging_%28data%29). It gives an overview of how we use temporary tables. – spencer7593 Dec 30 '14 at 20:42
  • Thanks "Staging" has lead me to a lot of more info on google and info like this: [Datawarehouse](http://stackoverflow.com/questions/21010045/benefits-of-using-staging-database-while-designing-data-warehouse) that is leading me finally down the right road. If you have any more info please let me know. – trail_runner Dec 30 '14 at 21:15
  • 1
    @trail_runner: Yes, I've updated my answer. My first read-through of your question, I was thinking in terms of you importing foreign data into your three tables... in terms of importing data. If you have structured operational data, and you want to "prep" that for easy queries, that would be "Data Warehouse", or, for only three tables, for a specific domain, more appropriately called a "Data Mart". (With a larger, cross domain Data Warehouse, we're often dealing with data from a wide variety of disparate operational systems, and a lot of work goes into getting that data "matched up".) – spencer7593 Dec 30 '14 at 21:21
  • 1
    The most critical thing to remeber when denormalizing liek this is that the dat must be kept in synch with the original data throgh some process. IN a small spplication it might be through a trigger, a more complex data warehouse should have a job that runs on a schedule to keep the data in synch. Further, reporting data should be just that. It shoud not be allowed to be changed except through the mechanism that keeps it up-to-date with the original data. – HLGEM Dec 30 '14 at 22:49