4

I'm designing a PostgreSQL database that takes in readings from many sensor sources. I've done a lot of research into the design and I'm looking for some fresh input to help get me out of a rut here.

To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).

The basic structure of the data coming in is as follows:

  • For each data logging device, there are several channels.
  • For each channel, the logger reads data and attaches it to a record with a timestamp
  • Different channels may have different data types, but generally a float4 will suffice.
  • Users should (through database functions) be able to add different value types, but this concern is secondary.
  • Loggers and channels will also be added through functions.

The distinguishing characteristic of this data layout is that I've got many channels associating data points to a single record with a timestamp and index number.

Now, to describe the data volume and common access patterns:

  • Data will be coming in for about 5 loggers, each with 48 channels, for every minute.
    • The total data volume in this case will be 345,600 readings per day, 126 million per year, and this data needs to be continually read for the next 10 years at least.
  • More loggers & channels will be added in the future, possibly from physically different types of devices but hopefully with similar storage representation.
  • Common access will include querying similar channel types across all loggers and joining across logger timestamps. For example, get channel1 from logger1, channel4 from logger2, and do a full outer join on logger1.time = logger2.time.

I should also mention that each logger timestamp is something that is subject to change due to time adjustment, and will be described in a different table showing the server's time reading, the logger's time reading, transmission latency, clock adjustment, and resulting adjusted clock value. This will happen for a set of logger records/timestamps depending on retrieval. This is my motivation for RecordTable below but otherwise isn't of much concern for now as long as I can reference a (logger, time, record) row from somewhere that will change the timestamps for associated data.

I have considered quite a few schema options, the most simple resembling a hybrid EAV approach where the table itself describes the attribute, since most attributes will just be a real value called "value". Here's a basic layout:

RecordTable          DataValueTable
----------           --------------
[PK] id        <--   [FK] record_id
[FK] logger_id       [FK] channel_id
record_number        value
logger_time    

Considering that logger_id, record_number, and logger_time are unique, I suppose I am making use of surrogate keys here but hopefully my justification of saving space is meaningful here. I have also considered adding a PK id to DataValueTable (rather than the PK being record_id and channel_id) in order to reference data values from other tables, but I am trying to resist the urge to make this model "too flexible" for now. I do, however, want to start getting data flowing soon and not have to change this part when extra features or differently-structured-data need to be added later.

At first, I was creating record tables for each logger and then value tables for each channel and describing them elsewhere (in one place), with views to connect them all, but that just felt "wrong" because I was repeating the same thing so many times. I guess I'm trying to find a happy medium between too many tables and too many rows, but partitioning the bigger data (DataValueTable) seems strange because I'd most likely be partitioning on channel_id, so each partition would have the same value for every row. Also, partitioning in that regard would require a bit of work in re-defining the check conditions in the main table every time a channel is added. Partitioning by date is only applicable to the RecordTable, which isn't really necessary considering how relatively small it will be (7200 rows per day with the 5 loggers).

I also considered using the above with partial indexes on channel_id since DataValueTable will grow very large but the set of channel ids will remain small-ish, but I am really not certain that this will scale well after many years. I have done some basic testing with mock data and the performance is only so-so, and I want it to remain exceptional as data volume grows. Also, some express concern with vacuuming and analyzing a large table, and dealing with a large number of indexes (up to 250 in this case).

On a very small side note, I will also be tracking changes to this data and allowing for annotations (e.g. a bird crapped on the sensor, so these values were adjusted/marked etc), so keep that in the back of your mind when considering the design here but it is a separate concern for now.

Some background on my experience/technical level, if it helps to see where I'm coming from: I am a CS PhD student, and I work with data/databases on a regular basis as part of my research. However, my practical experience in designing a robust database for clients (this is part of a business) that has exceptional longevity and flexible data representation is somewhat limited. I think my main problem now is I am considering all the angles of approach to this problem instead of focusing on getting it done, and I don't see a "right" solution in front of me at all.

So In conclusion, I guess these are my primary queries for you: if you've done something like this, what has worked for you? What are the benefits/drawbacks I'm not seeing of the various designs I've proposed here? How might you design something like this, given these parameters and access patterns?

I'll be happy to provide clarification/details where needed, and thanks in advance for being awesome.

Phillip
  • 832
  • 7
  • 10
  • 1
    You say that different channels might have different data types. Can you give me a little more detail about that? – Mike Sherrill 'Cat Recall' Mar 08 '11 at 12:02
  • @Catcall: Many of the channels are translated to a 4-byte IEEE float from a proprietary FP format, while some others will be 8-byte floats (and possibly numeric) when there is a need for greater accuracy. Also, I meant to elaborate that some of the many (sensor) to one (logger/time/record) relationships may be different, e.g. there's a type that requires two distinct values, which means that the readings would best be stored in adjacent columns in the same table. There will even be some strings that will be manual-entry data, but I'm not sure I should try to fit that data into this format. – Phillip Mar 08 '11 at 16:22
  • 1
    @Phillip: So you can configure(?) some (all?) channels to return a 4-byte float, an 8-byte float, a scaled integer(?), a compound value (two distinct values), or a string. Did I get that right? Also, if you have two channels that both return a 4-byte float, might they have different constraints on their range? (I'm imagining something like temperature in degrees Fahrenheit and humidity--temperature might range from -50.0 to 130.0, humidity from 0 to 100.) – Mike Sherrill 'Cat Recall' Mar 08 '11 at 16:56
  • @Catcall: In short, yes. The loggers are programmable and some output channels have multiple parameters associated with them. All channels are configurable, so I figure the best I can do is provide a function to add one (whatever that ends up entailing), and also to make sure it's configured for their existing setup at launch. Also, you are correct about constraints determined by the data source- I was planning on allowing for alerts when data falls outside a specified range. Not sure yet if that will be via triggers or what, but however that goes I'm not worried about the insert overhead. – Phillip Mar 08 '11 at 18:23
  • 1
    @Phillip: Configurable channels means each channel's data type can change over time, right? Could be a 4-byte float for a couple of years, then reconfigured to be a string? – Mike Sherrill 'Cat Recall' Mar 09 '11 at 04:23
  • @Catcall: The physical channel may be used for something else, but a data source is identified by more than its physical location, so the channel would be assigned to a new data source. The actual channel numbers are abstracted in the logger by names, so this would mean a new channel_id in the database essentially. So I can assume that once a source of data is defined, its description may change but the type will never change. I've been meaning to put up the rest of my data model in a question edit- when I get to it, hopefully things are cleared up a bit. – Phillip Mar 09 '11 at 04:45

2 Answers2

2

It is no problem at all to provide all this in a Relational database. PostgreSQL is not enterprise class, but it is certainly one of the better freeware SQLs.

To be clear, I am not looking for help describing the sources of data or any related metadata. I am specifically trying to figure out how to best store data values (eventually of various types).

That is your biggest obstacle. Contrary to program design, which allows decomposition and isolated analysis/design of components, databases need to be designed as a single unit. Normalisation and other design techniques need to consider both the whole, and the component in context. The data, the descriptions, the metadata have to be evaluated together, not as separate parts.

Second, when you start off with surrogate keys, implying that you know the data, and how it relates to other data, it prevents you from genuine modelling of the data.

I have answered a very similar set of questions, coincidentally re very similar data. If you could read those answers first, it would save us both a lot of typing time on your question/answer.

Answer One/ID Obstacle
Answer Two/Main
Answer Three/Historical

Community
  • 1
  • 1
PerformanceDBA
  • 32,198
  • 10
  • 64
  • 90
  • Thanks for commenting. I had previously read many of your answers before posting my question, and I can try to elaborate on the big picture as I do agree on its importance- I have just been very (probably too much) focused on storing the readings. I feel like a lot of your other answers help me a great deal, but there's enough about my situation that makes it a somewhat different problem. I will do my best to elaborate specifically how my situation might differ from those in your other answers. (I will aim to add to the question in the next 12 hours or so). Thanks again. – Phillip Mar 08 '11 at 04:09
  • 1
    @Phillip. Great! I did not mean to imply that your db would or should be the same; every db is different, unique, and its power is diminished when squeezed into a generic db. Don't worry about describing the difference, just focus on describing your db, and we will model one for you that is specific to your need. As identified in AnswerOne/ID Obstacle, get those ID hindrances off the board, and describe how the data relates to other data. If you have diagrams or models, even better. The concept of logical Sensors vs physical capture points (Logger:Channel) is quite advanced. – PerformanceDBA Mar 09 '11 at 09:54
0

I did something like this with seismic data for a petroleum exploration company.

My suggestion would be to store the meta-data in a database, and keep the sensor data in flat files, whatever that means for your computer's operating system.

You would have to write your own access routines if you want to modify the sensor data. Actually, you should never modify the sensor data. You should make a copy of the sensor data with the modifications so that you can show later what changes were made to the sensor data.

Gilbert Le Blanc
  • 50,182
  • 6
  • 67
  • 111
  • While that is a good idea in theory, my client's data is actually subject to change via adjustments/corrections. For example, the calibration curve of a thermocouple changes over time. After recalibration, recent values may need to be updated in order to indicate the correct values. There will be a record of these changes as well. Also, there is a public portal to this data to go along with the client's internal access methods, so between that and the data transformations we aim to provide, a flat file probably wouldn't work out very well. – Phillip Mar 07 '11 at 21:30