DB Design for high amount of data (20 millions rows / day)

Question

We are looking to create a software that receive log files from a high number of devices. We are looking around 20 million rows a day with log (2kb / each for each log line).

I have developed a lot of software but never with this large quantity of input data. The data needs to be searchable, sortable, groupable by source IP, dest IP, alert level etc.

It should be combining similiar log entries (occured 6 times etc..)

Any ideas and suggestions on what type of design, database and general thinking around this would be much appreciated.

UPDATE:
Found this presentation, seems like a similar scenario, any thoughts on this? http://skillsmatter.com/podcast/cloud-grid/mongodb-humongous-data-at-server-density

Does it need to be a database. How often will it be queried? How quickly do you need the results? Microsoft's LogParser allows you to query the log files on disk in a SQL style. http://www.codinghorror.com/blog/2005/08/microsoft-logparser.html — Greg B, Jul 01 '11 at 09:43
There will be hundreds of users logging in, viewing and updating the log entries as resolved so every log entry is "managed". Results should be shown quickly. — grandnasty, Jul 01 '11 at 10:09

score 0 · Answer 1 · answered Jul 01 '11 at 09:48

0

Check this out, it might be helpful https://github.com/facebook/scribe

answered Jul 01 '11 at 09:48

Prathab K

89
5

score 0 · Accepted Answer · answered Jul 01 '11 at 10:07

I see a couple of things you may want to consider.

1) message queue - to drop a log line and let other part (worker) of the system to take care of it when time permits

2) noSQL - reddis, mongodb,cassandra

I think your real problem would be in querying the data , not in storing.

Also you probably would need a scalable solution. Some of noSql databases are distributed you may need that.

Tim · Answer 3 · 2011-07-01T11:21:01.343

I'd base many decisions on how users most often will be selecting subsets of data -- by device? by date? by sourceIP? You want to keep indexes to a minimum and use only those you need to get the job done.

For low-cardinality columns where indexing overhead is high yet the value of using an index is low, e.g. alert-level, I'd recommend a trigger to create rows in another table to identify rows corresponding to emergency situations (e.g. where alert-level > x) so that alert-level itself would not have to be indexed, and yet you could rapidly find all high-alert-level rows.

Since users are updating the logs, you could move handled/managed rows older than 'x' days out of the active log and into an archive log, which would improve performance for ad-hoc queries.

For identifying recurrent problems (same problem on same device, or same problem on same ip address, same problem on all devices made by the same manufacturer, or from the same manufacturing run, for example) you could identify the subset of columns that define the particular kind of problem and then create (in a trigger) a hash of the values in those columns. Thus, all problems of the same kind would have the same hash value. You could have multiple columns like this -- it would depend on your definition of "similar problem" and how many different problem-kinds you wanted to track, and on the subset of columns you'd need to enlist to define each kind of problem. If you index the hash-value column, your users would be able to very quickly answer the question, "Are we seeing this kind of problem frequently?" They'd look at the current row, grab its hash-value, and then search the database for other rows with that hash value.

score 0 · Answer 4 · edited May 23 '17 at 12:13

0

A web search on "Stackoverflow logging device data" yielded dozens of hits.

Here is one of them. The question asked may not be exactly the same as yours, but you should get dozens on intersting ideas from the responses.

edited May 23 '17 at 12:13

Community

1
1

answered Jul 01 '11 at 11:18

Walter Mitty

18,205
2
28
58

DB Design for high amount of data (20 millions rows / day)

4 Answers4