Home ยป Can MySQL reasonably perform queries on billions of rows?

Can MySQL reasonably perform queries on billions of rows?


I am not very familiar with your needs, but perhaps storing each data point in the database is a bit of overkill. It sound almost like taking the approach of storing an image library by storing each pixel as a separate record in a relational database.

As a general rule, storing binary data in databases is wrong most of the time. There is usually a better way of solving the problem. While it is not inherently wrong to store binary data in relational database, often times the disadvantages outweigh the gains. Relational databases, as the name alludes to, are best suited for storing relational data. Binary data is not relational. It adds size (often significantly) to databases, can hurt performance, and may lead to questions about maintaining billion-record MySQL instances. The good news is that there are databases especially well suited for storing binary data. One of them, while not always readily apparent, is your file system! Simply come up with a directory and file naming structure for your binary files, store those in your MySQL DB together with any other data which may yield value through querying.

Another approach would be using a document-based storage system for your datapoints (and perhaps spectra) data, and using MySQL for the runs (or perhaps putting the runs into the same DB as the others).

I once worked with a very large (Terabyte+) MySQL database. The largest table we had was literally over a billion rows. This was using MySQL 5.0, so it’s possible that things may have improved.

It worked. MySQL processed the data correctly most of the time. It was extremely unwieldy though. (If you want six sigma-level availability with a terabyte of data, don’t use MySQL. We were a startup that had no DBA and limited funds.)

Just backing up and storing the data was a challenge. It would take days to restore the table if we needed to.

We had numerous tables in the 10-100 million row range. Any significant joins to the tables were too time consuming and would take forever. So we wrote stored procedures to ‘walk’ the tables and process joins against ranges of ‘id’s. In this way we’d process the data 10-100,000 rows at a time (Join against id’s 1-100,000 then 100,001-200,000, etc). This was significantly faster than joining against the entire table.

Using indexes on very large tables that aren’t based on the primary key is also much more difficult. Mysql 5.0 stores indexes in two pieces — it stores indexes (other than the primary index) as indexes to the primary key values. So indexed lookups are done in two parts: First MySQL goes to an index and pulls from it the primary key values that it needs to find, then it does a second lookup on the primary key index to find where those values are.

The net of this is that for very large tables (1-200 Million plus rows) indexing against tables is more restrictive. You need fewer, simpler indexes. And doing even simple select statements that are not directly on an index may never come back. Where clauses must hit indexes or forget about it.

But all that being said, things did actually work. We were able to use MySQL with these very large tables and do calculations and get answers that were correct.

Trying to do analysis on 200 billion rows of data would require very high-end hardware and a lot of hand-holding and patience. Just keeping the data backed up in a format that you could restore from would be a significant job.

I agree with srini.venigalla’s answer that normalizing the data like crazy may not be a good idea here. Doing joins across multiple tables with that much data will open you up to the risk of file sorts which could mean some of your queries would just never come back. Denormallizing with simple, integer keys would give you a better chance of success.

Everything we had was InnoDB. Regarding MyISAM vs. InnoDB: The main thing would be to not mix the two. You can’t really optimize a server for both because of the way MySQL caches keys and other data. Pick one or the other for all the tables in a server if you can. MyISAM may help with some speed issues, but it may not help with the overall DBA work that needs to be done – which can be a killer.

normalizing the data like crazy

Normalizing the data like crazy may not be the right strategy in this case. Keep your options open by storing the data both in the Normalized form and also in the form of materialized views highly suited to your application. Key in this type of applications is NOT writing adhoc queries. Query modeling is more important than data modeling. Start with your target queries and work towards the optimum data model.

Is this reasonable?

I would also create an additional flat table with all data.

run_id | spectrum_id | data_id | <data table columns..> |

I will use this table as the primary source of all queries. The reason is to avoid having to do any joins. Joins without indexing will make your system very unusable, and having indexes on such huge files will be equally terrible.

Strategy is, query on the above table first, dump the results into a temp table and join the temp table with the look up tables of Run and Spectrum and get the data you want.

Have you analyzed your Write needs vs Read needs? It will be very tempting to ditch SQL and go to non-standard data storage mechanisms. In my view, it should be the last resort.

To accelerate the write speeds, you may want to try the Handler Socket method. Percona, if I remember, packages Handler Socket in their install package. (no relation to Percona!)


Related Solutions

Why not use “which”? What to use then?

Here is all you never thought you would ever not want to know about it: Summary To get the pathname of an executable in a Bourne-like shell script (there are a few caveats; see below): ls=$(command -v ls) To find out if a given command exists: if command -v...

Split string into Array of Arrays [closed]

If I got correct what you want to receive as a result, then this code would make what you want: extension Array { func chunked(into size: Int) -> [[Element]] { return stride(from: 0, to: self.count, by: size).map { Array(self[$0 ..< Swift.min($0 + size,...

Retrieving n rows per group

Let's start with the basic scenario. If I want to get some number of rows out of a table, I have two main options: ranking functions; or TOP. First, let's consider the whole set from Production.TransactionHistory for a particular ProductID: SELECT...

Don’t understand how my mum’s Gmail account was hacked

IMPORTANT: this is based on data I got from your link, but the server might implement some protection. For example, once it has sent its "silver bullet" against a victim, it might answer with a faked "silver bullet" to the same request, so that anyone...

What is /storage/emulated/0/?

/storage/emulated/0/Download is the actual path to the files. /sdcard/Download is a symlink to the actual path of /storage/emulated/0/Download However, the actual files are located in the filesystem in /data/media, which is then mounted to /storage/emulated/0...

How can I pass a command line argument into a shell script?

The shell command and any arguments to that command appear as numbered shell variables: $0 has the string value of the command itself, something like script, ./script, /home/user/bin/script or whatever. Any arguments appear as "$1", "$2", "$3" and so on. The...

What is pointer to string in C?

argv is an array of pointers pointing to zero terminated c-strings. I painted the following pretty picture to help you visualize something about the pointers. And here is a code example that shows you how an operating system would pass arguments to your...

How do mobile carriers know video resolution over HTTPS connections?

This is an active area of research. I happen to have done some work in this area, so I'll share what I can about the basic idea (this work was with industry partners and I can't share the secret details ๐Ÿ™‚ ). The tl;dr is that it's often possible to identify an...

How do I change the name of my Android device?

To change the hostname (device name) you have to use the terminal (as root): For Eclair (2.1): echo MYNAME > /proc/sys/kernel/hostname For Froyo (2.2): (works also on most 2.3) setprop net.hostname MYNAME Then restart your wi-fi. To see the change, type...

How does reverse SSH tunneling work?

I love explaining this kind of thing through visualization. ๐Ÿ™‚ Think of your SSH connections as tubes. Big tubes. Normally, you'll reach through these tubes to run a shell on a remote computer. The shell runs in a virtual terminal (tty). But you know this part...

Difference between database vs user vs schema

In Oracle, users and schemas are essentially the same thing. You can consider that a user is the account you use to connect to a database, and a schema is the set of objects (tables, views, etc.) that belong to that account. See this post on Stack Overflow:...

What’s the output of this code written in java?

//if you're using Eclipse, press ctrl-shift-f to "beautify" your code and make it easier to read int arr[] = new int[3]; //create a new array containing 3 elements for (int i = 0; i < 3; i++) { arr[i] = i;//assign each successive value of i to an entry in...

How safe are password managers like LastPass?

We should distinguish between offline password managers (like Password Safe) and online password managers (like LastPass). Offline password managers carry relatively little risk. It is true that the saved passwords are a single point of failure. But then, your...

Can anyone tell me why this program go to infinite times?

while (i <= 2) { while (i > 0) { a = a + b; i--; <- out the inner while loop when i = 0 } printf("%d", a); i++; <- at here, the i==0 each time, so infinity loop } Because your nested loop always restores the value of i to 0, And 0 <= 2 is always...

How to conditionally do something if a command succeeded or failed

How to conditionally do something if a command succeeded or failed That's exactly what bash's if statement does: if command ; then echo "Command succeeded" else echo "Command failed" fi Adding information from comments: you don't need to use the [ ... ] syntax...