Database design optimization - MariaDB (MySQL)

Skillz

FPS Regular
Staff member
AT Refugee
Joined
Jun 4, 2019
Messages
1,519
Points
113
So I run a decent sized database setup. This database collects Distributed Computing "credits" and other metrics revolving around Distributed Computing projects. It's quick, but I think it can be faster, maybe. So here's what I know about it.


Its 3 databases that stores all the information. One of them is a "static" database that keeps data that doesn't get updated all too frequently. The other two databases is where most all the magic happens. These databases are near identical clones of one another. While one of the database is being updated with new stats the other one is being read by the web server to read the data. When the database being updated is completed, they switch so the other database begins its update and the web server then starts reading from the database that was just updated.

Now there is one particular query that takes ~20 minutes to complete. Its an "overall" ranking that combines everyones scores from all projects then ranks them.

I've tried taking this query and changing the database engine to InnoDB then splitting the query up into 4 chunks. That's 2M+ users it's ranking so each one is roughly 500k users it works on. Then it combines them all back into the table in the proper sort order. This took around or more than 45 minutes to do.

So I switched it back to MYISAM and one single process to process it.

I'm really not sure of how else I can make this so it works faster.

The hardware is Samsung NVME U.2 1.9TB PM9A3 drives. Each database is stored on it's own drive using symbolic links to achieve this. I've tried putting the entire database in RAM (Over 750GB RAM) to see if that would speed up anything but no change.

The CPUs are a pair of Xeon Gold 6154.

I feel this is a CPU limitation and going with a much faster CPU would see some speed up, but I don't know that it would be very drastic.
 
Do you have indexes of the specific data in question created for the tables being queried?

Do you know the job that is running for the report, is it using good syntax to prevent things like full table scans where it is not needed?

I'm thinking a good index and optimized query will help a ton. This is where something like SQL enterprise with it's query analyzer would REALLY help to see what's eating time on your job.
 
It does use indexing. I inherited this database/setup from someone who got tired of maintaining it. I'm also far from a "database guy" and have learned everything I know so far using AI.

I don't know if it's doing full table scans. How can I figure that out?
 
It does use indexing. I inherited this database/setup from someone who got tired of maintaining it. I'm also far from a "database guy" and have learned everything I know so far using AI.

I don't know if it's doing full table scans. How can I figure that out?
Honestly I would do a query analyzer to see what your query is doing. Search for a 3rd party tool for the DB software you're using?
 
I'm in the same boat with some reporting tools we have for work. One of our jobs has about 20 queries, and they take about 5 minutes each to run.

It only gets run once a month though, so I haven't put a lot of time into optimization. The table is very narrow, just 3 fields, but it fairly large.
 
Honestly I would do a query analyzer to see what your query is doing. Search for a 3rd party tool for the DB software you're using?

I wouldn't even know where to look, what to look for or understand what it tells me. Haha. Everything I've learned so far, or rather got accomplished so far has been with AI.

I tried using an "optimizer" tool only for it to basically tell me I need to switch all the tables to InnoDB which is definitely the wrong move.
 
Most of that stuff is well beyond what I understand.

This is just a hobby/project web site. So spending a whole lot of time on it isn't really something I can do. If I was making money off this, sure I'd take the time to learn everything I possibly can. Even going as far as taking some courses that teaches stuff.

Another thing I noticed. When I built this new database server to replace the old one. I installed the latest version of MariaDB. Despite being a much more powerful server the query in question still took multiple minutes longer to complete. Somewhere around the 25+ minutes range while the old server was doing it between 15 and 20 minutes. I couldn't figure it out.

New server had much faster CPUs, much more RAM but it has Enterprise U.2 NVME drives.

Old server: Dual Intel Xeon E5-2630 CPUs, 128GB RAM, consumer grade M.2 NVME drives (Samsung 970 and some other Samsungs)
New server: Dual Intel Xeon Gold 6154 CPUS, 768GB RAM, enterprise grade U.2 NVME drives Samsung PM9A3.

For the life of me I couldn't figure out why this new server was doing the same exact queries and taking longer.

At first I thought somehow the consumer drivers were capable of doing this much faster. So that's when I put the whole database in a ram drive on the new server (it's around 80GB in size) and ran the query. Still took just a long. So the drives isn't the bottle neck.

After a while, with AI help, we discovered that the newer version of MariaDB. The old server was running MariaDB 10.3 while the new server was running version 10.11. Apparently sometime around 10.4+ the MariaDB optimizer was changed drastically causing the new server to take a good bit longer for the same query. So I had to tell the new server to use the old optimizer version and it instantly started doing the same query a tad bit faster than the older server.
 
I haven't worked with this particular dB/ tech

In general
  1. CPU cycles are taken up to process rows
  2. Memory is taken up to process columns/bytes
  3. Apply filters as early as possible on small tables.
  4. (Range partition or cluster index large tables by the dimension id's used max for filtering)
  5. Normalize & join by id's (& index the join keys) to the max
  6. Avoid situations where you have to join 2 large tables, directly. If necessary create intermediate objects
  7. Execute the query in small chunks at a time to figure out which step is causing the slow down & then optimize it
  8. Always do equijoins. Avoid less than & greater than etc.
  9. Avoid not in clause. Try left join & where clause to filter out as an alternative for NOT IN
  10. Avoid Cartesian joins. Understand the unique keys of the data you are joining
  11. Avoid unnecessary distincts, order by or UNION etc
 
I haven't worked with this particular dB/ tech

In general
  1. CPU cycles are taken up to process rows
  2. Memory is taken up to process columns/bytes
  3. Apply filters as early as possible on small tables.
  4. (Range partition or cluster index large tables by the dimension id's used max for filtering)
  5. Normalize & join by id's (& index the join keys) to the max
  6. Avoid situations where you have to join 2 large tables, directly. If necessary create intermediate objects
  7. Execute the query in small chunks at a time to figure out which step is causing the slow down & then optimize it
  8. Always do equijoins. Avoid less than & greater than etc.
  9. Avoid not in clause. Try left join & where clause to filter out as an alternative for NOT IN
  10. Avoid Cartesian joins. Understand the unique keys of the data you are joining
  11. Avoid unnecessary distincts, order by or UNION etc
@Skillz These are all great points but as far as point 7 goes, could you try to figure out which specific query or queries in the script that runs is what's taking the longest? If you post that specific query or queries here maybe we'll have some ideas on how to speed it up.
 
how is the ranking accomplished, a computed score? Is that calculated when the row is updated? use EXPLAIN to figure out how the query is done.

running 3 DBs is not necessarily faster than running 1.
 
Last edited:
New server: Dual Intel Xeon Gold 6154 CPUS, 768GB RAM, enterprise grade U.2 NVME drives Samsung PM9A3.
How much of that RAM gets consumed during the long running query? If not much, you could create a ramdrive using imdisk and copy everything needed for the query into a DB on the ramdrive. I tried this to benchmark SQL Server with HammerDB and got 10x more performance simply by virtue of the DB file being on the ramdrive.
 
The CPUs are a pair of Xeon Gold 6154.
For best performance, use Process Lasso to ensure that the DB engine does not spawn any threads on the second CPU. If it does and the threads on different CPUs need to talk to each other, they will get hit with the horrendously high socket to socket latency. That's the simplest way to avoid performance issues until you can figure out how to put the other CPU to good use while bypassing the socket latency hit.
 
how is the ranking accomplished, a computed score? Is that calculated when the row is updated? use EXPLAIN to figure out how the query is done.

running 3 DBs is not necessarily faster than running 1.

It adds up all the BOINC credits for the user and then ranks them based on others scores.

Running three databases is faster. While one is being updated the other one is being read by the web server. Then it flips when the update process is completed. This ensures the database being updated isn't consuming any other resources by the web server. Each database is stored on its own individual NVME drive.


How much of that RAM gets consumed during the long running query? If not much, you could create a ramdrive using imdisk and copy everything needed for the query into a DB on the ramdrive. I tried this to benchmark SQL Server with HammerDB and got 10x more performance simply by virtue of the DB file being on the ramdrive.

I've put the whole database in a ramdisk before and it wasn't any faster.

For best performance, use Process Lasso to ensure that the DB engine does not spawn any threads on the second CPU. If it does and the threads on different CPUs need to talk to each other, they will get hit with the horrendously high socket to socket latency. That's the simplest way to avoid performance issues until you can figure out how to put the other CPU to good use while bypassing the socket latency hit.

Hmm, this is something I didn't think about. Not running Windows, but I could use taskset to pin it to the socket the NVME drives are connected to directly. Not sure is will really speed things up, but it wouldn't hurt either.

I haven't had a chance to work on this as I've been busy with other things. Mostly moving the whole site to a new web server and it's taking a bit more time than I anticipated.
 
it's not an hardware issue, it's the design of the db/queries/stored procs. You can have a table that stores the data needed to feed the webserver, it's a static table that you update after you compute the results.

Are you computing results as a scheduled task? Are you assuming everyone will need to be updated? The boinc data, is it just all in one table? if you are just doing sum by user, project, it should not take that long, especially if it is a stored procedure.

I never saw that kind of time with hardware from late 1990s and multi million row rdbms. And that was on spinning platters.
 
Last edited:
I don't have the level of knowledge on this to answer your questions, but AI has helped me a lot with this setup. Specifically with converting the php files to work with PHP 7+ as it was all coded for PHH5 and wouldnt work in 7. It also tried to help me speed up the process by changing the table engine from MyISAM to InnoDB, so we could do the batch processing in chunks that it designed for me. Basically splitting the work up until 4 smaller chunks and running them concurrently. While this was faster in this specific step, overall it was much, much slower because after it did this it then had to do some kind of combining the the data back into one which took so long the overall process was nearly 3x longer than before.
Based on our previous conversations, I can answer his questions:


1. "Are you computing results as a scheduled task?"


No, it's continuous processing:


  • Downloads run all day in a loop (checks all projects → 15 min sleep → repeat)
  • When a project's stats file timestamp changes, it downloads and processes that data immediately
  • The rollover happens once daily at 8:00 PM EDT (0000 UTC) - this is the scheduled part

2. "Are you assuming everyone will need to be updated?"


Yes, full table processing:


  • When a project's stats file updates, ALL users/teams/hosts for that project are re-parsed
  • Data flows: XML file → staging tables (project_userx) → main table (boinc_user)
  • The system doesn't do delta updates - it's a full reload per project per update

3. "The boinc data, is it just all in one table?"


Yes, single monolithic tables:


  • boinc_user - all users from all projects
  • boinc_team - all teams from all projects
  • boinc_host - all hosts from all projects
  • Each has a proj column (VARCHAR) to identify which project
  • Millions of total rows across all projects

4. "If you are just doing sum by user, project, it should not take that long, especially if it is a stored procedure"


No stored procedures - using Perl scripts with direct SQL:


  • UPDATE statements are constructed in Perl and executed directly
  • Not using stored procedures or bulk operations
  • From the code snippets in logs, appears to be row-by-row or small-batch operations
  • Takes ~1 hour per database to update (with 2 databases alternating)

Additional Context He Should Know:​


  • MyISAM tables (not InnoDB) - table-level locking
  • Dual database flip-flop - while one updates, website reads from the other
  • You attempted chunking/splitting queries previously but it didn't help
  • The rollover process shifts historical columns: metric1 → days1 → days2 etc.

The way this works is kind of simple. Each one of the BOINC projects updates .xml files on their servers daily. Sometimes a couple times per day, sometimes once per day. Its up to them to decide how often they update these files.

There are basically three key files. hosts.gz user.gz and team.gz.

host.gz tracks all of the hosts attached to the project and updates the points each host has earned.
user.gz, similar to host.gz, but does this for each user that runs the project.
team.gz similar to both of the above, but does the updates based on the teams points those users are part of.

My system takes these files, extracts them, then populates the data into the database. It then calculates, based on the new numbers, how many points the user, team, host has earned in this current update and the current day and stores the new total credit.

Since not all teams, users or hosts get new points every day, or every update, their is no real way to determine this without comparing the data between them all, each time. So that's why it just updates them all.
 
good god, this is negating the purpose of rdbms lol. Is there a reason you are downloading every 15 minutes? Can you not just download 1 file at a fix time, call it midnight, and specify the download is for the day and then process it? I am assuming the donwnload includes only whoever is active for that time interval?
You can determine which user/team has update no? You should only need to update the rows with change.
can you post the 3 files somewhere so I can take a look? sanitise the names if you wish.

I like doing db optimisation.
 
good god, this is negating the purpose of rdbms lol. Is there a reason you are downloading every 15 minutes? Can you not just download 1 file at a fix time, call it midnight, and specify the download is for the day and then process it? I am assuming the donwnload includes only whoever is active for that time interval?
You can determine which user/team has update no? I am assuming the download is just

The download process takes longer than 15 minutes. I'm not sure on the actual time, but it checks all the projects. If the project files are changed/altered since the last check, it downloads the files. If its not, it goes to the next one.

This gives users who, like me, check their stats multiple times a day to see how they're progressing can easily see so on a single page for all their projects. When the project updates it, they'll know within about 1.5 hours of their new totals.

I asked AI to specifically take a look at the one process that takes the longest (the one that takes ~20 minutes) and tell me about it. This is what it said:

Here's The Problem - I Found It!


Looking at the generate_data() subroutine (lines 214-254), this is what's killing performance:


The Bottleneck (Lines 224-244):




sql
INSERT INTO overall_boinc_userx
(select
cpid,
sum(metric1),
sum(metric2),
sum(metric4),
...
from boinc_user
group by cpid)

What This Does:


  1. Full table scan of boinc_user - reads EVERY user from EVERY project (millions of rows)
  2. GROUP BY cpid - groups all users with the same CPID across all projects
  3. Aggregates totals - sums credits/RAC/etc. for each unique CPID
  4. Inserts into temp table - writes aggregated results to overall_boinc_userx

Then at line 250-253, it updates team assignments with another correlated subquery:




sql
update overall_boinc_userx a
set a.t_team = (Select b.team from boinc_user_teams b where a.t_cpid = b.cpid)

Why It's Slow:


  1. No incremental updates - recalculates EVERYTHING from scratch every time
  2. Full table scan on millions of rows with MyISAM table-level locking
  3. GROUP BY on unindexed column (probably no index on cpid)
  4. Correlated subquery for team updates (runs once per user)

Why Chunking Didn't Help:


You can't easily chunk a GROUP BY operation that needs to aggregate across the entire dataset. Even with InnoDB, the query structure is the problem, not the storage engine.




What Your Database Expert Is Getting At:


He's saying: "You shouldn't need to recalculate everything from scratch."


A better design would be:


  1. Static summary table - boinc_combined_user that's already computed
  2. Update only changed records - when a project updates, only recalculate affected CPIDs
  3. Proper indexes on (proj, cpid) or (cpid)
  4. Stored procedure doing set-based operations more efficiently

Does this match what you're experiencing? The script takes ~1 hour because it's doing this massive aggregation across millions of rows every single time it runs.
 
Perfect! Now I can give you a clear answer for your database expert.


Answer to His Question:


YES, there ARE indexes on the key fields:


Looking at the index output:


boinc_user table HAS indexes on:


  • PRIMARY KEY: (proj, id) - composite index
  • cpid index: (cpid, proj) - composite index (row 15-16 in the output)
    • Cardinality shows ~3,066,027 unique CPIDs out of ~6,132,055 total users
    • This means the index exists and is being used

boinc_combined_user table HAS:


  • PRIMARY KEY: (cpid)
  • Multiple other indexes for various ranking queries



BUT - Here's The Real Problem:


Even WITH proper indexes, the query design is still fundamentally slow because:


From the script (line 224-244):




sql
INSERT INTO overall_boinc_userx
(select cpid, sum(metric1), sum(metric2), sum(metric4), ...
from boinc_user
group by cpid)

What This Does:


  1. Reads ALL 6+ million rows from boinc_user (every user from every project)
  2. Groups by cpid (~3 million unique CPIDs)
  3. Sums all metrics for each CPID across all projects
  4. Inserts ~3 million aggregated rows into temp table
  5. Then does a correlated subquery update for teams (line 250)

Why It's Slow Even With Indexes:


  1. MyISAM table-level locking - entire boinc_user table locks during the read
  2. Full table aggregation - must process all 6M+ rows every single time
  3. Not incremental - recalculates EVERYTHING from scratch, even if only one project updated
  4. Temporary table operations - 3M+ rows written to temp table, then merged



What Your Expert Is Saying:


He's pointing out that you're doing a full recalculation when you should be doing incremental updates.


Better Design Would Be:


  • boinc_combined_user is already populated (static)
  • When PrimeGrid updates → only UPDATE the CPIDs that changed in PrimeGrid
  • Keep running totals instead of recalculating from scratch
  • Use something like:



sql
UPDATE boinc_combined_user
SET metric1 = metric1 - old_pgrid_value + new_pgrid_value
WHERE cpid IN (affected_cpids_from_pgrid)

His "late 1990s hardware" comment means: "Even old hardware could SUM a few million rows quickly IF you were doing it right with stored procedures and incremental updates, not full recalculation."

Yes it uses Indexes, another reply from AI to help me answer you properly. I do read what it says and try to make sense of it to make sure it's actually telling the right things and from what I can tell, it is correct.
 
Become a Patron!
Back
Top