Database design optimization - MariaDB (MySQL)

Am I understanding this right? Boinc_User table has ProjectID and UserID as PK? Why not just UserID and add index on ProjectID? Typically you try to normalise the data so that you only update what you need to update. so User in one table, Project in another, the daily transactions you download is a separate one. You update User and Project if there are changes.

like I said, provide the db as an export through sqldump and I can take a look at it and see what can be done. That gives me the schema without the data, you just tell me approximatly how many rows there are in each table.

something like this I think, I googled.

mysqldump -u [username] -p [database_name] > [output_filename].sql
 
Not sure what you are asking here, so I had to ask AI again. Haha.

YES! Your database expert is 100% correct and has identified the fundamental design flaw.


Current Design Problem:


Right now, boinc_user has:


  • PRIMARY KEY: (proj, id) - composite key

This means the same user appears multiple times in the table:




+-------+--------+----------+--------+
| proj | id | nick | cpid |
+-------+--------+----------+--------+
| pgrid | 12345 | Skillz | abc123 |
| seti | 12345 | Skillz | abc123 | <- SAME person, different project
| mine | 67890 | Skillz | abc123 | <- SAME person, different ID on this project
+-------+--------+----------+--------+

The Problems:


  1. User data is duplicated across every project they participate in
  2. Same user can have different IDs on different projects (that's why CPID exists)
  3. Can't directly update "user Skillz" - have to update each (proj, id) combination separately
  4. To get cross-project totals, must GROUP BY cpid and scan millions of duplicate user records



What He's Recommending (Proper Normalization):


Design 1: Normalized Structure




sql
-- Table 1: Users (one row per unique user)
CREATE TABLE users (
cpid CHAR(32) PRIMARY KEY,
nick VARCHAR(250),
country VARCHAR(50),
create_date INT,
-- other user attributes
);

-- Table 2: Projects (one row per project)
CREATE TABLE projects (
proj_id CHAR(6) PRIMARY KEY,
name VARCHAR(100),
url VARCHAR(255),
-- other project attributes
);

-- Table 3: User-Project Stats (daily transaction data)
CREATE TABLE user_project_stats (
cpid CHAR(32),
proj_id CHAR(6),
stat_date DATE,
total_credit DOUBLE,
rac DOUBLE,
project_user_id INT, -- their ID on that specific project
-- other daily metrics
PRIMARY KEY (cpid, proj_id, stat_date),
INDEX (proj_id, stat_date),
INDEX (cpid)
);

Benefits:


  1. User data stored ONCE - update "Skillz" in one place
  2. Only stats are repeated (by date/project) - this is transactional data
  3. Incremental updates: When PrimeGrid updates:



sql
-- Only update the rows that changed
UPDATE user_project_stats
SET total_credit = NEW_VALUE, rac = NEW_VALUE
WHERE proj_id = 'pgrid' AND stat_date = '2025-10-24'
AND cpid IN (list_of_changed_users);

  1. Cross-project totals become easy:



sql
-- Fast aggregation (only sums latest stats, not all history)
SELECT cpid, SUM(total_credit)
FROM user_project_stats
WHERE stat_date = CURDATE()
GROUP BY cpid;



Why Current Design Is Slow:


Your current overallboinc.pl script does:




sql
SELECT cpid, sum(metric1), sum(metric2)...
FROM boinc_user
GROUP BY cpid

This:


  • ✗ Reads 6 million+ rows (every user×project combination)
  • ✗ Most of that data hasn't changed since last run
  • Recalculates everything from scratch
  • ✗ Takes ~1 hour because it's processing millions of unchanged rows

With normalized design:




sql
-- Get today's stats for users in changed projects
-- Only processes NEW data
SELECT cpid, SUM(total_credit)
FROM user_project_stats
WHERE stat_date = CURDATE()
AND proj_id IN ('pgrid', 'mine') -- only projects that updated today
GROUP BY cpid



His Key Point: "You only update what you need to update"


Current Design:


  • PrimeGrid updates with 50,000 users → must reprocess ALL 6M rows to recalculate totals

Normalized Design:


  • PrimeGrid updates with 50,000 users → only INSERT/UPDATE those 50,000 rows in user_project_stats
  • Overall totals can be maintained incrementally or recalculated from current day's stats only



Summary:


Your expert is saying: "The problem isn't the queries or indexes - it's that you've denormalized the data in a way that forces you to recalculate everything repeatedly instead of doing incremental updates."


He's absolutely right. This is a database design issue that no amount of query optimization or better hardware will fix.


The real fix: Redesign to separate:


  1. Static user data (stored once)
  2. Static project data (stored once)
  3. Transactional stats data (daily updates only)

This would make your updates orders of magnitude faster because you'd only process what actually changed.

And I'm still a little lost on what this means. Let me read it over a few more times, maybe it'll get it. Give me a few minutes to figure this out please.
 
You need a User table that has a say autonumber as the PK (LocalUserID for example), the the Boinc User table, then Project table. Add LocalUserID to the BoincUser table, then you need another table, let's call it LocalUserProject where you list the LocalUserID with the ProjectID. That way your references are taken care of. These are probably low change tables, so they form the basis of your sums.
 
Oh I need to wait on my reply. You keep editing your post when I reply. HAHA!

Doing a myslqdump of one of the databases will be over 65GB in size. That's not really feasible for me to upload that anywhere.

So what are you asking for specifically? The database structure without all the data? As this will be smaller, but that command you gave will be the whole database + data = huge database file.
 
No, just exports the db without the data. I don't know mariadb's particular db design export syntax

I'll take a look later when I have some time.
 
I sent you a PM with it.

Its the same as MySQL. You just use the --no-data flag with the mysqldump command. Which I gave you in a PM to download. Its tiny, less than 1MB lol

Much better than 65GB+.
 
that is a lot of tables. the scripts you generated didn't work as an import in mssql cus the syntax was not correct so I just installed mariadb and looked at some of the tables. In general, text make poor keys, so for project you should have a ProjectID that is integer and is the PK.
 
ok so the stats folder represents only new data from the last 24hrs? you add the number there to existing numbers? The Team id is just local to the project? There is no Boinc Team ID? How do you match across different projects? the user has cpid as unique identifier. I need to understand the entity-relationship.

Is there a boinc team table available somewhere that list all the boinc teams with pk?
 
Last edited:
Stats folder? If its the folder containing the XML files, then no. Nothing gets added or subtracted from this. This is what the system utilizes to populate/update the database. Those files just don't get deleted I keep the latest copy on the server at all times. When a new/updated version of those files are downloaded, then I overwrite the existing one with the new one and process that file to update the database.

Their is a TeamID and a UserID, these IDs are unique to the Project itself. The users (and hosts) have cross-project identification (CPID) numbers that allow me to generate a CPID page that shows all the projects that user has crunched on a single page.

The teams are matched on their "team page" using their names as far as I know.

The site, freedc.teamanandtech.org (this is the development version) will grab a stats file that's been recently updated. Update the database with all the content. Then the system will process how many credits the teams, hosts and users did since the last update. It will also track what they did the whole day as well as the previous day(s). It gives a total credit amount and ranks them based on this. This process repeats continuously all day long. When projects become inactive, obviously no more stats are updated so nothing gets processed. Those projects are skipped in the list. At the end of the day (UTC) a roll over process happens that moves all the current "Today" stats to "Yesterday" and all of "Yesterdays" stats to "2 days ago" so on and so forth.
 
Stats folder? If its the folder containing the XML files, then no. Nothing gets added or subtracted from this. This is what the system utilizes to populate/update the database. Those files just don't get deleted I keep the latest copy on the server at all times. When a new/updated version of those files are downloaded, then I overwrite the existing one with the new one and process that file to update the database.

Their is a TeamID and a UserID, these IDs are unique to the Project itself. The users (and hosts) have cross-project identification (CPID) numbers that allow me to generate a CPID page that shows all the projects that user has crunched on a single page.

The teams are matched on their "team page" using their names as far as I know.

The site, freedc.teamanandtech.org (this is the development version) will grab a stats file that's been recently updated. Update the database with all the content. Then the system will process how many credits the teams, hosts and users did since the last update. It will also track what they did the whole day as well as the previous day(s). It gives a total credit amount and ranks them based on this. This process repeats continuously all day long. When projects become inactive, obviously no more stats are updated so nothing gets processed. Those projects are skipped in the list. At the end of the day (UTC) a roll over process happens that moves all the current "Today" stats to "Yesterday" and all of "Yesterdays" stats to "2 days ago" so on and so forth.
isn't https://sech.me/boinc/Amicable/stats/ the latest stats from project Amicable? Typically refreshed every 24hrs?
 
Yes, that is the latest files that was updated. I think Amicable numbers updates around 4 times a day.

Every time those files get updated, my stats collection system downloads the file. Extracts it. Moves it to a /stat/project directory, over writes the previous one, then the system processes the files and updates the database.
 
So you have to follow their update schedule then? Each project has different update frequency?
 
So you have to follow their update schedule then? Each project has different update frequency?

Yes and no.

Each project does have it's own update frequencies and times, but I don't necessarily follow them. The download script just runs in a continuous loop. It runs through checking all the projects files to see if they've been changed since the last check, if different, it downloads them, if no change, it skips to the next one. Then a 15 minute timer at the end of the list and the process repeats itself.
 
so the user xml

<user>
<id>1</id>
<name>S C</name>
<country>Sweden</country>
<create_time>1483633404</create_time>
<total_credit>72451572.864025</total_credit>
<expavg_credit>0.094425</expavg_credit>
<expavg_time>1530457502.188703</expavg_time>
<cpid>ebeb727389d386958c5c414f38ca09bd</cpid>
<url>sech.me</url>
<teamid>72</teamid>
</user>

what is create time?
is id just the id of the user in this project?
is total credit representative of the delta since last update? or total of this user for this project up to the xml export time?
what about expavg_credit and expavg_time, are these all deltas?
is the teamid the id from team?
 
How about create multiple DB instances under at least four VMs? Assign two to one CPU each. That way, your average resource utilization will go up and you may be able to get more performance out of the server without changing anything, other than pulling data from the four instances separately after the processing is completed.
 
<id>1</id>
The Project ID for this user at this specific project

<name>S C</name>
The users name

<country>Sweden</country>
The country they selected

<create_time>1483633404</create_time>
The date they created this account

<total_credit>72451572.864025</total_credit>
The total credits they have earned on the project so far. This number gets updated and grows as they earn more credits on the project.

<expavg_credit>0.094425</expavg_credit>
The average credits they've earned (I believe this is based on the last 14 days, but its complicated and tracked by the project) Thus this number will go up and down depending on when they run the project and how many credits they earned over a short time frame.

<expavg_time>1530457502.188703</expavg_time>
Not sure what this is, I dont think I do anything with this.

<cpid>ebeb727389d386958c5c414f38ca09bd</cpid>
The cross project identifier. This links their account to all the other projects they also run.

<url>sech.me</url>
This is the URL they can enter in their profile.

<teamid>72</teamid>
This is the team ID for this specific project.
 
How about create multiple DB instances under at least four VMs? Assign two to one CPU each. That way, your average resource utilization will go up and you may be able to get more performance out of the server without changing anything, other than pulling data from the four instances separately after the processing is completed.

Not sure how that would help. There are already two databases that flip/flop depending on what its doing. While one database is being updated, the other one is being read from the web server. When the update is complete, the flip. The one that just got updated is now being read from the web server and the database that was previously being read by the web server is now being updated.

This ensures the database being update has full resources to do its updates.

Splitting them up even more, in VMs even, wouldnt really solve anything. Each VM instance will still take just as long to update the database as the current setup.
 
<id>1</id>
The Project ID for this user at this specific project

<name>S C</name>
The users name

<country>Sweden</country>
The country they selected

<create_time>1483633404</create_time>
The date they created this account

<total_credit>72451572.864025</total_credit>
The total credits they have earned on the project so far. This number gets updated and grows as they earn more credits on the project.

<expavg_credit>0.094425</expavg_credit>
The average credits they've earned (I believe this is based on the last 14 days, but its complicated and tracked by the project) Thus this number will go up and down depending on when they run the project and how many credits they earned over a short time frame.

<expavg_time>1530457502.188703</expavg_time>
Not sure what this is, I dont think I do anything with this.

<cpid>ebeb727389d386958c5c414f38ca09bd</cpid>
The cross project identifier. This links their account to all the other projects they also run.

<url>sech.me</url>
This is the URL they can enter in their profile.

<teamid>72</teamid>
This is the team ID for this specific project.
So you just compare the total credit with previous total credit to get the Delta for this interval?
 
Become a Patron!
Back
Top