YES! Your database expert is 100% correct and has identified the fundamental design flaw.
Current Design Problem:
Right now, boinc_user has:
- PRIMARY KEY: (proj, id) - composite key
This means the
same user appears multiple times in the table:
+-------+--------+----------+--------+
| proj | id | nick | cpid |
+-------+--------+----------+--------+
| pgrid | 12345 | Skillz | abc123 |
| seti | 12345 | Skillz | abc123 | <- SAME person, different project
| mine | 67890 | Skillz | abc123 | <- SAME person, different ID on this project
+-------+--------+----------+--------+
The Problems:
- User data is duplicated across every project they participate in
- Same user can have different IDs on different projects (that's why CPID exists)
- Can't directly update "user Skillz" - have to update each (proj, id) combination separately
- To get cross-project totals, must GROUP BY cpid and scan millions of duplicate user records
What He's Recommending (Proper Normalization):
Design 1: Normalized Structure
sql
-- Table 1: Users (one row per unique user)
CREATE TABLE users (
cpid CHAR(32) PRIMARY KEY,
nick VARCHAR(250),
country VARCHAR(50),
create_date INT,
-- other user attributes
);
-- Table 2: Projects (one row per project)
CREATE TABLE projects (
proj_id CHAR(6) PRIMARY KEY,
name VARCHAR(100),
url VARCHAR(255),
-- other project attributes
);
-- Table 3: User-Project Stats (daily transaction data)
CREATE TABLE user_project_stats (
cpid CHAR(32),
proj_id CHAR(6),
stat_date DATE,
total_credit DOUBLE,
rac DOUBLE,
project_user_id INT, -- their ID on that specific project
-- other daily metrics
PRIMARY KEY (cpid, proj_id, stat_date),
INDEX (proj_id, stat_date),
INDEX (cpid)
);
Benefits:
- User data stored ONCE - update "Skillz" in one place
- Only stats are repeated (by date/project) - this is transactional data
- Incremental updates: When PrimeGrid updates:
sql
-- Only update the rows that changed
UPDATE user_project_stats
SET total_credit = NEW_VALUE, rac = NEW_VALUE
WHERE proj_id = 'pgrid' AND stat_date = '2025-10-24'
AND cpid IN (list_of_changed_users);
- Cross-project totals become easy:
sql
-- Fast aggregation (only sums latest stats, not all history)
SELECT cpid, SUM(total_credit)
FROM user_project_stats
WHERE stat_date = CURDATE()
GROUP BY cpid;
Why Current Design Is Slow:
Your current overallboinc.pl script does:
sql
SELECT cpid, sum(metric1), sum(metric2)...
FROM boinc_user
GROUP BY cpid
This:
- ✗ Reads 6 million+ rows (every user×project combination)
- ✗ Most of that data hasn't changed since last run
- ✗ Recalculates everything from scratch
- ✗ Takes ~1 hour because it's processing millions of unchanged rows
With normalized design:
sql
-- Get today's stats for users in changed projects
-- Only processes NEW data
SELECT cpid, SUM(total_credit)
FROM user_project_stats
WHERE stat_date = CURDATE()
AND proj_id IN ('pgrid', 'mine') -- only projects that updated today
GROUP BY cpid
His Key Point: "You only update what you need to update"
Current Design:
- PrimeGrid updates with 50,000 users → must reprocess ALL 6M rows to recalculate totals
Normalized Design:
- PrimeGrid updates with 50,000 users → only INSERT/UPDATE those 50,000 rows in user_project_stats
- Overall totals can be maintained incrementally or recalculated from current day's stats only
Summary:
Your expert is saying:
"The problem isn't the queries or indexes - it's that you've denormalized the data in a way that forces you to recalculate everything repeatedly instead of doing incremental updates."
He's absolutely right. This is a database design issue that no amount of query optimization or better hardware will fix.
The real fix: Redesign to separate:
- Static user data (stored once)
- Static project data (stored once)
- Transactional stats data (daily updates only)
This would make your updates orders of magnitude faster because you'd only process what actually changed.