MySQL 中随机选择10条记录
MySQL 中随机选择10条记录
SELECT id FROM user ORDER BY RAND() LIMIT 10;
数据量小于1000行的时候,上面的 sql 执行的快。但是当数据大于10000行, 排序的开销就变得很重。上面的操作中,我们在排序完就把几乎所有的行都丢掉了。
只要我们有一个数字主键,我们可以有更好的方式去实现这个功能,不需要对所有数据进行排序。
在上面的例子中, 我们假设 id 从1开始, 并且在1和 id 的最大值之间是连续的。
通过应用程序解决问题
可以在应用程序中计算随机id, 简化整个计算。
SELECT MAX(id) FROM user;
## 在应用程序中生成区间内的随机数:random-id
SELECT name FROM user WHERE id = <random-id>
由于MAX(id) == COUNT(id),我们只是生成1和 max (id) 之间的随机数, 并将其传递到数据库中检索随机行。
第一个select语句是NO-OP,并一直在被优化。第二个是针对常量的 eq 速度也很快。
通过数据库解决问题
# 生成一个随机ID
> SELECT RAND() * MAX(id) FROM user;
+------------------+
| RAND() * MAX(id) |
+------------------+
| 689.37582507297 |
+------------------+
# 返回值是double,但是我们需要的是 int
> SELECT CEIL(RAND() * MAX(id)) FROM user;
+-------------------------+
| CEIL(RAND() * MAX(id)) |
+-------------------------+
| 1000000 |
+-------------------------+
# 返回值是 int,分析性能
> EXPLAIN
SELECT CEIL(RAND() * MAX(id)) FROM random;
+----+-------------+-------+-------+------+-------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+-------+-------+------+-------------+
| 1 | SIMPLE | random| index |1000000| Using index |
+----+-------------+-------+-------+------+-------------+
## 全表扫描?由于使用 MAX()函数了,导致优化丢失。
> EXPLAIN
SELECT CEIL(RAND() * (SELECT MAX(id) FROM random));
+----+-------------+-------+------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+-------+------+------+------------------------------+
| 1 | PRIMARY | NULL | NULL | NULL | No tables used |
| 2 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+-------+------+------+------------------------------+
## 子查询可以将性能损失挽回
> EXPLAIN
SELECT name
FROM user
WHERE id = (SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM user));
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
| 1 | PRIMARY | user | ALL | NULL | NULL | NULL | NULL | 1000000 | Using where |
| 3 | SUBQUERY | NULL | NULL | NULL | NULL | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+--------+------+---------------+------+---------+------+---------+------------------------------+
> show warnings;
+-------+------+------------------------------------------+
| Level | Code | Message |
+-------+------+------------------------------------------+
| Note | 1249 | Select 2 was reduced during optimization |
+-------+------+------------------------------------------+
上面的方法是最明显的, 但也是最错误的做法。原因是:where子查询中的select为外部select每一行都会执行。具体解释参考:sql语句嵌套查询性能低
要找一种方法,保证random-id只生成一次:
SELECT name
FROM user JOIN
(SELECT CEIL(RAND() *
(SELECT MAX(id)
FROM user)) AS id
) AS r2
USING (id);
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+------------+--------+------+------------------------------+
| 1 | PRIMARY | <derived2> | system | 1 | |
| 1 | PRIMARY | user | const | 1 | |
| 2 | DERIVED | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
内部select生成一个常量临时表, join 只在单行上执行。没有使用排序,没有通过应用程序,查询的大多数部分都被优化了。
非连续数据
删除一些行,构造ID非连续的记录。
SELECT name
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
+----+-------------+------------+--------+------+------------------------------+
| id | select_type | table | type | rows | Extra |
+----+-------------+------------+--------+------+------------------------------+
| 1 | PRIMARY | <derived2> | system | 1 | |
| 1 | PRIMARY | r1 | range | 689 | Using where |
| 2 | DERIVED | NULL | NULL | NULL | No tables used |
| 3 | SUBQUERY | NULL | NULL | NULL | Select tables optimized away |
+----+-------------+------------+--------+------+------------------------------+
join现在获取所有大于或等于我们随机值的ID,如果不能直接匹配则选择邻居。 但是一旦找到一行,就停止执行(LIMIT 1)。根据索引(ORDER BY id ASC)读取行。 当使用 >= 而不是a = 时,我们可以摆脱CEIL并以更少的工作获得相同的结果。
平等分配
当我们的ID分布不再相等时,我们选择的行也不是真正随机的。
> select * from holes;
+----+----------------------------------+----------+
| id | name | accesses |
+----+----------------------------------+----------+
| 1 | d12b2551c6cb7d7a64e40221569a8571 | 107 |
| 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 50 |
| 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 132 |
| 8 | 677a196206d93cdf18c3744905b94f73 | 230 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa | 481 |
+----+----------------------------------+----------+
RAND方法会生成9到15之类的ID,这些ID都会导致id 16被选为下一个更高的数字。
这个问题没有真正的解决方案,但是由于你的数据大多是不变的,你可以添加一个映射表,将行号映射到id:
> create table holes_map ( row_id int not NULL primary key, random_id int not null);
> SET @id = 0;
> INSERT INTO holes_map SELECT @id := @id + 1, id FROM holes;
> select * from holes_map;
+--------+-----------+
| row_id | random_id |
+--------+-----------+
| 1 | 1 |
| 2 | 2 |
| 3 | 4 |
| 4 | 8 |
| 5 | 16 |
+--------+-----------+
row_id现在再次是连续,我们可以再次运行随机查询
SELECT name FROM holes
JOIN (SELECT r1.random_id
FROM holes_map AS r1
JOIN (SELECT (RAND() *
(SELECT MAX(row_id)
FROM holes_map)) AS row_id)
AS r2
WHERE r1.row_id >= r2.row_id
ORDER BY r1.row_id ASC
LIMIT 1) as rows ON (id = random_id);
1000次提取后,我们再次看到平均分布:
> select * from holes;
+----+----------------------------------+----------+
| id | name | accesses |
+----+----------------------------------+----------+
| 1 | d12b2551c6cb7d7a64e40221569a8571 | 222 |
| 2 | f82ad6f29c9a680d7873d1bef822e3e9 | 187 |
| 4 | 9da1ed7dbbdcc6ec90d6cb139521f14a | 195 |
| 8 | 677a196206d93cdf18c3744905b94f73 | 207 |
| 16 | b7556d8ed40587a33dc5c449ae0345aa | 189 |
+----+----------------------------------+----------+
维护连续的表
DROP TABLE IF EXISTS r2;
CREATE TABLE r2 (
id SERIAL,
name VARCHAR(32) NOT NULL UNIQUE
);
DROP TABLE IF EXISTS r2_equi_dist;
CREATE TABLE r2_equi_dist (
id SERIAL,
r2_id bigint unsigned NOT NULL UNIQUE
);
当我们在r2中更改某些内容时,我们希望r2_equi_dist也会更新。
DELIMITER $$
DROP TRIGGER IF EXISTS tai_r2$$
CREATE TRIGGER tai_r2
AFTER INSERT ON r2 FOR EACH ROW
BEGIN
DECLARE m BIGINT UNSIGNED DEFAULT 1;
SELECT MAX(id) + 1 FROM r2_equi_dist INTO m;
SELECT IFNULL(m, 1) INTO m;
INSERT INTO r2_equi_dist (id, r2_id) VALUES (m, NEW.id);
END$$
DELIMITER ;
DELETE FROM r2;
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
INSERT INTO r2 VALUES ( NULL, MD5(RAND()) );
SELECT * FROM r2;
+----+----------------------------------+
| id | name |
+----+----------------------------------+
| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
| 2 | a09a3959d68187ce48f4fe7e388926a9 |
| 3 | 4e1897cd6d326f8079108292376fa7d5 |
| 4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 4 |
+----+-------+
INSERT非常简单,DELETE操作我们必须更新equi-dist-id以保持id的连续设置:
DELIMITER $$
DROP TRIGGER IF EXISTS tad_r2$$
CREATE TRIGGER tad_r2
AFTER DELETE ON r2 FOR EACH ROW
BEGIN
DELETE FROM r2_equi_dist WHERE r2_id = OLD.id;
UPDATE r2_equi_dist SET id = id - 1 WHERE r2_id > OLD.id;
END$$
DELIMITER ;
DELETE FROM r2 WHERE id = 2;
SELECT * FROM r2;
+----+----------------------------------+
| id | name |
+----+----------------------------------+
| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
| 3 | 4e1897cd6d326f8079108292376fa7d5 |
| 4 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
| 1 | 1 |
| 2 | 3 |
| 3 | 4 |
+----+-------+
update操作需要维护外键约束:
DELIMITER $$
DROP TRIGGER IF EXISTS tau_r2$$
CREATE TRIGGER tau_r2
AFTER UPDATE ON r2 FOR EACH ROW
BEGIN
UPDATE r2_equi_dist SET r2_id = NEW.id WHERE r2_id = OLD.id;
END$$
DELIMITER ;
UPDATE r2 SET id = 25 WHERE id = 4;
SELECT * FROM r2;
+----+----------------------------------+
| id | name |
+----+----------------------------------+
| 1 | 8b4cf277a3343cdefbe19aa4dabc40e1 |
| 3 | 4e1897cd6d326f8079108292376fa7d5 |
| 25 | 29a5e3ed838db497aa330878920ec01b |
+----+----------------------------------+
SELECT * FROM r2_equi_dist;
+----+-------+
| id | r2_id |
+----+-------+
| 1 | 1 |
| 2 | 3 |
| 3 | 25 |
+----+-------+
一次多行
如果要返回多行,您可以:
- 多次执行查询
- 编写执行查询的存储过程并将结果存储在临时表中
存储过程
存储过程为你了程序语言结构:
- 循环
- 控制结构
- 程序
...
对于此任务,我们只需要一个循环:
ELIMITER $$
DROP PROCEDURE IF EXISTS get_rands$$
CREATE PROCEDURE get_rands(IN cnt INT)
BEGIN
DROP TEMPORARY TABLE IF EXISTS rands;
CREATE TEMPORARY TABLE rands ( rand_id INT );
loop_me: LOOP
IF cnt < 1 THEN
LEAVE loop_me;
END IF;
INSERT INTO rands
SELECT r1.id
FROM random AS r1 JOIN
(SELECT (RAND() *
(SELECT MAX(id)
FROM random)) AS id)
AS r2
WHERE r1.id >= r2.id
ORDER BY r1.id ASC
LIMIT 1;
SET cnt = cnt - 1;
END LOOP loop_me;
END$$
DELIMITER ;
CALL get_rands(4);
SELECT * FROM rands;
+---------+
| rand_id |
+---------+
| 133716 |
| 702643 |
| 112066 |
| 452400 |
+---------+
性能
我们有3个不同的查询来解决我们的问题:
- Q1. ORDER BY RAND()
- Q2. RAND() * MAX(ID)
- Q3. RAND() * MAX(ID) + ORDER BY ID
Q1预计成本为N * log2(N),Q2和Q3几乎恒定。
我们用N行(一千到一百万)填充表格并执行每次查询1000次。
100 1.000 10.000 100.000 1.000.000
Q1 0:00.718s 0:02.092s 0:18.684s 2:59.081s 58:20.000s
Q2 0:00.519s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
Q3 0:00.570s 0:00.607s 0:00.614s 0:00.628s 0:00.637s
作者:浮梁翁
链接:https://www.jianshu.com/p/f68da9510066
来源:简书
著作权归作者所有。商业转载请联系作者获得授权,非商业转载请注明出处。
这个主题简介啊