Оптимизация запросов SQLite3 join vs subselect

Я пытаюсь найти наилучший способ (в данном случае, вероятно, не имеет значения) найти строки одной таблицы на основе наличия флага и реляционного идентификатора в строке в другой таблице.

Вот схемы:

    CREATE TABLE files (
id INTEGER PRIMARY KEY,
dirty INTEGER NOT NULL);

    CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

Я использую SQLite3

таблица файлов будет очень большой, обычно строки 10K-5M. resume_points будет небольшим < 10K с только 1-2 различными scan_file_id 's

поэтому моя первая мысль была:

select distinct files.* from resume_points inner join files
on resume_points.scan_file_id=files.id where files.dirty = 1;

Сотрудник предложил включить соединение:

select distinct files.* from files inner join resume_points
on files.id=resume_points.scan_file_id where files.dirty = 1;

то я подумал, так как мы знаем, что количество различных scan_file_id будет настолько малым, возможно, подселек будет оптимальным (в этом редком случае):

select * from files where id in (select distinct scan_file_id from resume_points);

выходы explain имели следующие строки: 42, 42 и 48 соответственно.

Ответ 1

TL; DR: лучший запрос и индекс:

create index uniqueFiles on resume_points (scan_file_id);
select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;

Поскольку я обычно работаю с SQL Server, сначала я думал, что оптимизатор запросов найдет оптимальный план выполнения для такого простого запроса, независимо от того, каким образом вы пишете эти эквивалентные SQL-операторы. Поэтому я загрузил SQLite и начал играть. К моему большому удивлению, была огромная разница в производительности.

Здесь установочный код:

CREATE TABLE files (
id INTEGER PRIMARY KEY autoincrement,
dirty INTEGER NOT NULL);

CREATE TABLE resume_points (
id INTEGER PRIMARY KEY  AUTOINCREMENT  NOT NULL ,
scan_file_id INTEGER NOT NULL );

insert into files (dirty) values (0);
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;
insert into files (dirty) select (case when random() < 0 then 1 else 0 end) from files;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

insert into resume_points (scan_file_id) select (select abs(random() % 8000000)) from files limit 5000;

Я рассмотрел два показателя:

create index dirtyFiles on files (dirty, id);
create index uniqueFiles on resume_points (scan_file_id);
create index fileLookup on files (id);

Ниже приведены запросы, которые я попробовал, и время выполнения на моем ноутбуке i5. Размер файла базы данных составляет всего около 200 МБ, поскольку он не содержит никаких других данных.

select distinct files.* from resume_points inner join files on resume_points.scan_file_id=files.id where files.dirty = 1;
4.3 - 4.5ms with and without index

select distinct files.* from files inner join resume_points on files.id=resume_points.scan_file_id where files.dirty = 1;
4.4 - 4.7ms with and without index

select * from (select distinct scan_file_id from resume_points) d join files on d.scan_file_id = files.id and files.dirty = 1;
2.0 - 2.5ms with uniqueFiles
2.6-2.9ms without uniqueFiles

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;
2.1 - 2.5ms with uniqueFiles
2.6-3ms without uniqueFiles

SELECT f.* FROM resume_points rp INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1 GROUP BY f.id
4500 - 6190 ms with uniqueFiles
8.8-9.5 ms without uniqueFiles
    14000 ms with uniqueFiles and fileLookup

select * from files where exists (
select * from resume_points where files.id = resume_points.scan_file_id) and dirty = 1;
8400 ms with uniqueFiles
7400 ms without uniqueFiles

Похоже, что оптимизатор запросов SQLite не очень продвинут. Лучшие запросы сначала сводят resume_points к небольшому числу строк (два в тестовом случае. OP сказал, что это будет 1-2.), А затем найдите файл, чтобы увидеть, загрязнен ли он или нет. Индекс dirtyFiles не имел большого значения для любого из файлов. Я думаю, это может быть из-за того, как данные упорядочены в тестовых таблицах. Это может повлиять на производственные таблицы. Однако разница не слишком велика, так как будет меньше, чем несколько поисков. uniqueFiles действительно имеет значение, так как он может уменьшить 10000 строк resume_points до 2 строк без сканирования через большинство из них. fileLookup сделал несколько запросов несколько быстрее, но недостаточно, чтобы значительно изменить результаты. Примечательно, что группа сделала это очень медленно. В заключение снимите набор результатов раньше, чтобы сделать самые большие различия.

Ответ 2

Так как files.id является первичным ключом, попробуйте GROUP ing BY это поле вместо проверки DISTINCT files.*

SELECT f.*
FROM resume_points rp
INNER JOIN files f on rp.scan_file_id = f.id
WHERE f.dirty = 1
GROUP BY f.id

Другим вариантом для оценки производительности является добавление индекса к resume_points.scan_file_id.

CREATE INDEX index_resume_points_scan_file_id ON resume_points (scan_file_id)

Ответ 3

Вы можете попробовать exists, который не будет создавать дубликаты files:

select * from files
where exists (
    select * from resume_points 
    where files.id = resume_points.scan_file_id
)
and dirty = 1;

Конечно, это может помочь иметь правильные индексы:

files.dirty
resume_points.scan_file_id

Будет ли полезен индекс, будет зависеть от ваших данных.

Ответ 4

Я думаю, что jtseng дал решение.

select * from (select distinct scan_file_id from resume_points) d
join files on d.scan_file_id = files.id and files.dirty = 1

В основном это то же самое, что вы разместили в качестве последней опции:

select * from files where id in (select distinct scan_file_id from resume_points) and dirty = 1;

Избегайте того, что вам нужно избегать полного сканирования/присоединения таблицы.

Итак, сначала вам нужны ваши 1-2 отдельных идентификатора:

select distinct scan_file_id from resume_points

после этого только ваши 1-2 строки должны быть объединены на другой таблице вместо всех 10K, что дает оптимизацию производительности.

если вам нужно это утверждение несколько раз, я бы поставил его в точку зрения. вид не изменит производительность, но выглядит более чистым/легче читать.

также проверьте документацию по оптимизации запроса: http://www.sqlite.org/optoverview.html

Ответ 5

Если таблица "resume_points" будет содержать только один или два разных идентификатора файла, для этого, похоже, требуется только одна или две строки, и, похоже, в качестве первичного ключа требуется scan_file_id. Эта таблица содержит только два столбца, а идентификационный номер не имеет смысла.

И если это так, вам не нужен ни один из идентификационных номеров.

pragma foreign_keys = on;
CREATE TABLE resume_points (
  scan_file_id integer primary key
);

CREATE TABLE files (
  scan_file_id integer not null references resume_points (scan_file_id),
  dirty INTEGER NOT NULL,
  primary key (scan_file_id, dirty)
);

И теперь вам тоже не нужно соединение. Просто запросите таблицу "файлы".