I have been using SQLite occasionally to work with large (>100GB) datasets. Recently I had to split out a field that had comma seperated values into multiple rows. My SQL and database skills is not very mature, but I am learning new things all the time (eg. don’t forget to index columns that you need), so its been a fun addition to my toolkit.
In this particular case, I was working on multi-label classification problem (each example can have more than one label). The labels were stored as a comma seperated values in the same column. I needed to get (id, label) pairs out from the database. Postgres has a convenient method regexp_split_to_table which can be used to split the column using commas. SQLite (being lite), does not have such a function though so we got to implement it ourselves.
Lets take a look at how the data looks like
sqlite> SELECT eid,labels FROM examples;
e0000000|label1,label2,label3
e0000001|label1
e0000002|label2,label4
Since we need a query that can run on an unknown number of labels per row, we can use a Common Table Expression (CTE) to create a recursive query that splits the strings one substring at a time.
-- query.sql
WITH RECURSIVE split(eid, label, str) AS (
SELECT eid, '', labels||',' FROM examples
UNION ALL SELECT
eid,
substr(str, 0, instr(str, ',')),
substr(str, instr(str, ',')+1)
FROM split WHERE str!=''
)
SELECT eid, label
FROM split
WHERE label!='';
Each time a string is passed into the above query, it splits the string at the first comma it finds. It returns the left part of the string as a label and the right part of the string is passed into the next iteration of the recursive query to be processed again.
Given label1,label2,label3
, the first iteration of the CTE will return
label1
and run the recursive CTE on label2,label3
and so on. The rest of the
code is handling minor things such as ensuring there is a comma after every
label.
$ sqlite3 data.db < query.sql
e0000000|label1
e0000000|label2
e0000000|label3
e0000001|label1
e0000002|label2
e0000002|label4
The query produces exactly what we want. We can now run hyperparameter tuning to overfit on the validation set!
CTE’s are really powerful and are extensively used to traverse graphs (yes you heard that right, you dont need graph databases to store graphs). Read more about them from the SQLite documentation.