I have been using SQLite occasionally to work with large (>100GB) datasets. Recently I had to split out a field that had comma seperated values into multiple rows. My SQL and database skills is not very mature, but I am learning new things all the time (eg. don’t forget to index columns that you need), so its been a fun addition to my toolkit.
In this particular case, I was working on multi-label classification problem (each example can have more than one label). The labels were stored as a comma seperated values in the same column. I needed to get (id, label) pairs out from the database. Postgres has a convenient method regexp_split_to_table which can be used to split the column using commas. SQLite (being lite), does not have such a function though so we got to implement it ourselves.
Lets take a look at how the data looks like
sqlite> SELECT eid,labels FROM examples; e0000000|label1,label2,label3 e0000001|label1 e0000002|label2,label4
Since we need a query that can run on an unknown number of labels per row, we can use a Common Table Expression (CTE) to create a recursive query that splits the strings one substring at a time.
-- query.sql WITH RECURSIVE split(eid, label, str) AS ( SELECT eid, '', labels||',' FROM examples UNION ALL SELECT eid, substr(str, 0, instr(str, ',')), substr(str, instr(str, ',')+1) FROM split WHERE str!='' ) SELECT eid, label FROM split WHERE label!='';
Each time a string is passed into the above query, it splits the string at the first comma it finds. It returns the left part of the string as a label and the right part of the string is passed into the next iteration of the recursive query to be processed again.
label1,label2,label3, the first iteration of the CTE will return
label1 and run the recursive CTE on
label2,label3 and so on. The rest of the
code is handling minor things such as ensuring there is a comma after every
$ sqlite3 data.db < query.sql e0000000|label1 e0000000|label2 e0000000|label3 e0000001|label1 e0000002|label2 e0000002|label4
The query produces exactly what we want. We can now run hyperparameter tuning to overfit on the validation set!
CTE’s are really powerful and are extensively used to traverse graphs (yes you heard that right, you dont need graph databases to store graphs). Read more about them from the SQLite documentation.