Skip to content

Getting started

fuersten edited this page Oct 16, 2015 · 1 revision

Getting started

Setup

For this example I will use csv data from http://ourairports.com/data/, specifically the airports.csv, countries.csv and regions.csv files.

Schema

First you will have to create a schema for the required tables and a mapping for the csv files. This can be done via a command file with sql commands (csvsqldb -c <file_name>) or the interactive mode. You start the interactive mode by invoking csvsqldb -i.

First the schema for the tables. The order and types of the fields have to match exactly the order the fields have in the csv file. Things like primary keys and length of characters currently have no influence on the engine. This might change in the future. The schema files for the tables will be created in the .csvdb/tables metadata directory at the current path. csvsqldb will always look for a .csvdb directory in the current path if it is not specified by the -p command line argument. The .csvdb directory will be created, if it is not found in the specified path.

CREATE TABLE IF NOT EXISTS airport(id INTEGER PRIMARY KEY,ident CHAR(8),type VARCHAR(50),name VARCHAR(255),latitude_deg REAL,longitude_deg REAL,elevation_ft INTEGER,continent CHAR(2),iso_country CHAR(2),iso_region CHAR(6),municipality VARCHAR(255),scheduled_service CHAR(3),gps_code CHAR(4),iata_code CHAR(4),local_code CHAR(4),home_link VARCHAR(255),wikipedia_link VARCHAR(255),keywords VARCHAR(50));
CREATE TABLE IF NOT EXISTS region(id INTEGER PRIMARY KEY,code CHAR(6),local_code CHAR(4),name VARCHAR(255),continent CHAR(2),iso_country CHAR(2),wikipedia_link VARCHAR(255),keywords VARCHAR(50));
CREATE TABLE IF NOT EXISTS country(id INTEGER PRIMARY KEY,code CHAR(2),name VARCHAR(50),continent CHAR(2),wikipedia_link VARCHAR(255),keywords VARCHAR(50));

Mappings

Next we will create the mappings. This is more convenience than necessary, as you can specify a mapping using the command line argument -m. On the other hand, creating a mapping using sql has the advantage that you can specify not only a csv file pattern, but also a field separator (default is ,) and if you want to skip the first line of the csv file because it is a header line (default is false). If you create a mapping this way, it will also be stored in the .csvdb/mappings directory and can be used for the next csvsqldb invocations without specifying mappings. You can use regular expressions in the mappings.

CREATE MAPPING airport("airports\d*.csv",',',true);
CREATE MAPPING region("regions.csv",',',true);
CREATE MAPPING country("countries.csv",',',true);

Inspect

Now that we have created our schema, we can use the interactive mode in order to inspect, what metadata artifacts are available. Start the interactive mode by calling csvsqldb -i.

sql>

sql> help
help - this help
quit|exit - quit shell
version - show version
verbose ([on|off]) - show verbosity or switch it on/off
database - show the database path
clear history - clear all history entries
show [tables|mappings|columns <tablename>|functions|files] - show db objects
add file <path to csv file> - add the specified csv file for processing
<sql command> - execute the sql

sql> show tables
SYSTEM_DUAL
AIRPORT
COUNTRY
REGION

sql> show columns region
ID : INTEGER
CODE : VARCHAR
LOCAL_CODE : VARCHAR
NAME : VARCHAR
CONTINENT : VARCHAR
ISO_COUNTRY : VARCHAR
WIKIPEDIA_LINK : VARCHAR
KEYWORDS : VARCHAR

You get the idea. By the way the SYSTEM_DUAL table is a system table that only has one boolean column x that has exactly one row containing the value 0. This table is especially handy if you want to do some testing or arithmetic with sql, as csvsqldb does not allow to skip the FROM clause.

Add data

In order to execute some queries in interactive mode, we need some data sources (aka csv files). I have put the three above mentioned csv files in the directory ~/Downloads/csvsqldb_example. Now I can add them for processing in interactive mode using:

sql> add file ~/Downloads/csvsqldb_example/*.csv
added 3 new csv files for processing

sql> show files
csv files for processing:
/Users/lfg/Downloads/csvsqldb_example/airports.csv
/Users/lfg/Downloads/csvsqldb_example/countries.csv
/Users/lfg/Downloads/csvsqldb_example/regions.csv

Paths to files will not be stored in the metadata directory, so you have to add them every time you use the interactive mode. Alternatively you can specify them on the command line when you invoke csvsqldb:

csvsqldb -i ~/Downloads/csvsqldb_example/*.csv

Queries

Now that all is setup, lets do some queries on the data.

Easy stuff

sql> select count(*) from airport
#$alias_1
46741

sql> select count(*) as "COUNT" from airport
#COUNT
46741

Grouping and Ordering

sql> select count(*) as "COUNT", iso_region from airport where iso_region like 'DE%' group by iso_region order by "COUNT"
#COUNT,ISO_REGION
2,'DE-HH'
2,'DE-U-A'
3,'DE-HB'
5,'DE-SL'
5,'DE-BE'
22,'DE-ST'
24,'DE-TH'
24,'DE-MV'
25,'DE-SN'
30,'DE-SH'
42,'DE-RP'
49,'DE-BR'
78,'DE-HE'
89,'DE-BW'
95,'DE-NW'
103,'DE-NI'
170,'DE-BY'

Joining

sql> select ident,airport.name,municipality,region.name,country.name from airport join region on airport.iso_region = region.code join country on region.iso_country = country.code where airport.name like '%Heliport%' order by region.name,airport.name limit 10
#IDENT,AIRPORT.NAME,MUNICIPALITY,REGION.NAME,COUNTRY.NAME
'BGAQ','Aappilattoq (Kujalleq) Heliport','Nanortalik','(unassigned)','Greenland'
'BGAG','Aappilattoq (Qaasuitsup) Heliport','Qaasuitsup','(unassigned)','Greenland'
'GL-QCU','Akunaq Heliport',NULL,'(unassigned)','Greenland'
'BGAP','Alluitsup Paa Heliport','Alluitsup Paa','(unassigned)','Greenland'
'BGAS','Ammassivik Heliport',NULL,'(unassigned)','Greenland'
'BGAR','Arsuk Heliport',NULL,'(unassigned)','Greenland'
'BGAT','Attu Heliport','Attu','(unassigned)','Greenland'
'PR16','Banco Popular Center Heliport','San Juan','(unassigned)','Puerto Rico'
'PR23','Baxter-Aibonito Heliport','Aibonito','(unassigned)','Puerto Rico'
'PR08','Baxter-San German Heliport','San German','(unassigned)','Puerto Rico'

As command line tool

Doing a query in interactive mode is only one way to execute sql on csv files. You can also use the command line directly. Hereby you have to supply the sql, an optional mapping and the corresponding csv files.

csvsqldb --sql="select country.name from airport join region on airport.iso_region = region.code join country on region.iso_country = country.code group by country.name order by country.name limit 10" ~/Downloads/csvsqldb_example/airports.csv ~/Downloads/csvsqldb_example/regions.csv ~/Downloads/csvsqldb_example/countries.csv
#COUNTRY.NAME
'Afghanistan'
'Albania'
'Algeria'
'American Samoa'
'Andorra'
'Angola'
'Anguilla'
'Antarctica'
'Antigua and Barbuda'
'Argentina'

And with mappings specified, where the mapping is <csv_file_pattern> -> <table_name>:

csvsqldb --sql="select count(*) as \"COUNT\", iso_region from airport where iso_region like 'DE%' group by iso_region order by \"COUNT\"" --mapping="airports.csv->airport;countries.csv->country;regions.csv->region" ~/Downloads/csvsqldb_example/airports.csv ~/Downloads/csvsqldb_example/regions.csv ~/Downloads/csvsqldb_example/countries.csv

Clone this wiki locally