by Katja Seltmann with excerpts from Software Carpentry SQLite lesson.
Supplementary Material:
Goal of this lesson
A relational database is a way to store and manipulate information. Databases are arranged as tables. Each table has columns (also known as fields) that describe the data, and rows (also known as records) which contain the data.
When we are using a spreadsheet, we put formulas into cells to calculate new values based on old ones. When we are using a database, we send commands (usually called queries) to a database manager: a program that manipulates the database for us. The database manager does whatever lookups and calculations the query specifies, returning the results in a tabular form that we can then use as a starting point for further queries.
It is a database manager. Every database manager — Oracle, IBM DB2, PostgreSQL, MySQL, Microsoft Access, and SQLite — stores data in a different way, so a database created with one cannot be used directly by another. However, every database manager can import and export data in a variety of formats, like .csv, so it is possible to move information from one to another.
Both relational databases and spreadsheets are used to store and manipulate data.
relational DB | spreadsheets |
---|---|
handle more data | slow down quickly |
develop open science practices | difficult to follow methods |
easy to version | easy to loose data |
We are going to learn the basics of SQL using SQLite using data frames. You can think of a data frame as if they were tables in a relational database. We can do this with sqldf package.
sqldf is a powerful R package that allows 1) the use of SQLite syntax to be used directly on data frames, and 2) can create mini-SQLlite databases.
Here’s how to install sqldf:
install.packages("sqldf", dependencies = TRUE)
library("sqldf")
data
For this section, let’s first start by dealing with our data and folders:
The data is found here
createing a new folder for our sqldf lesson
cp the mammal_stats.csv file into the sqldf folder
create a new R file called sqldf-lesson.R and save it into the sqldf folder
commit your changes to github!
Clear that working memory…
rm(list = ls())
Check your working directory, make a new folder called sqldf
setwd("~/Desktop/software-carpentry-2016/data-files/sqldf")
getwd()
First, read in the data frame as we did before using read.csv
mammals <- read.csv("mammal_stats.csv", header=TRUE)
head(mammals)
TIP: header: logical: should the data frame use the first row as headers?
R gives you lots of ways to look at your data frame.
head(mammals)
tail(mammals)
ncol(mammals)
View(mammals)
SQLite gives you more ways with Select. Select statements using SQLite * indicates selecting all columns.
sqldf("select * from mammals limit 10")
Select distinct values in rows.
sqldf("select distinct `order` from mammals")
sqldf("select distinct `order`,species from mammals")
TIP: The word order is a column name, but it is also a reserved word in SQL. Put column names in `` to avoid confusion.
Select using filters, limits and ordering
sqldf("select * from mammals where `order`='Carnivora'")
sqldf("select `order`, species from mammals where `order`='Carnivora'")
Select using limits
sqldf("select * from mammals where `order`='Carnivora' limit 3")
Order using acending or decending values
sqldf("select * from mammals where `order`='Carnivora' order by `adult_body_mass_g` desc limit 10")
sqldf("select * from mammals where `order`='Carnivora' order by `order` asc, litter_size desc limit 10")
TIP: The opposite of desc is asc
Select based on wildcard searching.
sqldf("select * from mammals where species like 'Canis%'")
Select a distinct list of values from a column
sqldf("select distinct `order` as taxonOrder from mammals")
Select all where litter_size is not NA
sqldf("select * from mammals where litter_size is not null limit 10")
or
sqldf("select * from mammals where litter_size != 'N/A' limit 10")
Exercise 1: Select unique species with litter_size less than 1
Save your output to a new data frame
mammalsEdited <- sqldf("select `order` as taxonOrder, species, adult_body_mass_g as mass from mammals")
head(mammalsEdited)
Concatination
sqldf("select taxonOrder || '-' || species as name from mammalsEdited limit 10")
Remove white space
taxonString <- sqldf("select species, taxonOrder || '-' || replace(species,' ','-') as name from mammalsEdited limit 10")
head(taxonString)
Counting using SQLite by Groups and then making quick and simple barplots
numberSpecies <- sqldf("select count(species) as cnt,taxonOrder from mammalsEdited group by taxonOrder order by cnt desc")
head(numberSpecies)
par(las=2) # make label text perpendicular to axis
par(mar=c(8,8,3,2)) # increase y-axis margin
barplot(log(numberSpecies$cnt), names.arg=numberSpecies$taxonOrder)
Finding maximum and minimum
sqldf("select max(adult_body_mass_g) from mammals")
sqldf("select min(adult_body_mass_g) from mammals")
sqldf("select * from mammals where adult_body_mass_g = (select max(adult_body_mass_g) from mammals)")
SQLite has a lot of built in functions that can help with processing numbers or text.
Some particularly helpful ones are: trim(), upper(), round(), and random()
Exercise 2: Use the upper() function to output the taxonOrder names in mammalsEdited all in uppercase. Write the output to a new data frame.