MySQL and Databases Workshop
You will need a SQL client installed. Here are some suggestions:
- MySQL Workbench: http://dev.mysql.com/downloads/tools/workbench/
- Works well, but is overly complicated and somewhat confusing. The official client for MySQL
- Navicat: http://www.navicat.com/download/navicat-for-mysql
- Free trial; works for 30 days. Very nice (what I use), but the full version is $75-$95
- Sequel Pro: http://www.sequelpro.com/
- Free, but mac only
To connect, you will need the following information:
- Username: Your username
- Password: Your password
Everyone should already have some data in their database. This data is from the website Donors Choose. Please remember to treat this data confidentially. The structure of the data is available in this graphical schema. More information about the data is available from the Donors Choose Hacking Education documentation website.
To get information out of a database table, use the
SELECT id, grade_level, resource_type, total_price_excluding_optional_support, students_reached FROM projects
Notice that you have to specify which columns you want returned. You can specify any set of columns:
SELECT id, project_id FROM projects
Columns are returned in the order you ask for them:
SELECT project_id, id, FROM projects
You can also use
* as a wildcard to select all columns:
SELECT * from projects
You can ask MySQL to sort your results by using an
ORDER BY clause:
SELECT id, state, metro, county FROM schools ORDER BY state
You can sort by more than one thing:
SELECT id, state, metro, county FROM schools ORDER BY state, county
You can also reverse the ordering:
SELECT id, state, metro, county FROM schools ORDER BY state DESC
You can also order them randomly:
SELECT id, state, metro, county FROM schools ORDER BY RAND()
You can limit the number of results with the
LIMIT clause. This is really useful to make sure you are getting the
results you want before running a query on the full dataset.
SELECT id, state, metro, county FROM schools LIMIT 10
Limit is particularly useful when combined with the
ORDER BY clause:
SELECT id, state, metro, county FROM schools ORDER BY state LIMIT 10
ORDER BY can also be used to pick a random school:
SELECT id, state, metro, county FROM schools ORDER BY RAND() LIMIT 1
MySQL has a number of functions that allow you to calculate information. There are many of them, but the really useful
COUNT(). More information on these functions is
available on the MySQL documentation page: http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html
This query determines the largest donation in our dataset:
SELECT max(base_amount) FROM donations
This query gives both the largest and the smallest, and the average donation:
SELECT max(base_amount), min(base_amount), avg(base_amount) FROM donations
We can also count how many donations are in our dataset:
SELECT count(*) FROM donations
Try to answer the following two questions:
- What type of project (books, technology, etc.) asked for the largest amount of money? What grade level was it for?
- What is the total amount of money donated in our dataset?
Select subsets of the whole dataset
You don’t always have to work with the whole table at once. You can request subsets of the data using a
SELECT * FROM schools WHERE state = 'MI' SELECT id, project_id, base_amount FROM donations WHERE payment_method = 'paypal' AND base_amount < 100 SELECT id, project_id, base_amount FROM donations WHERE payment_method = 'paypal' OR payment_method = 'check'
You can use wildcards to do basic searches. The
% character serves as a wildcard, and you have to use
LIKE to match
SELECT * FROM schools WHERE state LIKE 'M%'
Calculations with WHERE
WHERE clause is particularly useful when combined with functions that calculate. This allows you to ask much more
interesting questions. For example, do people donate more when paying with credit cards or with paypal?
SELECT AVG(base_amount) FROM donations WHERE payment_method = 'credit_card' SELECT AVG(base_amount) FROM donations WHERE payment_method = 'paypal'
The way to think about this is the
WHERE clause always happens first. So, MySQL first finds all of the rows where the
WHERE clause is true, and then is does the calculation on the resulting rows.
Rather than writing a separate query for each payment method, SQL allows us to group the rows of the table and perform calculations on them.
SELECT payment_method, AVG(base_amount) FROM donations GROUP BY payment_method
You can still combine
GROUP BY with any of the other clauses.
SELECT payment_method, AVG(base_amount) FROM donations GROUP BY payment_method ORDER BY payment_method LIMIT 2
You can also give columns new names using an
AS clause. This allows you to order by the results of a calculation:
SELECT payment_method, AVG(base_amount) AS donation_size FROM donations GROUP BY payment_method ORDER BY donation_size DESC
Write a query to answer the following questions:
- Do high poverty projects ask for more money than low poverty projects?
- (note, the field that contains the amount of money asked for is
- (note, the field that contains the amount of money asked for is
- Are there more high-poverty projects than low-poverty projects?
Joining Multiple Tables Together
So far, everything we’ve done can also be done with Excel. However, one thing excel can’t do but SQL is good at is connecting multiple tables together. To do this, we do a join, which in SQL is so common that it is just a comma.
SELECT * FROM projects, donations
Notice that this query is really large. It takes every row from the projects table (500 rows), and combines it with
every possible row from the donations table (2328 rows). It makes every possible combination of rows (500 times 2328 =
1,164,000 rows!). Most of these, however, don’t have a lot of meaning. So normally, we limit these results using a
WHERE clause. In this case, each from of the donations table has a
project_id field, that specifies which project
the donation was for. So we can use a
WHERE clause to limit to just the interesting combinations:
SELECT * FROM projects, donations WHERE projects.id = donations.project_id
To illustrate why this is so useful, lets try to use a
GROUP BY and the
AVG() functino that we learned earlier to
answer an interesting question: Do people donate more money when donating to a high-poverty school?
This isn’t easy to answer using Excel style analysis on individual tables. The projects table has the information for
whether each project is at a high-poverty school or not. However, the donations table has the information about
people’s donations. So we can’t answer this question just by looking at a single table. To answer it, we need to join
the two tables together. We start by writing out basic join query, and using a
LIMIT clause to make sure it looks
SELECT * FROM projects, donations WHERE projects.id = donations.project_id LIMIT 50
Next, we can add the
GROUP BY and
SELECT projects.poverty_level, AVG(donations.base_amount) FROM projects, donations WHERE projects.id = donations.project_id GROUP BY projects.poverty_level
And, that provides our answer!
Write a query to answer the following questions:
- Which focus subject (i.e., subject area of project) asks for the most money?
focus_subjectstable has the list of focus subjects – types of projects as determined by the requester
primary_focus_subjectis a field in the projects table that links to the focus_subjects table
- Schools in which state get the largest donations on average?
- Hint: Use the schema to figure out which tables and fields to join
- Hint 2: You need to join 3 tables together to find this answer
MySQL has some extra facilities for handling dates in a database. For querying, they work just fine in
ORDER BY statements:
SELECT id, base_amount, timestamp FROM donations ORDER BY timestamp
You can also use functions like
DATE() to extract only the date part of the data, or
MONTH() to get just the month:
SELECT id, base_amount, DATE(timestamp), MONTH(timestamp), DAY(timestamp), YEAR(timestamp) FROM donations ORDER BY timestamp
You can use these functions in a
WHERE clause. For example, this query just gets donations in the afternoon:
SELECT id, base_amount FROM donations WHERE HOUR(timestamp) > 12 and HOUR(timestamp) < 17
You can also use these functions in
GROUP BY clauses. For example, the results from this query are actually reported
in a paper I published this summer:
SELECT YEAR(timestamp), AVG(base_amount), COUNT(*) FROM donations GROUP BY YEAR(timestamp)
Are donations seasonal? That is, are there more donations at some times of the year (like the beginning of a school year) than others? Do people make larger donations at some times of the year?
MONTH()is a great way to look at seasonality
Creating Database Tables
The easiest way to figure out the syntax for creating a table is to ask MySQL for help by asking it to show you the syntax for an existing table:
SHOW CREATE TABLE projects
This displays a valid
CREATE TABLE command. It also includes lots of optional information; your create table commands
don’t have to include everything that this one does. Let’s take it apart next.
First, we need to give the table a name. This is specified right after
CREATE TABLE. Then we have to specify what
the structure of that table will look like. The structure is everything between the parentheses. The basic
TABLE command looks like this:
CREATE TABLE table_name ( ... )
Inside the parentheses goes a comma separated list of columns that will appear in that table. For each column, you MUST specify what type of data it will contain. For example, here is a really simple myprojects table:
CREATE TABLE myprojects ( id INTEGER, name VARCHAR(500), amount DECIMAL )
It is always a good idea to specify a
PRIMARY KEY in every database table. The primary key is the column that
uniquely identifies each row of data in the table. I almost always add an
id column to every table that I’m
working with so each row can have a unique identifying number. Then I specify that row as a
CREATE TABLE myprojects ( id INTEGER, name VARCHAR(500), amount DECIMAL, PRIMARY KEY (id) )
MySQL has a nice feature that makes working with
id columns easier. It can automatically assign a new number in the
id field by specifying the field as
CREATE TABLE myprojects ( id INTEGER AUTO_INCREMENT, name VARCHAR(500), amount DECIMAL, PRIMARY KEY (id) )
By default, all columns of a database are allowed to contain blanks – called
NULLs. (Exception: columns specified as
PRIMARY KEY cannot contain
NULLs because they must be unique.) If you want to ask MySQL to force every row to
have a value in that column, you can specify the column at
NOT NULL. It is often a good idea to do this for important
columns; that way, MySQL will tell you if your data has an error in it.
CREATE TABLE myprojects ( id INTEGER AUTO_INCREMENT, name VARCHAR(500) NOT NULL, amount DECIMAL, PRIMARY KEY (id) )
You can also specify a default value for any column, which gets used if someone tries to add a row of data but doesn’t
specify what should go in that column. Normally, if you don’t specify some data, the default it
NULL, but using
DEFAULT clause can change that:
CREATE TABLE myprojects ( id INTEGER AUTO_INCREMENT, name VARCHAR(500) NOT NULL, amount DECIMAL DEFAULT 0.0, PRIMARY KEY (id) )
There are a number of common types of columns we see in databases. Full documentation is available on MySQL’s website. There are many more types than what is in the table below; but these are the most commonly used types.
||A whole number between -2 billion and 2 billion||
||A text string no more than
||A text string of arbitrary length||The text of Moby Dick|
||True or False||
||A number including a decimal point||
||A month-day-year specified date||
||An hour-minute-second specified time of day||
||A point in time, containing both date and time||
||One of a small set of possible choices||x|
Note, the date and time is always stored in the time zone that the MySQL server is running in.
Let’s get our database ready for some data from the US Census. The Census provides data about every state and about every zipcode in the US. Let’s create a table that will hold the data about states, and in particular, data about gender, age, and socio-economic status of each state. A lot of that information is available on the US Census website: http://quickfacts.census.gov/qfd/index.html
CREATE TABLE statement that creates a database table that stores census data about states.
To insert data into the database, you can use an
INSERT query. You need to specify three things: the table the data
should go in to, the columns of data you want to specify, and the actual data itself.
INSERT INTO myprojects (id, name, amount) VALUES (12, 'Big Data Research', 100000.00)
I usually don’t specify the
AUTO_UPDATE column, and let the database set it itself:
INSERT INTO myprojects (name, amount) VALUES ('Ricks Research', 500000.12)
If we don’t specify the amount, it will be entered as the default of 0.0:
INSERT INTO myprojects (name) VALUES ('Ricks Awesome Work')
You can specify the columns in any order you want. However, the data has to appear in the same order.
If you want to change or edit existing data in the database, you can always do it manually in the SQL client softwre you
have been using. However, sometimes you want to edit large quantities of data at once. To do that, you can use an
update query. For example, if you want to update all rows of the
myprojects table so that the amount is specified in
hundreds of dollars instead of just dollars, you can use this query:
UPDATE myprojects SET amount = amount / 100
You can also specify just the rows that you want to update using a
WHERE clause. The syntax is the same as for
UPDATE myprojects SET amount = amount + 1000 WHERE name LIKE 'Rick%'
Of course, you can use this same syntax to update just one entry
UPDATE myprojects SET amount = 100 WHERE id = 12
You can delete data using a
DELETE statement. To delete all of the rows in a table, just do
If you want to delete just some of the rows, use a
DELETE myprojects WHERE amount > 1000
You can eliminate a whole table with a
DROP TABLE myprojects;
Handy Tools for moving data around
MySQL can automatically create tables for you if it the source data is coming from another MySQL database. In that
case, you can tell it to automatically create a table and populate it with a
CREATE TABLE big_projects SELECT id, poverty_level, grade_level, total_price_excluding_optional_support AS amount FROM projects WHERE amount > 1000
You can also tell it to add a new, unique ID column:
CREATE TABLE big_projects (id INTEGER AUTO_INCREMENT PRIMARY KEY) SELECT poverty_level, grade_level, total_price_excluding_optional_support AS amount FROM projects WHERE total_price_excluding_optional_support > 1000
Likewise, you can use an
INSERT statement with a
INSERT INTO big_projects (poverty_level, grade_level, amount) SELECT poverty_level, grade_level, total_price_excluding_optional_support AS amount FROM project where total_price_excluding_optional_support > 500
Choose 3 states, and write manually write insert statements to insert data from the US Census into the database. The data is available on the US census website: http://quickfacts.census.gov/qfd/index.html
Once you’ve got your database structure down, it is a good idea to create indexes. Indexes help the database execute
queries faster. Generally, you want to create an index on all fields that will be used in
JOIN clauses. (Primary
keys automatically have an index, so you don’t need to manually create those.) You also want an index on any field that
is used in a
WHERE or a
GROUP BY clause and the query is really slow.
CREATE INDEX amount ON myprojects (amount)
To create these tables, I first worked on writing a big
SELECT query that would give me all of the information I
needed. It usually took a couple of
JOINs and some
GROUP BYs to get it working. Once I had the
working, I added a
CREATE TABLE <tablename> at the front of the query, and it saved the results of that query as a new
table. Now I had my tables for my research.
One other tool that I used in SQL was call a subquery. It turns out that you can write a
SELECT query, and use the
results of that query as if it were a table on its own. For example, consider this query to get the average number of
donors for each project:
SELECT AVG(sub.donors) FROM projects, (SELECT donations.project_id, count(donors.id) AS donors FROM donations, donors WHERE donors.id = donations.donor_id GROUP BY donations.project_id) AS sub WHERE projects.id = sub.project_id