BITlab: Behavior Information Technology

BITlab
404 Wilson Rd. Room 251
Communication Arts & Sciences
Michigan State University
East Lansing, MI 48824

MySQL and Databases Workshop

Important information

You will need a SQL client installed. Here are some suggestions:

To connect, you will need the following information:

Everyone should already have some data in their database. This data is from the website Donors Choose. Please remember to treat this data confidentially. The structure of the data is available in this graphical schema. More information about the data is available from the Donors Choose Hacking Education documentation website.

There are some good SQL cheat sheets here, here, here, and here.

Selecting

To get information out of a database table, use the select query:

SELECT id, grade_level, resource_type, total_price_excluding_optional_support, students_reached FROM projects

Notice that you have to specify which columns you want returned. You can specify any set of columns:

SELECT id, project_id FROM projects

Columns are returned in the order you ask for them:

SELECT project_id, id, FROM projects

You can also use * as a wildcard to select all columns:

SELECT * from projects

Sorting

You can ask MySQL to sort your results by using an ORDER BY clause:

SELECT id, state, metro, county FROM schools ORDER BY state

You can sort by more than one thing:

SELECT id, state, metro, county FROM schools ORDER BY state, county

You can also reverse the ordering:

SELECT id, state, metro, county FROM schools ORDER BY state DESC

You can also order them randomly:

SELECT id, state, metro, county FROM schools ORDER BY RAND()

Limiting results

You can limit the number of results with the LIMIT clause. This is really useful to make sure you are getting the results you want before running a query on the full dataset.

SELECT id, state, metro, county FROM schools LIMIT 10

Limit is particularly useful when combined with the ORDER BY clause:

SELECT id, state, metro, county FROM schools ORDER BY state LIMIT 10

Combining LIMIT with ORDER BY can also be used to pick a random school:

SELECT id, state, metro, county FROM schools ORDER BY RAND() LIMIT 1

Calculations

MySQL has a number of functions that allow you to calculate information. There are many of them, but the really useful ones are MAX(), MIN(), SUM(), AVG(), STDDEV(), and COUNT(). More information on these functions is available on the MySQL documentation page: http://dev.mysql.com/doc/refman/5.0/en/group-by-functions.html

This query determines the largest donation in our dataset:

SELECT max(base_amount) FROM donations

This query gives both the largest and the smallest, and the average donation:

SELECT max(base_amount), min(base_amount), avg(base_amount) FROM donations

We can also count how many donations are in our dataset:

SELECT count(*) FROM donations

Exercise

Try to answer the following two questions:

  1. What type of project (books, technology, etc.) asked for the largest amount of money? What grade level was it for?
  2. What is the total amount of money donated in our dataset?

Select subsets of the whole dataset

You don’t always have to work with the whole table at once. You can request subsets of the data using a WHERE clause:

SELECT * FROM schools WHERE state = 'MI'
SELECT id, project_id, base_amount FROM donations WHERE payment_method = 'paypal' AND base_amount < 100
SELECT id, project_id, base_amount FROM donations WHERE payment_method = 'paypal' OR payment_method = 'check'

You can use wildcards to do basic searches. The % character serves as a wildcard, and you have to use LIKE to match wildcards:

SELECT * FROM schools WHERE state LIKE 'M%'

Calculations with WHERE

The WHERE clause is particularly useful when combined with functions that calculate. This allows you to ask much more interesting questions. For example, do people donate more when paying with credit cards or with paypal?

SELECT AVG(base_amount) FROM donations WHERE payment_method = 'credit_card'
SELECT AVG(base_amount) FROM donations WHERE payment_method = 'paypal'

The way to think about this is the WHERE clause always happens first. So, MySQL first finds all of the rows where the WHERE clause is true, and then is does the calculation on the resulting rows.

GROUP BY

Rather than writing a separate query for each payment method, SQL allows us to group the rows of the table and perform calculations on them.

SELECT payment_method, AVG(base_amount) FROM donations GROUP BY payment_method

You can still combine GROUP BY with any of the other clauses.

SELECT payment_method, AVG(base_amount) FROM donations GROUP BY payment_method ORDER BY payment_method LIMIT 2

You can also give columns new names using an AS clause. This allows you to order by the results of a calculation:

SELECT payment_method, AVG(base_amount) AS donation_size FROM donations GROUP BY payment_method ORDER BY donation_size DESC

Exercise!

Write a query to answer the following questions:

Joining Multiple Tables Together

So far, everything we’ve done can also be done with Excel. However, one thing excel can’t do but SQL is good at is connecting multiple tables together. To do this, we do a join, which in SQL is so common that it is just a comma.

SELECT * FROM projects, donations

Notice that this query is really large. It takes every row from the projects table (500 rows), and combines it with every possible row from the donations table (2328 rows). It makes every possible combination of rows (500 times 2328 = 1,164,000 rows!). Most of these, however, don’t have a lot of meaning. So normally, we limit these results using a WHERE clause. In this case, each from of the donations table has a project_id field, that specifies which project the donation was for. So we can use a WHERE clause to limit to just the interesting combinations:

SELECT * FROM projects, donations WHERE projects.id = donations.project_id

To illustrate why this is so useful, lets try to use a GROUP BY and the AVG() functino that we learned earlier to answer an interesting question: Do people donate more money when donating to a high-poverty school?

This isn’t easy to answer using Excel style analysis on individual tables. The projects table has the information for whether each project is at a high-poverty school or not. However, the donations table has the information about people’s donations. So we can’t answer this question just by looking at a single table. To answer it, we need to join the two tables together. We start by writing out basic join query, and using a LIMIT clause to make sure it looks right:

SELECT * FROM projects, donations WHERE projects.id = donations.project_id LIMIT 50

Next, we can add the GROUP BY and AVG() clauses:

SELECT projects.poverty_level, AVG(donations.base_amount) FROM projects, donations WHERE projects.id = donations.project_id GROUP BY projects.poverty_level

And, that provides our answer!

Exercise!

Write a query to answer the following questions:

Dates

MySQL has some extra facilities for handling dates in a database. For querying, they work just fine in SELECT and ORDER BY statements:

SELECT id, base_amount, timestamp FROM donations ORDER BY timestamp

You can also use functions like DATE() to extract only the date part of the data, or MONTH() to get just the month:

SELECT id, base_amount, DATE(timestamp), MONTH(timestamp), DAY(timestamp), YEAR(timestamp) FROM donations ORDER BY timestamp

You can use these functions in a WHERE clause. For example, this query just gets donations in the afternoon:

SELECT id, base_amount FROM donations WHERE HOUR(timestamp) > 12 and HOUR(timestamp) < 17

You can also use these functions in GROUP BY clauses. For example, the results from this query are actually reported in a paper I published this summer:

SELECT YEAR(timestamp), AVG(base_amount), COUNT(*) FROM donations GROUP BY YEAR(timestamp)

Exercise!

Are donations seasonal? That is, are there more donations at some times of the year (like the beginning of a school year) than others? Do people make larger donations at some times of the year?

Creating Database Tables

The easiest way to figure out the syntax for creating a table is to ask MySQL for help by asking it to show you the syntax for an existing table:

SHOW CREATE TABLE projects

This displays a valid CREATE TABLE command. It also includes lots of optional information; your create table commands don’t have to include everything that this one does. Let’s take it apart next.

First, we need to give the table a name. This is specified right after CREATE TABLE. Then we have to specify what the structure of that table will look like. The structure is everything between the parentheses. The basic CREATE TABLE command looks like this:

CREATE TABLE table_name ( ... )

Inside the parentheses goes a comma separated list of columns that will appear in that table. For each column, you MUST specify what type of data it will contain. For example, here is a really simple myprojects table:

CREATE TABLE myprojects (
  id INTEGER,
  name VARCHAR(500),
  amount DECIMAL
)

It is always a good idea to specify a PRIMARY KEY in every database table. The primary key is the column that uniquely identifies each row of data in the table. I almost always add an INTEGER id column to every table that I’m working with so each row can have a unique identifying number. Then I specify that row as a PRIMARY KEY:

CREATE TABLE myprojects (
  id INTEGER,
  name VARCHAR(500),
  amount DECIMAL,
  PRIMARY KEY (id)
)

MySQL has a nice feature that makes working with id columns easier. It can automatically assign a new number in the id field by specifying the field as AUTO_INCREMENT:

CREATE TABLE myprojects (
  id INTEGER AUTO_INCREMENT,
  name VARCHAR(500),
  amount DECIMAL,
  PRIMARY KEY (id)
)

By default, all columns of a database are allowed to contain blanks – called NULLs. (Exception: columns specified as a PRIMARY KEY cannot contain NULLs because they must be unique.) If you want to ask MySQL to force every row to have a value in that column, you can specify the column at NOT NULL. It is often a good idea to do this for important columns; that way, MySQL will tell you if your data has an error in it.

CREATE TABLE myprojects (
  id INTEGER AUTO_INCREMENT,
  name VARCHAR(500) NOT NULL,
  amount DECIMAL,
  PRIMARY KEY (id)
)

You can also specify a default value for any column, which gets used if someone tries to add a row of data but doesn’t specify what should go in that column. Normally, if you don’t specify some data, the default it NULL, but using DEFAULT clause can change that:

CREATE TABLE myprojects (
  id INTEGER AUTO_INCREMENT,
  name VARCHAR(500) NOT NULL,
  amount DECIMAL DEFAULT 0.0,
  PRIMARY KEY (id)
)

Column Types

There are a number of common types of columns we see in databases. Full documentation is available on MySQL’s website. There are many more types than what is in the table below; but these are the most commonly used types.

Type Description Example
INTEGER A whole number between -2 billion and 2 billion 153
VARCHAR(n) A text string no more than n letters long Hello, World
TEXT A text string of arbitrary length The text of Moby Dick
BOOLEAN True or False True
DECIMAL A number including a decimal point 100.20
DATE A month-day-year specified date 2013-09-12
TIME An hour-minute-second specified time of day 16:56:12
DATETIME A point in time, containing both date and time 2007-09-11 10:53:03
ENUM(x,y,z) One of a small set of possible choices x

Note, the date and time is always stored in the time zone that the MySQL server is running in.

Exercise

Let’s get our database ready for some data from the US Census. The Census provides data about every state and about every zipcode in the US. Let’s create a table that will hold the data about states, and in particular, data about gender, age, and socio-economic status of each state. A lot of that information is available on the US Census website: http://quickfacts.census.gov/qfd/index.html

Write a CREATE TABLE statement that creates a database table that stores census data about states.

Inserting Data

To insert data into the database, you can use an INSERT query. You need to specify three things: the table the data should go in to, the columns of data you want to specify, and the actual data itself.

INSERT INTO myprojects (id, name, amount) VALUES (12, 'Big Data Research', 100000.00)

I usually don’t specify the AUTO_UPDATE column, and let the database set it itself:

INSERT INTO myprojects (name, amount) VALUES ('Ricks Research', 500000.12)

If we don’t specify the amount, it will be entered as the default of 0.0:

INSERT INTO myprojects (name) VALUES ('Ricks Awesome Work')

You can specify the columns in any order you want. However, the data has to appear in the same order.

Updating Data

If you want to change or edit existing data in the database, you can always do it manually in the SQL client softwre you have been using. However, sometimes you want to edit large quantities of data at once. To do that, you can use an update query. For example, if you want to update all rows of the myprojects table so that the amount is specified in hundreds of dollars instead of just dollars, you can use this query:

UPDATE myprojects SET amount = amount / 100

You can also specify just the rows that you want to update using a WHERE clause. The syntax is the same as for SELECT queries:

UPDATE myprojects SET amount = amount + 1000 WHERE name LIKE 'Rick%'

Of course, you can use this same syntax to update just one entry

UPDATE myprojects SET amount = 100 WHERE id = 12

Deleting Data

You can delete data using a DELETE statement. To delete all of the rows in a table, just do

DELETE myprojects

If you want to delete just some of the rows, use a WHERE clause:

DELETE myprojects WHERE amount > 1000

You can eliminate a whole table with a DROP statement:

DROP TABLE myprojects;

Handy Tools for moving data around

MySQL can automatically create tables for you if it the source data is coming from another MySQL database. In that case, you can tell it to automatically create a table and populate it with a SELECT query:

CREATE TABLE big_projects SELECT id, poverty_level, grade_level, total_price_excluding_optional_support AS amount FROM projects WHERE amount > 1000

You can also tell it to add a new, unique ID column:

CREATE TABLE big_projects (id INTEGER AUTO_INCREMENT PRIMARY KEY) SELECT poverty_level, grade_level, total_price_excluding_optional_support AS amount FROM projects WHERE total_price_excluding_optional_support > 1000

Likewise, you can use an INSERT statement with a SELECT statement:

INSERT INTO big_projects (poverty_level, grade_level, amount) SELECT poverty_level, grade_level, total_price_excluding_optional_support AS amount FROM project where total_price_excluding_optional_support > 500

Exercise

Choose 3 states, and write manually write insert statements to insert data from the US Census into the database. The data is available on the US census website: http://quickfacts.census.gov/qfd/index.html

Indexes

Once you’ve got your database structure down, it is a good idea to create indexes. Indexes help the database execute queries faster. Generally, you want to create an index on all fields that will be used in JOIN clauses. (Primary keys automatically have an index, so you don’t need to manually create those.) You also want an index on any field that is used in a WHERE or a GROUP BY clause and the query is really slow.

CREATE INDEX amount ON myprojects (amount)

Additional Notes

To create these tables, I first worked on writing a big SELECT query that would give me all of the information I needed. It usually took a couple of JOINs and some GROUP BYs to get it working. Once I had the SELECT query working, I added a CREATE TABLE <tablename> at the front of the query, and it saved the results of that query as a new table. Now I had my tables for my research.

One other tool that I used in SQL was call a subquery. It turns out that you can write a SELECT query, and use the results of that query as if it were a table on its own. For example, consider this query to get the average number of donors for each project:

SELECT AVG(sub.donors)
FROM projects, 
     (SELECT donations.project_id, count(donors.id) AS donors FROM donations, donors 
          WHERE donors.id = donations.donor_id GROUP BY donations.project_id) AS sub
WHERE projects.id = sub.project_id