Tentative schedule

1. Introduction (10’)

What is OpenRefine useful for?

  • Free, open source tool for data cleaning.

  • Can do challenging cleaning easily.

  • Tracks history automatically.

2. Working with OpenRefine (35’)

  • How can we bring our data into OpenRefine?
  • How can we sort and summarize our data?
  • How can we find and correct errors in our raw data?

Create a project

Import Portal_rodents_19772002_scinameUUIDs.csv

Create a project

Play with preview

Create a Text facet on scientificName

Facet > Text facet

Create a Text facet on scientificName

  1. Sort by name and by count. Can you spot any problem?
  2. Hover over names to reveal the edit function
    • Fix something everywhere, then undo.

Exercise

Solution

Cluster scientificName to clean it

  • Merge using metaphone3 (should identify three clusters).
  • Try different Methods but don’t Merge again.

Split columns

Split scientificName into genus and species:

  • Edit Column > Split into several columns…

Split columns

  • replace the comma with a space.
  • Uncheck “Remove this column”.

(There is a problem with leading white space. See next.)

Trim whitespace

  • Undo previous split, trim whitespace, an repeat split.

(Finally undo to leave the dataset unsplit.)

Exercise

Solution

Key points

Facet and cluster your data to identify errors or outliers.

3. Filtering and Sorting (20’)

  • How can we select specific subsets of data?
  • How can we sort our data?

Exercise

  • Filter the scientificNames matching “bai”.
  • Type more characters.
  • Click on each species’ name; and on include / exclude.

Exercise

Sorting by one column

Exercise

Sorting by multiple columns

Key point

Sorting and filtering with Openrefine keeps raw data raw.

4. Examining Numbers (20’)

  • How can we change the type of a column?
  • How can we visualize relationships among columns?

Change type of recordID to number

(First remove any facet.)

Exercise

Numeric facet

Exercise

Numeric facet

Create a Scatterplot facet on recordID

(recordID, period, and 2 other columns should be numbers)

Exercise

Explore ralationship among numeric columns

Exercise

Exercise

Key point

OpenRefine helps you overview numerical data.

5. Scripts from OpenRefine (15’)

  • How can we document the data-cleaning steps?
  • How can we apply these steps to other datasets?

Export scripts of your work history

Undo / Redo > Extract…

  • OpenRefine scripts your history.
  • You can extract and save it as plain text.

Export scripts of your work history

Undo / Redo > Apply…

  • You can apply stored cleaning steps to other datasets with the same structure.

Key points

  • You can reproduce your cleaning steps.

  • You can publish in an appendix your data-cleaning steps.

6. Exporting and Saving Data (15’)

How can we save and export our cleaned data from OpenRefine?

Export clean datasets or entire projects

Key points

You can export and share cleaned data, or entire projects.

7. Other Resources in OpenRefine

What other resources are available for working with OpenRefine?

End

Have you installed the software for the next lesson?