In my work, I frequently end up having to do a lot of data analysis. This usually involves taking multiple sources of data like spreadsheets JSON or XML files and pulling them into a database.
This data is often grouped together in unusable ways, and needs to be broken up in order to be usable for analysis.
Depending on what needs to happen, I’ll either write up a quick php script to iterate through, or I’ll go right to MySQL and break it apart.
In this post, I figured I’d go through a few common things that I end up doing for SQL scripts.
It’s important to note that MSSQL and MySQL handle regex a bit differently, so these functions may not work with MSSQL
cleaning up phone numbers
So here is the most common one. You’ve got a completely bonkers list of phone numbers, and you need to clean them.
id | name | phone_number |
---|---|---|
1 | Addison Mraz | 646.777.2459 x0215 |
2 | Jordyn Lynch | 1-855-799-1507 |
3 | Miss Adell Lesch | 314-645-0287 x123 |
4 | Ruben Muller | (980) 500-2414 |
5 | Prof. Liana Gorczany | 877.973.2683 |
6 | Mr. Theron Marquardt IV | +12705657302 |
7 | Ms. Joanie Keeling | 1-804-224-6819 x98524 |
8 | Electa Crona III | (479) 416-1081 x6780 |
9 | Sylvan Muller | (385) 402-2286 |
10 | Allene Gulgowski | 539.605.0982 |
First thing we do is clean out all of the extra characters from the data.
-- Step 1 - This query will remove all non-digits (0-9) from the phone number
UPDATE users SET phone_number = REGEXP_REPLACE(phone_number,'[^0-9]', '');
-- Step 2 - Now we remove the first digit, if it's 1 (for the country code for USA/Canada)
UPDATE users
SET phone_number = CASE
WHEN SUBSTR(phone_number, 1,1) = '1' THEN SUBSTR(phone_number, 2,LENGTH(phone_number) -1)
ELSE phone_number
END;
If it’s the type of data where you need extensions (for business phones), then you can create a new column and we can put extension in there.
-- Step 3 - Where ever the Phone Number is longer than 10 digits, take any digits after 10, and put them in the extension column
UPDATE users
SET extension = SUBSTR(phone_number, 11, LENGTH(phone_number) - 10)
WHERE LENGTH(phone_number) > 10;
-- Step 4 - Cleans the extension off the end of the phone number column
UPDATE users
SET phone_number = SUBSTR(phone_number, 1,10);
Here is the progression of the phone number through our queries.
id | name | initial phone_number | step 1 phone_number | step 2 phone_number | step 3 extension | step 4 / final phone_number |
---|---|---|---|---|---|---|
1 | Addison Mraz | 646.777.2459 x0215 | 64677724590215 | 64677724590215 | 0215 | 6467772459 |
2 | Jordyn Lynch | 1-855-799-1507 | 18557991507 | 8557991507 | 8557991507 | |
3 | Miss Adell Lesch | 314-645-0287 x123 | 3146450287123 | 3146450287123 | 123 | 3146450287 |
4 | Ruben Muller | (980) 500-2414 | 9805002414 | 9805002414 | 9805002414 | |
5 | Prof. Liana Gorczany | 877.973.2683 | 8779732683 | 8779732683 | 8779732683 | |
6 | Mr. Theron Marquardt IV | +12705657302 | 12705657302 | 2705657302 | 2705657302 | |
7 | Ms. Joanie Keeling | 1-804-224-6819 x98524 | 1804224681998524 | 804224681998524 | 98524 | 8042246819 |
8 | Electa Crona III | (479) 416-1081 x6780 | 47941610816780 | 47941610816780 | 6780 | 4794161081 |
9 | Sylvan Muller | (385) 402-2286 | 3854022286 | 3854022286 | 3854022286 | |
10 | Allene Gulgowski | 539.605.0982 | 5396050982 | 5396050982 | 5396050982 |
Finally, after all this data-cleansing and breakdown is done, you can reformat the phone number into a readable format.
-- XXX-XXX-XXXX
UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3),'-',SUBSTR(phone_number,4,3),'-',SUBSTR(phone_number,7,4))
-- XXX.XXX.XXXX
UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3),'.',SUBSTR(phone_number,4,3),'.',SUBSTR(phone_number,7,4))
-- (XXX) XXX-XXXX
UPDATE users SET phone_number = CONCAT('(',SUBSTR(phone_number,1,3),') ',SUBSTR(phone_number,4,3),'-',SUBSTR(phone_number,7,4))
clean up addresses
In the first example the use case was really about using REGEX to clean the data, in this example we’re going to use regex to test the data first.
Here is our sample data, and you can see that the address is all bunched up into one column.
Right away, we should add columns for unit_number
, street_number
, street
, city
, state
to the table
id | name | address |
---|---|---|
1 | Mr. Ronaldo Eichmann | apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana |
2 | Hilma Baumbach | 2353 Cathryn Pass, West Jenniferview, Illinois |
3 | Lisa Schumm | 92467 Satterfield Locks, Wendyshire, Washington |
4 | Fredy McKenzie V | no. 179, 4769 Evangeline Garden, New Carlosstad, Vermont |
5 | Emanuel Ward | 78700 Vernon Ford, North Rudolph, New Jersey |
6 | Dr. Aiden Stokes | unit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma |
7 | Cory Windler | 719 Boyle Mall, Marcellusview, Oklahoma |
8 | Arnaldo Pagac | Unit 99, 499 Beahan Harbors, Annestad, Rhode Island |
9 | Hester Littel | 87673 Williamson Bridge, Port Nedra, Michigan |
10 | Kip Rogahn | No. 37 422 Wilson Streets, Nienowmouth, Arkansas |
/*
Step 1 - Copy the dirty unit numbers over to the unit_number column
REGEX_SUBSTR will basically use a regex pattern to find the first occurance of a regex pattern in a string.
In this case, we're looking for
^ - start of string
(unit|apt|no) - one of these options
.{1,2} - one or two characters of any kind. Usually will be just a space, or a comma then a space
[0-9]+ - one or more digits (0-9)
example: "apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana" will return "apt. 285"
*/
UPDATE users SET unit_number = REGEXP_SUBSTR(address, '^(unit|apt|no).{1,2}[0-9]+');
/*
Step 2 - Now we're going to remove the unit number from the address field
This query just snips the length of unit number off the front of the address column
*/
UPDATE users
SET address = SUBSTR(address, LENGTH(unit_number)+1, LENGTH(address)-LENGTH(unit_number)) where unit_number IS NOT NULL;
/*
Step 3 - Now we clean up the unit number field
In this one, we're just looking for instances of grouped digits.
example: "apt. 285" will return "285"
*/
UPDATE users
SET unit_number = REGEXP_SUBSTR(unit_number, '[0-9]+') where unit_number IS NOT NULL;
/*
Step 4 - Clean up the address field
When we do step 2, it often leaves some junk at the beginning of the string.
Eg: "apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana"
turns into ", 81001 Klocko Crossroad, Haleyhaven, Indiana"
REGEX_INSTR() will return the position of the first regex match
so we look for the next number and chop a bit more off the address field if there is any junk left
*/
UPDATE users
SET address = SUBSTR(address, REGEXP_INSTR(address, '[0-9]+'), LENGTH(address) - REGEXP_INSTR(address, '[0-9]+'));
So at this point, our data looks like this:
id | original address | step 1 unit_number | step 2 address | step 3 unit_number | step 4 address |
---|---|---|---|---|---|
1 | apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana | apt. 285 | , 81001 Klocko Crossroad, Haleyhaven, Indiana | 285 | 81001 Klocko Crossroad, Haleyhaven, Indian |
2 | 2353 Cathryn Pass, West Jenniferview, Illinois | 2353 Cathryn Pass, West Jenniferview, Illinoi | |||
3 | 92467 Satterfield Locks, Wendyshire, Washington | 92467 Satterfield Locks, Wendyshire, Washingto | |||
4 | no. 179, 4769 Evangeline Garden, New Carlosstad, Vermont | no. 179 | , 4769 Evangeline Garden, New Carlosstad, Vermont | 179 | 4769 Evangeline Garden, New Carlosstad, Vermon |
5 | 78700 Vernon Ford, North Rudolph, New Jersey | 78700 Vernon Ford, North Rudolph, New Jerse | |||
6 | unit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma | unit. 45 | 837 Franecki Meadows, Schmelerfurt, Oklahoma | 45 | 837 Franecki Meadows, Schmelerfurt, Oklahom |
7 | 719 Boyle Mall, Marcellusview, Oklahoma | 719 Boyle Mall, Marcellusview, Oklahom | |||
8 | Unit 99, 499 Beahan Harbors, Annestad, Rhode Island | Unit 99 | , 499 Beahan Harbors, Annestad, Rhode Island | 99 | 499 Beahan Harbors, Annestad, Rhode Islan |
9 | 87673 Williamson Bridge, Port Nedra, Michigan | 87673 Williamson Bridge, Port Nedra, Michiga | |||
10 | No. 37 422 Wilson Streets, Nienowmouth, Arkansas | No. 37 | 422 Wilson Streets, Nienowmouth, Arkansas | 37 | 422 Wilson Streets, Nienowmouth, Arkansa |
Now we’re going to pull the street number out.
-- Step 5 - We basically the step 3 function again on the address to build the street_number column
UPDATE users SET street_number = REGEXP_SUBSTR(address, '[0-9]+');
-- Step 6 - Same function as step 2, just to clean the street number off of the address field
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(street_number)+1, LENGTH(address)-LENGTH(street_number)));
-- Step 7 - Now we're going to update the remaining fields with basic string functions
UPDATE users SET street = SUBSTRING_INDEX(address, ",", 1); -- fill address
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(street) + 2, LENGTH(address) - LENGTH(street))); -- remove street from address
UPDATE users SET city = SUBSTRING_INDEX(address, ",", 1); -- fill city
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(city) + 2, LENGTH(address) - LENGTH(city))); -- remove city from address
UPDATE users SET state = SUBSTRING_INDEX(address, ",", 1); -- fill state
We still have the address field in the table, but it’s safe to drop it now.
So here is our end result
id | name | original_address | unit_number | street_number | street_number | city | state |
---|---|---|---|---|---|---|---|
1 | Mr. Ronaldo Eichmann | apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana | 285 | 81001 | 81001 | Klocko Crossroad | Haleyhaven |
2 | Hilma Baumbach | 2353 Cathryn Pass, West Jenniferview, Illinois | NULL | 2353 | 2353 | Cathryn Pass | West Jenniferview |
3 | Lisa Schumm | 92467 Satterfield Locks, Wendyshire, Washington | NULL | 92467 | 92467 | Satterfield Locks | Wendyshire |
4 | Fredy McKenzie V | no. 179, 4769 Evangeline Garden, New Carlosstad, Vermont | 179 | 4769 | 4769 | Evangeline Garden | New Carlosstad |
5 | Emanuel Ward | 78700 Vernon Ford, North Rudolph, New Jersey | NULL | 78700 | 78700 | Vernon Ford | North Rudolph |
6 | Dr. Aiden Stokes | unit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma | 45 | 837 | 837 | Franecki Meadows | Schmelerfurt |
7 | Cory Windler | 719 Boyle Mall, Marcellusview, Oklahoma | NULL | 719 | 719 | Boyle Mall | Marcellusview |
8 | Arnaldo Pagac | Unit 99, 499 Beahan Harbors, Annestad, Rhode Island | 99 | 499 | 499 | Beahan Harbors | Annestad |
9 | Hester Littel | 87673 Williamson Bridge, Port Nedra, Michigan | NULL | 87673 | 87673 | Williamson Bridge | Port Nedra |
10 | Kip Rogahn | No. 37 422 Wilson Streets, Nienowmouth, Arkansas | 37 | 422 | 422 | Wilson Streets | Nienowmouth |
special notes about MySQL and REGEX
- These functions above are all MySQL 8 functions, MySQL 5.7 doesn’t have nearly this kind of functionality.
- To see the MySQL 8 docs for Regular Expressions, check them out here: https://dev.mysql.com/doc/refman/8.0/en/regexp.html
- To see the MySQL 5.7 docs for Regular Expressions, check them out here: https://dev.mysql.com/doc/refman/5.7/en/regexp.html
- Even though MySQL 8 has way better functionality, it’s still not compatible with all regular expression formulas, for instance in php I’m able to do use
\d
as any digit, and\D
as anything but a digit, where MySQL does not seem to be compatible with these methods, instead you have to do things like[0-9]
and^0-9
useful REGEX tools
Regex-Vis is a really nice visualizer tool for REGEX statements
RegExr is a great way of testing regular expressions against bodies of text
Desktop Apps With Electron is a video course that Beyond Code put out a few years back that shows people how create desktop apps with Electron. One of the things that they show you how to do is basically programming your own Regex-Vis. Playing with it like this can supercharge your understanding of regular expressions
feedback / requests ?
Shoot me a tweet or direct message if you want me to do any other examples.