Aug 31, 2022

In my work, I frequently end up having to do a lot of data analysis. This usually involves taking multiple sources of data like spreadsheets JSON or XML files and pulling them into a database.

This data is often grouped together in unusable ways, and needs to be broken up in order to be usable for analysis.

Depending on what needs to happen, I’ll either write up a quick php script to iterate through, or I’ll go right to MySQL and break it apart.

In this post, I figured I’d go through a few common things that I end up doing for SQL scripts.

It’s important to note that MSSQL and MySQL handle regex a bit differently, so these functions may not work with MSSQL

cleaning up phone numbers

So here is the most common one. You’ve got a completely bonkers list of phone numbers, and you need to clean them.

id	name	phone_number
1	Addison Mraz	646.777.2459 x0215
2	Jordyn Lynch	1-855-799-1507
3	Miss Adell Lesch	314-645-0287 x123
4	Ruben Muller	(980) 500-2414
5	Prof. Liana Gorczany	877.973.2683
6	Mr. Theron Marquardt IV	+12705657302
7	Ms. Joanie Keeling	1-804-224-6819 x98524
8	Electa Crona III	(479) 416-1081 x6780
9	Sylvan Muller	(385) 402-2286
10	Allene Gulgowski	539.605.0982

Note - this data was randomly generated using Faker

First thing we do is clean out all of the extra characters from the data.

-- Step 1 - This query will remove all non-digits (0-9) from the phone number
UPDATE users SET phone_number = REGEXP_REPLACE(phone_number,'[^0-9]', '');

-- Step 2 - Now we remove the first digit, if it's 1 (for the country code for USA/Canada)
UPDATE users 
SET phone_number =  CASE 
                        WHEN SUBSTR(phone_number, 1,1) = '1' THEN SUBSTR(phone_number, 2,LENGTH(phone_number) -1) 
                        ELSE phone_number 
                    END;

If it’s the type of data where you need extensions (for business phones), then you can create a new column and we can put extension in there.

-- Step 3 - Where ever the Phone Number is longer than 10 digits, take any digits after 10, and put them in the extension column
UPDATE users 
SET extension = SUBSTR(phone_number, 11, LENGTH(phone_number) - 10) 
WHERE LENGTH(phone_number) > 10;


-- Step 4 - Cleans the extension off the end of the phone number column
UPDATE users
SET phone_number = SUBSTR(phone_number, 1,10);

Here is the progression of the phone number through our queries.

id	name	initial phone_number	step 1 phone_number	step 2 phone_number	step 3 extension	step 4 / final phone_number
1	Addison Mraz	646.777.2459 x0215	64677724590215	64677724590215	0215	6467772459
2	Jordyn Lynch	1-855-799-1507	18557991507	8557991507		8557991507
3	Miss Adell Lesch	314-645-0287 x123	3146450287123	3146450287123	123	3146450287
4	Ruben Muller	(980) 500-2414	9805002414	9805002414		9805002414
5	Prof. Liana Gorczany	877.973.2683	8779732683	8779732683		8779732683
6	Mr. Theron Marquardt IV	+12705657302	12705657302	2705657302		2705657302
7	Ms. Joanie Keeling	1-804-224-6819 x98524	1804224681998524	804224681998524	98524	8042246819
8	Electa Crona III	(479) 416-1081 x6780	47941610816780	47941610816780	6780	4794161081
9	Sylvan Muller	(385) 402-2286	3854022286	3854022286		3854022286
10	Allene Gulgowski	539.605.0982	5396050982	5396050982		5396050982

Finally, after all this data-cleansing and breakdown is done, you can reformat the phone number into a readable format.

-- XXX-XXX-XXXX
UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3),'-',SUBSTR(phone_number,4,3),'-',SUBSTR(phone_number,7,4))

-- XXX.XXX.XXXX
UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3),'.',SUBSTR(phone_number,4,3),'.',SUBSTR(phone_number,7,4))

-- (XXX) XXX-XXXX
UPDATE users SET phone_number = CONCAT('(',SUBSTR(phone_number,1,3),') ',SUBSTR(phone_number,4,3),'-',SUBSTR(phone_number,7,4))

clean up addresses

In the first example the use case was really about using REGEX to clean the data, in this example we’re going to use regex to test the data first.

Here is our sample data, and you can see that the address is all bunched up into one column.

Right away, we should add columns for unit_number, street_number, street, city, state to the table

id	name	address
1	Mr. Ronaldo Eichmann	apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana
2	Hilma Baumbach	2353 Cathryn Pass, West Jenniferview, Illinois
3	Lisa Schumm	92467 Satterfield Locks, Wendyshire, Washington
4	Fredy McKenzie V	no. 179, 4769 Evangeline Garden, New Carlosstad, Vermont
5	Emanuel Ward	78700 Vernon Ford, North Rudolph, New Jersey
6	Dr. Aiden Stokes	unit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma
7	Cory Windler	719 Boyle Mall, Marcellusview, Oklahoma
8	Arnaldo Pagac	Unit 99, 499 Beahan Harbors, Annestad, Rhode Island
9	Hester Littel	87673 Williamson Bridge, Port Nedra, Michigan
10	Kip Rogahn	No. 37 422 Wilson Streets, Nienowmouth, Arkansas

Note - this data was randomly generated using Faker

/* 
    Step 1 - Copy the dirty unit numbers over to the unit_number column
    
    REGEX_SUBSTR will basically use a regex pattern to find the first occurance of a regex pattern in a string.

    In this case, we're looking for 
    
    ^ - start of string
    (unit|apt|no) - one of these options
    .{1,2} - one or two characters of any kind.  Usually will be just a space, or a comma then a space
    [0-9]+ - one or more digits (0-9)

    example: "apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana" will return "apt. 285"
*/
UPDATE users SET unit_number = REGEXP_SUBSTR(address, '^(unit|apt|no).{1,2}[0-9]+');


/* 
    Step 2 - Now we're going to remove the unit number from the address field

    This query just snips the length of unit number off the front of the address column
*/
UPDATE users 
SET address = SUBSTR(address, LENGTH(unit_number)+1, LENGTH(address)-LENGTH(unit_number)) where unit_number IS NOT NULL;

/*
    Step 3 - Now we clean up the unit number field
    
    In this one, we're just looking for instances of grouped digits.

    example: "apt. 285" will return "285"

*/
UPDATE users 
SET unit_number = REGEXP_SUBSTR(unit_number, '[0-9]+') where unit_number IS NOT NULL;

/*
    Step 4 - Clean up the address field

    When we do step 2, it often leaves some junk at the beginning of the string.
    Eg: "apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana" 
         turns into ", 81001 Klocko Crossroad, Haleyhaven, Indiana"

    REGEX_INSTR() will return the position of the first regex match
    so we look for the next number and chop a bit more off the address field if there is any junk left
*/

UPDATE users 
SET address = SUBSTR(address, REGEXP_INSTR(address, '[0-9]+'), LENGTH(address) - REGEXP_INSTR(address, '[0-9]+'));

So at this point, our data looks like this:

id	original address	step 1 unit_number	step 2 address	step 3 unit_number	step 4 address
1	apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana	apt. 285	, 81001 Klocko Crossroad, Haleyhaven, Indiana	285	81001 Klocko Crossroad, Haleyhaven, Indian
2	2353 Cathryn Pass, West Jenniferview, Illinois				2353 Cathryn Pass, West Jenniferview, Illinoi
3	92467 Satterfield Locks, Wendyshire, Washington				92467 Satterfield Locks, Wendyshire, Washingto
4	no. 179, 4769 Evangeline Garden, New Carlosstad, Vermont	no. 179	, 4769 Evangeline Garden, New Carlosstad, Vermont	179	4769 Evangeline Garden, New Carlosstad, Vermon
5	78700 Vernon Ford, North Rudolph, New Jersey				78700 Vernon Ford, North Rudolph, New Jerse
6	unit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma	unit. 45	837 Franecki Meadows, Schmelerfurt, Oklahoma	45	837 Franecki Meadows, Schmelerfurt, Oklahom
7	719 Boyle Mall, Marcellusview, Oklahoma				719 Boyle Mall, Marcellusview, Oklahom
8	Unit 99, 499 Beahan Harbors, Annestad, Rhode Island	Unit 99	, 499 Beahan Harbors, Annestad, Rhode Island	99	499 Beahan Harbors, Annestad, Rhode Islan
9	87673 Williamson Bridge, Port Nedra, Michigan				87673 Williamson Bridge, Port Nedra, Michiga
10	No. 37 422 Wilson Streets, Nienowmouth, Arkansas	No. 37	422 Wilson Streets, Nienowmouth, Arkansas	37	422 Wilson Streets, Nienowmouth, Arkansa

Now we’re going to pull the street number out.

-- Step 5 - We basically the step 3 function again on the address to build the street_number column
UPDATE users SET street_number = REGEXP_SUBSTR(address, '[0-9]+');

-- Step 6 - Same function as step 2, just to clean the street number off of the address field
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(street_number)+1, LENGTH(address)-LENGTH(street_number)));

-- Step 7 - Now we're going to update the remaining fields with basic string functions
UPDATE users SET street = SUBSTRING_INDEX(address, ",", 1); -- fill address
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(street) + 2, LENGTH(address) - LENGTH(street))); -- remove street from address

UPDATE users SET city = SUBSTRING_INDEX(address, ",", 1); -- fill city
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(city) + 2, LENGTH(address) - LENGTH(city))); -- remove city from address

UPDATE users SET state = SUBSTRING_INDEX(address, ",", 1); -- fill state

We still have the address field in the table, but it’s safe to drop it now.

So here is our end result

id	name	original_address	unit_number	street_number	street_number	city	state
1	Mr. Ronaldo Eichmann	apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana	285	81001	81001	Klocko Crossroad	Haleyhaven
2	Hilma Baumbach	2353 Cathryn Pass, West Jenniferview, Illinois	NULL	2353	2353	Cathryn Pass	West Jenniferview
3	Lisa Schumm	92467 Satterfield Locks, Wendyshire, Washington	NULL	92467	92467	Satterfield Locks	Wendyshire
4	Fredy McKenzie V	no. 179, 4769 Evangeline Garden, New Carlosstad, Vermont	179	4769	4769	Evangeline Garden	New Carlosstad
5	Emanuel Ward	78700 Vernon Ford, North Rudolph, New Jersey	NULL	78700	78700	Vernon Ford	North Rudolph
6	Dr. Aiden Stokes	unit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma	45	837	837	Franecki Meadows	Schmelerfurt
7	Cory Windler	719 Boyle Mall, Marcellusview, Oklahoma	NULL	719	719	Boyle Mall	Marcellusview
8	Arnaldo Pagac	Unit 99, 499 Beahan Harbors, Annestad, Rhode Island	99	499	499	Beahan Harbors	Annestad
9	Hester Littel	87673 Williamson Bridge, Port Nedra, Michigan	NULL	87673	87673	Williamson Bridge	Port Nedra
10	Kip Rogahn	No. 37 422 Wilson Streets, Nienowmouth, Arkansas	37	422	422	Wilson Streets	Nienowmouth

special notes about MySQL and REGEX

These functions above are all MySQL 8 functions, MySQL 5.7 doesn’t have nearly this kind of functionality.
To see the MySQL 8 docs for Regular Expressions, check them out here: https://dev.mysql.com/doc/refman/8.0/en/regexp.html
To see the MySQL 5.7 docs for Regular Expressions, check them out here: https://dev.mysql.com/doc/refman/5.7/en/regexp.html
Even though MySQL 8 has way better functionality, it’s still not compatible with all regular expression formulas, for instance in php I’m able to do use \d as any digit, and \D as anything but a digit, where MySQL does not seem to be compatible with these methods, instead you have to do things like [0-9] and ^0-9

useful REGEX tools

Regex-Vis is a really nice visualizer tool for REGEX statements

RegExr is a great way of testing regular expressions against bodies of text

Desktop Apps With Electron is a video course that Beyond Code put out a few years back that shows people how create desktop apps with Electron. One of the things that they show you how to do is basically programming your own Regex-Vis. Playing with it like this can supercharge your understanding of regular expressions

feedback / requests ?

Shoot me a tweet or direct message if you want me to do any other examples.