using regular expressions (REGEX) with mysql queries

I walk through a few common data cleanup regex patterns for mysql

Aug 31, 2022

In my work, I frequently end up having to do a lot of data analysis. This usually involves taking multiple sources of data like spreadsheets JSON or XML files and pulling them into a database.

This data is often grouped together in unusable ways, and needs to be broken up in order to be usable for analysis.

Depending on what needs to happen, I’ll either write up a quick php script to iterate through, or I’ll go right to MySQL and break it apart.

In this post, I figured I’d go through a few common things that I end up doing for SQL scripts.

It’s important to note that MSSQL and MySQL handle regex a bit differently, so these functions may not work with MSSQL

cleaning up phone numbers


So here is the most common one. You’ve got a completely bonkers list of phone numbers, and you need to clean them.

idnamephone_number
1Addison Mraz646.777.2459 x0215
2Jordyn Lynch1-855-799-1507
3Miss Adell Lesch314-645-0287 x123
4Ruben Muller(980) 500-2414
5Prof. Liana Gorczany877.973.2683
6Mr. Theron Marquardt IV+12705657302
7Ms. Joanie Keeling1-804-224-6819 x98524
8Electa Crona III(479) 416-1081 x6780
9Sylvan Muller(385) 402-2286
10Allene Gulgowski539.605.0982
Note - this data was randomly generated using Faker

First thing we do is clean out all of the extra characters from the data.

-- Step 1 - This query will remove all non-digits (0-9) from the phone number
UPDATE users SET phone_number = REGEXP_REPLACE(phone_number,'[^0-9]', '');

-- Step 2 - Now we remove the first digit, if it's 1 (for the country code for USA/Canada)
UPDATE users 
SET phone_number =  CASE 
                        WHEN SUBSTR(phone_number, 1,1) = '1' THEN SUBSTR(phone_number, 2,LENGTH(phone_number) -1) 
                        ELSE phone_number 
                    END;

If it’s the type of data where you need extensions (for business phones), then you can create a new column and we can put extension in there.

-- Step 3 - Where ever the Phone Number is longer than 10 digits, take any digits after 10, and put them in the extension column
UPDATE users 
SET extension = SUBSTR(phone_number, 11, LENGTH(phone_number) - 10) 
WHERE LENGTH(phone_number) > 10;


-- Step 4 - Cleans the extension off the end of the phone number column
UPDATE users
SET phone_number = SUBSTR(phone_number, 1,10);


Here is the progression of the phone number through our queries.

idnameinitial phone_numberstep 1 phone_numberstep 2 phone_numberstep 3 extensionstep 4 / final phone_number
1Addison Mraz646.777.2459 x0215646777245902156467772459021502156467772459
2Jordyn Lynch1-855-799-15071855799150785579915078557991507
3Miss Adell Lesch314-645-0287 x123314645028712331464502871231233146450287
4Ruben Muller(980) 500-2414980500241498050024149805002414
5Prof. Liana Gorczany877.973.2683877973268387797326838779732683
6Mr. Theron Marquardt IV+127056573021270565730227056573022705657302
7Ms. Joanie Keeling1-804-224-6819 x985241804224681998524804224681998524985248042246819
8Electa Crona III(479) 416-1081 x6780479416108167804794161081678067804794161081
9Sylvan Muller(385) 402-2286385402228638540222863854022286
10Allene Gulgowski539.605.0982539605098253960509825396050982

Finally, after all this data-cleansing and breakdown is done, you can reformat the phone number into a readable format.

-- XXX-XXX-XXXX
UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3),'-',SUBSTR(phone_number,4,3),'-',SUBSTR(phone_number,7,4))

-- XXX.XXX.XXXX
UPDATE users SET phone_number = CONCAT(SUBSTR(phone_number,1,3),'.',SUBSTR(phone_number,4,3),'.',SUBSTR(phone_number,7,4))

-- (XXX) XXX-XXXX
UPDATE users SET phone_number = CONCAT('(',SUBSTR(phone_number,1,3),') ',SUBSTR(phone_number,4,3),'-',SUBSTR(phone_number,7,4))




clean up addresses

In the first example the use case was really about using REGEX to clean the data, in this example we’re going to use regex to test the data first.

Here is our sample data, and you can see that the address is all bunched up into one column.

Right away, we should add columns for unit_number, street_number, street, city, state to the table

idnameaddress
1Mr. Ronaldo Eichmannapt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana
2Hilma Baumbach2353 Cathryn Pass, West Jenniferview, Illinois
3Lisa Schumm92467 Satterfield Locks, Wendyshire, Washington
4Fredy McKenzie Vno. 179, 4769 Evangeline Garden, New Carlosstad, Vermont
5Emanuel Ward78700 Vernon Ford, North Rudolph, New Jersey
6Dr. Aiden Stokesunit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma
7Cory Windler719 Boyle Mall, Marcellusview, Oklahoma
8Arnaldo PagacUnit 99, 499 Beahan Harbors, Annestad, Rhode Island
9Hester Littel87673 Williamson Bridge, Port Nedra, Michigan
10Kip RogahnNo. 37 422 Wilson Streets, Nienowmouth, Arkansas
Note - this data was randomly generated using Faker

/* 
    Step 1 - Copy the dirty unit numbers over to the unit_number column
    
    REGEX_SUBSTR will basically use a regex pattern to find the first occurance of a regex pattern in a string.

    In this case, we're looking for 
    
    ^ - start of string
    (unit|apt|no) - one of these options
    .{1,2} - one or two characters of any kind.  Usually will be just a space, or a comma then a space
    [0-9]+ - one or more digits (0-9)

    example: "apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana" will return "apt. 285"
*/
UPDATE users SET unit_number = REGEXP_SUBSTR(address, '^(unit|apt|no).{1,2}[0-9]+');


/* 
    Step 2 - Now we're going to remove the unit number from the address field

    This query just snips the length of unit number off the front of the address column
*/
UPDATE users 
SET address = SUBSTR(address, LENGTH(unit_number)+1, LENGTH(address)-LENGTH(unit_number)) where unit_number IS NOT NULL;

/*
    Step 3 - Now we clean up the unit number field
    
    In this one, we're just looking for instances of grouped digits.

    example: "apt. 285" will return "285"

*/
UPDATE users 
SET unit_number = REGEXP_SUBSTR(unit_number, '[0-9]+') where unit_number IS NOT NULL;

/*
    Step 4 - Clean up the address field

    When we do step 2, it often leaves some junk at the beginning of the string.
    Eg: "apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana" 
         turns into ", 81001 Klocko Crossroad, Haleyhaven, Indiana"

    REGEX_INSTR() will return the position of the first regex match
    so we look for the next number and chop a bit more off the address field if there is any junk left
*/

UPDATE users 
SET address = SUBSTR(address, REGEXP_INSTR(address, '[0-9]+'), LENGTH(address) - REGEXP_INSTR(address, '[0-9]+'));


So at this point, our data looks like this:

idoriginal addressstep 1 unit_numberstep 2 addressstep 3 unit_numberstep 4 address
1apt. 285, 81001 Klocko Crossroad, Haleyhaven, Indianaapt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana28581001 Klocko Crossroad, Haleyhaven, Indian
22353 Cathryn Pass, West Jenniferview, Illinois2353 Cathryn Pass, West Jenniferview, Illinoi
392467 Satterfield Locks, Wendyshire, Washington92467 Satterfield Locks, Wendyshire, Washingto
4no. 179, 4769 Evangeline Garden, New Carlosstad, Vermontno. 179, 4769 Evangeline Garden, New Carlosstad, Vermont1794769 Evangeline Garden, New Carlosstad, Vermon
578700 Vernon Ford, North Rudolph, New Jersey78700 Vernon Ford, North Rudolph, New Jerse
6unit. 45 837 Franecki Meadows, Schmelerfurt, Oklahomaunit. 45837 Franecki Meadows, Schmelerfurt, Oklahoma45837 Franecki Meadows, Schmelerfurt, Oklahom
7719 Boyle Mall, Marcellusview, Oklahoma719 Boyle Mall, Marcellusview, Oklahom
8Unit 99, 499 Beahan Harbors, Annestad, Rhode IslandUnit 99, 499 Beahan Harbors, Annestad, Rhode Island99499 Beahan Harbors, Annestad, Rhode Islan
987673 Williamson Bridge, Port Nedra, Michigan87673 Williamson Bridge, Port Nedra, Michiga
10No. 37 422 Wilson Streets, Nienowmouth, ArkansasNo. 37422 Wilson Streets, Nienowmouth, Arkansas37422 Wilson Streets, Nienowmouth, Arkansa

Now we’re going to pull the street number out.

-- Step 5 - We basically the step 3 function again on the address to build the street_number column
UPDATE users SET street_number = REGEXP_SUBSTR(address, '[0-9]+');

-- Step 6 - Same function as step 2, just to clean the street number off of the address field
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(street_number)+1, LENGTH(address)-LENGTH(street_number)));

-- Step 7 - Now we're going to update the remaining fields with basic string functions
UPDATE users SET street = SUBSTRING_INDEX(address, ",", 1); -- fill address
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(street) + 2, LENGTH(address) - LENGTH(street))); -- remove street from address

UPDATE users SET city = SUBSTRING_INDEX(address, ",", 1); -- fill city
UPDATE users SET address = TRIM(SUBSTR(address, LENGTH(city) + 2, LENGTH(address) - LENGTH(city))); -- remove city from address

UPDATE users SET state = SUBSTRING_INDEX(address, ",", 1); -- fill state

We still have the address field in the table, but it’s safe to drop it now.

So here is our end result

idnameoriginal_addressunit_numberstreet_numberstreet_numbercitystate
1Mr. Ronaldo Eichmannapt. 285, 81001 Klocko Crossroad, Haleyhaven, Indiana2858100181001Klocko CrossroadHaleyhaven
2Hilma Baumbach2353 Cathryn Pass, West Jenniferview, IllinoisNULL23532353Cathryn PassWest Jenniferview
3Lisa Schumm92467 Satterfield Locks, Wendyshire, WashingtonNULL9246792467Satterfield LocksWendyshire
4Fredy McKenzie Vno. 179, 4769 Evangeline Garden, New Carlosstad, Vermont17947694769Evangeline GardenNew Carlosstad
5Emanuel Ward78700 Vernon Ford, North Rudolph, New JerseyNULL7870078700Vernon FordNorth Rudolph
6Dr. Aiden Stokesunit. 45 837 Franecki Meadows, Schmelerfurt, Oklahoma45837837Franecki MeadowsSchmelerfurt
7Cory Windler719 Boyle Mall, Marcellusview, OklahomaNULL719719Boyle MallMarcellusview
8Arnaldo PagacUnit 99, 499 Beahan Harbors, Annestad, Rhode Island99499499Beahan HarborsAnnestad
9Hester Littel87673 Williamson Bridge, Port Nedra, MichiganNULL8767387673Williamson BridgePort Nedra
10Kip RogahnNo. 37 422 Wilson Streets, Nienowmouth, Arkansas37422422Wilson StreetsNienowmouth


special notes about MySQL and REGEX

  • These functions above are all MySQL 8 functions, MySQL 5.7 doesn’t have nearly this kind of functionality.
  • To see the MySQL 8 docs for Regular Expressions, check them out here: https://dev.mysql.com/doc/refman/8.0/en/regexp.html
  • To see the MySQL 5.7 docs for Regular Expressions, check them out here: https://dev.mysql.com/doc/refman/5.7/en/regexp.html
  • Even though MySQL 8 has way better functionality, it’s still not compatible with all regular expression formulas, for instance in php I’m able to do use \d as any digit, and \D as anything but a digit, where MySQL does not seem to be compatible with these methods, instead you have to do things like [0-9] and ^0-9


useful REGEX tools

Regex-Vis is a really nice visualizer tool for REGEX statements

RegExr is a great way of testing regular expressions against bodies of text

Desktop Apps With Electron is a video course that Beyond Code put out a few years back that shows people how create desktop apps with Electron. One of the things that they show you how to do is basically programming your own Regex-Vis. Playing with it like this can supercharge your understanding of regular expressions

feedback / requests ?

Shoot me a tweet or direct message if you want me to do any other examples.