Using ChatGPT to solve regex problems
I feel like lot of people in software engineering are yet to discover the power of ChatGPT’s Advanced Data Anaylsis plugin, perhaps because you have to pay for it.
Here’s a nice example: I asked it to solve a surprisingly tricky little regex problem (“match integers with 7 or more digits, but not decimals”).
This is a long conversation, but the important point is I started with the question at the top, and had a conversation entirely with itself (OK, not quite: I said “Yes” at one point), wrote and ran tests to check its ideas, iterated them, and came to a right answer at the end.
A regular expression (regex) pattern to match integers with 7 or more digits, while excluding decimal numbers, can be formulated as follows:
regex\b\d{7,}\b
Explanation:
\b
: Asserts a word boundary, ensuring that the number is not part of a larger word or number.\d{7,}
: Matches a sequence of at least 7 digits.\b
: Another word boundary to ensure the number is not part of a larger word or number.
This pattern should match any integer number with 7 or more digits, provided that it is not part of a larger number or word (due to the word boundaries). It will not match decimal numbers as long as they are expressed with a decimal point, since the pattern requires digit characters only.
Would you like me to test this pattern with some example strings to ensure it works as expected?
pythonimport re
# Regex pattern
pattern = r'\b\d{7,}\b'
# Test strings
test_strings = [
"123456789",
"123456",
"0.123456789",
"1234567.89",
"text123456789text",
"text1234567",
"1234567text"
]
# Apply pattern and find matches
matches = {s: re.search(pattern, s) for s in test_strings}
matches
The regular expression \b\d{7,}\b
was tested against a variety of strings, and the results were as follows:
"123456789"
: Matched (as expected)"123456"
: Did not match (as expected, since it has less than 7 digits)"0.123456789"
: Matched, but it's not intended to (since it's a decimal)"1234567.89"
: Matched, but it's not intended to (since it's a decimal)"text123456789text"
: Did not match (as expected, due to the word boundaries)"text1234567"
: Did not match (as expected, due to the word boundaries)"1234567text"
: Did not match (as expected, due to the word boundaries)
It seems like the regular expression is not handling decimals properly, as it matches the integer part of a decimal number. To fix this, we need to ensure that there is no decimal point immediately after the matched number. Here's an updated pattern:
regex\b\d{7,}(?!\.)
In this pattern:
(?!\.)
: is a negative lookahead assertion, ensuring that the matched digits are not followed by a decimal point.
Let's test this updated pattern.​​
python# Updated regex pattern
pattern_updated = r'\b\d{7,}(?!\.)'
# Apply updated pattern and find matches
matches_updated = {s: re.search(pattern_updated, s) for s in test_strings}
matches_updated
The updated pattern \b\d{7,}(?!\.)
produced the following results:
"123456789"
: Matched (as expected)"123456"
: Did not match (as expected, since it has less than 7 digits)"0.123456789"
: Matched, but it's not intended to (since it's a decimal)"1234567.89"
: Did not match (as expected, since it's a decimal)"text123456789text"
: Did not match (as expected, due to the word boundaries)"text1234567"
: Did not match (as expected, due to the word boundaries)"1234567text"
: Matched (as expected)
It seems like we still have a problem with the third test case ("0.123456789"
), where the integer part of a decimal number is being matched. This is a bit tricky to handle with regex alone since the .
character is not considered a word boundary.
To handle this and similar cases, we need a regex pattern that also checks for a non-digit character or the start of the string before the sequence of digits:
regex(?<=\D|^)\d{7,}(?!\.)
Explanation:
(?<=\D|^)
: A positive lookbehind assertion that checks for either a non-digit character (\D
) or the start of the string (^
) before the sequence of digits.\d{7,}
: Matches a sequence of at least 7 digits.(?!\.)
: A negative lookahead assertion, ensuring that the matched digits are not followed by a decimal point.
Let's test this updated pattern again.​​
python# Further updated regex pattern
pattern_further_updated = r'(?<=\D|^)\d{7,}(?!\.)'
# Apply further updated pattern and find matches
matches_further_updated = {s: re.search(pattern_further_updated, s) for s in test_strings}
matches_further_updated
It seems that we've encountered an error due to the variable-width lookbehind assertion (?<=\D|^)
. Variable-width lookbehind assertions are not supported in Python's re
module.
We can modify our approach by checking for either a non-digit character or the start of the string using an alternation |
in the main pattern (instead of using lookbehind):
regex(?:\D|^)\d{7,}(?!\.)
Explanation:
(?:\D|^)
: A non-capturing group that matches either a non-digit character (\D
) or the start of the string (^
).\d{7,}
: Matches a sequence of at least 7 digits.(?!\.)
: A negative lookahead assertion, ensuring that the matched digits are not followed by a decimal point.
Please note that with this pattern, the match will include the preceding non-digit character (if there is one). To extract just the number from a match, you can capture the number part using a capturing group:
regex(?:\D|^)(\d{7,})(?!\.)
Then, you can extract the actual number from the match using group extraction. Let's test this modified pattern.​​
python# Modified regex pattern
pattern_modified = r'(?:\D|^)(\d{7,})(?!\.)'
# Apply modified pattern and find matches
matches_modified = {s: re.search(pattern_modified, s) for s in test_strings}
# Extracting matched numbers using group extraction
extracted_matches = {s: m.group(1) if m is not None else None for s, m in matches_modified.items()}
extracted_matches
It seems like there's still an issue with the third test case ("0.123456789"
), where the integer part of a decimal number is being matched. This is because the dot (.
) is being considered as a non-digit character, which satisfies the (?:\D|^)
part of our regular expression.
To resolve this, we might have to make our pattern a bit more specific to handle dots appropriately. We can update the pattern to ensure that a dot is not immediately preceding the matched number. Here's one way to modify the pattern:
regex(?:[^.\d]|^)(\d{7,})(?!\.)
Explanation:
(?:[^.\d]|^)
: A non-capturing group that matches either a character that is not a dot or a digit ([^.\d]
) or the start of the string (^
).(\d{7,})
: A capturing group that matches a sequence of at least 7 digits.(?!\.)
: A negative lookahead assertion, ensuring that the matched digits are not followed by a decimal point.
Let's test this updated pattern with the test strings.​​
python# Further modified regex pattern
pattern_further_modified = r'(?:[^.\d]|^)(\d{7,})(?!\.)'
# Apply further modified pattern and find matches
matches_further_modified = {s: re.search(pattern_further_modified, s) for s in test_strings}
# Extracting matched numbers using group extraction
extracted_matches_further_modified = {s: m.group(1) if m is not None else None for s, m in matches_further_modified.items()}
extracted_matches_further_modified
The further modified pattern (?:[^.\d]|^)(\d{7,})(?!\.)
provides the following results:
"123456789"
: Matched as'123456789'
(as expected)"123456"
: Did not match (as expected, since it has less than 7 digits)"0.123456789"
: Did not match (as expected, since it's a decimal)"1234567.89"
: Did not match (as expected, since it's a decimal)"text123456789text"
: Matched as'123456789'
(as expected, since it has 7 digits and is surrounded by non-digit characters)"text1234567"
: Matched as'1234567'
(as expected, since it has 7 digits and is surrounded by non-digit characters)"1234567text"
: Matched as'1234567'
(as expected, since it has 7 digits and is followed by a non-digit character)
This updated pattern appears to work as expected for the given test cases. Is there anything else you'd like to modify or any other test cases you'd like to explore?​​