Visual Basic for Applications | Data Processing | Excel
Locating data in a range of values is an essential early step in most data processing projects and it often consists of taking the same steps to locate a dataset. Given the repetitiveness and similarity of such tasks, there are a few approaches that automate and help kickstart the process. Often one would jump between Find, Match and Looping arrays.
Andreas Killer came up with a nice routine for generating test data. It was slightly modified so that we are able to provide the threshold when calling the function:
Function generate_test_data(ByVal dbl_threshold As Double) | |
Dim arr_test_data(1 To 100000, 1 To 2) As Variant | |
Dim lng_counter As Long | |
Rnd -1652 'Random seed | |
For lng_counter = LBound(arr_test_data) To UBound(arr_test_data, 1) | |
If Rnd > dbl_threshold Then arr_test_data(lng_counter, 1) = "foo" | |
If Rnd > dbl_threshold Then arr_test_data(lng_counter, 2) = "bar" | |
Next lng_counter | |
With ActiveWorkbook.Sheets(1) | |
.UsedRange.ClearContents | |
.Range("A1").Resize(UBound(arr_test_data, 1), UBound(arr_test_data, 2)).Value2 = arr_test_data | |
End With | |
End Function |
We randomly generate foo and bar in column 1 and column 2 in a range of 100 000 rows, with the number of foos and bars controlled by the value of dbl_threshold.
Changing the dbl_threshold to a higher number will give few foos and bars, and changing it to a smaller value will give lots.
Since the foo and the bar are using different random numbers there will be rows with foo but no bar and bar but no foo, as well as rows with both foo and bar.
This makes it easy to test how the various methods compare with different densities of data.
Using the VBA function Timer - we will time how much time will it take to find every pair in the existing range.
The Timer function returns fractional portions of a second.
Function time_find(ByVal rng_test_data As Range) As Double | |
Dim lng_result_array As Long | |
Dim dbl_start_time As Double, dbl_end_time As Double | |
Dim rng_lookup_column As Range, rng_found_result As Range | |
Dim str_first_address As String | |
dbl_start_time = Timer | |
Set rng_lookup_column = rng_test_data.Resize(rng_test_data.Rows.Count, 1) | |
With rng_lookup_column | |
Set rng_found_result = .Find("foo", After:=.Cells(.Rows.Count, .Columns.Count), _ | |
LookIn:=xlValues, SearchDirection:=xlNext, MatchCase:=False) | |
str_first_address = rng_found_result.Address | |
Do | |
Set rng_found_result = .FindNext(rng_found_result) | |
If rng_found_result.Offset(0, 1) = "bar" Then | |
lng_result_array = lng_result_array + 1 | |
End If | |
Loop While Not rng_found_result Is Nothing And rng_found_result.Address <> str_first_address | |
End With | |
dbl_end_time = Timer | |
time_find = dbl_end_time - dbl_start_time | |
End Function |
This Excel VBA function will loop over all of the sheets in an activated workbook and map their values to an array. It's your responsibility to activate the correct workbook prior to starting the function. The function will initialize and change the contents of an array defined by the user. The function call should be done in this way:
Function time_match(ByVal rng_test_data As Range) As Double | |
Dim lng_match_position As Long, lng_result_array As Long | |
Dim dbl_start_time As Double, dbl_end_time As Double | |
Dim rng_lookup_column As Range | |
dbl_start_time = Timer | |
Set rng_lookup_column = rng_test_data.Resize(rng_test_data.Rows.Count, 1) | |
On Error GoTo Finish | |
Do | |
lng_match_position = Application.Match("foo", rng_lookup_column, False) | |
If rng_lookup_column(lng_match_position, 2) = "bar" Then | |
lng_result_array = lng_result_array + 1 | |
End If | |
Set rng_lookup_column = rng_lookup_column.Resize(rng_lookup_column.Rows.Count - lng_match_position, 1).Offset(lng_match_position, 0) | |
Loop | |
Finish: | |
dbl_end_time = Timer | |
time_match = dbl_end_time - dbl_start_time | |
End Function |
This Excel VBA function will extract the headers from an active sheet into an array. It's your responsibility to activate the correct workbook and sheet prior to starting the function. The function will initialize and change the contents of the array defined by the user. The user has the possibility to pick a predefined start cell or a whole range. The whole range can be chosen manualy if bPickWholeRange is set to True. The function call should be done in this way:
Function time_array(ByVal arr_test_data As Variant) As Double | |
Dim lng_array_counter As Long, lng_result_array As Long | |
Dim dbl_start_time As Double, dbl_end_time As Double | |
dbl_start_time = Timer | |
For lng_array_counter = LBound(arr_test_data, 1) To UBound(arr_test_data, 1) | |
If arr_test_data(lng_array_counter, 1) = "foo" And _ | |
arr_test_data(lng_array_counter, 2) = "bar" Then | |
lng_result_array = lng_result_array + 1 | |
End If | |
Next lng_array_counter | |
dbl_end_time = Timer | |
time_array = dbl_end_time - dbl_start_time | |
End Function |
This Excel VBA function will extract the table of data from an active sheet into an array. It's your responsibility to activate the correct workbook and sheet prior to starting the function. The function will initialize and change the contents of the array defined by the user. The user has the possibility to pick a predefined start cell or a whole range. The whole range can be chosen manualy if bPickWholeRange is set to True. The function call should be done in this way:
This Excel VBA function will create a new sheet with a user-defined name. The user must activate the desired workbook prior to calling the function. This function doesn't check if a sheet with the same name already exists and will halt the macro if a check hasn't been performed. The function call should be done in this way:
Sub compare_search() | |
Dim rng_test_data As Range | |
Dim int_trial_counter As Integer | |
Dim arr_results(1 To 100, 1 To 4) As Variant, arr_thresholds As Variant, _ | |
arr_test_data As Variant, var_threshold As Variant | |
With ActiveWorkbook.Sheets(2) | |
.Cells(1, 1).Resize(1, UBound(arr_results, 2)).Value2 = Array("Threshold", "Find", "Match", "Array") | |
End With | |
arr_thresholds = Array(0.99, 0.9, 0.85, 0.8, 0.75, 0.7, 0.65, 0.6, 0.55, 0.5, 0.45, 0.4, 0.35, 0.3, _ | |
0.25, 0.2, 0.15, 0.1, 0.05, 0.001) | |
For Each var_threshold In arr_thresholds | |
Call generate_test_data(var_threshold) | |
Set rng_test_data = ActiveWorkbook.Sheets(1).UsedRange | |
arr_test_data = rng_test_data.Value2 | |
For int_trial_counter = LBound(arr_results, 1) To UBound(arr_results, 1) | |
arr_results(int_trial_counter, 1) = var_threshold | |
arr_results(int_trial_counter, 2) = time_find(rng_test_data) | |
arr_results(int_trial_counter, 3) = time_match(rng_test_data) | |
arr_results(int_trial_counter, 4) = time_array(arr_test_data) | |
Next int_trial_counter | |
With ActiveWorkbook.Sheets(2) | |
.Cells(.UsedRange.Rows.Count + 1, 1).Resize(UBound(arr_results, 1), 4).Value2 = arr_results | |
End With | |
Next var_threshold | |
End Sub |
The implementation of the following functions can greately reduce execution times while providing enough flexibility to transform and validate data.