Discussion:
ORDER BY a second letter of a field
(too old to reply)
Eduardo
2011-08-21 18:50:56 UTC
Permalink
Hello guys, I'm using DAO 3.6 with an access 2003 database in a program.

I would like to generate recordsets ordered by a second (or third)
letter of a text field of a table. Is it possible?

Thanks.
Tony Toews
2011-08-21 19:47:22 UTC
Permalink
Post by Eduardo
Hello guys, I'm using DAO 3.6 with an access 2003 database in a program.
I would like to generate recordsets ordered by a second (or third)
letter of a text field of a table. Is it possible?
In your SQL try
ORDER BY Mid([FieldName],2,1)

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
Nobody
2011-08-21 19:48:39 UTC
Permalink
Post by Eduardo
Hello guys, I'm using DAO 3.6 with an access 2003 database in a program.
I would like to generate recordsets ordered by a second (or third) letter
of a text field of a table. Is it possible?
Air code:

SELECT * FROM Customers ORDER BY Mid(FullName, 2)

Needless to say this would be slow because a temporary index will be
created.
Eduardo
2011-08-22 04:22:46 UTC
Permalink
Post by Eduardo
Hello guys, I'm using DAO 3.6 with an access 2003 database in a program.
I would like to generate recordsets ordered by a second (or third)
letter of a text field of a table. Is it possible?
Thanks.
OK, guys. I wanted to go ahead with the work and what I did is to create
additional fields without the first letters.

It's a database with fixed data, so there is not data additions by the
program. It's a dictionary.

I need to consult 6 fields in a table that have different
transliterations of Greeks words, and I need to find a word by aproximation.

So I added additional fields for earch of the 6 fields (the
transliterations), one without the first letter, other one without 2
letters, and so on until without the 9 first letters. And I filled the
new data using a routine.

Then, for consulting, I make a number of recordsets ordered each one by
a different field. They are about 60 recordsets.

But if your suggestion work, I can delete all the additional fields and
generate the recordsets by SQL.

What do you think is better? (I have it already working)

Thanks.
Tony Toews
2011-08-22 06:52:06 UTC
Permalink
Post by Eduardo
It's a database with fixed data, so there is not data additions by the
program. It's a dictionary.
I need to consult 6 fields in a table that have different
transliterations of Greeks words, and I need to find a word by aproximation.
So I added additional fields for earch of the 6 fields (the
transliterations), one without the first letter, other one without 2
letters, and so on until without the 9 first letters. And I filled the
new data using a routine.
Did you index each of those additional 9*6 fields?

How many records in your table?
Post by Eduardo
Then, for consulting, I make a number of recordsets ordered each one by
a different field. They are about 60 recordsets.
But if your suggestion work, I can delete all the additional fields and
generate the recordsets by SQL.
What do you think is better? (I have it already working)
It all depends on your speed requirements which depends on how many
records you have.

On the one hand if it's a few thousand records and the Access database
file is on your local hard drive then a full table scan on the fields
might be quite acceptable as this might only be 5 or 10 Mb in size.

On the other hand if it's 50K records and on a network then you're
looking as much as 100 times as long to do the same search.

On the third hand if you already have it working then don't muck with
it. <smile>

That said stuff can happen. So if the users can enter new words and
you fill in these additional fields and something happens you might
get data that is incomplete. Therefore you should have a routine that
does a bulk update of all these fields.

And if you are going to do a bulk update then you are strongly urged
to delete the indexes, do the bulk update and recreate the indexes.
Things will likely go much, much faster. There is DAO collections
code that you can use to delete/recreate the indexes.

Finally I don't care for the idea of 6 virtually identical fields in
one table. You'd be better having one field with six rows in
another table. That would simplify the searching. Maybe. Depending.

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
Eduardo
2011-08-22 07:37:59 UTC
Permalink
Post by Tony Toews
Post by Eduardo
It's a database with fixed data, so there is not data additions by the
program. It's a dictionary.
I need to consult 6 fields in a table that have different
transliterations of Greeks words, and I need to find a word by aproximation.
So I added additional fields for earch of the 6 fields (the
transliterations), one without the first letter, other one without 2
letters, and so on until without the 9 first letters. And I filled the
new data using a routine.
Did you index each of those additional 9*6 fields?
I tried, but Access told me that I can't have more than 32 indexes, so I
removed all the indexes.
Post by Tony Toews
How many records in your table?
~~ 8600
Post by Tony Toews
Post by Eduardo
Then, for consulting, I make a number of recordsets ordered each one by
a different field. They are about 60 recordsets.
But if your suggestion work, I can delete all the additional fields and
generate the recordsets by SQL.
What do you think is better? (I have it already working)
It all depends on your speed requirements which depends on how many
records you have.
I need more speed.
I realized that I needed to add also comparison without the last
letters, so I added a new loop level discarding letters from the end.
iRecordSet.FindFirst iField & " LIKE '" & Left(iWord...

And with long words it's taking too much time, about 3 minutes.
Post by Tony Toews
On the one hand if it's a few thousand records and the Access database
file is on your local hard drive then a full table scan on the fields
might be quite acceptable as this might only be 5 or 10 Mb in size.
On the other hand if it's 50K records and on a network then you're
looking as much as 100 times as long to do the same search.
It's a local database on the HD.
Post by Tony Toews
On the third hand if you already have it working then don't muck with
it.<smile>
As I said, now I see that I need more speed.
Post by Tony Toews
That said stuff can happen. So if the users can enter new words and
you fill in these additional fields and something happens you might
get data that is incomplete. Therefore you should have a routine that
does a bulk update of all these fields.
There is not new records entries, it's a fixed dictinary with Bible words.
Post by Tony Toews
And if you are going to do a bulk update then you are strongly urged
to delete the indexes, do the bulk update and recreate the indexes.
Things will likely go much, much faster. There is DAO collections
code that you can use to delete/recreate the indexes.
I don't have much experience with "complex" operations with databases,
so I need to learn anything that could be useful to do this task better.
Post by Tony Toews
Finally I don't care for the idea of 6 virtually identical fields in
one table. You'd be better having one field with six rows in
another table. That would simplify the searching. Maybe. Depending.
And Access would let me to add the indexes... (they would be less than
32 in each table).

But if I go for the SQL Mid(Filename,2) it could be faster?
Post by Tony Toews
Tony
Thank you.
Schmidt
2011-08-22 13:00:41 UTC
Permalink
Post by Eduardo
And Access would let me to add the indexes... (they would
be less than 32 in each table).
But if I go for the SQL Mid(Filename,2) it could be faster?
No - in this case (using builtin SQL-Functions) there's always
at least one fulltable-scan necessary.

But didn't we have this topic already "on the table"
(the thread, where you were asking "find a similar word")?

In short, if there are only about 8000 Word-entries,
which need "enhanced, unsharp scanning", then you
would be faster, if you handle these wordscans
not directly per "normal DB-Functions and table-scans"
(in case, the DB-engine in question does not
offer builtin "unsharp search-functions", which
are more capable than for example 'Like').

So the best approach would be, to move your
8000 words over into a String-Array and perform
your comparisons on this (or also additional,
differently ordered or shortened StringArrays)
simply by index, directly in your own App-Loops.

The problem is an old one - and your current
approach (same word-content in different
"shapes", shortened from right, shortened from
left, etc. ... and then multiple comparison-loops)
is the one, which perhaps everyone starts with.

But eventually one will find, that this "naive"
approach might be sufficient for a small amount
of words, but multiple array-scans are costly in
terms of performance and need a lot of memory
when the "unsharp requirements" grow - or the
wordcount gets higher.

And that's, were all these more advanced,
unsharp algorithms come into play (metaphone,
ratcliff & Co.)

Did you already tried one of them against your
set of words?

Or could you give a list of (maybe 10-20 should
be sufficient) words - and then a few searchterms,
for which you define the word (out of your
10-20 words-list) that should be returned - just that
we can see, how tolerant/unsharp the algorithm should be...

Olaf
Tony Toews
2011-08-22 19:31:51 UTC
Permalink
Post by Schmidt
So the best approach would be, to move your
8000 words over into a String-Array and perform
your comparisons on this (or also additional,
differently ordered or shortened StringArrays)
simply by index, directly in your own App-Loops.
Agreed.

I almost always, 99.999% of the time, think in terms of database
because I'm locked into that mindset after 30+ years and 15 years in
Access.

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
Eduardo
2011-08-22 23:55:34 UTC
Permalink
Hello Olaf.
Post by Schmidt
Post by Eduardo
And Access would let me to add the indexes... (they would
be less than 32 in each table).
But if I go for the SQL Mid(Filename,2) it could be faster?
No - in this case (using builtin SQL-Functions) there's always
at least one fulltable-scan necessary.
OK.
Post by Schmidt
But didn't we have this topic already "on the table"
(the thread, where you were asking "find a similar word")?
Yes and no. At that time, it was to compare with a small set of words
that can range from 5 to 50 words, that I have in an array in memory.
They are also bible words, but that case was for words used in a
versicle or several versicles.
That is already working fine.

Now, I'm working again on that program, and I found that I also need to
find a word when no versicle is specified, and that's the difference
because I need to search in the whole dictionary.
Post by Schmidt
In short, if there are only about 8000 Word-entries,
which need "enhanced, unsharp scanning", then you
would be faster, if you handle these wordscans
not directly per "normal DB-Functions and table-scans"
(in case, the DB-engine in question does not
offer builtin "unsharp search-functions", which
are more capable than for example 'Like').
Yes, I remember that you suggested to go to a SQLite database...
The only database type that I have worked with so far are Access
databases, so I'm trying to stay with Access unless I see that I really
must migrate to another one (and also learn all the things related to
distribution and deployment).
Post by Schmidt
So the best approach would be, to move your
8000 words over into a String-Array and perform
your comparisons on this (or also additional,
differently ordered or shortened StringArrays)
simply by index, directly in your own App-Loops.
This is something that I'm considering, but at the end they will be 8000
* 60, because all the variants.
I In fact there are two tables, ~8000 are the Hebrew words, and the
Greek words are ~5000.

This will take time to load ~13,000 x 60 registers in memory at the
start of the program.

In that case I would search on each vector with a succesive
aproximations routine.
Post by Schmidt
The problem is an old one - and your current
approach (same word-content in different
"shapes", shortened from right, shortened from
left, etc. ... and then multiple comparison-loops)
is the one, which perhaps everyone starts with.
But eventually one will find, that this "naive"
approach might be sufficient for a small amount
of words, but multiple array-scans are costly in
terms of performance and need a lot of memory
when the "unsharp requirements" grow - or the
wordcount gets higher.
In this case, the dictionay won't ever change.
Post by Schmidt
And that's, were all these more advanced,
unsharp algorithms come into play (metaphone,
ratcliff & Co.)
Did you already tried one of them against your
set of words?
Humm, no. What I'm doing, is shortening from right, shortening from left
(also varying, the search word shortened 2 characters, the database word
shortened one, and so), and do a "FindFirst" in the register in
question. If I find a match, then I compare that word (not the shortened
one but the complete word) with the word that I'm searching (also the
complete one) with an algorytm based on "Levenshtein Distance".
Then I move first to the previous and later to the next contiguous
records until I reach a number of words that didn't match more than 50%,
then I stop (and go to the next searching cicle in the loop).

At the end, I order the list of words taht were found by the percentage
of coincidence, and if they are more than 20, I leave only the 20 more
relevants.
Post by Schmidt
Or could you give a list of (maybe 10-20 should
be sufficient) words - and then a few searchterms,
for which you define the word (out of your
10-20 words-list) that should be returned - just that
we can see, how tolerant/unsharp the algorithm should be...
Olaf
OK, I'll copy some results from what I've made.
Suppose that the user wants to find the word "anupokritos", but he or
she doesn't remember how to spell it, and writes it "like it sound":

******************
word: anaupocritos (Greek)
It took 159,06 seconds

Word ID: G 505 anupokritos percentage: 82
Word ID: G 506 anupotaktos percentage: 65
Word ID: G 5273 hupokrites percentage: 58
Word ID: G 379 anapologetos percentage: 58
Word ID: G 368 anantirrhetos percentage: 58
Word ID: G 369 anantirrhetos percentage: 58
Word ID: G 178 akatakritos percentage: 58
Word ID: G 799 Asugkritos percentage: 57
Word ID: G 5580 pseudochristos percentage: 56
Word ID: G 319 anagnorizomai percentage: 53
Word ID: G 338 anaitios percentage: 50
Word ID: G 402 anachoreo percentage: 50
Word ID: G 87 adiakritos percentage: 50
Word ID: G 361 anamartetos percentage: 50
Word ID: G 526 apallotrioo percentage: 50
Word ID: G 3480 Nazoraios percentage: 50
Word ID: G 4182 polupoikilos percentage: 50
Word ID: G 125 Aiguptos percentage: 50
Word ID: G 378 anapleroo percentage: 50
Word ID: G 377 anapipto percentage: 50
******************

Some more samples:

******************
word: neumaticós (Greek)
It took 124,41 seconds

Word ID: G 4153 pneumatikos percentage: 81
Word ID: G 4152 pneumatikos percentage: 81
Word ID: G 1193 dermatinos percentage: 66
Word ID: G 3020 Leuitikos percentage: 60
Word ID: G 4984 somatikos percentage: 60
Word ID: G 4985 somatikos percentage: 60
Word ID: G 1444 Hebraikos percentage: 60
Word ID: G 2122 eukairos percentage: 57
Word ID: G 2121 eukairos percentage: 57
Word ID: G 2739 kaumatizo percentage: 57
Word ID: G 3524 nephaleos percentage: 55
Word ID: G 5538 chrematismos percentage: 54
Word ID: G 2773 kermatistes percentage: 54
Word ID: G 3566 numphios percentage: 53
Word ID: G 3512 neoterikos percentage: 50
Word ID: G 2452 Ioudaikos percentage: 50
Word ID: G 2451 Ioudaikos percentage: 50
Word ID: G 2441 himatismos percentage: 50
Word ID: G 1054 Galatikos percentage: 50
Word ID: G 4262 probatikos percentage: 50
******************
word: teofenustos (Greek)
It took 152,21 seconds

Word ID: G 2315 theopneustos percentage: 65
Word ID: G 3504 neophutos percentage: 60
Word ID: G 1675 Hellenistes percentage: 54
Word ID: G 1354 Dionusios percentage: 53
Word ID: G 5118 tosoutos percentage: 53
Word ID: G 5108 toioutos percentage: 53
Word ID: G 5082 telikoutos percentage: 51
Word ID: G 4339 proselutos percentage: 51
Word ID: G 5533 chreopheiletes percentage: 50
Word ID: G 2459 Ioustos percentage: 50
Word ID: G 2312 theodidaktos percentage: 49
Word ID: G 2180 Ephesios percentage: 48
Word ID: G 1721 emphutos percentage: 48
******************
word: alos (Greek)
It took 16,13 seconds

Word ID: G 243 allos percentage: 97
Word ID: G 247 allos percentage: 97
Word ID: G 2570 kalos percentage: 79
Word ID: G 4535 salos percentage: 79
Word ID: G 2573 kalos percentage: 77
Word ID: G 216 alalos percentage: 66
Word ID: G 5194 hualos percentage: 66
Word ID: G 358 analos percentage: 66
Word ID: G 527 apalos percentage: 65
Word ID: G 836 aulos percentage: 55
Word ID: G 259 halosis percentage: 54
Word ID: G 3171 megalos percentage: 54
Word ID: G 806 asphalos percentage: 54
Word ID: G 250 aloe percentage: 53
Word ID: G 301 Amos percentage: 50
******************
word: Jehová Nisi (Hebrew)
It took 215,4 seconds

Word ID: H 3071 Yehovah nicciy percentage: 72
Word ID: H 3070 Yehovah yireh percentage: 54
******************
word: Jehová Sitquenu (Hebrew)
It took 292,88 seconds

Word ID: H 3072 Yehovah tsidqenuw percentage: 72
******************
word: Elojim (Hebrew)
It took 59,43 seconds

Word ID: H 430 'elohiym percentage: 97
Word ID: H 440 'Elowniy percentage: 53
******************
word: nefes (Hebrew)
It took 36,44 seconds

Word ID: H 5315 nephesh percentage: 80
Word ID: H 657 'ephec percentage: 77
Word ID: H 5311 nephets percentage: 62
Word ID: H 5233 nekec percentage: 62
Word ID: H 5298 Nepheg percentage: 60
Word ID: H 5309 nephel percentage: 60
Word ID: H 5316 nepheth percentage: 60
Word ID: H 660 'eph`eh percentage: 57
Word ID: H 7516 rephesh percentage: 49
Word ID: H 2665 chephes percentage: 49
******************

In all cases the desired word was the first one in the list.

Thanks.
Schmidt
2011-08-23 07:55:11 UTC
Permalink
Post by Eduardo
Post by Schmidt
In short, if there are only about 8000 Word-entries,
which need "enhanced, unsharp scanning", then you
would be faster, if you handle these wordscans
not directly per "normal DB-Functions and table-scans"
(in case, the DB-engine in question does not
offer builtin "unsharp search-functions", which
are more capable than for example 'Like').
Yes, I remember that you suggested to go to a SQLite database...
The only database type that I have worked with so far are Access
databases, so I'm trying to stay with Access unless I see that I really
must migrate to another one (and also learn all the things related to
distribution and deployment).
FWIW:
The RichClient4 wraps SQLite in an "ADO-like-way",
so that you can work with Connection- and Recordset-
Objects, without changing too much of your "ADO-thinking"
and the way you were coding with Recordsets.
There's also an Import-Class contained, to move
Data over from JET or other "ADO-compatible-DBs".

As for redistribution:
For RC4-(DB-)functionality (in case one does not need the
WebKit-Browser), there are only 3 Dlls to deploy:
DirectCOM.dll
vb_cairo_sqlite.dll
vbRichClient4.dll (only this one needs to be registered)

That's all there is (no MS-COM-dependencies in this toolset,
no problems on Win7).
Post by Eduardo
Post by Schmidt
So the best approach would be, to move your
8000 words over into a String-Array and perform
your comparisons on this (or also additional,
differently ordered or shortened StringArrays)
simply by index, directly in your own App-Loops.
This is something that I'm considering, but at the end they
will be 8000 * 60, because all the variants.
I In fact there are two tables, ~8000 are the Hebrew words, and the
Greek words are ~5000.
This will take time to load ~13,000 x 60 registers in memory at the
start of the program.
In that case I would search on each vector with a succesive
aproximations routine.
Woah, didn't expect 60 different variants of
the same (differently "shaped") Word-expression.

No wonder, that this gets slow, if you go through
each of these 60 lists.
Post by Eduardo
Post by Schmidt
And that's, were all these more advanced,
unsharp algorithms come into play (metaphone,
ratcliff & Co.)
Did you already tried one of them against your
set of words?
Humm, no.
Then please try the small Demo I made from your
supplied scan-examples.

(after a first test, maybe fill-up the G- and H-Arrays
with your real DB-Data first - and this will only need
the "real word" - not all of your 60 variants).

Performance is below 1 second when running in the IDE -
and after native compilation (with enhanced compiler-
options checked), a scan on both lists is below 50msec,
so that you can type your search-term and get the
answers "live and in realtime".
(the above for the sum of two scans: against 8000 entries
in the Greek- and 8000 entries in the Hebrew-Array as well).

So, yeah - check it out: www.datenhaus.de/Downloads/Ratcliff.zip

The included cRatcliff-Class contains a PreProcessor-
Routine, which currently (although being generic)
is "tuned", to respect some "special sounding stuff"
from the german language (Umlauts äöü, or 'ck' sounding
like 'k', ... etc.)
I'm pretty sure you can get even better results
from cRatcliff, when you include additional PreProcessor-
routines (or enhance the existing one about some cases)
which do respect the "soundings" of greek or hebrew
a bit better.

In case you want to see an SQLite-based version of
this demo (to get an impression, what this would
look like), then please say so ... the SQLite-wrapper
supports Ratcliff as a builtin SQL-function - but it
also allows, to define and Add your own (VB6-implemented)
SQL-functions (in case you want to use for example
your own, "tuned" Ratcliff-variants directly in your
"Selects").

Olaf
Eduardo
2011-08-23 20:41:57 UTC
Permalink
Post by Schmidt
Then please try the small Demo I made from your
supplied scan-examples.
Hello Olaf,

The concept of preprocess and comparison is quite similar to what I'm doing.

I also preprocess the words, so I have stored special fiels in the
database ready for "direct" comparison.

In my case, I removed all the apostrophes, accents, circumflexes,
tildes, etc. From the letters of the words, and I also preprocess the
word to search in the same way.

Then I use for comparison a routine like this:

'********** code************
Public Function WordsSimilarity(ByVal nWord1 As String, _
ByVal nWord2 As String, Optional nIgnoreCaps As _
Boolean = True) As Single

Dim iDifferencesCount As Long
Dim iLen1 As Long
Dim iLen2 As Long
Dim iLongerWord As Long

iLen1 = Len(nWord1)
iLen2 = Len(nWord2)

If iLen2 > iLen1 Then
iLongerWord = iLen2
Else
iLongerWord = iLen1
End If

If nIgnoreCaps Then
nWord1 = LCase(nWord1)
nWord2 = LCase(nWord2)
End If

iDifferencesCount = LevenshteinDistance(nWord1, nWord2)

WordsSimilarity = (iLongerWord - iDifferencesCount) / iLongerWord
End Function

'********************************
'*** Compute Levenshtein Distance
'********************************

Private Function LevenshteinDistance(ByVal s As String, ByVal t As
String) As Integer
Dim d() As Integer ' matrix
Dim M As Integer ' length of t
Dim n As Integer ' length of s
Dim i As Integer ' iterates through s
Dim J As Integer ' iterates through t
Dim s_i As String ' ith character of s
Dim t_j As String ' jth character of t
Dim cost As Integer ' cost

' Step 1
n = Len(s)
M = Len(t)
If n = 0 Then
LevenshteinDistance = M
Exit Function
End If
If M = 0 Then
LevenshteinDistance = n
Exit Function
End If
ReDim d(0 To n, 0 To M) As Integer

' Step 2
For i = 0 To n
d(i, 0) = i
Next i

For J = 0 To M
d(0, J) = J
Next J

' Step 3
For i = 1 To n
s_i = Mid$(s, i, 1)

' Step 4
For J = 1 To M
t_j = Mid$(t, J, 1)

' Step 5
If s_i = t_j Then
cost = 0
Else
cost = 1
End If

' Step 6
d(i, J) = Minimum(d(i - 1, J) + 1, d(i, J - 1) + 1, d(i -
1, J - 1) + cost)
Next J
Next i

' Step 7
LevenshteinDistance = d(n, M)
Erase d
End Function

'*******************************
'*** Get minimum of three values
'*******************************

Private Function Minimum(ByVal a As Integer, ByVal b As Integer, ByVal c
As Integer) As Integer
Dim mi As Integer

mi = a
If b < mi Then
mi = b
End If
If c < mi Then
mi = c
End If

Minimum = mi
End Function
'****************************

But, the "problem" was a wrong idea that I had, because I thought that
performing such a comparison in 8000 records would be too slow. I was
plain wrong.

Now I tried comparing all and every one of the 8000 words with a simple
loop, and I takes just 4,5 seconds in IDE and 1,3 seconds compiled.

So I can delete all the additional fields and tables that I created to
"speed up" the search process.
Schmidt
2011-08-23 22:48:19 UTC
Permalink
Post by Eduardo
Post by Schmidt
Then please try the small Demo I made from your
supplied scan-examples.
The concept of preprocess and comparison is quite
similar to what I'm doing.
Yes, I see.
Post by Eduardo
I also preprocess the words, so I have stored special fiels in
the database ready for "direct" comparison.
In my case, I removed all the apostrophes, accents, circumflexes,
tildes, etc. From the letters of the words, and I also preprocess
the word to search in the same way.
[similarity-checks, based on preprocessing and Levenshtein-distances]
Post by Eduardo
...
...
But, the "problem" was a wrong idea that I had, because I thought that
performing such a comparison in 8000 records would be too slow. I was
plain wrong.
Now I tried comparing all and every one of the 8000 words with a simple
loop, and I takes just 4,5 seconds in IDE and 1,3 seconds compiled.
If these 1.3 seconds are enough for your purposes,
then you should perhaps stick with your current
implemenation, which you know best.
Just keep in mind, that the Ratcliff/Obershelp
implementation in cRatCliff is about 100 times
faster than what you currently achieve in your
"non-SafeArray-tuned" implementation (already
fulfilling "realtime-typing-requirements) - but
perhaps even more important would be a longer
comparison-session between the different results,
the both algorithms throw at you.

Nothing's perfect in this regard, but maybe
the one or the other gives slightly more
correct answers especially in the "grey zones".
You will have to test that against your whole
Greek or Hebrew sets of course (maybe using
the Form in my Demo, which already offers two
ListBoxes, to compare also the output of two
different algorithms instead of two different
"Base-Comparison-Sets".

There's also something like a "combined algo"
out there (Levenshtein + Ratcliff/Obershelp),
by Atul Brad Buono, Donald Lessau - and then
finally tuned by Mike Williams - and similar in
performance to what the cRatcliff-Class has
to offer (although not yet coming with
preprocessing, but that could be added).
http://www.msvisual.com/viewtopic.php?t=27689&postdays=0&postorder=asc&start=30

So, yeah - happy testing/coding and good luck
with all that... :-)

Olaf
Eduardo
2011-08-24 17:22:58 UTC
Permalink
Post by Schmidt
If these 1.3 seconds are enough for your purposes,
then you should perhaps stick with your current
implemenation, which you know best.
The first thing I did, was to load all the words in vectors in memory
instead of using a recordset. The speed improved with that (obviosly, as
expected).
It loads the 13000 registers in a fraction of a second at the start of
the program.

In my case I perform three comparisons for each word, because I have
three different transliterations of the words in the dictionary.
(Previously I had 6, but I discarded three).

What is a trasliteration?
(I don't know if Greek characters will be posted in the newsgroup, but
I'll try)

A Greek word:
ἀνυπόκριτος

Greek and Hebrew languages, (and also Arab, Chinese, Japanese, etc.) use
another character set to encode the words, so Greek use characters like
ἀνυπ, and Hebrew characters like אֲבַטִּחִים
(I'm not assuming that you don't know this, I'm just explaining it to
anyone reading this thread)

OK, then if someone not familiar with these "strange" characters wants
to read something in Greek or Hebrew, some people developed a way to
transliterate, making a correspondence for the Greek or Hebrew sounds
(or characters) with our characters. So ἀ can be written as "a" and π as
"p" (I don't know if this is accurate, it's just an example).

But different people made different ways to transliterate.
In short I'm using now one Spanish transliteration and two English
transliterations to compare.

So, the word of the example, that was ἀνυπόκριτος, is transliterated as
anupókritos in Spanish, as anupokritos in English using transliteration
1, and as anypokritos using English transliteration 2.

But the pronunciation in Spanish sounds like "anaupocritos".

Some users of the program can be familiar with one of the
transliterations, and most of them won't be familiar with anyone.
And I want to help them to find the word that they want to study.
Post by Schmidt
Just keep in mind, that the Ratcliff/Obershelp
implementation in cRatCliff is about 100 times
faster than what you currently achieve in your
"non-SafeArray-tuned" implementation (already
fulfilling "realtime-typing-requirements) - but
perhaps even more important would be a longer
comparison-session between the different results,
the both algorithms throw at you.
Nothing's perfect in this regard, but maybe
the one or the other gives slightly more
correct answers especially in the "grey zones".
You will have to test that against your whole
Greek or Hebrew sets of course (maybe using
the Form in my Demo, which already offers two
ListBoxes, to compare also the output of two
different algorithms instead of two different
"Base-Comparison-Sets".
There's also something like a "combined algo"
out there (Levenshtein + Ratcliff/Obershelp),
by Atul Brad Buono, Donald Lessau - and then
finally tuned by Mike Williams - and similar in
performance to what the cRatcliff-Class has
to offer (although not yet coming with
preprocessing, but that could be added).
http://www.msvisual.com/viewtopic.php?t=27689&postdays=0&postorder=asc&start=30
Well, after I loaded the words in arrays in memory, I tested Mike
Williams routine and was faster than mine, and also yours, and it was
also faster than mine and even a bit faster than Mike Williams's one.
And compiled it performs really fast.

To get the results for "anaupocritos" it takes just 0.11 seconds in my
machine, for "Jehova sitquenu" it takes 0.16, and for "alos" 0.08 seconds.

The results are similar, although there are some small variations. I
decided to use yours.

I removed all the select case in the Preprocessing.
Post by Schmidt
So, yeah - happy testing/coding and good luck
with all that... :-)
Olaf
Thanks a lot.
Tony Toews
2011-08-22 19:33:41 UTC
Permalink
Post by Eduardo
Post by Tony Toews
Did you index each of those additional 9*6 fields?
I tried, but Access told me that I can't have more than 32 indexes, so I
removed all the indexes.
Oops, I forgot about that problem.
Post by Eduardo
Post by Tony Toews
How many records in your table?
~~ 8600
It's a local database on the HD.
Post by Tony Toews
That said stuff can happen. So if the users can enter new words and
you fill in these additional fields and something happens you might
get data that is incomplete. Therefore you should have a routine that
does a bulk update of all these fields.
There is not new records entries, it's a fixed dictinary with Bible words.
That helps.

Schmidt has the right idea. Go with arrays in memory. For 8800
records that'll be just fine. 8 million records maybe not but then
these days with 2 Gb of available RAM even that might be doable.

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
Eduardo
2011-08-23 00:05:25 UTC
Permalink
Post by Tony Toews
Post by Eduardo
Post by Tony Toews
Did you index each of those additional 9*6 fields?
I tried, but Access told me that I can't have more than 32 indexes, so I
removed all the indexes.
Oops, I forgot about that problem.
Post by Eduardo
Post by Tony Toews
How many records in your table?
~~ 8600
It's a local database on the HD.
Post by Tony Toews
That said stuff can happen. So if the users can enter new words and
you fill in these additional fields and something happens you might
get data that is incomplete. Therefore you should have a routine that
does a bulk update of all these fields.
There is not new records entries, it's a fixed dictinary with Bible words.
That helps.
Schmidt has the right idea. Go with arrays in memory. For 8800
records that'll be just fine. 8 million records maybe not but then
these days with 2 Gb of available RAM even that might be doable.
Tony
OK, but they are 8000 (in fact 13000 because they are two tables) * 60,
not just 8000.

I think I should do this in the background, and if the user wants to
search a word inmediately after he/she starts the program, I'll have to
say that the database is still not ready. It has also some programming
complications because of the use of Doevents.

But this is the next way I'll explore, because it's taking too much time
now.

Another thing to try is to split the fields in several tables so I can
index every one. Do you think that it worth to try that?
Eduardo
2011-08-23 03:35:06 UTC
Permalink
Post by Eduardo
Another thing to try is to split the fields in several tables so I can
index every one. Do you think that it worth to try that?
I made that. It improved the speed a lot.

Word before now

anaupocritos 159,06 seconds 26,09 seconds
neumaticós 124,41 seconds 19,71 seconds
teofenustos 152,21 seconds 27,32 seconds
alos 16,13 seconds 4,03 seconds
Jehová Nisi 215,4 seconds 25,57 seconds
Jehová Sitquenu 292,88 seconds 24,43 seconds
Elojim 59,43 seconds 7,28 seconds
nefes 36,44 seconds 4,56 seconds

This is much more acceptable.
I also added a fast search option that speed up in a factor of ~3, and
in the most of the cases it finds the word. I perform less iterations in
that search.

It's not still as fast as I would desire (considering slower machines),
but it could be acceptable.
Tony Toews
2011-08-24 19:27:14 UTC
Permalink
Post by Eduardo
Post by Eduardo
Another thing to try is to split the fields in several tables so I can
index every one. Do you think that it worth to try that?
I made that. It improved the speed a lot.
Yes, but not as fast as I would've thought. Hmm, have you compacted
the database after you built the indexes? I'd also be curious to see
if the Access database file decreases in size dramatically after the
compact.

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
Eduardo
2011-08-25 03:42:23 UTC
Permalink
Post by Tony Toews
Post by Eduardo
Post by Eduardo
Another thing to try is to split the fields in several tables so I can
index every one. Do you think that it worth to try that?
I made that. It improved the speed a lot.
Yes, but not as fast as I would've thought. Hmm, have you compacted
the database after you built the indexes? I'd also be curious to see
if the Access database file decreases in size dramatically after the
compact.
Yes, I compacted the database.

But those numbers were with my complicated routine "to speed up" the
search, that obviously wasn't working well.

Comparing all the records in the table took about 4 seconds with an
indexed field.

But now I'm loading the words in vectors in memory, and it takes even
less, and I'm using a faster algorithm (provided by Olaf) to perform the
comparisons, and it takes less, and compiled even less, now it's in the
range of 0,05 seconds to 0.15 seconds (compiled).
Schmidt
2011-08-25 05:17:01 UTC
Permalink
now it's in the range of 0,05 seconds to 0.15
seconds (compiled).
Just out of interest, did you enable all the
checks in the extended compiler-options?

If not, this should bring another factor 2-3
in speedup for this kind of algorithms.

Olaf
Eduardo
2011-08-25 07:22:23 UTC
Permalink
Post by Schmidt
now it's in the range of 0,05 seconds to 0.15
seconds (compiled).
Just out of interest, did you enable all the
checks in the extended compiler-options?
If not, this should bring another factor 2-3
in speedup for this kind of algorithms.
Olaf
No... I get an error 9 if I enable all the options.

BTW, I changed the algorytm, it's not that fast anymore (but still fast).

I found some difficulty finding a word that I thought it should appear,
so I combined your preprocessing (but modified) with a Levenshtein
Distance function (more optimized that the one I had before).

In the preprocessing, I left the two consecutives characters removal
part, but also added consecutive consonants removal (if a consonant is
followed by more consonants, only the initial consonant remains).
Anyway, every removal has also a weight, but just of two percent.

I'm testing it, it seems to work fine so far.
It is taking compiled about 0,3 seconds to find a word now.

If you want to see the code I'll post it (I need just to change some
variables that are named in Spanish)
Eduardo
2011-08-26 03:33:21 UTC
Permalink
Olaf, here is my class:

Option Explicit

'based on Ratcliff-Implementation with additional PreProcessing
'Author: Olaf Schmidt (2008)
'modified by Eduardo (2011)

Private Type SAFEARRAY1D
cDims As Integer
fFeatures As Integer
cbElements As Long
cLocks As Long
pvData As Long
cElements As Long
lLbound As Long
End Type

Private Declare Sub BindArray Lib "kernel32" Alias "RtlMoveMemory" _
(PArr() As Any, PSrc&, Optional ByVal cb& = 4)
Private Declare Sub ReleaseArray Lib "kernel32" Alias "RtlMoveMemory" _
(PArr() As Any, Optional PSrc& = 0, Optional ByVal cb& = 4)

Private Declare Function CharLowerBuffA Lib "user32" _
(lpsz As Any, ByVal cchLength&) As Long
Private Declare Function CharLowerBuffW Lib "user32" _
(lpsz As Any, ByVal cchLength&) As Long
Private aLowChars%(&H8000 To &H7FFF)

Private Arr1() As Integer, S1Arr() As Integer, saS1Arr As SAFEARRAY1D
Private Arr2() As Integer, S2Arr() As Integer, saS2Arr As SAFEARRAY1D
Private mDeletedChars1() As Integer, S3Arr() As Integer, saS3Arr As
SAFEARRAY1D
Private mDeletedChars2() As Integer, S4Arr() As Integer, saS4Arr As
SAFEARRAY1D

Private Sub Class_Initialize()
saS1Arr.cDims = 1: saS1Arr.cbElements = 2 '2 Bytes per Element
BindArray S1Arr, VarPtr(saS1Arr)

saS2Arr.cDims = 1: saS2Arr.cbElements = 2 '2 Bytes per Element
BindArray S2Arr, VarPtr(saS2Arr)

saS3Arr.cDims = 1: saS3Arr.cbElements = 2 '2 Bytes per Element
BindArray S3Arr, VarPtr(saS3Arr)

saS4Arr.cDims = 1: saS4Arr.cbElements = 2 '2 Bytes per Element
BindArray S4Arr, VarPtr(saS4Arr)

ReDim Arr1(8192): ReDim Arr2(8192) 'preinitialize our LowerBufArrays

ReDim mDeletedChars1(8192): ReDim mDeletedChars2(8192) 'preinitialize

InitLowCharLUT
End Sub

Private Sub Class_Terminate()
ReleaseArray S1Arr 'resets S1Arr into its original, "virginal" state
ReleaseArray S2Arr 'resets S1Arr into its original, "virginal" state
ReleaseArray S3Arr 'resets S1Arr into its original, "virginal" state
ReleaseArray S4Arr 'resets S1Arr into its original, "virginal" state
End Sub


Public Function WordsSimilarity(S1 As String, S2 As String) As Long
Dim i As Long, l1 As Long, l2 As Long, D As Single
Dim iLonger As Long
Dim iEqualChars As Long
Dim iCoincidence As Single
Dim iDiferences As Long
Dim c As Long
Dim iIndexScanFromEnd As Long
Dim iLastDeletedEqualFromBegining As Long
Dim iNumberOfDeleted1 As Long
Dim iNumberOfDeleted2 As Long

l1 = Len(S1): l2 = Len(S2)
If l1 = 0 Or l2 = 0 Then Exit Function

'pretest for exact similarity
If l1 = l2 Then
If S1 = S2 Then WordsSimilarity = 100: Exit Function
End If

'make sure, we have enough space in our CompareBuffers
If UBound(Arr1) < l1 + 1 Then ReDim Preserve Arr1(l1 + 1)
If UBound(Arr2) < l2 + 1 Then ReDim Preserve Arr2(l2 + 1)
If UBound(mDeletedChars1) < l1 + 1 Then ReDim Preserve
mDeletedChars1(l1 + 1)
If UBound(mDeletedChars2) < l2 + 1 Then ReDim Preserve
mDeletedChars2(l2 + 1)

'prepare for fast preprocessing (map S1 and S2 to Integer-Arrays)
saS1Arr.pvData = StrPtr(S1): saS1Arr.cElements = l1 + 1
saS2Arr.pvData = StrPtr(S2): saS2Arr.cElements = l2 + 1
saS3Arr.cElements = l1 + 1
saS4Arr.cElements = l2 + 1

'preprocess the stringcontent into the real Buffers Arr1 and Arr2
' now we get the chars that have been deleted in the preprocessing into
' two arrays, mDeletedChars1 and mDeletedChars2
D = PreProcessing(S1Arr, Arr1, l1, mDeletedChars1, iNumberOfDeleted1)
D = D + PreProcessing(S2Arr, Arr2, l2, mDeletedChars2,
iNumberOfDeleted2)

' test if the deleted chars were coincident (or at least some)
' for the two words, if so, then add the percentage again
iLastDeletedEqualFromBegining = 0
For c = 1 To iNumberOfDeleted1
If c > iNumberOfDeleted2 Then Exit For
If mDeletedChars1(c) = mDeletedChars2(c) Then
D = D - 0.04
iLastDeletedEqualFromBegining = c
End If
Next c

iIndexScanFromEnd = iNumberOfDeleted2 + 1
For c = iNumberOfDeleted1 To 1 Step -1
iIndexScanFromEnd = iIndexScanFromEnd - 1
If iIndexScanFromEnd < 1 Then Exit For
If c <= iLastDeletedEqualFromBegining Then Exit For
If iIndexScanFromEnd <= iLastDeletedEqualFromBegining Then Exit For
If mDeletedChars1(c) = mDeletedChars2(iIndexScanFromEnd) Then
D = D - 0.04
End If
Next c

If D > 0.12 Then D = 0.12 'limit the Replacement-Reduction
' (the replacement reduction limit was increased)

'reset the mapping
saS1Arr.pvData = 0: saS1Arr.cElements = 0
saS2Arr.pvData = 0: saS2Arr.cElements = 0
saS3Arr.pvData = 0: saS3Arr.cElements = 0
saS4Arr.pvData = 0: saS4Arr.cElements = 0

' the similarity is based on Levenstein Distance
If l1 > l2 Then
iLonger = l1
Else
iLonger = l2
End If

iDiferences = LevDist03(Arr1, Arr2, l1, l2)
If iDiferences = 0 Then
WordsSimilarity = 100 - D * 100
Exit Function
End If
iEqualChars = iLonger - iDiferences
If iEqualChars = 0 Then
WordsSimilarity = 0
Exit Function
End If
iCoincidence = iEqualChars / iLonger

WordsSimilarity = (iCoincidence - D) * 100
If WordsSimilarity < 0 Then WordsSimilarity = 0
End Function

'preprocessing, optimized for german-language
Private Function PreProcessing(ByRef Src() As Integer, ByRef Dst() _
As Integer, L As Long, ByRef nDeletedChars() As Integer, _
nNumberOfDeleted As Long) As Single

Dim i As Long, J As Long, LChar As Integer, ReplCount As Long
Dim StartIdx As Long
Dim n As Long
Dim nc As Long
Dim ce As Long
Dim c As Long
'replace a leading 'h' with nothing and weight it with 2
If aLowChars(Src(0)) = 104 Then StartIdx = 1: ReplCount = 2

For i = StartIdx To L - 1
LChar = aLowChars(Src(i))

'each char-doubling of the *same* char is reduced to one char
If LChar = aLowChars(Src(i + 1)) Then
Dst(J) = LChar: J = J + 1
i = i + 1 'skip one source-char
ReplCount = ReplCount + 1
nNumberOfDeleted = nNumberOfDeleted + 1
nDeletedChars(nNumberOfDeleted) = LChar
GoTo Continue
End If

' If there are several consecutive consonants, only the first
is left
' (may be too drastic, leaving two consecutive consonants may be
' also good or even better, it's something to modify and test)
' Also here the accents are not filtered (they would be treated as
' consonants), that's because I'm working with words already
filtered
If IsConsonant(LChar) Then
n = 1
If i + n <= L - 1 Then
Do Until Not IsConsonant(aLowChars(Src(i + n)))
n = n + 1
If i + n + 1 > UBound(Src) Then Exit Do
Loop
End If
n = n - 1
If n > 0 Then
Dst(J) = LChar: J = J + 1
ReplCount = ReplCount + n
c = 1
For nc = nNumberOfDeleted + 1 To nNumberOfDeleted + n
nDeletedChars(nc) = aLowChars(Src(i + c))
c = c + 1
Next nc
nNumberOfDeleted = nNumberOfDeleted + n
i = i + n 'skip source-chars
GoTo Continue
End If
End If
ReplCount = ReplCount + 1

Dst(J) = LChar: J = J + 1: ReplCount = ReplCount - 1

Continue: Next i
Dst(J) = 0 'set terminating NullChar
L = J 'reflect the eventually reduced CharCount in L

PreProcessing = ReplCount * 0.02 '2 percent reduction per replace
End Function

Private Sub InitLowCharLUT()
Dim J&

'inits a lookup-table for fast (unicode-aware) Lower-Lookups
For J = -32768 To 32767: aLowChars(J) = J: Next J
If CharLowerBuffW(aLowChars(-32768), &H10000) = 0 Then
CharLowerBuffA aLowChars(65), (223 - 65) * 2
End If

' patch the stooges
' S 138/352, s 154/353 | O 140/338, o 156/339 | Z 142/381, z 158/382 |
' Y 159/376, ÿ 255/255
aLowChars(138) = 154: aLowChars(352) = 353
aLowChars(140) = 156: aLowChars(338) = 339
aLowChars(142) = 158: aLowChars(381) = 382
aLowChars(159) = 255: aLowChars(376) = 255
End Sub

Private Function IsConsonant(nChar As Integer) As Boolean
Select Case nChar
Case 97, 101, 105, 111, 117
IsConsonant = False
Case Else
IsConsonant = True
End Select
End Function

Public Function LevDist03(ByRef b1() As Integer, ByRef b2() _
As Integer, l1 As Long, l2 As Long) As Long

' by Donald Lessau (VB newsgroup), 20041129
'
' this is a VB implementation of the Levenstein
' Distance function which ranks words by their
' similarity
Dim D() As Long ' matrix
Dim i As Long ' iterates through String1
Dim J As Long ' iterates through String2
Dim lCost As Long ' lCost

' Step 1
If l1 = 0 Then
LevDist03 = l2
Exit Function
End If
If l2 = 0 Then
LevDist03 = l1
Exit Function
End If
' Step 2: fill matrix
ReDim D(0 To l1, 0 To l2) As Long
For i = 0 To l1
D(i, 0) = i
Next
For J = 0 To l2
D(0, J) = J
Next
' Step 3

For i = 1 To l1
For J = 1 To l2
If b1(i - 1) = b2(J - 1) Then
lCost = 0
Else
lCost = 1
End If
D(i, J) = MinThree01(D(i - 1, J) + 1, _
D(i, J - 1) + 1, D(i - 1, J - 1) + lCost)
Next
Next
LevDist03 = D(l1, l2)
End Function

Private Function MinThree01(ByVal l1&, ByVal l2&, _
ByVal l3&) As Long
' by Donald, 20011116
If l1 < l2 Then
If l3 < l1 Then MinThree01 = l3 Else MinThree01 = l1
Else
If l2 < l3 Then MinThree01 = l2 Else MinThree01 = l3
End If
End Function
Schmidt
2011-08-26 15:28:34 UTC
Permalink
Thanks for posting that back (archived here now)... :-)

Glad you were able, to re-use the SafeArray-stuff
from cRatcliff to speed things up on your end -
but I'd adapt the now somewhat misleading comment
at the Top of this class (about being a "Ratcliff
with preprocessing") to "Levenshtein with Preprocessing",
since that's what's now acting as the workhorse (algo-wise).

Regards,

Olaf
Eduardo
2011-08-26 18:44:46 UTC
Permalink
Post by Schmidt
but I'd adapt the now somewhat misleading comment
at the Top of this class (about being a "Ratcliff
with preprocessing") to "Levenshtein with Preprocessing",
since that's what's now acting as the workhorse (algo-wise).
Yes, I know, I didn't want to remove your authorship comments but you're
rigth that it's somewhat misleading now.

It could be:

'based on code from Author: Olaf Schmidt (2008)
'modified by Eduardo (2011)
'Words similarity using Levenshtein Distance with PreProcessing
Schmidt
2011-08-26 20:19:15 UTC
Permalink
Post by Eduardo
Post by Schmidt
but I'd adapt the now somewhat misleading comment
at the Top of this class (about being a "Ratcliff
with preprocessing") to "Levenshtein with Preprocessing",
since that's what's now acting as the workhorse (algo-wise).
Yes, I know, I didn't want to remove your authorship
comments but you're rigth that it's somewhat misleading now.
Yeah, there are such cases, when you start with something
(some helpful, or "inspiring" code or code-structures),
but when you're done with adapting to your own problem,
you'll often find, that you really created something
new and different (not resembling much to the code from
the other author anymore).

What I'm doing in such cases (where no false modesty's
needed IMO... meaning: when it's got far beyond a simple
"some lines modified by: humble_me")
is something like:
'this is my new wonderfull "foo- and bar-" class
'based on ideas from: link_to_inspiring_code (authors_name)


When entire routines were left (largely) intact, then
I'm basically doing the same as you've already done with:

Public Function LevDist03(ByRef b1() As Integer, ByRef b2() _
As Integer, l1 As Long, l2 As Long) As Long

' by Donald Lessau (VB newsgroup), 20041129
...
...


When I look at what remains in your new Class, then
I'd not have a problem, even when you'd remove
my name completely (would have the advantage, that
I will not get sued, in case you'd sell your software
to a powerplant, which later on "nukes an entire
area" - or something like that... <g>).

I was not the first, who came up with this nice
SafeArray-Pointer-trickery, and the (not existent
anymore in your Class) recursive Ratcliff-algo was
also not "my idea" (only a VB-based implementation),
the only thing from cRatcliff which is "genuinely
my own invention", is the Declaration of:
BindArray and ReleaseArray (at least I didn't see
anybody doing that before, placing Optional Parameters
in API-Declarations).

Take care... ;-)

Olaf
Eduardo
2011-08-26 22:08:41 UTC
Permalink
Post by Schmidt
What I'm doing in such cases (where no false modesty's
needed IMO... meaning: when it's got far beyond a simple
"some lines modified by: humble_me")
'this is my new wonderfull "foo- and bar-" class
'based on ideas from: link_to_inspiring_code (authors_name)
When entire routines were left (largely) intact, then
Public Function LevDist03(ByRef b1() As Integer, ByRef b2() _
As Integer, l1 As Long, l2 As Long) As Long
' by Donald Lessau (VB newsgroup), 20041129
...
...
When I look at what remains in your new Class, then
I'd not have a problem, even when you'd remove
my name completely (would have the advantage, that
I will not get sued, in case you'd sell your software
to a powerplant, which later on "nukes an entire
area" - or something like that... <g>).
I was not the first, who came up with this nice
SafeArray-Pointer-trickery, and the (not existent
anymore in your Class) recursive Ratcliff-algo was
also not "my idea" (only a VB-based implementation),
the only thing from cRatcliff which is "genuinely
BindArray and ReleaseArray (at least I didn't see
anybody doing that before, placing Optional Parameters
in API-Declarations).
Take care... ;-)
Olaf
OK Olaf.

The new header is:

'this is a class for getting words similarity
'based on inspiring ideas from: www.datenhaus.de (Olaf Schmidt)
'use at your own risk
'(specially if you are going to use it in a nuclear power plant)
Schmidt
2011-08-26 23:41:01 UTC
Permalink
Post by Eduardo
'this is a class for getting words similarity
'based on inspiring ideas from: www.datenhaus.de (Olaf Schmidt)
'use at your own risk
'(specially if you are going to use it in a nuclear power plant)
Lol,yeah - that's the spirit <g>

Seriously, what I had in mind, was a more solemn:

'Words similarity using Levenshtein Distance with PreProcessing
'Author: Eduardo ...
'Class-Layout and SafeArray-speedup based on ideas found in:
' www.datenhaus.de/Downloads/Ratcliff.zip (Olaf Schmidt)
'Levenshtein Implementation from Donald Lessau

Or something like that. ;-)

BTW, forgot to comment about your:
"I get an error 9 if I enable all the options"

Is this "index out of range"-error caused by the Class,
or somewhere else in the Application-Project?

In case, this error is not happening in the Class,
then you could always move it into a spearate
Dll-Project - and compile it (together with
other helper-functions) with full optimization -
as said, these "completely enabled checks" are
usually good for another factor 2, sometimes
(depending on the algo) even factor 3.

Due to such a "binary-decoupling", you're then free
again, to choose an entirely different compile-mode
(or different compile-options)
for your "Exe-Project" (without loosing speed).

As a side-note (to enabling *all* extended
compiler-options in conjunction with
SafeArray-Pointer-stuff).

Sometimes it is necessary, to leave the very first
extended option:
"assume, that no aliasing is used" (translated from german)
unchecked.

This "unchecking of the first option" is not necessary
in each and every SafeArray-scenario, only in cases,
where a given SafeArrays .pData-Member gets
"switched often" -> to differently preallocated memory-
areas (as e.g. different Strings over their StrPtr)
meaning: more than once within a given subroutine.

For example, if you have an entire Array of Strings
to "parse fast" (by mapping the contents of each of
the String-Arrays normal StringMembers against a
single SafeArray-Helper as e.g. in:

For i = 0 to Ubound(MyWordListArray)
saIntArray.pData = StrPtr(MyWordListArray(i))

For j = 0 to Len(MyWordListArray(i)) - 1
'process the IntChars of the current String
Select Case IntArray(j)

End Select
Next j
Next i

Then this is the scenario, where you would need to
leave the first extended compiler-option unchecked.


In cRatcliff this scenario is *not* met -
and also in your new Class I cannot find such
"more than once per function" .pData - assignments
to (different) string-(pointer)-areas.
BTW, whilst looking at your code again at this occasion...
your new introduced saS3Arr and saS4Arr safearray-
types don't seem to be used in the class.

So this code should be safe with regards to no
problem with: "enabling *all* the extended options".

Though the performance-gain from an enabled first
option is only 3-10% or so - and so I usually leave
it out in most of the "speed-optimized binaries",
just to play safe (no matter in what way I use
SafeArray-mappings there).

So, yeah (long story short) - in case you already
have an external Helper-Dll for your project, then
try to move the Class over there - and try the
extended compile-options on that Dll again.

Olaf
Ulrich Korndoerfer
2011-08-27 02:33:44 UTC
Permalink
Hi Olaf,
Post by Schmidt
...
As a side-note (to enabling *all* extended
compiler-options in conjunction with
SafeArray-Pointer-stuff).
Sometimes it is necessary, to leave the very first
"assume, that no aliasing is used" (translated from german)
unchecked.
This "unchecking of the first option" is not necessary
in each and every SafeArray-scenario, only in cases,
where a given SafeArrays .pData-Member gets
"switched often" -> to differently preallocated memory-
areas (as e.g. different Strings over their StrPtr)
meaning: more than once within a given subroutine.
For example, if you have an entire Array of Strings
to "parse fast" (by mapping the contents of each of
the String-Arrays normal StringMembers against a
For i = 0 to Ubound(MyWordListArray)
saIntArray.pData = StrPtr(MyWordListArray(i))
For j = 0 to Len(MyWordListArray(i)) - 1
'process the IntChars of the current String
Select Case IntArray(j)
End Select
Next j
Next i
Then this is the scenario, where you would need to
leave the first extended compiler-option unchecked.
...
I think in above scenario it is safe to check the "assume no aliasing"
option:

For i = 0 to Ubound(MyWordListArray)
saIntArray.pData = StrPtr(MyWordListArray(i))
For j = 0 to Len(MyWordListArray(i)) - 1
'process the IntChars of the current String
Select Case IntArray(j)

End Select
Next j
Next i

forces the compiler in each inner loop step to always reload the integer
value from the array, because it is indexed with a variable (j). The
compiler does not know and can not infer in this case which value j will
have, so he also cannot buffer the value of IntArray(j).

In following scenario however it would not be safe:

For i = 0 to Ubound(MyWordListArray)
saIntArray.pData = StrPtr(MyWordListArray(i))
'process the first char of the current String
Select Case IntArray(0)

End Select
Next i

Here the compiler would buffer the value of IntArray(0) and would reuse
the value in each loop step, if "assume no aliasing" would have been
checked.
--
Ulrich Korndoerfer

VB tips, helpers, solutions -> http://www.prosource.de/Downloads/
MS Newsgruppen Alternativen -> http://www.prosource.de/ms-ng-umzug.html
Schmidt
2011-08-27 15:52:12 UTC
Permalink
Post by Ulrich Korndoerfer
I think in above scenario it is safe to check the "assume no aliasing"
For i = 0 to Ubound(MyWordListArray)
saIntArray.pData = StrPtr(MyWordListArray(i))
For j = 0 to Len(MyWordListArray(i)) - 1
'process the IntChars of the current String
Select Case IntArray(j)
End Select
Next j
Next i
forces the compiler in each inner loop step to always reload the integer
value from the array, because it is indexed with a variable (j). The
compiler does not know and can not infer in this case which value j will
have, so he also cannot buffer the value of IntArray(j).
For i = 0 to Ubound(MyWordListArray)
saIntArray.pData = StrPtr(MyWordListArray(i))
'process the first char of the current String
Select Case IntArray(0)
End Select
Next i
Here the compiler would buffer the value of IntArray(0) and would reuse
the value in each loop step, if "assume no aliasing" would have been
checked.
You are right, my example above (the one with the j-indexed
access into the "mapped IntArray") was misleading, and would
not provoce this choking on an aliasing-error.

I've copied that from something I knew *was* choking,
but then "oversimplified" the case for the NG-post
(not really aware of, that these aliasing-problems are
not simply caused by "nested safearray-loop-scenarios",
but instead very specifically by using const-expressions
whilst referring to such "mapped array-members" ...
thanks for clearing that up).

Here a more complete example, to play around with the
extended compiler-switch for the aliasing-option (in
addition to your smaller one above).

It is trying to show, how one can end up in this
"aliasing-trap" (just by attempting, to further optimize
an already existing safearray-based routine):

(needs a normal Exe-Project, which one should test compiled
- with all extended options first - and then disabling
the "assume no aliasing" option, to make it work correctly)

'***Into a Form
Option Explicit

Private WordList() As String, WordListPtrs() As Long

Private Sub Form_Load()
AutoRedraw = True

'let's prepare a small String-Array with
'only three members, just to provide some
'real-world "raw-data" for the Demo
ReDim WordList(0 To 2)
WordList(0) = "abc"
WordList(1) = "abc123"
WordList(2) = "abc123xyz"

'let's further assume, that this Wordlist is
'"constant" and will not change its String-Contents
'So we can store the BSTR-Pointers of the StringList-
'Members in our WordListPtrs-Array
ReDim WordListPtrs(0 To UBound(WordList)) 'same size as our wordlist
Dim i As Long
For i = 0 To UBound(WordList)
'the line below does not store the Pointer to the
'beginning of the BSTR-Char-Data, but moves 4Bytes
'back instead, to catch also the LenB-area which
'precedes the CharData in a given (V)BSTRING
WordListPtrs(i) = StrPtr(WordList(i)) - 4
Next i

'Ok, our constant WordList-structures are filled and prepared...

'Now we hand the wordlist over, to a speed-optimized
'routine (e.g. to find something with a scan over all
'the words in the passed wordlist-pointer-array)
Dim WS As cWordScan
Set WS = New cWordScan
' Print WS.SafeArrayAliasingProblem1(WordList)
Print WS.SafeArrayAliasingProblem2(WordListPtrs)

End Sub


'***Into a Class, named cWordScan
Option Explicit

Private Type SAFEARRAY1D
cDims As Integer
fFeatures As Integer
cbElements As Long
cLocks As Long
pvData As Long
cElements As Long
lLbound As Long
End Type

Private Declare Sub BindArray Lib "kernel32" Alias "RtlMoveMemory" _
(PArr() As Any, PSrc&, Optional ByVal cb& = 4)
Private Declare Sub ReleaseArray Lib "kernel32" Alias "RtlMoveMemory" _
(PArr() As Any, Optional PSrc& = 0, Optional ByVal cb& = 4)

Private IntArr() As Integer, saIntArr As SAFEARRAY1D

Private Sub Class_Initialize()
'saIntArr defines the IntArr-behaviour, we later on map our Strings to
saIntArr.cDims = 1 'one-dimensional
saIntArr.cbElements = 2 '2 Bytes per Element (Int16)
saIntArr.lLbound = -2 'a lower bound of -2 allows "BSTR-LenB-catching"
saIntArr.cElements = 16384 'we assume this as MaxCharCount of a Word

BindArray IntArr, VarPtr(saIntArr)
End Sub

Private Sub Class_Terminate()
ReleaseArray IntArr
End Sub

'this function is not running into a "safearray-aliasing-trap"
'(so, all extended compiler-options may be checked)
Public Function SafeArrayAliasingProblem1(WordList() As String) As Long
Dim i As Long, j As Long, bNumericCharFound As Boolean

For i = 0 To UBound(WordList) 'loop over the words in our wordlist

saIntArr.pvData = StrPtr(WordList(i))

For j = 0 To Len(WordList(i)) - 1 'loop WChars of the current word

'just to provide "something useful" here in this Demo-Function
Select Case IntArr(j)
Case 48 To 57: bNumericCharFound = True: Exit For
End Select

Next j

If bNumericCharFound Then
'increase the function-result when a word containts a numeric char
SafeArrayAliasingProblem1 = SafeArrayAliasingProblem1 + 1
bNumericCharFound = False
End If
Next i
End Function

'this second function here could result from an attempt,
'to speedup the existing function even more - and in turn
'one would run into the aliasing-problem, since now a
'const-expression is used, to refer into our mapped IntArr()
'(to work properly, the "aliasing"-compiler-option has to be disabled)
Public Function SafeArrayAliasingProblem2(WordListPtrs() As Long) As Long
Dim i As Long, j As Long, bNumericCharFound As Boolean, WordLen As Long

For i = 0 To UBound(WordListPtrs) 'loop over the words in our wordlist

'now within this outer loop, we try to work as fast as possible,
'not causing any calls into the VB-Runtime...

saIntArr.pvData = WordListPtrs(i) '<-using this avoids a StrPtr-call

'and the following line avoids a call to the Len()-Function
'... all perfectly reasonable speedup-considerations ...
WordLen = IntArr(-2) \ 2
'but this line above can also cause an (indirect) error, since
'the Len-info is retrieved over a const-expression (-2) whilst
'accessing the mapped IntArr - the VB-Compiler does not know,
'that the data-content behind IntArr was just switched by
'setting saIntArr.pvData to a new pointer - it does not know
'about the relation between saIntArr and IntArr - and so it is
'"optimizing away" the line below (out of the loop), in case
'the first ("assume no aliasing") compiler-option is checked

'the now following loop is perfectly fine - it's just the
'WordLen, which is not reliably delivered here, in case the
'"assume no aliasing"-option was checked
j = 0
Do While j < WordLen 'loop WChars of the current word

'just to provide something "useful" here in this Demo-Function
Select Case IntArr(j)
Case 48 To 57: bNumericCharFound = True: Exit Do
End Select

j = j + 1
Loop

If bNumericCharFound Then
'increase the function-result when a word contained a numeric char
SafeArrayAliasingProblem2 = SafeArrayAliasingProblem2 + 1
bNumericCharFound = False
End If
Next i
End Function


Olaf
Schmidt
2011-08-27 16:04:27 UTC
Permalink
Am 27.08.2011 17:52, schrieb Schmidt:

Sorry, made a mistake in a comment
Post by Schmidt
WordLen = IntArr(-2) \ 2
'but this line above can also cause an (indirect) error, since
'the Len-info is retrieved over a const-expression (-2) whilst
'accessing the mapped IntArr - the VB-Compiler does not know,
'that the data-content behind IntArr was just switched by
'setting saIntArr.pvData to a new pointer - it does not know
'about the relation between saIntArr and IntArr - and so it is
'"optimizing away" the line below (out of the loop), in case
Please change the last comment-line above to:
'"optimizing away" the line above (out of the loop), in case


Olaf
Ulrich Korndoerfer
2011-08-27 17:08:57 UTC
Permalink
Hi,
Post by Schmidt
WordListPtrs(i) = StrPtr(WordList(i)) - 4
WordLen = IntArr(-2) \ 2
You fetch the word len from the wrong adress! Either use

WordListPtrs(i) = StrPtr(WordList(i)) - 4
WordLen = IntArr(0) \ 2

or

WordListPtrs(i) = StrPtr(WordList(i))
WordLen = IntArr(-2) \ 2

After this correction it might be possible that the code runs with
"assume no aliasing checked" on. It depends on how aggressively the
compiler optimizes.

Aliasing in principle takes place when the content of a memory cell is
accessed by different means. Problems arise when the compiler does not
recognize this and one of the accesses writes to the memory cell,
another reads from the cell. Then the compiler does not recognize that
the content of the cell has changed. However this may lead to correct
results if the compiler does not do "wrong" code changes due to eg
optimization efforts.

The problem is that we do not know which code changes due to
optimizations the compiler may do or not and also we do not know which
kind of code changes the compiler will not do when "assume no aliasing"
is checked.

Aliasing is a well known problem with C/C++ Compilers. So C/C++
compilers have a set of switches specific to this problem that allow
fine tuned adaption to the code and the documentation of the compiler
may describe well what then is allowed in code when a switch is set or
not. Or the compiler may issue warnings that he suspects aliasing when
compiling, so one could glance over the code and see if he is correct.
Additionally especially C++ allows to give the compiler hints in code by
using the const keyword.

Sadly this information is not available for the VB native code compiler.
The underlying compiler (that one that produces machine code) of VB is a
rather old version of a MS C-Compiler, but the VB prograammer can not
access him and cannot set more specific aliasing switches.

So when "assume no aliasing" is checked, exactly in which cases the
compiler would not do code changes we do not know. It for example might
be that he only reacts on cases that are allowed by the language. Eg in
normal, not hacked VB code it is not possible that eg IntArr(0) delivers
different values when it is not aliased in a way VB allows.

For example:

IntArr(0) = 2
DoSomething IntArr(0)
Print IntArr(0)

with

Private Sub DoSomething(ByRef Value As Integer)
...
End Sub

is a case where aliasing takes place by using legal VB menas, the
compiler should recognize this and act accordingly.

However with the safearray mapping one does things that are possible,
but not standard VB.

Now from my experience as VB programmer, having written many lines of
code, I never came accross a problem that had to do with aliasing, code
changes by the compiler, when "assume no aliasing" was unchecked. This
leads me to the conclusion, that VB is ultraconservative when dealing
with these kind of things, so even crude hacks do not lead to erroneous
code. In each case when aliasing might be possible, with the compiler
not knowing that it is aliased, and then aliasing would lead to
incorrect results, he must act as if aliasing is in effect.

Eg with

A = 2: Print A

no aliasing is possible and the compiler in any way may do constant
propagation and compile the call to Print with a constant value (2 in
this case).

However with

A = 2: B = 3: Print A

if B aliases A, using 2 for the call to Print would give an error. As
the compiler does not know, wether B aliases A or not, the compiler
therefore always reloads the value of A when calling Print (when "assume
no aliasing" is unchecked). In this case aliasing B with A can be done
with legal VB code:

A = 1: DoSomething A, A
Private Sub DoSomething(ByRef A As Integer, ByRef B As Integer)
A = 2: B = 3: Print A
End Sub

However with

IntArr(0) = 2
'This line manipulates the SA data pointer
Print IntArr(0)

from the compilers knowledge about the VB language he could determine
that aliasing in this case is not possible, because this is a hack.
Nevertheless this code works when "assume no aliasing" is unchecked. So
the compiler really acts ultraconservative and seems to reload values
everytime a variable is used, at least if the chance is that it might
have changed (that is there is some code between the two accesses of the
variable). This of course costs performance.
--
Ulrich Korndoerfer

VB tips, helpers, solutions -> http://www.prosource.de/Downloads/
MS Newsgruppen Alternativen -> http://www.prosource.de/ms-ng-umzug.html
Schmidt
2011-08-27 17:45:03 UTC
Permalink
Post by Ulrich Korndoerfer
Post by Schmidt
WordListPtrs(i) =
WordLen = IntArr(-2) \ 2
You fetch the word len from the wrong adress!
Nope, please see the definition of the safe-arrays
lLbound -Member in the Class' Initialize:
saIntArr.lLbound = -2

This setting in conjunction with:
saIntArr.pvData = StrPtr(SomeString) - 4

allows, to do your Loop over the WChars
without changing your usual indexing:
from 0 to Len(SomeString) - 1
Post by Ulrich Korndoerfer
After this correction it might be possible that the
code runs with "assume no aliasing checked" on.
As I already wrote, the Code in SafeArrayAliasingProblem2
reliably reproduces the choking (when the "assume no aliasing"
is checked), but is otherwise perfectly valid code which
runs just fine when the aliasing-check is turned off.

[extra-infos snipped]

Olaf
Ulrich Korndoerfer
2011-08-27 19:02:21 UTC
Permalink
Post by Schmidt
Post by Ulrich Korndoerfer
Post by Schmidt
WordListPtrs(i) =
WordLen = IntArr(-2) \ 2
You fetch the word len from the wrong adress!
Nope, please see the definition of the safe-arrays
saIntArr.lLbound = -2
Oops. You are right.
Post by Schmidt
saIntArr.pvData = StrPtr(SomeString) - 4
allows, to do your Loop over the WChars
from 0 to Len(SomeString) - 1
Post by Ulrich Korndoerfer
After this correction it might be possible that the
code runs with "assume no aliasing checked" on.
As I already wrote, the Code in SafeArrayAliasingProblem2
reliably reproduces the choking (when the "assume no aliasing"
is checked), but is otherwise perfectly valid code which
runs just fine when the aliasing-check is turned off.
[extra-infos snipped]
Olaf
Mhm, seems to be that I did not make my point clear enough. So lets try
again:

only very experienced programmers should check "assume no aliasing".
They should double check their code then and perform solid testing,
covering all relevant code paths. All others should stay away from doing so.

Reasons:

- first it is not possible to restrict "assume no aliasing" to one
compilation unit (like a class, form or simple module) or a method
therein. Often one needs this only in one method to optimize
performance, but then one has to check the code of the complete
compilation unit and the code of the whole project!

- you yourself must identify all possible occurrences of aliasing, the
compiler would not help. Of course one can use aliasing at will, when
"assume no aliasing" is checked, and those cases then should be easily
identifiable by the programmer, but more often there may be accidential
aliasing or potential aliasing, which is much harder to detect.

- then you have to know wether an identified potential or real aliasing
will make problems when running the compiled code. For this you have to
know the rules the compiler uses when creating machine code. This is the
most severe obstacle: you don't know these rules, and even if you would
know them, they will be complex and difficult to grasp and to anticipate
their effects just from reading the VB source code. For certain cases
you can guess or infer from your experience and former tests, but in
general you can't be sure. And remember: test runs in the IDE do not
help, because the rules for code creation for PCode are complete others
and are not be affected by compiler switches.

Your SafeArrayAliasingProblem2 shows above point: before you test you
don't know wether this kind of aliasing will pose problems. And, at
least not if you don't decompile the machine code, you don't know why it
is creating problems. One who is aware what kind of optimizations a
compiler typically can do can guess that it comes from factoring the
assignement out of the loop, to be sure however one would have to look
at the machine code. And then you can be not sure about wether the same
optimization would be done by the compiler if the surrounding code is
slight different or other compiler switches are different.

So my point is: it is not possible to give general rules for how to
structure your code when "assume no aliasing" is checked.
--
Ulrich Korndoerfer

VB tips, helpers, solutions -> http://www.prosource.de/Downloads/
MS Newsgruppen Alternativen -> http://www.prosource.de/ms-ng-umzug.html
ralph
2011-08-27 21:51:16 UTC
Permalink
On Sat, 27 Aug 2011 21:02:21 +0200, Ulrich Korndoerfer
Post by Ulrich Korndoerfer
Post by Schmidt
Post by Ulrich Korndoerfer
Post by Schmidt
WordListPtrs(i) =
WordLen = IntArr(-2) \ 2
You fetch the word len from the wrong adress!
Nope, please see the definition of the safe-arrays
saIntArr.lLbound = -2
Oops. You are right.
Post by Schmidt
saIntArr.pvData = StrPtr(SomeString) - 4
allows, to do your Loop over the WChars
from 0 to Len(SomeString) - 1
Post by Ulrich Korndoerfer
After this correction it might be possible that the
code runs with "assume no aliasing checked" on.
As I already wrote, the Code in SafeArrayAliasingProblem2
reliably reproduces the choking (when the "assume no aliasing"
is checked), but is otherwise perfectly valid code which
runs just fine when the aliasing-check is turned off.
[extra-infos snipped]
Olaf
Mhm, seems to be that I did not make my point clear enough. So lets try
only very experienced programmers should check "assume no aliasing".
They should double check their code then and perform solid testing,
covering all relevant code paths. All others should stay away from doing so.
- first it is not possible to restrict "assume no aliasing" to one
compilation unit (like a class, form or simple module) or a method
therein. Often one needs this only in one method to optimize
performance, but then one has to check the code of the complete
compilation unit and the code of the whole project!
- you yourself must identify all possible occurrences of aliasing, the
compiler would not help. Of course one can use aliasing at will, when
"assume no aliasing" is checked, and those cases then should be easily
identifiable by the programmer, but more often there may be accidential
aliasing or potential aliasing, which is much harder to detect.
- then you have to know wether an identified potential or real aliasing
will make problems when running the compiled code. For this you have to
know the rules the compiler uses when creating machine code. This is the
most severe obstacle: you don't know these rules, and even if you would
know them, they will be complex and difficult to grasp and to anticipate
their effects just from reading the VB source code. For certain cases
you can guess or infer from your experience and former tests, but in
general you can't be sure. And remember: test runs in the IDE do not
help, because the rules for code creation for PCode are complete others
and are not be affected by compiler switches.
Your SafeArrayAliasingProblem2 shows above point: before you test you
don't know wether this kind of aliasing will pose problems. And, at
least not if you don't decompile the machine code, you don't know why it
is creating problems. One who is aware what kind of optimizations a
compiler typically can do can guess that it comes from factoring the
assignement out of the loop, to be sure however one would have to look
at the machine code. And then you can be not sure about wether the same
optimization would be done by the compiler if the surrounding code is
slight different or other compiler switches are different.
So my point is: it is not possible to give general rules for how to
structure your code when "assume no aliasing" is checked.
For the benefit of others who might be curious as I assume you and
Olaf are already aware of this, but you can get closer to viewing and
playing with different compilers and compiler options by creating a
simple VB application which captures the incoming command line and
stops. Name the application "c2.exe" and temporarily replace the C2
provided by VB. (MSC v12 IIRC.)

At that point you can view the command line options, and the VB ".obj"
files. These files are "VB preprocessed C" files which can be treated
as plain C files to further build your product. (And with newer
compilers.)

It has been many moons since I played with this and if IIRC two things
will become apparent. One, VB will occasionally generate extra
"undocumented options". It is not clear if they are being used or not,
ie. something that never got adopted, or real but undocumented.

Second, Ulrich's suspicious are correct, there are often subtle
differences between VB code and the code the VB preprocessor
generates, making it difficult to predetermine just when or why a
particular optimization takes place. (Or in this case even when or why
a particular C code is even generated for a given VB construct.)

One can get lost in analyzing and studying the VB preprocessed C files
themselves. It is a remarkable achievement when one considers that VB
and C are so fundamentally different. Pcode is stack-based and C is
function(module)-based. The latter is the primary reason why so many
functionally identical VB constructs (viewed from a high level) can in
fact lead to subtly different C code at a lower level. Quite
fascinating.

However, based on my experience, having played with this for too many
months off 'n on, I can share with you that I seldom found a
justifiable reason to go there other than simple curiosity. And the
few I found were of limited value and produced mostly a maintenance
nightmare. <g> It is simpler to set a few options, compile and test,
go back, flip another, compile and go test again.

Also (as Ulrich noted), one needs to appreciate that the VB compiler
options are 'global' for all modules - so, for large projects, often
what you might improve in one running section, had an negative impact
on another. ie, if you go flipping options - you can't stop at a Unit
test - you need to perform a full system test.

You can actually set different options for each module using the
technique I outlined above. Then later link them together. (Did I
mention maintenance nightmare? <bg>)

-ralph
Ulrich Korndoerfer
2011-08-27 22:48:43 UTC
Permalink
Post by ralph
...
For the benefit of others who might be curious as I assume you and
Olaf are already aware of this, but you can get closer to viewing and
playing with different compilers and compiler options by creating a
simple VB application which captures the incoming command line and
stops. Name the application "c2.exe" and temporarily replace the C2
provided by VB. (MSC v12 IIRC.)
At that point you can view the command line options, and the VB ".obj"
files. These files are "VB preprocessed C" files which can be treated
as plain C files to further build your product. (And with newer
compilers.)
Yeah. There are tools available (free and commercial IDE addins) that
allow to "break" in the build process and also to alter him. Some of
them (at least one IIRC) allow to control the build process by reading
build instructions from special crafted comments inside the VB source
modules. It should be possible too to write a tool that controls
building on a method level.
Post by ralph
It has been many moons since I played with this and if IIRC two things
will become apparent. One, VB will occasionally generate extra
"undocumented options". It is not clear if they are being used or not,
ie. something that never got adopted, or real but undocumented.
Second, Ulrich's suspicious are correct, there are often subtle
differences between VB code and the code the VB preprocessor
generates, making it difficult to predetermine just when or why a
particular optimization takes place. (Or in this case even when or why
a particular C code is even generated for a given VB construct.)
One can get lost in analyzing and studying the VB preprocessed C files
themselves. It is a remarkable achievement when one considers that VB
and C are so fundamentally different. Pcode is stack-based and C is
function(module)-based. The latter is the primary reason why so many
functionally identical VB constructs (viewed from a high level) can in
fact lead to subtly different C code at a lower level. Quite
fascinating.
However, based on my experience, having played with this for too many
months off 'n on, I can share with you that I seldom found a
justifiable reason to go there other than simple curiosity. And the
few I found were of limited value and produced mostly a maintenance
nightmare. <g> It is simpler to set a few options, compile and test,
go back, flip another, compile and go test again.
Yes, hunting down the generated C code is not justifiable. However
checking "assume no aliasing" can give this extra performance boost. So
if one needs it, go this route and test, don't try to infer (too much).
Post by ralph
Also (as Ulrich noted), one needs to appreciate that the VB compiler
options are 'global' for all modules - so, for large projects, often
what you might improve in one running section, had an negative impact
on another. ie, if you go flipping options - you can't stop at a Unit
test - you need to perform a full system test.
You can actually set different options for each module using the
technique I outlined above. Then later link them together. (Did I
mention maintenance nightmare? <bg>)
...
One could automate that inserting of build instructions by reading them
from the VB source code, so maintenance might be a little bit easier ;-)
--
Ulrich Korndoerfer

VB tips, helpers, solutions -> http://www.prosource.de/Downloads/
MS Newsgruppen Alternativen -> http://www.prosource.de/ms-ng-umzug.html
ralph
2011-08-27 23:48:58 UTC
Permalink
On Sun, 28 Aug 2011 00:48:43 +0200, Ulrich Korndoerfer
Post by Ulrich Korndoerfer
One could automate that inserting of build instructions by reading them
from the VB source code, so maintenance might be a little bit easier ;-)
I once created an elaborate build based on nmake to automate the
tweaking of several modules for a suite of applications. Took a bit of
tweaking, but it was a one-click wonder.

I was quite proud of myself ... that is until I had to stand in front
of the Project Manager and explain - "So you billed 10 hours doing
what now?" <bg>

-ralph
Ulrich Korndoerfer
2011-08-28 04:49:40 UTC
Permalink
Post by ralph
On Sun, 28 Aug 2011 00:48:43 +0200, Ulrich Korndoerfer
Post by Ulrich Korndoerfer
One could automate that inserting of build instructions by reading them
from the VB source code, so maintenance might be a little bit easier ;-)
I once created an elaborate build based on nmake to automate the
tweaking of several modules for a suite of applications. Took a bit of
tweaking, but it was a one-click wonder.
I was quite proud of myself ... that is until I had to stand in front
of the Project Manager and explain - "So you billed 10 hours doing
what now?" <bg>
I feel with you :-)

Btw, rereading my post, I recognize some severe language flaws, eg.
instead of inserting I should have used insertion, and some others.
Please excuse my bad command of english language. And never trust my
writings literally :-)
--
Ulrich Korndoerfer

VB tips, helpers, solutions -> http://www.prosource.de/Downloads/
MS Newsgruppen Alternativen -> http://www.prosource.de/ms-ng-umzug.html
ralph
2011-08-28 06:05:53 UTC
Permalink
On Sun, 28 Aug 2011 06:49:40 +0200, Ulrich Korndoerfer
Post by Ulrich Korndoerfer
Post by ralph
On Sun, 28 Aug 2011 00:48:43 +0200, Ulrich Korndoerfer
Post by Ulrich Korndoerfer
One could automate that inserting of build instructions by reading them
from the VB source code, so maintenance might be a little bit easier ;-)
I once created an elaborate build based on nmake to automate the
tweaking of several modules for a suite of applications. Took a bit of
tweaking, but it was a one-click wonder.
I was quite proud of myself ... that is until I had to stand in front
of the Project Manager and explain - "So you billed 10 hours doing
what now?" <bg>
I feel with you :-)
Btw, rereading my post, I recognize some severe language flaws, eg.
instead of inserting I should have used insertion, and some others.
Please excuse my bad command of english language. And never trust my
writings literally :-)
Ha. Your grammar is generally better than mine, especially in this
newsgroup. I unfortunately tend to drop into some kind of mixed
Western, Southern, Mid-West Jibberish.

Never trust my writings literally, as I may be drinking.

-ralph
Ulrich Korndoerfer
2011-08-28 06:34:08 UTC
Permalink
Post by ralph
...
Never trust my writings literally, as I may be drinking.
...
Prosit ;-)
--
Ulrich Korndoerfer

VB tips, helpers, solutions -> http://www.prosource.de/Downloads/
MS Newsgruppen Alternativen -> http://www.prosource.de/ms-ng-umzug.html
Schmidt
2011-08-28 15:08:23 UTC
Permalink
Post by Ulrich Korndoerfer
Mhm, seems to be that I did not make my point
clear enough.
Why do you think so?

We don't differ that much in our general opinion, that
the "aliasing-switch" should better be left untouched
(if one's not sure).

In my reply to Eduardo I already wrote:
"Though the performance-gain from an enabled first
option is only 3-10% or so - and so I usually leave
it out in most of the "speed-optimized binaries",
just to play safe (no matter in what way I use
SafeArray-mappings there)."
Post by Ulrich Korndoerfer
only very experienced programmers should check
"assume no aliasing".
Yep - but I consider those, who (often) use SafeArray-
Mappings in their fast routines "among them".
All explanations which came up in these last postings
here, were more or less thought for exactly these
"experienced speed freaks", I thought.

But I wouldn't go that far, that switching these
extended options on - should be somehow reserved
only for those "who have at least some years on their
belts" - even a Newbie should be "allowed" to play
with them, as long as he tests his compiled binaries
appropriately.


You wrote, that you never experienced "over your years"
aliasing-problems (due to our "ultraconservative compiler")
whilst leaving this option mostly unchecked - that's reasonable
(at least for me), that the compiler then behaves this way,
since - well, the option was not checked. ;-)

And from the experience "over my years" (compiling usually
with *all* options checked - especially in Helper-Dlls) -
I also never experienced any aliasing-problems ...
*as_long_as_no_SafeArray-mapping-routines_were_involved*.

Only the SafeArray-stuff made problems in this regard so far
on my end - and not always, but under certain circumstances,
(which ones, I already tried to describe - and I don't claim,
that these are "all of them").

But as a "general rule" I stand by my words, that it is
(to 99%) safe, to enable *all* options, even in conjunction
with SafeArrays, when in your SafeArray-boosted-routine the
setting of the .pvData-pointer happens only *once*
(and is not repeatedly changed within the entered Main-
Routine as e.g. in the shown Loop over the WordList-
Pointers in the posted example - or in potentially called
SubRoutines ... these shouldn't alter the .pvData-Pointer
too, once set in the Main-Entry-Routine ... as said all that
needs to be considered only, when all extended options are
checked).
Post by Ulrich Korndoerfer
So my point is: it is not possible to give general rules
for how to structure your code when "assume no aliasing"
is checked.
That's where we (slightly) differ, since I tried with my
post, to give such a "rule of thumb" - at least, one that
worked for me over the years. And as you said - since
we have no extended information about our native-compilers
internal optimization-strategies, one can only work
based on experience and tests.
But what we know (in the meantime) is, that the compiler
apparently works "not very aggressively" with regards
to optimization - it starts to try some more sophisticated
things, when the aliasing-option is turned on - but compared
with the "full-blown C-Compilers" I think, it's still
a "conservative companion".

Where we do not differ is, that enabling the
extended options (especially the first one) should
be accompanied by extended testing of the compiled
Binary.

Olaf
Ulrich Korndoerfer
2011-08-28 23:04:07 UTC
Permalink
Post by Schmidt
...
You wrote, that you never experienced "over your years"
aliasing-problems (due to our "ultraconservative compiler")
whilst leaving this option mostly unchecked - that's reasonable
(at least for me), that the compiler then behaves this way,
since - well, the option was not checked. ;-)
It might be possible that the VB compiler is not perfect in avoiding
problems when aliasing takes place, one may suspect. Especially C
compilers *are* not perfect. There *exist* commercial grade compilers,
that fail on some cases of aliasing, even when they are told that they
can not assume that aliasing is not used within the code. Because the VB
backend compiler is a C compiler, it is reasonable to suspect that VB
may fail here too, even when "assume no aliasing" is unchecked.

Only experience can tell wether a compiler (C or VB) behaves well in
regard to aliasing.

Experience told that some C compilers fail. My experience tells me that
the VB compiler seemingly is perfect in this respect. Either because the
generated C code is avoiding such problems or because the C compiler
backend is able to recognize problems or may be because of both reasons.

So it is not obvious, that there are no aliasing problems when "assume
no aliasing" is unchecked. Experience however tells that one can trust
the VB compiler and this is a valuable information, in my oppinion.

So the rule for the "assume no aliasing" option unchecked is:

even when aliasing is in your code, the *compiler* will emit correct
machine code that behaves as written. The compiler will not do
optimizations that alter the behavior of the code.

As a side note that does not imply that the code will work as *intended*
by the programmer, because eg. a programmer might accidentially use
aliasing! But then it was not the compiler's fault, but the programmer's.
Post by Schmidt
And from the experience "over my years" (compiling usually
with *all* options checked - especially in Helper-Dlls) -
I also never experienced any aliasing-problems ...
*as_long_as_no_SafeArray-mapping-routines_were_involved*.
The difference to the "assume no aliasing" option unchecked behavior
here is that it was not the compiler that resolved aliasing problems,
because he was told that aliasing does not take place.

It was *you* that identified those places where aliasing is in effect
and so might pose problems. *You* wrote or rewrote the code such that
either aliasing was not used or such that despite using aliasing the
code works as written and intended and ensured this by testing.

So the rule for the "assume no aliasing" option *checked* is:

either the *programmer* identifies uses or potential uses of aliasing
and writes or rewrites the code such that no aliasing occurs or may occur.

Or, when aliasing is used at will, the *compiler* may emit, due to
optimizations done by the compiler, machine code that behaves not as
written. The *programmer* now has to write or rewrite his code such that
the compiled code behaves as written. Unfortunately the only viable way
to ensure that is by testing.
Post by Schmidt
Only the SafeArray-stuff made problems in this regard so far
on my end - and not always, but under certain circumstances,
(which ones, I already tried to describe - and I don't claim,
that these are "all of them").
Because (I assume) those were the only cases where you used aliasing (at
will).
Post by Schmidt
But as a "general rule" I stand by my words, that it is
(to 99%) safe, to enable *all* options, even in conjunction
with SafeArrays, when in your SafeArray-boosted-routine the
setting of the .pvData-pointer happens only *once*
(and is not repeatedly changed within the entered Main-
Routine as e.g. in the shown Loop over the WordList-
Pointers in the posted example - or in potentially called
SubRoutines ... these shouldn't alter the .pvData-Pointer
too, once set in the Main-Entry-Routine ... as said all that
needs to be considered only, when all extended options are
checked).
Of course it is safe to enable all options, because correct code can be
written also with these options on. But now the programmer has to have
more knowledge, being aware of the consequences from switching these
options on.

To your rule: as said, setting the safe array data pointer more than one
time in a routine may or may not lead to problems. And setting it only
one time also may or may not lead to problems. It depends on the code,
not on how often the de data pointer is set.

Your rule should be reformulated:

when safearray mapping is used aliasing is done. So in this case be
aware that there might be problems caused by the compiler and test. If
there are problems, rewrite the code. But a general recommendation on
deducing from code when to rewrite or how to rewrite can not be given.
Post by Schmidt
Post by Ulrich Korndoerfer
So my point is: it is not possible to give general rules
for how to structure your code when "assume no aliasing"
is checked.
That's where we (slightly) differ, since I tried with my
post, to give such a "rule of thumb" - at least, one that
worked for me over the years. And as you said - since
we have no extended information about our native-compilers
internal optimization-strategies, one can only work
based on experience and tests.
But what we know (in the meantime) is, that the compiler
apparently works "not very aggressively" with regards
to optimization - it starts to try some more sophisticated
things, when the aliasing-option is turned on - but compared
with the "full-blown C-Compilers" I think, it's still
a "conservative companion".
Where we do not differ is, that enabling the
extended options (especially the first one) should
be accompanied by extended testing of the compiled
Binary.
Olaf
--
Ulrich Korndoerfer

VB tips, helpers, solutions -> http://www.prosource.de/Downloads/
MS Newsgruppen Alternativen -> http://www.prosource.de/ms-ng-umzug.html
Tony Toews
2011-08-27 19:08:46 UTC
Permalink
Post by Eduardo
'this is a class for getting words similarity
'based on inspiring ideas from: www.datenhaus.de (Olaf Schmidt)
'use at your own risk
'(specially if you are going to use it in a nuclear power plant)
Hehehe. You forgot to add. Code not usable in Syria, North Korea or
Elbonia.

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
Tony Toews
2011-08-25 19:45:04 UTC
Permalink
Post by Eduardo
Comparing all the records in the table took about 4 seconds with an
indexed field.
But now I'm loading the words in vectors in memory, and it takes even
less, and I'm using a faster algorithm (provided by Olaf) to perform the
comparisons, and it takes less, and compiled even less, now it's in the
range of 0,05 seconds to 0.15 seconds (compiled).
Awesome. Glad you're making a *lot* of progress.

Tony
--
Tony Toews, Microsoft Access MVP
Tony's Main MS Access pages - http://www.granite.ab.ca/accsmstr.htm
Tony's Microsoft Access Blog - http://msmvps.com/blogs/access/
For a convenient utility to keep your users FEs and other files
updated see http://www.autofeupdater.com/
Eduardo
2011-08-25 23:55:58 UTC
Permalink
Post by Tony Toews
Post by Eduardo
Comparing all the records in the table took about 4 seconds with an
indexed field.
But now I'm loading the words in vectors in memory, and it takes even
less, and I'm using a faster algorithm (provided by Olaf) to perform the
comparisons, and it takes less, and compiled even less, now it's in the
range of 0,05 seconds to 0.15 seconds (compiled).
Awesome. Glad you're making a *lot* of progress.
Tony
Yes, thanks to Olaf and you.
David Kaye
2011-08-25 22:44:27 UTC
Permalink
Post by Eduardo
It's a database with fixed data, so there is not data additions by the
program. It's a dictionary.
A faster approach might be to add a new field containing the data beginning
with the second letter, then index that field.

For instance: SELECT mid(myfield,2) INTO mynewfield FROM mytable
Eduardo
2011-08-25 23:54:25 UTC
Permalink
Post by David Kaye
A faster approach might be to add a new field containing the data beginning
with the second letter, then index that field.
Thanks David, that's what I did, but later I found that my idea was
wrong (about what I wanted to do) and now I solved the issue in another way.
Loading...