Externalize dictionaries (for now)
This commit is contained in:
parent
28e188ae39
commit
7b5f8862ed
4 changed files with 163 additions and 61 deletions
157
README.md
157
README.md
|
@ -3,4 +3,159 @@ human-friendly-id-gen
|
||||||
|
|
||||||
New Haskell project to generate Human Friendly Ids.
|
New Haskell project to generate Human Friendly Ids.
|
||||||
|
|
||||||
Those ids should be easy to read, write and to remember.
|
Those ids should be easier to read / write and remember than classical random
|
||||||
|
base64 ids.
|
||||||
|
|
||||||
|
The package provide both a lib and an executable `hfig` (for Human Friendly
|
||||||
|
Identifier Generator).
|
||||||
|
|
||||||
|
## Strategies
|
||||||
|
|
||||||
|
There are different strategies depending on your preferences.
|
||||||
|
|
||||||
|
### Short strategy
|
||||||
|
|
||||||
|
We generate random phonemes that should be not too hard to pronounce but in the
|
||||||
|
same time having sufficiently different phonemes to be able to have not too long
|
||||||
|
words to prevent collision.
|
||||||
|
|
||||||
|
~~~
|
||||||
|
rupomdovi
|
||||||
|
waziridro
|
||||||
|
moplaloxo
|
||||||
|
kankujochplu
|
||||||
|
drubrusadka
|
||||||
|
dripuxmopbi
|
||||||
|
jotchibluzuv
|
||||||
|
plotabrprabudr
|
||||||
|
zopranblokplab
|
||||||
|
tirbrozprakow
|
||||||
|
~~~
|
||||||
|
|
||||||
|
Here is the probability of collision if you generate a sample of n of those words:
|
||||||
|
|
||||||
|
| n | % |
|
||||||
|
|------|--------|
|
||||||
|
| 1000 | 2.5e-8 |
|
||||||
|
| 10k | 2.5e-6 |
|
||||||
|
| 100k | 2.5e-4 |
|
||||||
|
| 1M | 2.5e-2 |
|
||||||
|
|
||||||
|
You can also ask to use more phonemes if you only use 2 phonemes which generate words like:
|
||||||
|
|
||||||
|
~~~
|
||||||
|
blilwa
|
||||||
|
wirpa
|
||||||
|
winupl
|
||||||
|
tani
|
||||||
|
ludu
|
||||||
|
probrip
|
||||||
|
pichprox
|
||||||
|
joprux
|
||||||
|
drudibl
|
||||||
|
zibrku
|
||||||
|
~~~
|
||||||
|
|
||||||
|
The probility of collision become:
|
||||||
|
|
||||||
|
| n | % |
|
||||||
|
|-----|------|
|
||||||
|
| 10 | 1e-5 |
|
||||||
|
| 100 | 1e-3 |
|
||||||
|
| 1k | 0.11 |
|
||||||
|
| 10k | 1.0 |
|
||||||
|
|
||||||
|
### Lovecraftian strategy
|
||||||
|
|
||||||
|
My nickname isn't yogsototh for nothing so why not generate as if Lovecraft
|
||||||
|
could have invented them.
|
||||||
|
|
||||||
|
~~~
|
||||||
|
ymhiovhotl
|
||||||
|
zhaobritl
|
||||||
|
v'odher
|
||||||
|
neltha
|
||||||
|
ucnouthlaxr
|
||||||
|
kola
|
||||||
|
adavhig
|
||||||
|
ctuthrilbh
|
||||||
|
yakthembru
|
||||||
|
athoubr'murh
|
||||||
|
~~~
|
||||||
|
|
||||||
|
The probability collision table looks like:
|
||||||
|
|
||||||
|
| n | % |
|
||||||
|
|------|----------------------|
|
||||||
|
| 10 | 6.669334400426838e-8 |
|
||||||
|
| 100 | 6.669334400426838e-6 |
|
||||||
|
| 1k | 6.669334400426837e-4 |
|
||||||
|
| 10k | 6.669334400426838e-2 |
|
||||||
|
| 100k | 1.0 |
|
||||||
|
|
||||||
|
if you generate two names for an id, you should be safe.
|
||||||
|
|
||||||
|
| n | % |
|
||||||
|
|------|---------|
|
||||||
|
| 10 | 8.8e-17 |
|
||||||
|
| 100 | 8.8e-15 |
|
||||||
|
| 1k | 8.8e-13 |
|
||||||
|
| 10k | 8.8e-11 |
|
||||||
|
| 100k | 8.8e-9 |
|
||||||
|
| 1M | 8.8e-7 |
|
||||||
|
|
||||||
|
### Dictionary Strategy
|
||||||
|
|
||||||
|
You can read any file and each line will be considered as a word.
|
||||||
|
We then take a few random words.
|
||||||
|
|
||||||
|
You can gather some word list in this repository to use.
|
||||||
|
|
||||||
|
There is a default english dictionary with approximatively 370k English words.
|
||||||
|
|
||||||
|
Here is an example:
|
||||||
|
|
||||||
|
~~~
|
||||||
|
shuckins-digitinerved-microspectrophotometrical
|
||||||
|
indeterminableness-getaways-sceloporus
|
||||||
|
diverts-okayed-cast
|
||||||
|
semirhythmically-thasian-thrawart
|
||||||
|
smashups-phototherapeutics-swollenness
|
||||||
|
bindingness-phoenicia-ringy
|
||||||
|
execs-axes-barotaxis
|
||||||
|
monimiaceous-presutural-submembers
|
||||||
|
heterodyned-pourparley-zecchino
|
||||||
|
fragmentate-contrude-taeniae
|
||||||
|
~~~
|
||||||
|
|
||||||
|
And here are the different table of collision probability.
|
||||||
|
|
||||||
|
use 1 word to make the identifier:
|
||||||
|
|
||||||
|
| n | % |
|
||||||
|
|---------|--------|
|
||||||
|
| 10 | 1.3e-4 |
|
||||||
|
| 100 | 1.3e-2 |
|
||||||
|
| 1k | 1.0 |
|
||||||
|
|
||||||
|
combine 2 words to make the identifier:
|
||||||
|
|
||||||
|
| n | % |
|
||||||
|
|------|---------|
|
||||||
|
| 10 | 3.6e-10 |
|
||||||
|
| 100 | 3.6e-8 |
|
||||||
|
| 1k | 3.6e-6 |
|
||||||
|
| 10k | 3.6e-4 |
|
||||||
|
| 100k | 3.6e-2 |
|
||||||
|
| 1M | 1.0 |
|
||||||
|
|
||||||
|
combine 3 words to make the identifier:
|
||||||
|
|
||||||
|
| n | % |
|
||||||
|
|------|---------|
|
||||||
|
| 10 | 9.8e-16 |
|
||||||
|
| 100 | 9.8e-14 |
|
||||||
|
| 1k | 9.8e-12 |
|
||||||
|
| 10k | 9.8e-10 |
|
||||||
|
| 100k | 9.8e-8 |
|
||||||
|
| 1M | 9.8e-6 |
|
||||||
|
|
|
@ -17,13 +17,9 @@ main = do
|
||||||
Short -> Short.idgen (fromMaybe 4 (optLen opts)) >>= putText
|
Short -> Short.idgen (fromMaybe 4 (optLen opts)) >>= putText
|
||||||
Lovecraftian -> Lovecraftian.idgen (fromMaybe 1 (optLen opts)) >>= putText
|
Lovecraftian -> Lovecraftian.idgen (fromMaybe 1 (optLen opts)) >>= putText
|
||||||
Dict -> do
|
Dict -> do
|
||||||
let file = case optDict opts of
|
file <- case optDict opts of
|
||||||
Just "english" -> "dictionaries/english.txt"
|
Just filepath -> return $ toS (format fp filepath)
|
||||||
Just "first-names" -> "dictionaries/first-names.txt"
|
Nothing -> die "Please select a dictionary file with the -d or --dict options"
|
||||||
Just "generic" -> "dictionaries/generic.txt"
|
|
||||||
Just "literary" -> "dictionaries/literary.txt"
|
|
||||||
Just filepath -> toS (format fp filepath)
|
|
||||||
Nothing -> "dictionaries/english.txt"
|
|
||||||
dict <- Dict.dictionaryFromFile file
|
dict <- Dict.dictionaryFromFile file
|
||||||
Dict.idgen dict (fromMaybe 3 (optLen opts)) >>= putText
|
Dict.idgen dict (fromMaybe 3 (optLen opts)) >>= putText
|
||||||
|
|
||||||
|
|
|
@ -13,58 +13,23 @@ Yet not the best for preventing collision.
|
||||||
module HFIG.Dictionary
|
module HFIG.Dictionary
|
||||||
( idgen
|
( idgen
|
||||||
, dictionaryFromFile
|
, dictionaryFromFile
|
||||||
, collisionProbability
|
|
||||||
)
|
)
|
||||||
where
|
where
|
||||||
|
|
||||||
import Protolude
|
import Protolude
|
||||||
|
|
||||||
import qualified System.Random.MWC as Random
|
|
||||||
import qualified Control.Monad.Primitive as Prim
|
|
||||||
import qualified Data.Vector as V
|
import qualified Data.Vector as V
|
||||||
import qualified Data.Text as T
|
import qualified Data.Text as T
|
||||||
|
|
||||||
|
import qualified HFIG.Helpers as Helpers
|
||||||
|
|
||||||
type Dictionary = V.Vector Text
|
type Dictionary = V.Vector Text
|
||||||
|
|
||||||
|
|
||||||
-- | Will generate readable short names The integer parameter determine the
|
-- | Will generate readable short names The integer parameter determine the
|
||||||
-- length in number of syllabus of the name
|
-- length in number of syllabus of the name
|
||||||
idgen :: Dictionary -> Int -> IO Text
|
idgen :: Dictionary -> Int -> IO Text
|
||||||
idgen allwords n =
|
idgen d = Helpers.idgen "-" [d]
|
||||||
Random.withSystemRandom $ \gen ->
|
|
||||||
T.intercalate "-" <$> replicateM n (genWord gen allwords)
|
|
||||||
|
|
||||||
-- | Approximate collision probability other n generated name with complexity
|
|
||||||
-- parameter equal to l
|
|
||||||
--
|
|
||||||
-- For example if you generate 1000 words randomly with complexity parameter 4
|
|
||||||
-- We estimate the probability of collision to 3.85%
|
|
||||||
--
|
|
||||||
-- This is a nice helper function to use when you want to estimate the optimal
|
|
||||||
-- length of your ids
|
|
||||||
--
|
|
||||||
-- @
|
|
||||||
-- > collisionProbability 1000 4
|
|
||||||
-- 3.8580246913580245e-2
|
|
||||||
--
|
|
||||||
-- > collisionProbability 10000 5
|
|
||||||
-- 6.430041152263374e-2
|
|
||||||
--
|
|
||||||
-- > collisionProbability 10000 6
|
|
||||||
-- 1.0716735253772291e-3
|
|
||||||
-- @
|
|
||||||
collisionProbability :: V.Vector Text -- ^ The dictionary
|
|
||||||
-> Double -- ^ nb of generated names
|
|
||||||
-> Double -- ^ length parameter used
|
|
||||||
-> Double
|
|
||||||
collisionProbability dict n l = min ((n**2) / (2 * (nbWords dict ** l))) 1
|
|
||||||
|
|
||||||
nbWords :: V.Vector Text -> Double
|
|
||||||
nbWords ws = fromIntegral $ V.length ws
|
|
||||||
|
|
||||||
genWord :: Random.Gen (Prim.PrimState IO) -> V.Vector Text -> IO Text
|
|
||||||
genWord gen allwords = do
|
|
||||||
(k :: Int) <- Random.uniformR (0, V.length allwords - 1) gen
|
|
||||||
return (allwords V.! k)
|
|
||||||
|
|
||||||
dictionaryFromFile :: FilePath -> IO (V.Vector Text)
|
dictionaryFromFile :: FilePath -> IO (V.Vector Text)
|
||||||
dictionaryFromFile dictName = (V.fromList . T.lines) <$> readFile dictName
|
dictionaryFromFile dictName = (V.fromList . T.lines) <$> readFile dictName
|
||||||
|
|
|
@ -12,7 +12,6 @@ Yet not the best for preventing collision.
|
||||||
-}
|
-}
|
||||||
module HFIG.Lovecraftian
|
module HFIG.Lovecraftian
|
||||||
( idgen
|
( idgen
|
||||||
, collisionProbability
|
|
||||||
)
|
)
|
||||||
where
|
where
|
||||||
|
|
||||||
|
@ -43,16 +42,3 @@ nameparts = [
|
||||||
, V.fromList ["a","e","i","u","o"]
|
, V.fromList ["a","e","i","u","o"]
|
||||||
, V.fromList ["","","","","","","","","","","d","g","h","l","lb","lbh","n","r","rc","rh","s","sh","ss","st","sz","th","tl","x","xr","xz"]
|
, V.fromList ["","","","","","","","","","","d","g","h","l","lb","lbh","n","r","rc","rh","s","sh","ss","st","sz","th","tl","x","xr","xz"]
|
||||||
]
|
]
|
||||||
|
|
||||||
-- | Approximate collision probability other n generated name with complexity
|
|
||||||
-- parameter equal to l
|
|
||||||
--
|
|
||||||
-- For example if you generate 1000 words randomly with complexity parameter 4
|
|
||||||
-- We estimate the probability of collision to 3.85%
|
|
||||||
--
|
|
||||||
-- This is a nice helper function to use when you want to estimate the optimal
|
|
||||||
-- length of your ids
|
|
||||||
collisionProbability :: Double -- ^ nb of generated names
|
|
||||||
-> Double -- ^ length parameter used
|
|
||||||
-> Double
|
|
||||||
collisionProbability = Helpers.collisionProbability nameparts
|
|
||||||
|
|
Loading…
Reference in a new issue