how to design a tinyURL website.
A: 有人讨论出tiny URL怎么设计吗?
微博URL短网址生成算法原理及(java版、php版实现实例)
I think for those questions, start from a simple, naive solution would be a good way to go..
simply solve the problem without worrying about scaling first
effectively, tiny url computes hash of the given URL
naive way就是直接hash到短的
hash function怎么设计?
naive way就是直接hash到短的
hash function怎么设计?
so for any given URL, h = hash(url)
and you have a key-value datastore, which is a map from hash to url
check if the hash is already used by a different url,
if so, you need a different hash
短地址 好像就是md5之后变成256字节的 然后再除以4 变成4个。然后在随便挑一个 搞一搞。。
the above solution is a naive solution to start with
then let us look into some interesting things about tiny url
first, this mapping only will be appended, records are never deleted。it means the locking can be optimized here
existing entries will be read only
effectively it means lock free
new entries will be written to the db
lock是指2个URL hash到一个短的时候才需要lock吧?
哦,你的意思是不lock整个table?
只lock部分东东?
sure, even rdbms never locks the whole table
只lock要写的那行就行了。good idea
you might also want to discuss how to distributed the load
how to do better than a db can offer, and how to maintain consistency, i.e., the new tiny url should letter corrects returns the original URL
no matter how soon it is requested
say, you have a lot of urls already in db
at the same time, there are a lot of new requests every seconds
both reads (translations) and writes (new hashes)
to create a new tiny url from a long url can be a little slower, users can generally accept the fact
but reads will be a lot more than writes
for any URL, it will only be created once了, but read multiple times
how to distribute the workload evenly?
you want to use multiple dbs / servers
how to do better than a db can offer, and how to maintain consistency, i.e., the new tiny url should letter corrects returns the original URL
no matter how soon it is requested
say, you have a lot of urls already in db
at the same time, there are a lot of new requests every seconds
both reads (translations) and writes (new hashes)
to create a new tiny url from a long url can be a little slower, users can generally accept the fact
but reads will be a lot more than writes
for any URL, it will only be created once了, but read multiple times
how to distribute the workload evenly?
you want to use multiple dbs / servers
say, you have a hash of 6 chars
[A-Za-z0-9]
you may have one server, serving hashes starting with [A-Z]
one for [a-z], and one for [0-9]
so you have 3 servers
so you have 3 servers
and the hashes will never overlap
another nice thing is, the url creation/request goes to the same server
so there is no consistency problem to worry about
and we expect the requests will distribute evenly
it depends on how the hash functions works
就是同样的tiny url会导向同样的url
if you share it with friends
everybody will be redirected to the same page
同样的URL生成同样的TINY URL
因为有multi-server,所以要考虑consistence
以后只要读取就行了
not necessarily i think, depends on how your design the system
you may want to avoid the db check to see if it is already there
then you need to make sure the new ID you generated will never a used one
it can be done, but you still need to write the record to the db
effectively you save a db read
to verify if it is already in use
把长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能
but i think in those systems, cache will play an important role
avoiding the db read might not be as beneficial as we think
anything, you need a db write afterwards after new ID creation
unless, you want to risk data lose to keep the new IDs in mem
and write those into db in btch
it will be a lot faster
it means, don't write only one ID into db every time
keep those IDs in memory,
and write them into db in batches
it will be a lot faster
since I/O comes with high overhead
and write lock is expensive
yes
it is some optimization
是不是db cache?
i mean, writing to db is slow
so you want to write 100 IDs a time
instead of write 100 times with 1 ID written every tie
肯定要CACHE把
it can be 10x faster
and there will be a lot of requests, so i think the db read will have a very high chance of cache hit
by generating different IDs for the same URL
you have a different set of problems
someone said it might be different
it will work
cache和DP是存一个东西
不过去DB里读慢
所以把一些很常用,很牛逼,大家发很多的网址CACHE起来存在内存里
google可以对相同的URL产生不同的tiny url
a-zA-Z0-9 这64位取6位组合,可产生500多亿个组合数量.把数字和字符组合做一定的映射,就可以产生唯一的字符串,如第62个组合就是aaaaa9,第63个组合就是aaaaba,再利用洗牌算法,把原字符串打乱后保存,那么对应位置的组合字符串就会是无序的组合。
把长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能,所以如果用上面的62个字符,任意取6个字符组合成字符串的话,你的数据存量达到500多亿后才会出现重复的可能。
萌喵喵脑残粉(28015012) 1:46:42 AM
但是一个URL做一个TINY URL也可以接受吧
i think, the above means, you generate an ID for every coming URL
我见过的最简单的URL到TINY URL的算法就是直接MD5之后MOD
say, for coming URL "google.com", you return ID "100"
这样如果没有限定必须要几个字符做TINY URL的话
and you already have a map from ID to string
100 -> abcdef
and most importantly, the IDs will only increase
怎么无序组合?
我也觉得有序排太sb了
shuffle
晕死,重点是shuffle。。shuffle后不会重复吗?
0 -> zafkds
1 -> ajiiii
2 -> 7juo9
something like this
i mean, shuffle the order
dont change the string
initially, 0 -> aaaaaa
1 -> aaaaab
2 -> aaaaac
长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能,
after shuffle, 0 -> yudxax
shuffle indexes吧
google对相同long url生成不同的tiny url是为了track 点击次数
the guy who wrote that did not explain that well
so you have a mapping from ID -> string of length 6
就是换顺序,让大家猜不到
and different IDs maps to different strings
then you only need to generate IDs, staring from 0
it is a lot faster
you only do n++ every time :)
no hashing
and no need to check
you are sure they will never conflict
so the last problem is
how to generate the IDs fast and reliably ?
fast -> do it in mem, and distributed
yes, the hashing can be done ahead of time
you only keep a mapping from IDs to srtings
those are generated and shuffled ahead of time
for IDs, in mem n++ is easy
how to do it on multiple servers?
you mean the pre-generated?
never
because those are effectively in order
you shuffle them afterwards
refer to the example above
0 -> aaaaaa, 1->aaaaab
those are in order generated, so the values are never the same
忘了(2754664635) 1:56:23 AM
来了个url就assign个id,id直接map到短url去了
Junjie(229516660) 1:56:30 AM
yes
that's why you see a different ID every time
看热闹的BT(935610471) 1:56:56 AM
会不会tiny url被人decode了,如果这么生成
Junjie(229516660) 1:57:05 AM
you never re-use it
so there is no way to decode it,,right?
the ID part
id 跟 url 没关系
in mem ++ is easy
but you want to do that on multiple servers
everything time the child ID server will request a block of new IDs
and all users will only get IDs from child servers
the central server only maintains what the next block would be
basically a single number
the child server will generate IDs in the given block
since the blocks do not overlap
there is no need to worry about different child servers giving out the same ID, right?
and the child server in mem does n++
return ++n :)
it is atomic, no lock is needed
now, we have the speed
then, availability
N is the current ID in the current block on a child server
say, child server X get a block [10000, 20000)
and currently N = 15000
so next ID returned will be 15001
something like that
since the blocks are kept in mem
we simply mark the whole block lost
when the child server is back, request a new block
the central server cannot keep the next ID available in mem
it has to be stored reliably
but since there are not many requests (all whole blocks) made to the central server
it wont be a big problem
and actually you can have multiple central servers if needed
忘了(2754664635) 2:06:55 AM
找next id available的时候是不是要lock那个table?
Junjie(229516660) 2:07:03 AM
as long as their ID space does not overlap
no table, it is only a digit
yes, it has to be in DB
in case the central server dies
hmm, actually i think it is a better way
than generating same hash for url
it is a lot faster and scalable
the block here is really two numbers
start and stop
多个server找next avaiable id不明白。。。感觉多线程啊,在同一个child server上找要lock。
忘了(2754664635) 2:11:02 AM
哦,就是数字区间啊。
Junjie(229516660) 2:11:08 AM
yes
like child server X get a block of [a, b)
Junjie(229516660) 2:11:35 AM
request_new_block(size)
will return [a, b)
a is start, b is stop
so child servers and central server only deal with IDs
there is another server, which returns the 6-char string for an ID
it is pre generated, read only
you only lose some IDs
哦。还不是 这些 id重新利用,那会2个URL map到不同的?
yes drop the used block
maybe you can figure it out how much was left
becasue you also record the real URL for the short URL
if you scan the DB
you will find where it stops
but not worth it maybe
simply use a new block
不存 real URL 和 short URL之间的关系
只有real URL-》 id -》 short URL吧
short url <-> id -> url
you have to record the mapping from id to url
child only get a small block every time
when used up, ask for a new block
选哪个child去分也是central的功能
that's why if the child server crashes, you just discard the whole old block
depends, even cache can be reliable to some extend
e.g., mem cached high availability
your data automatically gets multiple copies
and distributed in different servers
and you have a policy to write data back to DB
db里是存id->url么?
yes
there is another read-only mapping, both ways, from short url to id and from id to short url
for id to short url is used when returning the short id to user
the other one is used when serving user requests
id <-> short url
this bi-directional mapping are read only
short url -> id -> orignial url
-------------------------------------------------------------------------
Suppose you have a linked list and you want to speed up the allocation of new nodes.
One way to do this is to maintain a list of deleted nodes,
whose memory can be reused when new nodes are allocated.
Instead of using the default new and delete, new will be overloaded to try to get a node from
the list of deleted nodes; only if no deleted nodes are available would it dynamically allocate memory.
Delete will simply add the node to the deleted nodes.
This way instead of allocating new memory from the heap (which is pretty time consuming)
we will use the already allocated space. The technique is usually called caching.
就是effective c++里面说的 写一个资源管理类?
就是effective c++里面说的 写一个资源管理类?
Did you know that you can shorten your urls with OUO and make cash for every visit to your shortened links.
ReplyDelete