Monday, January 20, 2014

tinyURL 来自群讨论

Q:
how to design a tinyURL website.

A:    有人讨论出tiny URL怎么设计吗?

微博URL短网址生成算法原理及(java版、php版实现实例) 



I think for those questions, start from a simple, naive solution would be a good way to go..

simply solve the problem without worrying about scaling first

effectively, tiny url computes hash of the given URL

naive   way就是直接hash到短的
hash function怎么设计?

as it scales, you see hash collisions ,any hash function, it does not matter
so for any given URL, h = hash(url)
and you have a key-value datastore, which is a map from hash to url
check if the hash is already used by a different url,
if so, you need a different hash
短地址 好像就是md5之后变成256字节的 然后再除以4 变成4个。然后在随便挑一个 搞一搞。。
the above solution is a naive solution to start with
then let us look into some interesting things about tiny url
first, this mapping only will be appended, records are never deleted。it means the locking can be optimized here
existing entries will be read only
effectively it means lock free
new entries will be written to the db

lock是指2个URL hash到一个短的时候才需要lock吧?
哦,你的意思是不lock整个table?
只lock部分东东?
            sure, even rdbms never locks the whole table
只lock要写的那行就行了。good idea

you might also want to discuss how to distributed the load
how to do better than a db can offer,  and how to maintain consistency, i.e., the new tiny url should letter corrects returns the original URL
no matter how soon it is requested
say, you have a lot of urls already in db
at the same time, there are a lot of new requests every seconds
both reads (translations) and writes (new hashes)
to create a new tiny url from a long url can be a little slower, users can generally accept the fact
but reads will be a lot more than writes
for any URL, it will only be created once了, but read multiple times
how to distribute the workload evenly?
you want to use multiple dbs / servers 

say, you have a hash of 6 chars

[A-Za-z0-9]

you may have one server, serving hashes starting with [A-Z]

one for [a-z], and one for [0-9]

so you have 3 servers

so you have 3 servers

and the hashes will never overlap

another nice thing is, the url creation/request goes to the same server

so there is no consistency problem to worry about

and we expect the requests will distribute evenly 
it depends on how the hash functions works

就是同样的tiny url会导向同样的url
if you share it with friends

everybody will be redirected to the same page

同样的URL生成同样的TINY URL
因为有multi-server,所以要考虑consistence
以后只要读取就行了
not necessarily i think, depends on how your design the system

you may want to avoid the db check to see if it is already there

then you need to make sure the new ID you generated will never a used one

it can be done, but you still need to write the record to the db

effectively you save a db read

to verify if it is already in use

把长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能
but i think in those systems, cache will play an important role

avoiding the db read might not be as beneficial as we think

anything, you need a db write afterwards after new ID creation

unless, you want to risk data lose to keep the new IDs in mem

and write those into db in btch

it will be a lot faster

it means, don't write only one ID into db every time

keep those IDs in memory,
and write them into db in batches

it will be a lot faster
since I/O comes with high overhead

and write lock is expensive

yes

it is some optimization

是不是db cache?

i mean, writing to db is slow
so you want to write 100 IDs a time
 instead of write 100 times with 1 ID written every tie

肯定要CACHE把
it can be 10x faster

and there will be a lot of requests, so i think the db read will have a very high chance of cache hit

by generating different IDs for the same URL

you have a different set of problems

someone said it might be different
it will work

cache和DP是存一个东西
不过去DB里读慢
所以把一些很常用,很牛逼,大家发很多的网址CACHE起来存在内存里

google可以对相同的URL产生不同的tiny url

a-zA-Z0-9 这64位取6位组合,可产生500多亿个组合数量.把数字和字符组合做一定的映射,就可以产生唯一的字符串,如第62个组合就是aaaaa9,第63个组合就是aaaaba,再利用洗牌算法,把原字符串打乱后保存,那么对应位置的组合字符串就会是无序的组合。
把长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能,所以如果用上面的62个字符,任意取6个字符组合成字符串的话,你的数据存量达到500多亿后才会出现重复的可能。

萌喵喵脑残粉(28015012)  1:46:42 AM
但是一个URL做一个TINY URL也可以接受吧
i think, the above means, you generate an ID for every coming URL

我见过的最简单的URL到TINY URL的算法就是直接MD5之后MOD
say, for coming URL "google.com", you return ID "100"

这样如果没有限定必须要几个字符做TINY URL的话
and you already have a map from ID to string

100 -> abcdef

and most importantly, the IDs will only increase

怎么无序组合?
我也觉得有序排太sb了
shuffle

晕死,重点是shuffle。。shuffle后不会重复吗?
0 -> zafkds
1 -> ajiiii
2 -> 7juo9

something like this

i mean, shuffle the order
 dont change the string

initially, 0 -> aaaaaa

1 -> aaaaab
2 -> aaaaac

长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能,
after shuffle, 0 -> yudxax

shuffle indexes吧
google对相同long url生成不同的tiny url是为了track 点击次数
the guy who wrote that did not explain that well

so you have a mapping from ID -> string of length 6

就是换顺序,让大家猜不到
and different IDs maps to different strings

then you only need to generate IDs, staring from 0

it is a lot faster

you only do n++ every time :)

no hashing
and no need to check
you are sure they will never conflict
so the last problem is
how to generate the IDs fast and reliably ?

fast -> do it in mem, and distributed

yes, the hashing can be done ahead of time

you only keep a mapping from IDs to srtings

those are generated and shuffled ahead of time

for IDs, in mem n++ is easy

how to do it on multiple servers?

you mean the pre-generated?
 never

because those are effectively in order

you shuffle them afterwards
refer to the example above

0 -> aaaaaa, 1->aaaaab

those are in order generated, so the values are never the same

忘了(2754664635)  1:56:23 AM
来了个url就assign个id,id直接map到短url去了
Junjie(229516660)  1:56:30 AM
yes

that's why you see a different ID every time

看热闹的BT(935610471)  1:56:56 AM
会不会tiny url被人decode了,如果这么生成
Junjie(229516660)  1:57:05 AM
you never re-use it
so there is no way to decode it,,right?

the ID part

id 跟 url 没关系
in mem ++ is easy

but you want to do that on multiple servers

everything time the child ID server will request a block of new IDs

and all users will only get IDs from child servers

the central server only maintains what the next block would be

basically a single number

the child server will generate IDs in the given block

since the blocks do not overlap

there is no need to worry about different child servers giving out the same ID, right?

and the child server in mem does n++

return ++n :)

it is atomic, no lock is needed

now, we have the speed

then, availability

N is the current ID in the current block on a child server

say, child server X get a block [10000, 20000)

and currently N = 15000

so next ID returned will be 15001

something like that

since the blocks are kept in mem

we simply mark the whole block lost

when the child server is back, request a new block

the central server cannot keep the next ID available in mem

it has to be stored reliably

but since there are not many requests (all whole blocks) made to the central server

it wont be a big problem

and actually you can have multiple central servers if needed

忘了(2754664635)  2:06:55 AM
找next id available的时候是不是要lock那个table?
Junjie(229516660)  2:07:03 AM
as long as their ID space does not overlap

no table, it is only a digit

yes, it has to be in DB

in case the central server dies

hmm, actually i think it is a better way

than generating same hash for url

it is a lot faster and scalable

the block here is really two numbers
 start and stop

多个server找next avaiable id不明白。。。感觉多线程啊,在同一个child server上找要lock。
忘了(2754664635)  2:11:02 AM
哦,就是数字区间啊。
Junjie(229516660)  2:11:08 AM
yes

like child server X get a block of [a, b)

Junjie(229516660)  2:11:35 AM
request_new_block(size)

will return [a, b)

a is start, b is stop

so child servers and central server only deal with IDs

there is another server, which returns the 6-char string for an ID

it is pre generated, read only

you only lose some IDs

哦。还不是 这些 id重新利用,那会2个URL map到不同的?
yes drop the used block

maybe you can figure it out how much was left

becasue you also record the real URL for the short URL

if you scan the DB

you will find where it stops

but not worth it maybe

simply use a new block

不存 real URL 和 short URL之间的关系
只有real URL-》 id -》 short URL吧
short url <-> id -> url

you have to record the mapping from id to url

child only get a small block every time

when used up, ask for a new block

选哪个child去分也是central的功能
that's why if the child server crashes, you just discard the whole old block

depends, even cache can be reliable to some extend

e.g., mem cached high availability

your data automatically gets multiple copies

and distributed in different servers

and you have a policy to write data back to DB

db里是存id->url么?
yes

there is another read-only mapping, both ways, from short url to id and from id to short url

for id to short url is used when returning the short id to user

the other one is used when serving user requests

id <-> short url
this bi-directional mapping are read only

short url -> id -> orignial url



-------------------------------------------------------------------------
 Suppose you have a linked list and you want to speed up the allocation of new nodes. 
One way to do this is to maintain a list of deleted nodes, 
whose memory can be reused when new nodes are allocated. 
Instead of using the default new and delete, new will be overloaded to try to get a node from
 the list of deleted nodes; only if no deleted nodes are available would it dynamically allocate memory. 
Delete will simply add the node to the deleted nodes. 
This way instead of allocating new memory from the heap (which is pretty time consuming) 
we will use the already allocated space. The technique is usually called caching.

就是effective c++里面说的 写一个资源管理类?

1 comment:

  1. Did you know that you can shorten your urls with OUO and make cash for every visit to your shortened links.

    ReplyDelete