Friday, January 31, 2014

System Design for Big Data [tinyurl](转)

What is tinyurl?

tinyurl is a URL service that users enter a long URL and then the service return a shorter and unique url such as "http://tiny.me/5ie0V2". The highlight part can be any string with 6 letters containing [0-9, a-z, A-Z]. That is, 62^6 ~= 56.8 billions unique strings.

How it works?

On Single Machine
Suppose we have a database which contains three columns: id (auto increment), actual url, and shorten url.

Intuitively, we can design a hash function that maps the actual url to shorten url. But string to string mapping is not easy to compute.

Notice that in the database, each record has a unique id associated with it. What if we convert the id to a shorten url?
Basically, we need a Bijective function f(x) = y such that
  • Each x must be associated with one and only one y;
  • Each y must be associated with one and only one x.
In our case, the set of x's are integers while the set of y's are 6-letter-long strings. Actually, each 6-letter-long string can be considered as a number too, a 62-base numeric, if we map each distinct character to a number,
e.g. 0-0, ..., 9-9, 10-a, 11-b, ..., 35-z, 36-A, ..., 61-Z.
Then, the problem becomes Base Conversion problem which is bijection (if not overflowed :).
 public String shorturl(int id, int base, HashMap map) {
  StringBuilder res = new StringBuilder();
  while (id > 0) {
    int digit = id % base;
    res.append(map.get(digit));
    id /= base;
  }
  while (res.length() < 6)  res.append('0');
  return res.reverse().toString();
}
For each input long url, the corresponding id is auto generated (in O(1) time). The base conversion algorithm runs in O(k) time where k is the number of digits (i.e. k=6).

On Multiple Machine
Suppose the service gets more and more traffic and thus we need to distributed data onto multiple servers.

We can use Distributed Database. But maintenance for such a db would be much more complicated (replicate data across servers, sync among servers to get a unique id, etc.).

Alternatively, we can use Distributed Key-Value Datastore.
Some distributed datastore (e.g. Amazon's Dynamo) uses Consistent Hashing to hash servers and inputs into integers and locate the corresponding server using the hash value of the input. We can apply base conversion algorithm on the hash value of the input.

The basic process can be:
Insert
  1. Hash an input long url into a single integer;
  2. Locate a server on the ring and store the key--longUrl on the server;
  3. Compute the shorten url using base conversion (from 10-base to 62-base) and return it to the user.
Retrieve
  1. Convert the shorten url back to the key using base conversion (from 62-base to 10-base);
  2. Locate the server containing that key and return the longUrl.
---------

Thursday, January 23, 2014

given an integer, find the closest number that is palindrome ??????????????

Q:
Given a base 10 number N, find the nearest palindromic base 10 number X to N. If N is already palindromic do nothing.
If two palindromic numbers are equidistant to N, (10 - 1 = 9, 10 + 1 = 11) returning either is acceptable.

A:


The logic can work as below .

1)Increment the mid number

2)If mid number ==9 , after incrementing , make it zero and go to the next number to the left of it i.e. mid-1

repeat the above procedure for mid-1 element

If you reach the least number i.e. the first digit and it is also 9 , make it zero and add 1 to the left of it

The procedure is continued till we reach a number which does not have a nine or we reach the left most digit

this will do

e.g

1991

middle no is 9 . make to 0 and go to the number left to it

1991==>1091

now we check 1 , increment it to 2

1091==>2091 . we reached a non-nine number , hence we stop .

now reflect it .

2091==2002 , done

in another case

999

increment middle and go to right , we have

909 , do the same again

as we are on the leftmost digit and it is also 9 , make it 0and add an extra 1 to the left

909==1009

now reflect it

1009==1001

This approach will work , as this is a problem on spoj as well

find the next largest integer of it using the same digits

Q:

find the next largest integer of it using the same digits
   
A:
首先要区别正负。

假设是正:

例如125467
从后向前找到第一个不是升序的, 加上值为a(这里是6)
然后在d后面的这些数字里,寻找第一个比它大的数字,设为b(这里是7),交换a,b的位置。
然后把后面的这些数字,全部重新排序即可。 (这里没有别的数字了)

又例如 12543.首先找到的是2,然后,找比他最小的数字, 是3, 交换位置,再将 13542中的(542按照从前到后的升序排练)或者直接逆序即可。





find number of 1s

Q:

 Please implement a function to get the number of 1s in an integer. For example, the integer 9 is 1001 in binary, so it returns 2 since there are two bits of 1s.


A:
左移(数字1)或者右移(n),同时和1 比较 ---(下面代码是右移)
int NumberOf1(int n){
    int count = 0;
    while(n)    {
        if(n & 1)
            count ++;

        n = n >> 1;
    }
    return count;
}

tricky点儿的是  y用 n&(n-1)消去最右边的一个1.

 
int NumberOf1(int n){
    int count = 0;

    while (n){
        ++ count;
        n = (n - 1) & n;
    }

    return count;
}



Monday, January 20, 2014

Morris Traversal方法遍历二叉树(非递归,不用栈,O(1)空间)

本文主要解决一个问题,如何实现二叉树的前中后序遍历,有两个要求:
1. O(1)空间复杂度,即只能使用常数空间;
2. 二叉树的形状不能被破坏(中间过程允许改变其形状)。
通常,实现二叉树的前序(preorder)、中序(inorder)、后序(postorder)遍历有两个常用的方法:一是递归 (recursive),二是使用栈实现的迭代版本(stack+iterative)。这两种方法都是O(n)的空间复杂度(递归本身占用stack空 间或者用户自定义的stack),所以不满足要求。(用这两种方法实现的中序遍历实现可以参考这里。)
Morris Traversal方法可以做到这两点,与前两种方法的不同在于该方法只需要O(1)空间,而且同样可以在O(n)时间内完成。
要使用O(1)空间进行遍历,最大的难点在于,遍历到子节点的时候怎样重新返回到父节点(假设节点中没有指向父节点的p指针),由于不能用栈作为辅助空间。为了解决这个问题,Morris方法用到了线索二叉树(threaded binary tree)的概念。在Morris方法中不需要为每个节点额外分配指针指向其前驱(predecessor)和后继节点(successor),只需要利用叶子节点中的左右空指针指向某种顺序遍历下的前驱节点或后继节点就可以了。
Morris只提供了中序遍历的方法,在中序遍历的基础上稍加修改可以实现前序,而后续就要再费点心思了。所以先从中序开始介绍。
首先定义在这篇文章中使用的二叉树节点结构,即由val,left和right组成:
1 struct TreeNode {
2     int val;
3     TreeNode *left;
4     TreeNode *right;
5     TreeNode(int x) : val(x), left(NULL), right(NULL) {}
6 };

一、中序遍历

步骤:
1. 如果当前节点的左孩子为空,则输出当前节点并将其右孩子作为当前节点。
2. 如果当前节点的左孩子不为空,在当前节点的左子树中找到当前节点在中序遍历下的前驱节点。
   a) 如果前驱节点的右孩子为空,将它的右孩子设置为当前节点。当前节点更新为当前节点的左孩子。
   b) 如果前驱节点的右孩子为当前节点,将它的右孩子重新设为空(恢复树的形状)。输出当前节点。当前节点更新为当前节点的右孩子。
3. 重复以上1、2直到当前节点为空。
图示:
下图为每一步迭代的结果(从左至右,从上到下),cur代表当前节点,深色节点表示该节点已输出。

代码:
复制代码
 1 void inorderMorrisTraversal(TreeNode *root) {
 2     TreeNode *cur = root, *prev = NULL;
 3     while (cur != NULL)
 4     {
 5         if (cur->left == NULL)          // 1.
 6         {
 7             printf("%d ", cur->val);
 8             cur = cur->right;
 9         }
10         else
11         {
12             // find predecessor
13             prev = cur->left;
14             while (prev->right != NULL && prev->right != cur)
15                 prev = prev->right;
16 
17             if (prev->right == NULL)   // 2.a)
18             {
19                 prev->right = cur;
20                 cur = cur->left;
21             }
22             else                       // 2.b)
23             {
24                 prev->right = NULL;
25                 printf("%d ", cur->val);
26                 cur = cur->right;
27             }
28         }
29     }
30 }
复制代码
复杂度分析:
空间复杂度:O(1),因为只用了两个辅助指针。
时间复杂度:O(n)。证明时间复杂度为O(n),最大的疑惑在于寻找中序遍历下二叉树中所有节点的前驱节点的时间复杂度是多少,即以下两行代码:
1 while (prev->right != NULL && prev->right != cur)
2     prev = prev->right;
直觉上,认为它的复杂度是O(nlgn),因为找单个节点的前驱节点与树的高度有关。但事实上,寻找所有节点的前驱节点只需要O(n)时间。n个节 点的二叉树中一共有n-1条边,整个过程中每条边最多只走2次,一次是为了定位到某个节点,另一次是为了寻找上面某个节点的前驱节点,如下图所示,其中红 色是为了定位到某个节点,黑色线是为了找到前驱节点。所以复杂度为O(n)。

二、前序遍历

前序遍历与中序遍历相似,代码上只有一行不同,不同就在于输出的顺序。
步骤:
1. 如果当前节点的左孩子为空,则输出当前节点并将其右孩子作为当前节点。
2. 如果当前节点的左孩子不为空,在当前节点的左子树中找到当前节点在中序遍历下的前驱节点。
   a) 如果前驱节点的右孩子为空,将它的右孩子设置为当前节点。输出当前节点(在这里输出,这是与中序遍历唯一一点不同)。当前节点更新为当前节点的左孩子。
   b) 如果前驱节点的右孩子为当前节点,将它的右孩子重新设为空。当前节点更新为当前节点的右孩子。
3. 重复以上1、2直到当前节点为空。
图示:

代码:
复制代码
 1 void preorderMorrisTraversal(TreeNode *root) {
 2     TreeNode *cur = root, *prev = NULL;
 3     while (cur != NULL)
 4     {
 5         if (cur->left == NULL)
 6         {
 7             printf("%d ", cur->val);
 8             cur = cur->right;
 9         }
10         else
11         {
12             prev = cur->left;
13             while (prev->right != NULL && prev->right != cur)
14                 prev = prev->right;
15 
16             if (prev->right == NULL)
17             {
18                 printf("%d ", cur->val);  // the only difference with inorder-traversal
19                 prev->right = cur;
20                 cur = cur->left;
21             }
22             else
23             {
24                 prev->right = NULL;
25                 cur = cur->right;
26             }
27         }
28     }
29 }
复制代码
复杂度分析:
时间复杂度与空间复杂度都与中序遍历时的情况相同。
三、后序遍历
其实,Tiger自己认为,就先把所有node 的左右子树互换。再用前序遍历即可。 当然最后要reverse结果。还要把左右 子树再换回来。



后续遍历稍显复杂,需要建立一个临时节点dump,令其左孩子是root。并且还需要一个子过程,就是倒序输出某两个节点之间路径上的各个节点。
步骤:
当前节点设置为临时节点dump。
1. 如果当前节点的左孩子为空,则将其右孩子作为当前节点。
2. 如果当前节点的左孩子不为空,在当前节点的左子树中找到当前节点在中序遍历下的前驱节点。
   a) 如果前驱节点的右孩子为空,将它的右孩子设置为当前节点。当前节点更新为当前节点的左孩子。
   b) 如果前驱节点的右孩子为当前节点,将它的右孩子重新设为空。倒序输出从当前节点的左孩子到该前驱节点这条路径上的所有节点。当前节点更新为当前节点的右孩子。
3. 重复以上1、2直到当前节点为空。
图示:

代码:
复制代码
 1 void reverse(TreeNode *from, TreeNode *to) // reverse the tree nodes 'from' -> 'to'.
 2 {
 3     if (from == to)
 4         return;
 5     TreeNode *x = from, *y = from->right, *z;
 6     while (true)
 7     {
 8         z = y->right;
 9         y->right = x;
10         x = y;
11         y = z;
12         if (x == to)
13             break;
14     }
15 }
16 
17 void printReverse(TreeNode* from, TreeNode *to) // print the reversed tree nodes 'from' -> 'to'.
18 {
19     reverse(from, to);
20     
21     TreeNode *p = to;
22     while (true)
23     {
24         printf("%d ", p->val);
25         if (p == from)
26             break;
27         p = p->right;
28     }
29     
30     reverse(to, from);
31 }
32 
33 void postorderMorrisTraversal(TreeNode *root) {
34     TreeNode dump(0);
35     dump.left = root;
36     TreeNode *cur = &dump, *prev = NULL;
37     while (cur)
38     {
39         if (cur->left == NULL)
40         {
41             cur = cur->right;
42         }
43         else
44         {
45             prev = cur->left;
46             while (prev->right != NULL && prev->right != cur)
47                 prev = prev->right;
48 
49             if (prev->right == NULL)
50             {
51                 prev->right = cur;
52                 cur = cur->left;
53             }
54             else
55             {
56                 printReverse(cur->left, prev);  // call print
57                 prev->right = NULL;
58                 cur = cur->right;
59             }
60         }
61     }
62 }
复制代码
复杂度分析:
空间复杂度同样是O(1);时间复杂度也是O(n),倒序输出过程只不过是加大了常数系数。
注:
以上所有的代码以及测试代码可以在我的Github里获取。
参考:
http://www.geeksforgeeks.org/inorder-tree-traversal-without-recursion-and-without-stack/
http://www.geeksforgeeks.org/morris-traversal-for-preorder/
http://stackoverflow.com/questions/6478063/how-is-the-complexity-of-morris-traversal-on
http://blog.csdn.net/wdq347/article/details/8853371
Data Structures and Algorithms in C++ by Adam Drozdek
---------------
以前我只知道递归和栈+迭代实现二叉树遍历的方法,昨天才了解到有使用O(1)空间复杂度的方法。以上都是我参考了网上的资料加上个人的理解来总结,如果有什么不对的地方非常欢迎大家的指正。
原创文章,欢迎转载,转载请注明出处:http://www.cnblogs.com/AnnieKim/archive/2013/06/15/MorrisTraversal.html。

tinyURL 来自群讨论

Q:
how to design a tinyURL website.

A:    有人讨论出tiny URL怎么设计吗?

微博URL短网址生成算法原理及(java版、php版实现实例) 



I think for those questions, start from a simple, naive solution would be a good way to go..

simply solve the problem without worrying about scaling first

effectively, tiny url computes hash of the given URL

naive   way就是直接hash到短的
hash function怎么设计?

as it scales, you see hash collisions ,any hash function, it does not matter
so for any given URL, h = hash(url)
and you have a key-value datastore, which is a map from hash to url
check if the hash is already used by a different url,
if so, you need a different hash
短地址 好像就是md5之后变成256字节的 然后再除以4 变成4个。然后在随便挑一个 搞一搞。。
the above solution is a naive solution to start with
then let us look into some interesting things about tiny url
first, this mapping only will be appended, records are never deleted。it means the locking can be optimized here
existing entries will be read only
effectively it means lock free
new entries will be written to the db

lock是指2个URL hash到一个短的时候才需要lock吧?
哦,你的意思是不lock整个table?
只lock部分东东?
            sure, even rdbms never locks the whole table
只lock要写的那行就行了。good idea

you might also want to discuss how to distributed the load
how to do better than a db can offer,  and how to maintain consistency, i.e., the new tiny url should letter corrects returns the original URL
no matter how soon it is requested
say, you have a lot of urls already in db
at the same time, there are a lot of new requests every seconds
both reads (translations) and writes (new hashes)
to create a new tiny url from a long url can be a little slower, users can generally accept the fact
but reads will be a lot more than writes
for any URL, it will only be created once了, but read multiple times
how to distribute the workload evenly?
you want to use multiple dbs / servers 

say, you have a hash of 6 chars

[A-Za-z0-9]

you may have one server, serving hashes starting with [A-Z]

one for [a-z], and one for [0-9]

so you have 3 servers

so you have 3 servers

and the hashes will never overlap

another nice thing is, the url creation/request goes to the same server

so there is no consistency problem to worry about

and we expect the requests will distribute evenly 
it depends on how the hash functions works

就是同样的tiny url会导向同样的url
if you share it with friends

everybody will be redirected to the same page

同样的URL生成同样的TINY URL
因为有multi-server,所以要考虑consistence
以后只要读取就行了
not necessarily i think, depends on how your design the system

you may want to avoid the db check to see if it is already there

then you need to make sure the new ID you generated will never a used one

it can be done, but you still need to write the record to the db

effectively you save a db read

to verify if it is already in use

把长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能
but i think in those systems, cache will play an important role

avoiding the db read might not be as beneficial as we think

anything, you need a db write afterwards after new ID creation

unless, you want to risk data lose to keep the new IDs in mem

and write those into db in btch

it will be a lot faster

it means, don't write only one ID into db every time

keep those IDs in memory,
and write them into db in batches

it will be a lot faster
since I/O comes with high overhead

and write lock is expensive

yes

it is some optimization

是不是db cache?

i mean, writing to db is slow
so you want to write 100 IDs a time
 instead of write 100 times with 1 ID written every tie

肯定要CACHE把
it can be 10x faster

and there will be a lot of requests, so i think the db read will have a very high chance of cache hit

by generating different IDs for the same URL

you have a different set of problems

someone said it might be different
it will work

cache和DP是存一个东西
不过去DB里读慢
所以把一些很常用,很牛逼,大家发很多的网址CACHE起来存在内存里

google可以对相同的URL产生不同的tiny url

a-zA-Z0-9 这64位取6位组合,可产生500多亿个组合数量.把数字和字符组合做一定的映射,就可以产生唯一的字符串,如第62个组合就是aaaaa9,第63个组合就是aaaaba,再利用洗牌算法,把原字符串打乱后保存,那么对应位置的组合字符串就会是无序的组合。
把长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能,所以如果用上面的62个字符,任意取6个字符组合成字符串的话,你的数据存量达到500多亿后才会出现重复的可能。

萌喵喵脑残粉(28015012)  1:46:42 AM
但是一个URL做一个TINY URL也可以接受吧
i think, the above means, you generate an ID for every coming URL

我见过的最简单的URL到TINY URL的算法就是直接MD5之后MOD
say, for coming URL "google.com", you return ID "100"

这样如果没有限定必须要几个字符做TINY URL的话
and you already have a map from ID to string

100 -> abcdef

and most importantly, the IDs will only increase

怎么无序组合?
我也觉得有序排太sb了
shuffle

晕死,重点是shuffle。。shuffle后不会重复吗?
0 -> zafkds
1 -> ajiiii
2 -> 7juo9

something like this

i mean, shuffle the order
 dont change the string

initially, 0 -> aaaaaa

1 -> aaaaab
2 -> aaaaac

长网址存入数据库,取返回的id,找出对应的字符串,例如返回ID为1,那么对应上面的字符串组合就是bbb,同理 ID为2时,字符串组合为bba,依次类推,直至到达64种组合后才会出现重复的可能,
after shuffle, 0 -> yudxax

shuffle indexes吧
google对相同long url生成不同的tiny url是为了track 点击次数
the guy who wrote that did not explain that well

so you have a mapping from ID -> string of length 6

就是换顺序,让大家猜不到
and different IDs maps to different strings

then you only need to generate IDs, staring from 0

it is a lot faster

you only do n++ every time :)

no hashing
and no need to check
you are sure they will never conflict
so the last problem is
how to generate the IDs fast and reliably ?

fast -> do it in mem, and distributed

yes, the hashing can be done ahead of time

you only keep a mapping from IDs to srtings

those are generated and shuffled ahead of time

for IDs, in mem n++ is easy

how to do it on multiple servers?

you mean the pre-generated?
 never

because those are effectively in order

you shuffle them afterwards
refer to the example above

0 -> aaaaaa, 1->aaaaab

those are in order generated, so the values are never the same

忘了(2754664635)  1:56:23 AM
来了个url就assign个id,id直接map到短url去了
Junjie(229516660)  1:56:30 AM
yes

that's why you see a different ID every time

看热闹的BT(935610471)  1:56:56 AM
会不会tiny url被人decode了,如果这么生成
Junjie(229516660)  1:57:05 AM
you never re-use it
so there is no way to decode it,,right?

the ID part

id 跟 url 没关系
in mem ++ is easy

but you want to do that on multiple servers

everything time the child ID server will request a block of new IDs

and all users will only get IDs from child servers

the central server only maintains what the next block would be

basically a single number

the child server will generate IDs in the given block

since the blocks do not overlap

there is no need to worry about different child servers giving out the same ID, right?

and the child server in mem does n++

return ++n :)

it is atomic, no lock is needed

now, we have the speed

then, availability

N is the current ID in the current block on a child server

say, child server X get a block [10000, 20000)

and currently N = 15000

so next ID returned will be 15001

something like that

since the blocks are kept in mem

we simply mark the whole block lost

when the child server is back, request a new block

the central server cannot keep the next ID available in mem

it has to be stored reliably

but since there are not many requests (all whole blocks) made to the central server

it wont be a big problem

and actually you can have multiple central servers if needed

忘了(2754664635)  2:06:55 AM
找next id available的时候是不是要lock那个table?
Junjie(229516660)  2:07:03 AM
as long as their ID space does not overlap

no table, it is only a digit

yes, it has to be in DB

in case the central server dies

hmm, actually i think it is a better way

than generating same hash for url

it is a lot faster and scalable

the block here is really two numbers
 start and stop

多个server找next avaiable id不明白。。。感觉多线程啊,在同一个child server上找要lock。
忘了(2754664635)  2:11:02 AM
哦,就是数字区间啊。
Junjie(229516660)  2:11:08 AM
yes

like child server X get a block of [a, b)

Junjie(229516660)  2:11:35 AM
request_new_block(size)

will return [a, b)

a is start, b is stop

so child servers and central server only deal with IDs

there is another server, which returns the 6-char string for an ID

it is pre generated, read only

you only lose some IDs

哦。还不是 这些 id重新利用,那会2个URL map到不同的?
yes drop the used block

maybe you can figure it out how much was left

becasue you also record the real URL for the short URL

if you scan the DB

you will find where it stops

but not worth it maybe

simply use a new block

不存 real URL 和 short URL之间的关系
只有real URL-》 id -》 short URL吧
short url <-> id -> url

you have to record the mapping from id to url

child only get a small block every time

when used up, ask for a new block

选哪个child去分也是central的功能
that's why if the child server crashes, you just discard the whole old block

depends, even cache can be reliable to some extend

e.g., mem cached high availability

your data automatically gets multiple copies

and distributed in different servers

and you have a policy to write data back to DB

db里是存id->url么?
yes

there is another read-only mapping, both ways, from short url to id and from id to short url

for id to short url is used when returning the short id to user

the other one is used when serving user requests

id <-> short url
this bi-directional mapping are read only

short url -> id -> orignial url



-------------------------------------------------------------------------
 Suppose you have a linked list and you want to speed up the allocation of new nodes. 
One way to do this is to maintain a list of deleted nodes, 
whose memory can be reused when new nodes are allocated. 
Instead of using the default new and delete, new will be overloaded to try to get a node from
 the list of deleted nodes; only if no deleted nodes are available would it dynamically allocate memory. 
Delete will simply add the node to the deleted nodes. 
This way instead of allocating new memory from the heap (which is pretty time consuming) 
we will use the already allocated space. The technique is usually called caching.

就是effective c++里面说的 写一个资源管理类?

Wednesday, January 15, 2014

300. Longest Increasing Subsequence -M !!!!!!!!!!!!!!!

Given an unsorted array of integers, find the length of longest increasing subsequence.
Example:
Input: [10,9,2,5,3,7,101,18]
Output: 4 
Explanation: The longest increasing subsequence is [2,3,7,101], therefore the length is 4. 
Note:
  • There may be more than one LIS combination, it is only necessary for you to return the length.
  • Your algorithm should run in O(n2) complexity.
Follow up: Could you improve it to O(n log n) time complexity?

A:

每加入一个数字 val 。查看前面的所有的值。O(n^2)   如果< val ,则其可能是小于的一个。

class Solution {
public:
    int lengthOfLIS(vector<int>& nums) {
        int n = nums.size();
        vector<int> Count(n,0);// at each index, ending here, how many LIS can have
        
        int res =0;
        for(int i =0;i<n;++i)
        {
            int preCount =0;
            for(int j=0;j<i;++j)
            {
                if(nums[j]<nums[i])
                {
                    preCount = max(preCount, Count[j]);
                }
            }
            Count[i] = 1 + preCount;
            res = max(res, Count[i]);
        }
        return res;
    }
};

第二遍。DP, 每次保存迄今为止,到每个位置的最小的数字。
aka, V[i]表示在 长度为i+1的Longest Increasing Subsequence的可能的最小的值

class Solution {
public:
int lengthOfLIS(vector<int>& nums) {
vector<int> V{nums[0]}; // V[i] records minimum ending value of length (i+1)
for(int i =1; i< nums.size(); i++){
int val = nums[i];
for(int j = V.size()-1; j>=0;j--){
if( val > V[j]){ // care only if add at end
if(j == V.size()-1){
V.push_back(val);
break;
}
}else if (val < V[j] ){
if(j==0 || val > V[j-1]){
V[j] = val;
                        break; // since we only edit once
}
}
}
}
return V.size();
}
};

---------------上面没有利用前面排好序. 而且也只需要更改一个即可 因此用一个数组V,保存其迄今为止发现的每个长度的结尾最小的可能的值(那么这个数列肯定是递增的)

class Solution {
public:
    int lengthOfLIS(vector<int>& nums) {
        // save the record that: for length of k(use index +1), the  minimum ending Value 
        vector<int> vec;
        for(auto val : nums)
        {   // binary search to find the first index bigger than k (may exceed )
            int low = 0, high = vec.size();
            while(low<high)
            {
                int mid = low+ (high - low)/2;
                if(vec[mid] >= val) // here, >= is the key.  (previous, I choose >)
                {
                    high = mid;
                }else{
                    low = mid+1;
                }
            }
            if(low >= vec.size()) // failed to find an ending value bigger than current
            {
                vec.push_back(val);
            }else{
                vec[low] = val;
            }
        }
        return vec.size();
    }
};



American flag sort


 wiki 链接


An American flag sort is an efficient, in-place variant of radix sort that distributes items into hundreds of buckets. Non-comparative sorting algorithms such as radix sort and American flag sort are typically used to sort large objects such as strings, for which comparison is not a unit-time operation.[1] American flag sort iterates through the bits of the objects, considering several bits of each object at a time. For each set of bits, American flag sort makes two passes through the array of objects: first to count the number of objects that will fall in each bin, and second to place each object in its bucket. This works especially well when sorting a byte at a time, using 256 buckets. With some optimizations, it is twice as fast as quicksort for large sets of strings.[1]
The name comes by analogy with the Dutch national flag problem in the last step: efficiently partition the array into many "stripes".


Algorithm

Sorting algorithms in general sort a list of objects according to some ordering scheme. In contrast to comparison-based sorting algorithms, such as quicksort, American flag sort can only sort integers (or objects that can be interpreted as integers). In-place sorting algorithms, including American flag sort, run without allocating a significant amount of memory beyond that used by the original array. This is a significant advantage, both in memory savings and in time saved copying the array.
American flag sort works by successively dividing a list of objects into buckets based on the first digit of their base-N representation (the base used is referred to as the radix). When N is 2, each object can be swapped into the correct bucket by using the Dutch national flag algorithm. When N is larger, however, objects cannot be immediately swapped into place, because it is unknown where each bucket should begin and end.

American flag sort gets around this problem by making two passes through the array.
1) The first pass counts the number of objects that belong in each of the N buckets. The beginning and end of each bucket in the original array is then computed as the sum of sizes of preceding buckets.
2)The second pass swaps each object into place.

American flag sort is most efficient with a radix that is a power of 2, because bit-shifting operations can be used instead of expensive logarithms to compute the value of each digit. When sorting strings using 8- or 7-bit encodings such as ASCII, it is typical to use a radix of 256, which corresponds by sorting character-by-character.[1]

Pseudocode

american_flag_sort(Array, Radix)
  for each digit D:
    # first pass: compute counts
    Counts <- zeros(Radix)
    for object X in Array:
      Counts[digit D of object X in base Radix] += 1
    # compute bucket offsets
    Offsets <- [ sum(Counts[0..i]) for i in 1..Radix]
    # swap objects into place
    for object X in Array:
      swap X to the bucket starting at Offsets[digit D of X in base Radix]
    for each Bucket:
      american_flag_sort(Bucket, Radix)

Tuesday, January 14, 2014

Maximum Sum Subarray (non-LC)

Q:

the maximum subarray problem is the task of finding the contiguous subarray within a one-dimensional array of numbers (containing at least one positive number) which has the largest sum. For example, for the sequence of values −2, 1, −3, 4, −1, 2, 1, −5, 4; the contiguous subarray with the largest sum is 4, −1, 2, 1, with sum 6.

  A:

Kadane's algorithm

Kadane's algorithm consists of a scan through the array values, computing at each position the maximum subarray ending at that position. This subarray is either empty (in which case its sum is zero) or consists of one more element than the maximum subarray ending at the previous position. Thus, the problem can be solved with the following code, expressed here in Python:
def max_subarray(A):
    max_ending_here = max_so_far = 0
    for x in A:
        max_ending_here = max(0, max_ending_here + x)
        max_so_far = max(max_so_far, max_ending_here)
    return max_so_far
A variation of the problem that does not allow zero-length subarrays to be returned in the case that the entire array consists of negative numbers can be solved with the following code:
def max_subarray(A):
    max_ending_here = max_so_far = A[0]
    for x in A[1:]:
        max_ending_here = max(x, max_ending_here + x)
        max_so_far = max(max_so_far, max_ending_here)
    return max_so_far
 
 
 
The algorithm can also be easily modified to keep track of the starting and ending indices of the maximum subarray (see commented code).
Because of the way this algorithm uses optimal substructures (the maximum subarray ending at each position is calculated in a simple way from a related but smaller and overlapping subproblem, the maximum subarray ending at the previous position) this algorithm can be viewed as a simple example of dynamic programming.
The runtime complexity of Kadane's algorithm is \mathcal{O}(n).

 上面写的有点儿乱,
就是DP, 每次加进来的index,就看做maxSum的ending index,看前面的sum  value是否是正的。

public class MaxSumSubarray {
    // Kadane's algorithm
    public int maxSumSubArray(int[] A) {
        if (A == null || A.length == 0)
            return 0;
        int n = A.length;
        int[] maxSum = new int[n];
        maxSum[0] = A[0];
        int result = maxSum[0];

        for (int i = 2; i < n; i++) {
            maxSum[i] = A[i] + Math.max(0, maxSum[i - 1]);
            result = Math.max(result, maxSum[i]);
        }
        return result;
    }
}


Tuesday, January 7, 2014

邮局 村庄问题

Q:

用数轴描述一条高速公路,有V个村庄,每一个村庄坐落在数轴的某个点上,需要选择P个村庄在其中建立邮局,要求每个村庄到最近邮局的距离和最小。

A:

【题目分析】:经典DP
1、考虑在V个村庄中只建立【一个】邮局的情况,显然可以知道,将邮局建立在中间的那个村庄即可。也就是在a到b间建立一个邮局,若使消耗最小,则应该将邮局建立在(a+b)/2这个村庄上(可以通过画图知道)。
2、下面考虑建立【多个】邮局的问题,可以这样将该问题拆分为若干子问题,在前i个村庄中建立j个邮局的最短距离,是在前【k】个村庄中建立【j-1】个邮局的最短距离 与 在【k+1】到第i个邮局建立【一个】邮局的最短距离的和。而建立一个邮局我们在上面已经求出。
3、状态表示,由上面的讨论,可以开两个数组
dp[i][j]:在前i个村庄中建立j个邮局的最小耗费
sum[i][j]:在第i个村庄到第j个村庄中建立1个邮局的最小耗费
那么就有转移方程:dp[i][j] = min(dp[i][j],dp[k][j-1]+sum[k+1][i])  DP的边界状态即为dp[i][1] = sum[1][i]; 所要求的结果即为dp[vil_num][post_num];
4、然后就说说求sum数组的优化问题,可以假定有6个村庄,村庄的坐标已知分别为p1,p2,p3,p4,p5,p6;那么,如果要求 sum[1][4]的话邮局需要建立在2或者3处,放在2处的消耗为p4-p2+p3-p2+p2-p1=p4-p2+p3-p1 放在3处的结果为 p4-p3+p3-p2+p3-p1=p4+p3-p2-p1,可见,将邮局建在2处或3处是一样的。现在接着求sum[1][5],现在处于中点的村庄 是3,那么1-4到3的距离和刚才已经求出了,即为sum[1][4],所以只需再加上5到3的距离即可。同样,求sum[1][6]的时候也可以用 sum[1][5]加上6到中点的距离。所以有递推关系:sum[i][j] = sum[i][j-1] + p[j] -p[(i+j)/2]

public class PostOffice {
    int[] villages;

    int numberOfVillages;
    int numberOfPostOffices;
    int totalCost;

    public PostOffice(int numberOfVillages, int numberOfPostOffices) {
        this.numberOfVillages = numberOfVillages;
        this.numberOfPostOffices = numberOfPostOffices;
        villages = new int[numberOfVillages];
    }

    public void ComputeTotalCost() {
        int sum = 0;
        for (int i = 0; i < villages.length - 1; i++)
            sum += villages[i + 1] - villages[i];

        this.totalCost = sum;
    }

    public int min(int i, int j) {
        if (i >= j) 
            return j;
         else 
            return i;
    }

    public int DP(int indexOfVillage, int numberOfOfficesLeft) {
        if (numberOfOfficesLeft <= 0) 
            return 0;
        // Two cases: either add post office at village or not
        return min(DP(indexOfVillage + 1, numberOfOfficesLeft),
                DP(indexOfVillage + 1, numberOfOfficesLeft - 1)
        );
    }

    public static void main(String[] args) {
        PostOffice postOfficeProblem = new PostOffice(10, 5);
        postOfficeProblem.villages[0] = 1;
        postOfficeProblem.villages[1] = 2;
        postOfficeProblem.villages[2] = 3;
        postOfficeProblem.villages[3] = 6;
        postOfficeProblem.villages[4] = 7;
        postOfficeProblem.villages[5] = 9;
        postOfficeProblem.villages[6] = 11;
        postOfficeProblem.villages[7] = 22;
        postOfficeProblem.villages[8] = 44;
        postOfficeProblem.villages[9] = 50;
        postOfficeProblem.ComputeTotalCost();
        System.out.println(postOfficeProblem.totalCost);
    }
}



Mistakes