您的位置:首页 > 其它

[翻译Joel On Software]回归基础/Back to Basics

2013-11-01 21:24 579 查看
Joel on Software
Back to Basics回归基础
by Joel Spolsky
Tuesday,December 11, 2001
Wespend a lot of time on this site talking about exciting Big Picture Stuff like.NET versus Java, XML strategy, Lock-In, competitive strategy, software design,architecture, and so forth. All this stuff is a layer cake, in a way. At thetop layer,
you've got software strategy. Below that, we think about architectureslike .NET, and below that, individual products: software development productslike Java or platforms like Windows.
我们在这个站点花了大量时间来讲令人振奋的大场景像:.NET和Java,XML战略,Lock-In,竞争策略,软件设计,架构等等。在某种意义上,所有这些东西就像一个分层的蛋糕。在最顶层,是软件策略。在这下面,我们考虑像.NET一样的架构,在这下面,是单独的产品:像Java这样的软件开发产品或者像Windows一样的平台。
Golower on the cake, please. DLLs? Objects? Functions? No! Lower! At some pointyou're thinking about lines of code written in programming languages.
请往这个蛋糕的再底层走一点。动态链接库?对象?函数? 我!再低一点!在某一点上,你总会开始考虑编程语言写的一行行代码。
Stillnot low enough. Today I want to think about CPUs. A little bit of siliconmoving bytes around. Pretend you are a beginning programmer. Tear away all thatknowledge you've built up about programming, software, management, and get backto the
lowest level Von Neumann fundamental stuff. Wipe J2EE out of your mindfor a moment. Think Bytes.
层次还是不够低。今天我想考虑CPU,一小块的硅片承载着的字节流。假装你是个入门程序员。摒弃所有你学到的关于编程,软件,管理的知识,然后回到最底层的冯诺依曼基础计算机结构。忘掉一会儿你脑袋里的J2EE。用字节进行思考。
Whyare we doing this? I think that some of the biggest mistakes people make evenat the highest architectural levels come from having a weak or brokenunderstanding of a few simple things at the very lowest levels. You've built amarvelous palace
but the foundation is a mess. Instead of a nicecement slab, you've got rubble down there. So the palace looks nice butoccasionally the bathtub slides across the bathroom floor and you have no ideawhat's going on.
为什么我们要这样做?我觉得人们甚至在最高架构层都会犯的一些最大错误都源于人们对最底层简单概念残缺不全的认识。你造了一座金碧辉煌的宫殿,但基础却一塌糊涂。本该是坚固水泥板的地方,你放置的却是残砖碎瓦。所以宫殿看起来虽然漂亮但是有时候浴缸在浴室滑动你却不知道为什么。
Sotoday, take a deep breath. Walk with me, please, through a little exercisewhich will be conducted using the C programming language.
所以今天,请深呼吸,跟随我做一个用C语言写的小练习。
Rememberthe way strings work in C:they consist of a bunch of bytes followed by a null
character, which has thevalue 0. This has two obvious implications:
记住C语言里字符串的工作机制:他们由一串字节然后一个null(值为0)字符构成,两点含义不言而喻:
1. There is no way to know where the string ends(that is, the string length) without moving through it, looking for thenull character at the end.
2. 如果不遍历这个字符串去寻找null字符,就无法获知它的结尾(也就是获得字符串长度)
3. Your string can't have any zeros in it. Soyou can't store an arbitrary binary blob like a JPEG picture in a C string.
4. 你的字符串里面不能有0值,所以你不能把任意的块数据(如JPEG图像)存储在C字符串里。
Whydo C strings work this way? It's because the PDP-7 microprocessor, on whichUNIX and the C programming language were invented, had an ASCIZ string type.ASCIZ meant "ASCII with a Z (zero) at the end."
为什么C字符串是这样的呢?因为PDP-7微处理器(在它之上产生了UNIX和C编程语言)有一种ASCIZ字符串类型。ASCIZ的意思是“以0结尾的ASCII码”。
Isthis the only way to store strings? No, in fact, it's one of the worst ways tostore strings. For non-trivial programs, APIs, operating systems, classlibraries, you should avoid ASCIZ strings like the plague. Why?
这是存储字符串的唯一方法么?不,实际上,这是最糟糕的存储字符串的方法之一。 对于重要的程序,接口,操作系统,类库,你应该像避免瘟疫一样避免使用ASCIZ字符串。为什么?
Let'sstart by writing a version of the code for strcat, the functionwhich appends one string to another.

让我们先开始写一遍strcat(也就是把一个字符串接在另一个后面的函数)代码试试,

voidstrcat( char* dest, char* src )
{
     while (*dest) dest++;
     while (*dest++ = *src++);
}


Studythe code a bit and see what we're doing here. First, we're walking through thefirst string looking for its null-terminator. When we find it, we walk throughthe second string, copying one character at a time onto the first string.
研究一下这段代码看看我们在做什么。首先,我们在遍历第一个字符串寻找其终结符。当我们找到了,就遍历第二个字符串,把它的字符拼接到第一个字符串后。
Thiskind of string handling and string concatenation was good enough for Kernighan and Ritchie,but it has its problems. Here's
a problem. Suppose you have a bunch of namesthat you want to append together in one big string:
这样的字符串拼接处理对Kernighan和Ritchie【1】来说已经足够好了,但它也有问题。这就是问题:假设你有一堆名字,然后你想把它们一起拼接到一个大的字符串里:

charbigString[1000];     /* I never know how muchto allocate... */
bigString[0]= '\0';
strcat(bigString,"John, ");
strcat(bigString,"Paul, ");
strcat(bigString,"George, ");
strcat(bigString,"Joel ");


Thisworks, right? Yes. And it looks nice and clean.
这样是可以的,对吧?是的。而且看上去干净利落。
Whatis its performance characteristic? Is it as fast as it could be? Does it scalewell? If we had a million strings to append, would this be a good way to do it?
它的性能特点如何?达到可以达到的最快速度了么?扩放性【2】好么?如果我们有上百万的字符串要拼接,这样做是个好办法么?
No.This code uses the Shlemiel the painter's algorithm. Whois Shlemiel? He's the guy in this joke:
不,这段代码使用了薛乐蜜画家算法。 谁是薛乐蜜?他就是下面这个笑话里的人物:
Shlemielgets a job as a street painter, painting the dotted lines down the middle ofthe road. On the first day he takes a can of paint out to the road and finishes300 yards of the road. "That's pretty good!" says his boss,"you're a fast worker!"
and pays him a kopeck.
薛乐蜜获得了街道粉刷的工作,也就是在马路中间画虚线。 第一天,他带了一罐涂料街 然后刷完了300码的马路。 “干得相当不错!”他老板说“你真是个高效的工人!“ 然后给了他一个铜板。
Thenext day Shlemiel only gets 150 yards done. "Well, that's not nearly asgood as yesterday, but you're still a fast worker. 150 yards isrespectable," and pays him a kopeck.
第二天薛乐蜜[w1] 只刷完了150码。 “嗯,虽然没有昨天好,但你还是个高效工人。150码还是可以的“ 然后给了他一个铜板。
Thenext day Shlemiel paints 30 yards of the road. "Only 30!" shouts hisboss. "That's unacceptable! On the first day you did ten times that muchwork! What's going on?"
第三天薛乐蜜只刷了30码。 “才30码!”他老板吼道。“无法忍受!你第一天干的活儿是今天的10倍!发生什么情况了?”
"I can't helpit," says Shlemiel. "Every day I get farther and farther awayfrom the paint can!"
“我也没办法” 薛乐蜜 说:“每天我都离涂料罐越来越远!”
(Forextra credit,what are the real numbers? 题外话, 实际数字是多少?)

Thislame joke illustrates exactly what's going on when you usestrcat likeI just did. Since the first part of strcat has to scan throughthe destination string every time, looking for that dang null terminator
againand again, this function is much slower than it needs to be and doesn't scalewell at all. Lots of code you use every day has this problem. Many file systemsare implemented in a way that it's a bad idea to put too many files in onedirectory, because performance
starts to drop off dramatically when you getthousands of items in one directory. Try opening an overstuffed Windows recyclebin to see this in action -- it takes hours to show up, which is clearly notlinear in the number of files it contains. There must be
a Shlemiel thePainter's Algorithm in there somewhere. Whenever something seems like it shouldhave linear performance but it seems to have n-squared performance, look forhidden Shlemiels. They are often hidden by your libraries. Looking at a columnof strcats
or a strcat in a loop doesn't exactlyshout out "n-squared," but that is what's happening.
这个蹩脚笑话阐述了当你就像我那样使用strcat时究竟发生了什么。因为strcat的第一部分每次都必须要扫描整个目标字符串,一次又一次的寻找null终结符,这个函数比实际需要的要慢很多而且扩放性根本不好。很多你日常使用的函数都有这个问题。许多文件系统都是以一种方式实现允许在一个文件夹下面放许多文件,这其实是个糟糕的想法,因为当你在一个文件夹下面放成千上万的文件的时候性能就开始剧烈恶化。 你可以尝试打开一个塞满了的Windows回收站来亲身见证一下这个说法 – 要花上你几个小时才能打开,
打开所花费的实现显然和文件夹里的文件是不成线性相关的。 这中间某个地方肯定也有‘薛乐蜜的画家算法’。不管什么地方,看起来本该是线性时间性能的结果看起来确实N次方性能,就应该寻找隐藏的薛乐蜜。 通常他们被你所调用的库所隐藏起来。 仅仅看一列strcat调用或者是一个循环里的strcat调用不会精确的显示N次方低效问题,但那就是正在发生的事情。
Howdo we fix this? A few smart C programmers implemented their own mystrcat asfollows:
我们应该如何修正这个问题呢?一些聪明的C程序员实现了他们自己的mystrcat函数,如下所示:

char*mystrcat( char* dest, char* src )
{
     while (*dest) dest++;
     while (*dest++ = *src++);
     return --dest;
}


Whathave we done here? At very little extra cost we're returning a pointer tothe end of the new, longer string. That way the code thatcalls this function can decide to append further without rescanning the string:
我们这里做了什么改动呢? 我们仅仅花很小的额外代价返回了一个指向新分配的更长字符串末尾的指针。这样调用这个函数的代码就可以决定从该处拼接而不是重新扫描整个字符串:

Thisis, of course, linear in performance, not n-squared, so it doesn't suffer fromdegradation when you have a lot of stuff to concatenate.
性能而言这当然是线性的,而不是N次方的,所以当你有一大堆字符串要拼接的时候,它就不会被性能下降所困扰。
Thedesigners of Pascal were aware of this problem and "fixed" it bystoring a byte count in the first byte of the string. These are calledPascal Strings. They can contain zeros and are not null terminated. Because abyte can only store numbers
between 0 and 255, Pascal strings are limited to255 bytes in length, but because they are not null terminated they occupy thesame amount of memory as ASCIZ strings. The great thing about Pascalstrings is that you never have to have a loop just to figure out
the length ofyour string. Finding the length of a string in Pascal is one assemblyinstruction instead of a whole loop. It is monumentally faster.
Pascal的设计者意识到了这个问题,并且通过在字符串的第一个字节里存储字节数来“修正“这个问题。 这些就被称为Pascal字符串。它们可以包含0值并且不是NULL分隔的。然而因为一个字节只能存储0-255之间的数字,Pascal字符串长度就被限制在255字节以内,但因为他们不是NULL分隔的所以他们和ASCIZ消耗了相同数量的存储空间。Pascal字符串最大的好处就是不需要写个循环来获得字符串的长度。获取Pascal字符串的长度只需要一个汇编指令而不是一整个循环。那当然是快了相当多。
Theold Macintosh operating system used Pascal strings everywhere. Many Cprogrammers on other platforms used Pascal strings for speed. Excel uses Pascalstrings internally which is why strings in many places in Excel are limited to255 bytes, and
it's also one reason Excel is blazingly fast.
旧的苹果计算机基本都是用Pascal字符串。 许多在其他平台上的C程序员也因为速度会采用Pascal字符串。Excel内部也使用了Pascal字符串,这也是为什么Excel里面很多地方字符串被限制在255字节以内,当然这也是为什么Excel超级快的一个原因。
Fora long time, if you wanted to put a Pascal string literal in your C code, youhad to write:
很长一段时间,如果你想要把Pascal字符串常量放在你的C代码里,你必须这样写:
char*str = "\006Hello!";
Yep,you had to count the bytes by hand, yourself, and hardcode it into the firstbyte of your string. Lazy programmers would do this, and have slow programs:
是的,你得自己手动计算字符串字节数,然后把它硬编码进你字符串的第一字节。但懒的程序员就会这么做,然后就写出了很慢的程序:
char*str = "*Hello!";

str[0] = strlen(str) - 1;

Noticein this case you've got a string that is null terminated (the compiler didthat) as well as a Pascal string. I used to call these fucked strings becauseit's easier than calling them null terminated pascal strings butthis
is a rated-G channel so you will have use the longername.
注意,在这种情况下你得到的是NULL分隔的字符串(编译器保证)以及Pascal字符串。 我曾经称其为“操蛋“字符串,因为相比称其为NULL分隔的Pascal字符串要简单一些。 不过这儿可是个G评价的频道,所以你还是用长点儿的名字吧。
Ielided an important issue earlier. Remember this line of code?
我前面省略了个重要问题。还记得这行代码么?
charbigString[1000]; /* I never know how muchto allocate...我从来不知道要分配多少空间。。。*/
Sincewe're looking at the bits today I shouldn't have ignored this. I should havedone this correctly: figured out how many bytes I needed and allocated theright amount of memory.
因为我们今天在研究别特,所以我不应该忽略这个的。我应该正确的完成这件事情:先搞清楚我需要多少字节然后分配准确数量的内存。
Shouldn't I have?
我不应该么?
Becauseotherwise, you see, a clever hacker will read my code and notice that I'm onlyallocating 1000 bytes and hoping it will be enough, andthey'll find some clever way to trick me into strcatting a 1100byte string
into my 1000 bytes of memory, thus overwriting the stack frame andchanging the return address so that when this function returns, it executessome code which the hacker himself wrote. This is what they're talking aboutwhen they say that a particular program
has a buffer overflow susceptibility.It was the number one cause of hacks and worms in the olden days beforeMicrosoft Outlook made hacking easy enough for teenagers to do.
因为否则,如你所见,一个聪明的黑客就会读到我的代码然后注意到我只分配了1000字节然后希望这就够了, 而他们就会找到一些巧妙的方法骗我把1100字节的字符串拼接进我1000字节的内存, 然后就可以覆盖我的帧栈并且修改函数返回地址 最后当这个函数返回的时候,它就会去执行黑客自己写的代码。 这就是他们讨论说某个程序有可能缓冲区溢出漏洞时会发生的情况。这就是过去黑客攻击和蠕虫病毒的首要诱因。 当然这是微软OUTLOOK让黑客攻击容易到青少年就能做之前的旧时代了。
OK,so all those programmers are just lame-asses. They should have figured out howmuch memory to allocate.
Butreally, C does not make this easy on you. Let's go back to myBeatles example:
恩,所有的这些程序员都是SB。 他们应该搞清楚要分配多少内存。 但是实际上,C语言并没有使这项工作对你变得容易。我们来回顾下我的甲壳虫例子:
charbigString[1000]; /* I never know how muchto allocate... */

char*p = bigString;

bigString[0] = '\0';

p = mystrcat(p,"John, ");

p = mystrcat(p,"Paul, ");

p = mystrcat(p,"George, ");

p = mystrcat(p,"Joel ");

Howmuch should we allocate? Let's try doing this The Right Way.
我们要分配多少呢?让我们尝试用正确的方法来做这件事情能够。
char*bigString;

int i = 0;

i = strlen("John, ")

+ strlen("Paul, ")

+ strlen("George, ")

+ strlen("Joel ");

bigString = (char*) malloc (i + 1);

/*remember space for null terminator! 记住要给NULL分隔符分配空间*/

...
Myeyes glazeth over. You're probably about ready to change the channel already. Idon't blame you, but bear with me because it gets really interesting.
我的眼睛上光完毕。 你也许已经正要准备换台,我不会怪你,不过耐心点儿因为事情开始变得有意思了。
Wehave to scan through all the strings once just figuring out how big they are,then we scan through them again concatenating. At least if you use Pascalstrings the strlen operation is fast. Maybe we can write aversion of strcat that
reallocates memory for us.
我们必须扫描所有的字符串以确定他们到底有多大,然后我们在拼接的时候要再扫描一遍。至少如果你使用Pascal字符串的话strlen操作就会很快。 也许我们可以写一个新版本的strcat来帮我们重新分配所需内存空间。
Thatopens another whole can of worms: memory allocators. Do youknow how malloc works? The nature of malloc isthat it has a long linked list of available blocks of memory called the freechain.
When you call malloc, it walks the linked list lookingfor a block of memory that is big enough for your request. Then it cuts thatblock into two blocks -- one the size you asked for, the other with the extrabytes, and gives you the block you
asked for, and puts the leftover block (ifany) back into the linked list. When you call free, it adds theblock you freed onto the free chain. Eventually, the free chain gets chopped upinto little pieces and you ask for a big piece and there
are no big piecesavailable the size you want. So malloc calls a timeout andstarts rummaging around the free chain, sorting things out, and mergingadjacent small free blocks into larger blocks. This takes 3 1/2 days. The endresult of all this
mess is that the performance characteristic ofmalloc isthat it's never very fast (it always walks the free chain), and sometimes,unpredictably, it's shockingly slow while it cleans up. (This is, incidentally,the same performance characteristic
of garbage collected systems, surprisesurprise, so all the claims people make about how garbage collection imposes aperformance penalty are not entirely true, since typical malloc implementationshad the same kind of performance penalty, albeit
milder.)
这就打开了另一整罐的蠕虫。内存分配器。 你知道malloc是如何工作的么?malloc的本质是它维护了一个称为空闲链的很长的可用内存空间链表。当你调用malloc,它就会遍历这个链表寻找足够你请求空间那么大的空闲内存块。然后他就把这块内存切成两部分,一部分就是你要的大小,另一部分就是多余的部分,把你请求的内存返回给你,然后把剩下的内存块(如果有的话)放回到链表里。当你调用free的时候,它就会把你要释放的内存块加到空闲链表中。最终,空闲链表里的块就会被越切越小,当你请求大的块空间的时候,哪里没有你所要大小的块空间。
所以malloc会调用一个超时然后翻箱倒柜的整理空闲链表,把东西整理清楚,合并相邻的小空闲块为大空闲块。这也得花上个三两天,所有这一团糟设计的最终结果就是malloc的性能特点表现不佳,它从来没有很快(因为要遍历整个空闲链),甚至有时在它进行整理的时候,无法预测的,令人震惊的慢(这内在的和垃圾回收器系统的性能表现是一致的,意料之外吧,所以当人们说使用垃圾回收系统会造成性能下降不是完全准确,因为通常malloc实现也会有这种性能开销,当然会温和一点儿。)
Smartprogrammers minimize the potential distruption of malloc byalways allocating blocks of memory that are powers of 2 in size. You know, 4bytes, 8 bytes, 16 bytes, 18446744073709551616 bytes, etc. For reasonsthat should be
intuitive to anyone who plays with Lego, this minimizes theamount of weird fragmentation that goes on in the free chain. Although it mayseem like this wastes space, it is also easy to see how it never wastes morethan 50% of the space. So your program uses
no more than twice as much memoryas it needs to, which is not that big a deal.
聪明的程序员会通过总是分配2整数幂次大小的内存块空间来最小化malloc整理造成的性能中断。也就是:4字节,8字节,16字节,18446744073709551616字节,等等。这最小化了空闲链中的碎片化数量,原因我想对玩过乐高积木的人都很直观。虽然看起来这有点浪费空间,但是很容易就发现它怎么都不会浪费超过一半的空间。所以你的程序最多也不会使用超过两倍的所需空间,这当然也不是什么大问题。
Supposeyou wrote a smart strcat function that reallocatesthe destination buffer automatically. Should it always reallocate it tothe exact size needed? My teacher and mentor Stan
Eisenstat suggeststhat when you call realloc, you should always double the size ofmemory that was previously allocated. That means that you never have tocall realloc more than lg n times, which
hasdecent performance characteristics even for huge strings, and you never wastemore than 50% of your memory.
假定你实现了一个很只能的strcat函数,该函数能够自动的重新分配目标缓冲区的空间。 那么每次都应该只分配精确的所需空间大小么? 我的导师Stan Eisenstat 建议当调用realloc的时候,你每次都应该分配前一次分配的两倍空间。这也意味着你永远不要调用realloc超过lgn次,这即使是对于巨型字符串也是相当不错的性能数据了,而且你永远也不会浪费超过50%的内存空间。
Anyway.Life just gets messier and messier down here in byte-land. Aren't you glad youdon't have to write in C anymore? We have all these great languages like Perland Java and VB and XSLT that never make you think of anything like this, theyjust
deal with it, somehow. But occasionally, the plumbing infrastructuresticks up in the middle of the living room, and we have to think about whetherto use a Stringclass or a StringBuilder class, orsome such distinction, because
the compiler is still not smart enough tounderstand everything about what we're trying to accomplish and is trying tohelp us not write inadvertent Shlemiel the Painter algorithms.
总之,在底层到字节流的地方,生活总是变得越来越糟。你再也不需要写C,你难道不觉得高兴么?我们有所有的这些伟大的编程语言例如Perl,Java,VB以及XSLT 他们可以让你不需要考虑任何像这样的东西,它们会帮你处理好这些,但时不时的这整个架构就会竖在你起居室的中间,然后你就要考虑到底是用String类还是用StringBuilder类,或者是其他类似的东西,因为编译器还没有聪明到能够理解我们想要完成的所有事情或者是尝试帮助我们不要编写无意的薛乐蜜画家算法。
Lastweek I wrote thatyou can't implement the SQL statementSELECT author FROM books fastwhen
your data is stored in XML. Just in case everybody didn't understand whatI was talking about, and now that we've been rolling around in the CPU allday, this assertion might make more sense.
上周我提到了,当你的数据存储在XML里的时候你无法实现很快的SQL语句SELECT author FROM books。为了避免没人听懂我在说什么,而且我们现在整天都在讨论CPU,这个断言就显得更加有意义
Howdoes a relational database implement SELECT author FROM books? In arelational database, every row in a table (e.g. the bookstable) isexactly the same length in bytes, and every fields is always at a fixed
offsetfrom the beginning of the row. So, for example, if each record in the books tableis 100 bytes long, and the author field is at offset 23, thenthere are authors stored at byte 23, 123, 223, 323, etc. What is the code
tomove to the next record in the result of this query? Basically, it's this:
关系数据库是如何实现SELECT author FROM books这个SQL语句的呢?在关系数据库中,所有表中的行(例如books表)都精确的是同一长度的,所有的字段离行起始的偏离总是固定的。例如,如果books表里的记录总是100字节长,author字段位于偏离23处,那么就会有author字段被存储在偏离为23,123,223,323等等的地方。在这个查询里移动到下一个记录的代码是什么样子的?基本上就是这个样子:
pointer+= 100;
OneCPU instruction. Faaaaaaaaaast.
仅一个CPU指令。非常快。
Nowlets look at the books table in XML.
现在让我们来看看XML里的books表
<?xmlblah blah>

<books>

<book>

<title>UI Designfor Programmers</title>

<author>JoelSpolsky</author>

</book>

<book>

<title>The ChopSuey Club</title>

<author>BruceWeber</author>

</book>

</books>

Quickquestion. What is the code to move to the next record?
简单的文革问题,移动到下一个记录的代码是什么样子的?
Uh...
额…
Atthis point a good programmer would say, well, let's parse the XML into a treein memory so that we can operate on it reasonably quickly. The amount of workthat has to be done here by the CPU to SELECT author FROM books willbore
you absolutely to tears. As every compiler writer knows, lexing andparsing are the slowest part of compiling. Suffice it to say that it involves alot of string stuff, which we discovered is slow, and a lot of memoryallocation stuff, which we discovered is
slow, as we lex, parse, and build anabstract syntax tree in memory. That assumes that you have enoughmemory to load the whole thing at once. With relational databases, theperformance of moving from record to record is fixed and is,in fact, one
CPU instruction. That's very much by design.And thanks to memory mapped files you only have to load the pagesof disk that you are actually going to use. With XML, if youpreparse, the performance of moving from record to record is fixed but there'sa huge
startup time, and if you don't preparse, the performance of moving fromrecord to record varies based on the length of the record before it and isstill hundreds of CPU instructions long.
在这个点上,一个好的程序员就会说,好,让我们把这个XML在内存里解析成树,这样我们就能在上面相对快速的进行操作。在这里CPU为了执行SELECTauthor FROM books要做的工作绝对会把你烦的眼泪掉下来。因为每个编译器作者都知道,词法和解析是编译过程中最慢的环节。因为随着我们进行词法分析,并且在内存里构建抽象语法树(假定你能一次把所有东西都装载到内存),这过程包含了大量的字符处理内容,我们发现这是慢的;而且包含了大量的内存分配,我们发现这也是慢的。如果是关系数据库,从一条记录移动到另一条记录的性能开销是固定的,实际上,只有一个CPU指令。这是完全是设计带来的。并且多亏了内存映射文件,你可以仅装载磁盘上你实际需要使用的页面进内存。
而XML,如果你预解析,那么从一个记录移动大另一个记录的时间就是固定的但是就会带来巨大的启动时间,如果你不预解析,那么从一个记录移动到另一个记录的性能就会随记录的长度变化并且始终会是成百上千的CPU指令那么长
Whatthis means to me is that you can't use XML if you need performance and havelots of data. If you have a little bit of data, or if what you're doing doesn'thave to be fast, XML is a fine format. And if you really want the best of bothworlds,
you have to come up with a way to store metadata next to your XML,something like Pascal strings' byte count, which give you hints about wherethings are in the file so that you don't have to parse and scan for them. Butof course then you can't use text editors
to edit the file because that messesup the metadata, so it's not really XML anymore.
这对我来说就意味着:如果你有大量的数据而且你需要性能,那么你就不能使用XML。如果你只有一点点数据,或者你在做些不需要很快的事情,XML是个不错的格式。如果你确实想要两个世界的长处,你得想个办法把元数据存在你XML的旁边,就像Pascal字符串的长度一样,告诉你东西在文件中的存放位置线索这样你就不需要扫描然后解析这些它们。不过,然后你就不能使用文本编辑器来编辑文件,因为那样会弄乱那些元数据,所以这又不再是XML了。
Forthose three gracious members of my audience who are still with me at thispoint, I hope you've learned something or rethought something. I hope thatthinking about boring first-year computer-science stuff like how strcat and malloc actuallywork
has given you new tools to think about the latest, top level, strategicand architectural decisions that you make in dealing with technologies likeXML. For homework, think about why Transmeta chips will always feel sluggish.Or why the original HTML spec for
TABLES was so badly designed that largetables on web pages can't be shown quickly to people with modems. Or about whyCOM is so dang fast but not when you're crossing process boundaries. Or aboutwhy the NT guys put the display driver into kernelspace instead
of userspace.
对于那些在这点还在看下去的亲爱的读者们,我希望你们学到了一些东西或者至少重新思考了一些东西。我希望思考无聊的计算机科学第一学年的内容,诸如strcat和malloc是如何工作的,已经给了你新的工具来思考最近的,最高层的,战略架构决策 当你处理像XML那样的技术的时候。 作为课外作业,思考为什么全美达芯片总是感觉很慢,或者为什么原来的HTMLTABLE规范设计的如此糟糕 以至于web网页里的大表格不能很快的呈现给使用猫上网的人们,或者为什么COM是如此快,但一旦跨越进程边界就没那么快了?
或者为什么NT的设计者们决定把显卡驱动放到内核态而不是用户态。
Theseare all things that require you to think about bytes, and they affect the bigtop-level decisions we make in all kinds of architecture and strategy. This iswhy my view of teaching is that first year CS students need to start at thebasics,
using C and building their way up from the CPU. I am actuallyphysically disgusted that so many computer science programs think that Java isa good introductory language, because it's "easy" and you don't getconfused with all that boring string/malloc stuff
but you can learn cool OOPstuff which will make your big programs ever so modular. This is a pedagogicaldisaster waiting to happen. Generations of graduates are descending on us andcreating Shlemiel The Painter algorithms right and left and they don't even
realizeit, since they fundamentally have no idea that strings are, at a very deeplevel, difficult, even if you can't quite see that in your perl script. If youwant to teach somebody something well, you have to start at the very lowestlevel. It's like Karate
Kid. Wax On, Wax Off. Wax On, Wax Off. Do that forthree weeks. Then Knocking The Other Kid's Head off is easy.
这些是让你要思考字节的事情,而且他们影响到了我们在做各种关于架构和战略的高层大决定。这也是为什么我对于教育的观点是 一年级计算机专业学生应该从C学起,从CPU开始学起。我实在是感到浑身不舒服有那么多的计算机科学项目觉得Java是个很好的介绍语言,因为它很“容易“而且你不会被那些无聊的字符串/内存分配东西困扰并且可以开始学很酷的面向对象东西(面向对象能够让你的大程序变得更加模块化)。这简直就是等着发生的教学灾难。数代的毕业生围绕在我们周围并且大约会写出薛乐蜜粉刷算法 而且不会意料到,
因为他们不知道字符串本质上, 在非常深的层面上,是非常难的。甚至你不能从你的PERL脚本里看出来,如果你要教好某人一些东西,你得从最底层开始。 就像电影空手道小孩里那样[w2] 。上蜡,刮蜡。上蜡,刮蜡。这样坚持3个礼拜,那么要打其他小孩的头就轻而易举了。



[w1]翻译为薛乐蜜?

[w2]NHK : 1984年...由 Ralph Macchio 主演的美国电影...TheKarate Kid~
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: