Web Application developer using various technologies, pretty focused on scalable architecture design, database performance, caching systems. Long time linux user with with excellent understanding of the entire open source web development stack, especially Python & Django. Have also written several packages with Go.
Specialties: python, golang, javascript, distributed systems, networking, unix
At HiiDef, I worked mainly on the Flavors.me project, covering nearly all aspects including database performance, backend speed enhancements, the spidering infrastructure, deployment, backend tooling, i18n support, webfont system, front-end javascript performance.
Chief programmer and architect of high school sports platform.
Wireless Network Research related software development, visualization development, distributed systems, web applications, government contracts.
Since Go's 1.0 release over a year ago, there's been a struggle with go get and its lack of versioning. I've been thinking about this problem, and though I'm not alone, I wanted to describe the status quo, the problems people have with it, and issues with some commonly proposed solutions.
First, it's worth it to explain upfront that Go's import paths correspond to directory paths somewhere on $GOPATH/src/. Provided this relationship exists, go build will be able to find those imports and build your program. If that path is a url to some a go package, go get will be able to install it using that same path.
This duality of import path and get path is great for open source, and because of it Go has quickly built an impressive amount of third party libraries installable and importable in this way. This superficially looks a lot like package frameworks that many developers are used to, like gem, npm, cpan, or pypi, but without a central package registry. The Go authors are clear, however, that go get exists as a convenience for a common workflow; not to solve the many problems with dependency management.
Of course, it works well enough and in use widely enough that it is overwhelmingly the main way to install libraries and packages. Given that reality, the obvious fly in the ointment is that if upstream dependencies change, your code might no longer build. The centralized package repositories provide frozen versions of packages so you can always specify particular versions or sets of versions required to build your program, but Go's ad hoc method requires that packages maintain backwards compatibility (or downstream maintainers remain vigilant) for things to remain go getable.
If you are working on a standalone product, this is easily overcome by freezing the versions you need in your $GOPATH, maintaining external dependencies in a separate repos, or even importing them into your product's repos. This would involve some manual management (perhaps aided by SCM things like git subtrees or submodules), but it would work fine.
The problem is worse when you're working on a library or a piece of reusable code that you want to be able to "distribute" via go get. Ensuring a particular version of a dependency gets installed with the current system requires you give it its own go getable url. This is how the problem with package versions has been approached by some members of the Go community already, with the MongoDB driver mgo having two import paths: labix.org/v1/mgo and labix.org/v2/mgo. Of course, many hosted code repositories do not allow this type of URL control, and splitting projects up among multiple versioned repositories (ie sqlx, sqlx2) is not always appropriate.
It's worth noting at this point that much of the above is perceived as positive by a significant numbers in the Go community. The friction involved with maintaining a mature, versioned library provides a threshold of effort which can be taken as a positive indicator of quality. Many find the idea that software dependencies are resolvable at build time by resources on the web to be undesirable in the first place. Pulling all dependencies into one tree and managing them manually is either common or expected at Go's birthplace, Google.
By far the most popular suggestion people have to address this issue is to add something to go get paths which the go tool will interpret as a tag or a branch. Usually, the path looks something like github.com/jmoiron/sqlx:2.0, with the version number coming at the end. Unfortunately, this would either violate Go's package and binary naming conventions or the duality of import and get path, which is necessary for an install/get command to recursively resolve dependencies, neither of which are acceptable. Another possible solution in this vein is to allow urls like github.com/jmoiron/@tag:2.0/sqlx, which avoids the problems mentioned above but means you lose the common relationship that the go get path is also reachable by a browser.
There's a case to be made that the problem is simply intractable. If you're relying on a URL to exist that you do not control and is not guaranteed to be preserved by a trusted third party, you're opening yourself up to build failures. You might have a reasonable assurance that a tag will not change or a branch can stay backwards compatible, but you have no such assurance that these repositories won't simply be deleted. In the Google IO 2013 fireside chat, bradfitz remarked "Your deploy to production script shouldn't involve fetching some random dude's stuff on github."
Despite such arguments, I believe the authors of Go might underestimate the value of go get in the growth in use of the language. If you're deploying to production, of course you aren't using go get, but if you are building a library to be distributed, it's best to be able to build off of the increasingly rich ecosystem of go packages in a reliable way.
There's no convincing argument that a similar system to go get could not be developed that fixes some of the problems with the current system, or that the current system could not be amended with additional features that would make tracking and upgrading dependencies better while still playing nicely with the basic workspace conventions and build tools. At this point, the choices are manual management or the shotgun go get -u approach, and if the versioning issue is going to persist, then surely more can be done to help downstream developers manage these issues.
A few weeks ago, Alex Gaynor gave a talk at waza entitled "Why Python, Ruby, and Javascript are slow". A video now available, and it is a good watch if you use any of these languages. Even if you don't, or have just viewed the slides, it is still thought provoking.
He rightly cites (unsurprising, given his pypy experience) the old rags of dynamic typing or monkey patching being slow as being the modern version of the 1990s vms are slow mantra: oft repeated, but essentially untrue. Instead, he identifies hash/map addiction and over-allocation as the major sources of slowness in scripting languages.
Hash addiction is a term I've re-coined here to describe the prevalence and importance of hashing and hash tables in scripting languages and the predilection of programmers who mainly write in these languages to reach for the hash table wherever named lookups are desired.
Some hash addiction is cultural. Programmers of elegant scripting languages desire similarly elegant looking code with flexible data structures even where no flexibility is in play. Some who know better will understand that technical limitations make the difference between hashes and other types of named lookups essentially nil for many popular language runtimes.
These programmers know that the other side to hash addiction is technical and implicit. In CPython in particular, nearly all namespace lookups are done by hashing. For object namespaces, this hash is even easily accessible via obj.__dict__. It's a well known optimization idiom for CPython code to reduce the number of namespace traversals by re-defining out of scope variables or deeply nested names in your critical sections to avoid excess lookups; partly due to their hashy nature, and partly due to their repetition per loop. Not to be outdone, in JavaScript, the object and hash table are even more intimately related.
This was, ca. 2001, a difficult thing for Java exiles to wrap their heads around, because statically typed languages handle namespace lookups at compile time. Any given reference to a variable, whether it is Foo or com.java.sys.std.coolguys.Foo, incurs the same runtime costs. Go does this as well, which largely removes implicit hash addiction. Go does however include modern hash map semantics with its map type; more on these below.
The case of allocation is more interesting, because it's the primary source of slowness in many languages that have garbage collection. Garbage collection means cleaner APIs which can intelligently allocate their own buffers and can therefore relieve the user of what can be a tiresome or error-prone burden. Over-copying or over-allocation also has both cultural and technical causes.
Alex talks about how python programmers are addicted to string methods like str.split or str.strip despite the fact they must allocate new strings whenever they are used. Anytime you've ignored the results of map (yes, even pool.map) you've in some way succumb to over-copying. Think about range vs xrange or lists versus iterators in general. But languages like Python also lack APIs to explicitly allocate buffers and lists for reuse and to avoid unnecessary and costly runtime resizing (allocation!), which means that none of the tools built on top of these languages can make use of these time honored techniques either.
Allocation is a source of slowness in Go as well. In perhaps the seminal work on the subject of performance in Go, Russ Cox profiles a naively written Go program which commits both cardinal performance sins and, through excising hashes in favor of structs and lists and reducing allocations through classic techniques like caching/memoization, he manages to speed up a program by 15x while reducing its memory usage by nearly 2.5x. But, while Go has implicit allocation for single value variables, it has widely used idiomatic ways to pre-allocate slices and pointers to assist in reuse and to reduce copying.
This all might seem rather unremarkable, but there's a philosophical point to be made in all of this: there are cultural and technical sources of performance issues, and they are related. In his talk, Alex criticizes the naïveté of both tools and programmers. He bemoans the lack of sophistication in tools that make implicit hash addiction and allocation problems unavoidable, but he also complains about the commonality of poorly performing idioms in scripting language code. The tools, or in some cases lack of tools, encourage these poorly performing idioms in this code.
But with Go, you have allocation APIs, you have compile time name lookups, you have ways to avoid copying ([]byte vs String, pointers), and furthermore the best ways to take advantage of these techniques are generally the idiomatic ways of solving problems in the language. You have a map type which is, compared to its scripting cousins, inflexible: the key and value types are fixed at runtime. You have duck typing in the form of interface{} which works incredibly well when you absolutely must have it, but is otherwise cumbersome, requiring costly (in terms of line count and time) type assertions to functionally employ. These things combined mean that most Go programmers would rather reach for a struct when some kind of named collection is required. The struct they build has type safety, implicit and statically sized allocation known at compile time, and essentially free namespace lookups. You have a language where the cultural and technical performance deficits are minimized in a way that's positively reinforced by the semantics (and, sometimes, lack thereof) of the language.
I've written a fair amount of Go code, but recently I found the opportunity to spend about a month writing nothing but. Having got back to Python the last week, I've had this visceral feeling of being downright wasteful with time and memory. Some of that waste is cultural and avoidable; it's something I catch myself doing in a cute, elegant, slower way. A lot of it is technical, inescapable, and maddening.
Yesterday, Travis Reeded published an article about iron.io's migration from a ruby backend to a golang one titled "How we went from 30 servers to 2". People have commented that these savings could have been gained from writing critical sections in C or by going over the original code with an eye for performance. Putting the obvious parallelization benefits that Go has over most other languages aside, the point they're missing here is that you more or less achieve these results by writing normal Go code. You needn't inline C, perform semantically odd inline tricks, or open yourself up to the litany of programming errors that lie in wait for you in most languages that lack garbage collection. What's going on in your code is just right there in your code, unmasked by thick layers of powerful but obfuscating runtime complexity.
Control sequences are somehow, in 2013, something that don't really quite work properly yet. C-Left and C-Right (control-left-arrow & control-right-arrow) are common idioms in OSX and Windows for skipping back/forward one word. In control-sequence land, C-Left maps to ^[5D and C-Right to ^[5C, which is generally programmed to Do The Right Thing.
However, on OSX, where you have an embarrassment of modifier keys and a bucking of traditional norms, the option key is used for this behavior instead. Prior to OSX 10.7, these sent ^[5D and ^[5C to the terminal, which, if you were using bash, unhelpfully printed a D or a C. After 10.7, these were changed to the sequences ^[b and ^[f, which are emacs control sequences which bash understands as meaning backward-word and forward-word.
Unfortunately, Vim doesn't quite see things the same way. Although some emacs-mode editor escapes work on most Vim configurations in insert mode (like C-w for delete-back-word), b and f don't. For me, at least, they also defied remapping.
So, how can you get C-Left and C-Right to behave consistently in your OSX terminal, and in VIM, and in screen?
The best I could come up with was to first change them both to their traditional escapes; \0335D and \0335C (where \033 is the ESC key) for option-cursor-left and option-cursor-right in the OSX terminal keyboard escapes preferences.
From here, you can create the keymappings for those escapes necessary to work as expected in bash and Vim. For bash, I added these bindings to my ~/.bashrc:
bind '"\e[5C": forward-word'
bind '"\e[5D": backward-word'
# a commenter on the internet recommended these additional escapes
bind '"\e[1;5C": forward-word'
bind '"\e[1;5D": backward-word'
Finally, I mapped these escape sequences in Vim to the <C-Left> and <C-Right> functions, which work as expected, in my ~/.vimrc
map <ESC>[5D <C-Left>
map <ESC>[5C <C-Right>
map! <ESC>[5D <C-left>
map! <ESC>[5C <C-Right>
This makes option-left and option-right work across bash and vim, both in and outside of screen.
A quick note about these mappings, since it generally goes completely undiscussed, and people just paste this stuff into their vimrc without understanding it: key mapping in Vim is mode-specific; that is, you can map keys to do different things in different modes. map uses the mapmode nvo, which stands for normal, visual, and operator-pending modes, and map! uses the mapmode ic which stands for insert and command modes. For some explanations on what each of these modes correspond to, check out :help :map-modes.
One last caveat is that, for me, Vim treats <C-Left> and <C-Right> differently in normal mode compared to insert mode. In normal mode, they correspond to the B and W movement keys, respectively, whereas in insert mode they seem to correspond to b and w.
Go's version 1 will be a year old in about 2 months, and while I've been writing various libraries and applications in it since its release, I'm still coming across subtle false equivalences that I've made coming over from Python.
Though there can be some confusion in the terminology, everything in Python is a reference. These references are always passed around by value, but in essence, all references to a thing have the same access to that thing, and all assignments change the reference of a particular name:
>>> def foo(l):
... print id(l)
... l = "something else"
... print id(l)
...
>>> x = (1, 2, 3)
>>> foo(x)
140725676402880
140725676539440
>>> x
(1, 2, 3)
>>> id(x)
140725676402880
In the code above, x doesn't change to "something else" because the code in foo is merely setting the variable l to refer to some new thing. It doesn't affect the binding of the variable x. However, as foo gets its parameter, it's clear that the reference of l and x are the same 140725676402880, but "changing" x requires mutating the data at that location, and cannot be done by reassigning l to some new reference instead.
Golang has pointers and values, which means you can write functions which change outer variables by explicitly passing the address of those variables to a function that takes a pointer to its type. Still, the assignment semantics are the same for each: a := b always creates a new copy of b called a. When b is a pointer, a gets a copy of the same location b points to, and can be re-pointed elsewhere independently of b; when b is a value, a is a new copy of that value occupying different memory. However, Go goes to some lengths to make pointers and values behave similarly in certain situations:
type T struct{ Name string }
func (t T) String() string {
return t.Name
}
func main() {
vals := []T{T{"John"}, T{"Jane"}}
ptrs := []*T{&T{"John"}, &T{"Jane"}}
fmt.Println(vals)
fmt.Println(ptrs)
fmt.Println(vals[0].String(), ptrs[0].String())
}
This will print [John Jane] twice, and John John, because go knows it can call T.String() for a *T. If T.String() were defined on as func (t *T) String... instead, go would still call that method on addressable T's, and have a build error trying to call it on non-addressable literal T's (it will of course take literal *T's, but you'll have to use parens, eg. (&T{"John"}).String())
When you set a variable t := vals[0], it becomes a copy of that first val; when you set t := ptrs[0], it becomes a copy of the pointer. This is still obvious, but has subtle repercussions where you might not think of things as assignment:
// let's create a list of pointers to each T in vals
ptrcopy := []*T{}
for _, t := range vals {
ptrcopy = append(ptrcopy, &t)
}
fmt.Println(ptrcopy)
The code above does not work, and will print out [Jane Jane]. The loop variable t is a single variable throughout the loop, and as you loop over vals, it will get a copy of the next T, but the address (&t) stays the same. The equivalent code on the ptrs slice will work, because t will be a copy of the next address to a T, and the addresses will change as you loop over them and append them to ptrcopy. In order to make this work on vals, we'd have to do this:
for i, _ := range vals {
ptrcopy = append(ptrcopy, &vals[i])
}
When you stop to think what golang's for loop is actually doing, this behavior isn't all that surprising, but since I was working under the false notion that for _, x := range xs was basically the same as Python's for x in xs, it bit me in a very confusing way.
Go's goroutines make it easy to make embarrassingly parallel programs, but in many "real world" cases resources can be limited and attempting to do everything at once can exhaust your access to them.
In these cases, you need to limit the concurrency of your program to fall in line with the acceptable or optimum use of those resources. In many languages which use threads (or greenlets), pools are used to limit concurrency. They can be as easy to use as this example in python:
from gevent.pool import Pool
from requests import get
concurrency = 5
urls = ["url1", "url2", "..."]
Pool(concurrency).map(get, urls)
Since Go already has built in concurrency primitives in the way of goroutines and channels, lets look at an example which is an idiom borrowed from the net package:
concurrency := 5
sem := make(chan bool, concurrency)
urls := []string{"url1", "url2"}
for _, url := range urls {
sem <- true
go func(url) {
defer func() { <-sem }()
// get the url
}(url)
}
for i := 0; i < cap(sem); i++ {
sem <- true
}
First, a channel is created called sem (as it will act as a semaphore) with the level of concurrency desired. As we loop over the urls, we attempt to put a bool onto the channel. If it isn't full, we fire off the goroutine on the URL, which defers a read from the semaphore which frees its slot.
After the last goroutine is fired, there are still concurrency amount of goroutines running. In order to make sure we wait for all of them to finish, we attempt to fill the semaphore back up to its capacity. Once that succeeds, we know that the last goroutine has read from the semaphore, as we've done len(urls) + cap(sem) writes and len(urls) reads off the channel.
This is of course more verbose than the pool example above, but it's conceptually very simple, and there is opportunity for semaphore write/reads to be delayed or surround multiple throughput-controlled sections of the code in a flexible yet readable manner.
A template is a transform on a context to produce a presentation of that context.
In the beginning, there was printf. Printf takes format strings (templates) and a list of arguments (context), and applies the format to those arguments to produce a presentation:
>>> for item, price in {"Rice": 3.50, "Beans": 5.25, "Onions": 0.29}.items():
... print "%-10s %0.2f" % (item, price)
...
Onions 0.29
Rice 3.50
Beans 5.25
Printf lacks logic. If you want to make a shopping list only show items on sale, you must determine whether an item is on sale in the code surrounding your print statements. This is fine, because printf is always used in code.
On the web, we need to produce documents. With code and printf, this is possible. In the bad old days, when we ignored the wisdom of the ancients, we wrote PHP scripts (templates) which operated on HTTP requests and environment variables (context) to produced a webpage. PHP as a set of instructions for context transform is a powerful (if flawed), fully featured, turing complete language, which allows for any logic to be embedded in the template. Early tutorials often showed database connections being made and more data being pulled into the context (based on parameters from the request).
This ended up being a bad idea for a lot of reasons:
Common system design started to change, and a wall was erected between the logic in the controller producing the context and the logic in the template which produced the output: controller logic should be limited to producing presentation-neutral contexts, and template logic should be limited to presenting that context. This is such a consensus that modern PHP applications are structured completely differently from my caricature of the past and themselves use template languages which are simpler and more secure.
To borrow (again) from Alex Martelli, if making less logic possible in the template was such an improvement, it's a natural fallacy to think that eliminating logic altogether would make everything just perfect. Since this is an implicit claim fundamental to the philosophy of logic-less systems, it's worth examining a bit more closely. First, lets establish what that actually means. Mustache's documentation loosely defines what is meant by logic-less with these words:
We call it "logic-less" because there are no if statements, else clauses, or for loops. Instead there are only tags.
Lets investigate these claims and their impact on Mustache.
Important in Mustache is the concept of "sections", which narrow down the wider context to deal with whatever was mentioned in the opening tag. Sections behave differently depending on what the tag was: if it was an object, the section now has that object's namespace as its context. If it was a list, the section repeats per list item, each item's namespace being the context for the section on each iteration, and if it is a function then that function is called with the section as its input.
Unsurprisingly, sections will not render if the key does not exist or the list is empty. To be able to do something in this case, there are inverted sections, which will render if the key was falsy. Though the semantics of sections are limited, they actually do achieve most of the required semantics of if statements, else clauses, and for loops which Mustache otherwise lacks. So, is the logic-less boast empty sleight of hand?
Not exactly. Inverted sections are only if-else replacements if the test condition for your else was essentially boolean (empty or not empty). If you are looping over your list of users and want to show VIPs in a different color, you will need to express that state ahead of time in some boolean way. So far though, this doesn't violate the wall of separation; such an is_vip attribute on a user is a valid presentation neutral bit of data.
The problem arises when you have presentations which can differ based on the structure of the data in the context or even the context itself. If you want to display a list of 20 photos 4 per row, you're forced to violate the wall; either you pass in a list of photos as a matrix, embedding details of the presentation in the structure of the context, or you go through items in the context adding superficially presentation neutral data whose only purpose is to overcome these limitations. If, instead of VIP status, you wanted to alter user color based on their number of twitter followers, you'd have to embed those design considerations in the context, and change them there in the future when the design requirements change.
The original Django Template system was a pioneer of limited logic templating, but despite its effort to provide enough in the way of logic, it ran aground in areas. As an example, it eschewed traditional if/else logic for a bevvy of if tags likeif, ifequal, and ifnotequal, which allowed the template to have a limited but mostly functional conditional system. It was argued that this limitation was simpler for designers, and the required if/else semantics were still possible by nesting these tags. Eventually, the unwieldy verbosity of endlessly nesting if sections and the limitations led to the development of much more standard if/else tags with programming language style boolean operators.
For a little one man operation like this, I can just about grin and bear intrusions of display logic into the controller, but what I can't bear is the inability to actually achieve display logic in the template. After quite a bit of use and reflection, I've decided that the skin-deep logic-less claim of Moustache serves mostly to frustrate rather than to guide towards good practice. Since it doesn't actually limit logic itself, but rather the ergonomics in the application of that logic, and, often, its locality, the claimed benefits must be called into question.
There was a lament (not the first) in #go-nuts today that go's interfaces cannot accept any implementation. It wasn't the first such complaint, and this isn't the first exploration of why it might not be a great idea.
Regardless, some interesting golang-ish code was provided which demonstrated the concept and showed how a re-imagining of golang interfaces as traits could allow code reuse between different structs by simply allowing interfaces to implement functions themselves:
type Moveable trait {
Pos() int
SetPos(int)
Move(m Movable, v int) {
m.SetPos(m.Pos() + v)
}
}
type Entity struct {
pos int
}
func (e *Entity) Pos() int {
return e.pos
}
func (self *Entity) SetPos(v int) {
e.pos = v
}
e := new(Entity)
e.Move(4)
There are a few things to note here. A decision has to be made if Entity implements its own Move() function. The calling conventions for the shared code looks very natural and still allows for mutability via other interfaces fulfilled by the implementer. Worryingly, embedding the trait would work quite differently from embedding a struct or an interface: embedding an interface does nothing, and when embedding structs, the outer struct inherits the implementation of the inner, but the receiver of calls on that interface is the inner struct (which has its own attributes), not the outer.
Here is how I've achieved this type of code sharing in go up to this point:
type Moveable interface {
Pos() int
SetPos(int)
}
func Move(m Moveable, v int) {
m.SetPos(m.Pos() + v)
}
type Entity struct {
pos int
}
func (e *Entity) Pos() int {
return e.pos
}
func (e *Entity) SetPos(v int) {
e.pos = v
}
e := new(Entity)
Move(e, 4)
Aside from having the virtue of currently working, this allows us to keep the very simple and straightforward interface, struct, and method concepts, dodge the difficult conceptual/philosophical questions above, and still duck type Move to work on anything implementing Movable. The resultant library interface is a bit less attractive and, for some things, it could make certain types of method chaining difficult or impossible. In practice, if this were a library, the package name can provide some of that missing context.
In the end, go is a procedural language, and while it has some object oriented concepts like encapsulation and interface contracts, it's a bad idea to try to bend that into something it's not:
There's a class of templates like CTemplates, Mustache, and lately Handlebars, which are known as "logic-less" templates. My experience lies mostly with Mustache, but in general these templates act on an immutable context representable by simple data protocols like protobufs or json.
The implementations are generally incredibly fast and straightforward, as they limit the user to existence/truthiness checks and looping over object fields and lists. This makes them ideal for client-side templating, which is why a lot of these target JavaScript first, but they've started to find inroads in server-side templating as well, as they keep you honest and they allow you to run the same code on the same contexts on the server or in the browser.
Because true logic is impossible, they prevent you from embedding business logic in the templates. Unfortunately, they also prevent you from embedding presentation logic in your templates. I like these templating languages, but I feel like there's something missing. Today, in order to implement a simple paneling system, I had to write this code:
type M map[string]interface{}
panels := []M{}
for i, p := range Panels {
side := "left"
if i%2 != 0 {
side = "right"
}
panels = append(panels, M{
"side": side,
"right": side == "right",
"content": p.Render(),
})
}
And it felt gross. What this does is take a list of panels and embed their content in a list which encodes their position and a boolean variable which lets me know if it is on the right or not (since you cannot check that "side" == "right").
It struck me that this is analogous to the FizzBuzz problem, and while I have generally liked working with logic-less templates, I feel that there must be some middle ground where this kind of pretty prevalent problem should still be easily solvable without resorting to passing closures or determining display logic in the view:
Given a list of panels, place them in divs with an alternating class. Every two panels, place a clearing div. Do the same, but with 3 classes and 3 divs per row.
It’s kinda popular to bash regex. There's that JWZ Classic:
Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.
and there's plenty of modern equivalents. But I think that Occasionally; just occasionally, it’s the best tool for the job. If you're trying to dole out sage advice to beginners, I think it's better to give a balanced view, rather than instilling within them a "regex is evil" dogma that they will take ages to outgrow.
Let’s consider a function which extracts numbers from a string, which can have other characters in it; say for detecting a phone number. We can make two strings to see how our functions can perform in short-string and long-string conditions, just to see how regex and non-regex approaches work in different scenarios:
s = "My phone number is (123)456-7890."
ls = "My phone number is" * 5000 + "(123)456-7890."
len(s), len(ls)
# (33, 90014)
Let’s make 4 different functions, to see how different things might cause overhead here. The first uses isdigit(), the second compares chars as ints and will rely on the way the string is encoded, the third uses pre-compiled regex, and the 4th on-the-fly regex.
import re
def pextract(s):
return ''.join(c for c in s if c.isdigit())
def pextract_n(s):
return ''.join(c for c in s if c >= '0' and c <= '9')
pat = re.compile(r"(\d+)")
def pextract_re(s):
return ''.join(pat.findall(s))
def pextract_rec(s):
return ''.join(re.findall(r"(\d+)", s))
Lets see what timeit shows us:
timeit pextract(s) 100000 loops, best of 3: 4.78 us per loop
timeit pextract(ls) 100 loops, best of 3: 8.01 ms per loop
timeit pextract_n(s) 100000 loops, best of 3: 4.11 us per loop
timeit pextract_n(ls) 100 loops, best of 3: 7.19 ms per loop
timeit pextract_re(s) 100000 loops, best of 3: 2.18 us per loop
timeit pextract_re(ls) 100 loops, best of 3: 2.98 ms per loop
timeit pextract_rec(s) 100000 loops, best of 3: 3.19 us per loop
timeit pextract_rec(ls) 100 loops, best of 3: 2.89 ms per loop
It shouldn’t be surprising to people who have lots of experience with python that the regex module is faster: looping over things in python is slow, and since the regex code doesn’t have to loop much, it runs noticeably faster. It might be surprising that even with the compilation overhead, re outperforms doing this task manually (and it was surprising even to me that non-compilation fared better on the longer string; I think this might be the cost of going out of scope for the pattern).
If I was writing this code, I’d write the first function. Unless it was absolutely required, I would not resort to the regex versions. However, I don’t think the regex version is particularly ugly, and this is a really straight forward filter.
If you imagine a more complex but equally contrived example, say you wanted to get the sum of the possibly-floating-point numbers mentioned in the body of a text. This is a more complex regex, and might appear ugly to many, but in fact a naive but still quite decent version ends up being rather short and concise:
def sum_floats(s):
return sum(map(float, re.findall(r"(\d+(?:\.\d+)?)", s)))
This code has faults: it won't recognize negative numbers, it will consider any digits in the text to be numbers, it will validate "3.4.3" as "3.4" and "3", and more. In some ways, it's the poster child for why regex are evil; if these issues must be fixed, the regex will become monstrously complex. This is the reason you should shy away from regex.
However, the equivalent function that twiddles with str.split and has to keep track of the run of digits and decimals and such is much longer, less clear, and far slower. The regex version embodies all of the matching logic within the regex language; provided you can read it, you would grok the above expression a lot faster than you would the equivalent manual python code.
For little one offs, or when quick and dirty really is good enough and any better is YAGNI, regex can be an invaluable tool. If you can't read the regex above comfortably (save perhaps the non-matching group syntax) and think that it is obscure, that is your weakness, just the same as it would be if you found the code using the genexprs above code confusing. You must learn your tools better before you can understand and then, with experience, judge them.
The poignant post Staring Into The Abyss by the Gtk+3 maintainer Benjamin Otte is a pretty stark admission that Gnome3 is rudderless and in danger of failure or irrelevancy. When asked why he wouldn't be attending the Gnome conference Guadec in a previous post, he replied:
Because a mass in St. Peter’s Basilica is the wrong place to speak out as an Atheist.
I wrote a comment about this post on Reddit which I won't reproduce here. After trying out Linux Mint 13's Cinnamon and MATE distributions, I decided to go and look into whether or not I could contribute some alternatives to the Gnome3 defaults which would make life easier for Gnome2 exiles and perhaps get incorporated into distributions like Mint.
The first thing I did was Google for a Gnome3 Control Center API, only to find a mailing list entry saying that this was present in a deprecated fashion in 3.0 and would be removed in 3.2. In a typical display of hubris and arrogance, the maintainer remarked:
The "System Settings" isn't a random dumping ground for preferences. If we (we being the designers, and then the maintainers, in that order) don't think that the setting belongs in the System Settings, then it won't go in there.
Of course, the designers and the maintainers of the Gnome team actually have no say whatsoever in what goes into any panel in Gnome; since Gnome is a dense nest of 130+ packages, most users who use it will get it from a distribution, who will certainly feel free to modify the Control Center if it doesn't meet their needs. This kind of obstinate stance raises the effort bar for integration and leads to a worse desktop experience. It is absolutely nothing new to Gnome developers.
Way back in Gnome 2.6, Gnome developers changed the default behavior of the file browser from a single browsing window to a spatial mode. Reversable in Gconf, a configuration system editable only by hand or via an obtuse windows-registry like editor, this decision drew the ire of most Gnome users, and was turned off by default by many popular distributions years before the decision to make it default was finally reversed in 2.29.
The Gnome developers continue to develop their platform for an imagined demographic. Pushback from concerned users seems to strengthen their resolve to remove or alter the features they clamor for. When Adam Dingle sent an email to the list concerned about the removal of compact mode from Nautilus, the only acceptable mode for dealing with files that have realistic names, the Gnome UX guru made a hand wavy (and false) statement about the physiology of the human hand, implicitly presuming most users clicked a scrollbar to scroll (a scrollbar that has been shrunken dramatically in recent releases, one that has had its arrow controls removed), and dogmatically dismissed any feature which might result in horizontal scroll outright.
In the original thread about the feature removal, the ultimate reason for its removal was revealed: there was broken behavior when placing labels "beside icons" in the compact view. Although the ticket wrongly states that there's little difference between compact and text beside icons, a reply to the original thread wryly states:
Don't worry about it, the text beside icons bug was fixed by removing the offending feature too.
The astonishing part about this is that labels beside icons was considered a more important feature than two views in the file manager.
The problem isn't so much that Gnome isn't interested in making a desktop environment for me. They've been targeting what I consider to be a mostly hypothetical user base for years. The difference is that, as a platform, it was always easy or possible for the packagers, distributions, or even users to restore the behavior we were accustomed to. Spatial nautilus? Arbitrarily moving window buttons to the left? Neutering configuration applets in the Control center? These were fixable either through Gconf or by hooking a replacement into an API.
But despite its many packages, the new vision of Gnome seems to be non-cooperative and monolithic. Complex and mature pieces of software are duplicated within the Gnome project for political/release maintenance reasons, and then the superior third party software is locked out of integrating as seamlessly. If your application hasn't been vetted by our designers or our maintainers, well then you can sit on it and spin. Want to add font control to the control center? Fork the control center. The modern equivalent of the spacial nautilus decision seems like it would be to remove all modes aside from spatial.
After just a short evening of research, it sadly seems that cooperation with Gnome is impossible. The only palatable option seems to be to remove the the offending feature.