Group-by Considered Harmful

Oct 19th, 2012 | Comments

Main reason big data is so trendy is because it’s closely related with web scale, and that of course means money. It’s the kind of problem you would love to have. Everybody thinks big data is about being Facebook, or Twitter, or Tumblr

But big data is not only about moving your humongous petabytes of user pictures from one place to another, big data is also a state of mind, and it can help you solve problems better and faster even if you’re not on the petabyte scale.

And if you have ever seen a report taking ages to render, this post is for you. Because we’re in the Google age, so if you tell me it’s acceptable for you to take fucking seconds to spit that information out, you are delusional. In 2008, Google index size was over a trillion pages, and the typical search returns in less than 0.2 seconds.

And one of the main reasons Google is so fast and engineers still deliver applications with such a poor performance, is because we’re still modeling data as if disk space is an issue.

Let me explain.

Let’s think about a telephony system that needs to keep track of every call entering a company or made by employees. A typical simplified cdr -call detail report- table will look like this:

+------------+----------+----------+-----------+---------------+-----------+----------+
| Call Start | Call End | Duration | Direction |    Number     | Extension |   User   |
+------------+----------+----------+-----------+---------------+-----------+----------+
| 20:05      | 20:15    |      600 | OUT       | +130512319292 |      1010 | John.Doe |
+------------+----------+----------+-----------+---------------+-----------+----------+

That’s perfectly fine, and if someone wants to know who the hell did that 40 minute call to the hot-line, it’s all in there.

But let’s suppose you’re a contact center organization with 2000 extensions, you could easily be dialing 150K calls an hour. That’s billion of rows in a year’s operation and the table being hit pretty hard. Then someone asks, how much money have we spent in calls to Morocco? Or better yet, I want an alarm to be triggered when a deviation from standard operation occurs to detect fraud.

So to avoid locking and contention over your call table, you may do something like this:

+------------+-----------------+----------+--------------+-------------+
|   Slice    | Number of calls | Duration |     User     | Destination |
+------------+-----------------+----------+--------------+-------------+
| 2012-10-10 |             121 |    49239 | John.Doe     | Germany     |
| 2012-10-10 |              90 |    28711 | John.Doe     | Australia   |
| 2012-09-02 |              78 |    12111 | Richard.Read | Uruguay     |
+------------+-----------------+----------+--------------+-------------+

So you separate your realtime transactions hitting the cdr, from the table used to query about statistical stuff. In datawarehousing there’s a name for that, it’s called ETL. The process of offloading your data to a system designed to be queried against. A real pain in the ass.

Of course you have an advantage, you can answer many questions using the same stored information.

How many calls has made Ricky this month?
How many minutes were spent this week calling Bangladesh, by all users?
What’s the day of the month with the most number of calls

And we all know how, we use GROUP BY, the most overlooked clause in the performance analysis.

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITED;
SELECT User, SUM(Duration) as Duration
FROM CallSummary
WHERE Slice = '2012-10-10'
GROUP BY User;
COMMIT;

Then you face a problem again, whether to be fast, or to be up-to-date. Because if you update your summary every night you cannot detect a fraud deviation in the last 30 minutes, but if you update your summary with every call you have again the contention issue.

But why is it that you need to have all the information in the same table?

It’s the relational way of storing information, which gives you flexibility when you don’t know what you’re going to do with your data, but can be the cause of performance degradation. If you’re using group-by, the engine is iterating over each value in order to sum it up, there’s no free way out of that. I will show you a way that costs disk space and a loss in query flexibility.

Before continuing with data modeling, let me do a little digression on data visualization.

There’s a paper by Stephen Few, who divides analytical audiences into three different types:

1) Exploratory Analytics

Strong need for data exploration. The analyst approaches data with usually open ended questions “What may be hiding here that’s interesting?”. Tools such as Panopticon Explorer or Spotfire are in this category.

2) Custom Analytics

People analyzing information in routine ways, performing the same analytical task as part of their job. User only needs to understand the information presented and business processes.

3) Customizable Analytics

Provides ready-made libraries for developers that need to create an application, such as QlikView.

And for some reason that escapes me, engineers almost always get stuck in a mix of options 1 and 2. We think our users to be creative geniuses with a need to derive hidden insights from our data, so we put this wonderful flexible tools in place for them.

Bullshit. People want to get their job done and go home with the kids, and it’s your job to know better.

Most of the time reporting or “analytics” as it’s usually called, is not analytics, because the user is not doing analysis, he’s just looking at data translated into information. And what data you’re putting on the screen, is the result of proper program management and understanding of your user’s needs. Or is it that it took you months to push that product out of the door and you don’t really know which information is relevant to your users?

And if you do, why is it that you’re modeling your data as if you don’t?

Hey, we’ll just let them query the information using all this dimensions, cool, and you can have this pivot table here, you see it …pivoting… give me fucking a break, the last time I started dragging and dropping dimensions in a pivot table I had a surmenage, who are your users?

Digression ended, how do you escape the group-by trap?

Well, you just store it grouped according to your query needs. I’ll show you first the idea, and then a specific Cassandra modeling of it.

Let’s suppose that after talking to our users, we come to the conclusion they need the following questions answered:

How many minutes spent in outgoing calls to different destinations, by day, week and month.
How many calls made by users by day, week and month.

We can storage each counter separately, and only lock when updating a specific counter. Our counters will have this format:

OutgoingMinutes:<Destination>:<Period>:<Slice>
CallsMade:<User>:<Period>:<Slice>

Where <Destination> is any destination we’ve called, <User> is any user in the system, <Period> is one of “ByDay”, “ByMonth” and “ByWeek”, and <Slice> is the specific time the value is being counted for.

So an instance of our data may look like this:

OutgoingMinutes:Canada:ByDay:2012-10-09 = 432 ; 432 minutes in calls made to Canada on the day 2012-10-09
OutgoingMinutes:USA:ByMonth:2012-11 = 54021 ; 54021 minutes in calls made to USA on November
CallsMade:John.Doe:ByDay:2012-12-20 = 43 ; John made 43 calls today

If you’re looking at it and you think it looks like a map, it’s pretty much one. With the main difference each key is isolated from each other, you don’t need to lock on a page and four indexes in order to update a row. Each value gets updated separately.

There’s two things we need to be sure about:

Counter increment or decrement supporting concurrency
A way to query for date ranges, and it should be fast

So it happens, we can do this with Cassandra counters, and we get the bonus of having our data replicated and distributed, so you can scale out almost linearly, no matter how many users or destinations you have, or how long you decide to keep track of statistics.

The modeling is pretty straightforward, it looks like this:

Statistics = {
   OutgoingMinutes:Canada:ByDay = {
        2012-10-09 : 432
        2012-10-08 : 121
        2012-10-07 : 987
        2012-09-11 : 100
        ...
   },
   OutgoingMinutes:USA:ByMonth = {
        2012-11 : 54021
        2012-10 : 43222
        ...
   },
   CallsMade:John.Doe:ByDay = {
        2012-12-20 : 43,
        2012-12-19 : 34,
        ...
   }
}

I like modeling Cassandra schemas like JSON, it helps visualizing data layout better. For this particular case each element is:

Statistics is the Column Family, OutgoingMinutes:Canada:ByDay is the Row Key, 2012-10-09 is the Column name -you’re going to be sorting by this value- and 432 is the value, this is a counter column type that supports distributed updating.

You can create the column family like this:

create column family Statistics with column_type=Standard and default_validation_class=CounterColumnType and key_validation_class=UTF8Type and comparator=UTF8Type;

When you’ve been factorizing things out for years, repeating the key in order to have different values it’s nothing less than a deadly sin.

OutgoingMinutes:Canada:ByDay = 100
OutgoingMinutes:Canada:ByMonth = 1000
OutgoingMinutes:Canada:ByYear = 10000

It’s like it’s screaming for a normalization, so much bytes being wasted in key names!

But hey, we’re rendering views like this in our cluster with 100+ GB of data in milliseconds, without doing any kind of ETL.

'i6 analytics visualization'

We’re trading disk space for computation and I/O time, in order to improve query performance, but it’s I/O time the usual culprit, having to iterate over a more flexible data structure in order to satisfy the rendering of a particular view . Sometimes you can even trade computation time for I/O time to improve performance, as in the case of Cassandra data compression.

Wrapping up, this is not a case against relational databases, it’s a case for modeling things carefully, you don’t have to stick to an option because it seems to be flexible enough to accommodate all needs. You can always implement a message queue on top of MySQL, but it would be foolish not to use RabbitMQ, and you can always implement an inverted index on top of MySQL, but you should better use Lucene.

Data storage doesn’t need to be always relational.

Waterfall Is a Mindset

Sep 2nd, 2012 | Comments

The Agile camp has become an industry, books, seminars, agile coaches, awesome incubators, there’s a lot of people making a living preaching this approach to building software, and it’s ok. Hey, this guy even said you can build a product without writing up-front what you’re fucking supposed to do, kudos to him.

But there’s one thing missing in the picture, and it’s the fact that everybody seems to think agile is about process, and I dare to say it’s not only about process, it’s also about people. There’s a lot of waterfallists thinking they’re agile just because they’re in a meeting every morning where nobody’s allowed to use a chair.

Bullshit, let me show you.

This is how waterfall looks like by the books:

'Waterfall by the book'

This tries to convey the idea that waterfall is about process, you see, in each box there’s a task being done in each step, cool.

But that’s not how people look at this picture, this is what people see when they get shown these boxes:

'Waterfall by the walking practitioner'

Roles, what’s my part, you know? The worst thing never said about waterfall, is that impregnates the idea that the development process has an underlying hierarchy of roles. And as it happens in any hierarchy, nobody wants to be at the bottom, ain’t it?

Even the picture itself has a reminiscence of the infamous law of the henhouse, I wouldn’t be the one getting all the shit either.

'La ley del gallinero'

As Agile positions itself more about process and less about people, there’s going to be lot of fresh engineers knowing how to do a sprint, but waterfallists at heart.

Fortunately they’re easy to spot if you come across one of them, they almost always say:

I want to be an architect
I don’t want to be a tester
I don’t want to program forever, it’s just a step in my career.

Let them burn in hell, yet another reason why we don’t have ranks.

'Waterfallists doing a scrum meeting'

The World of Warcraft of the Linkedinz

Sep 1st, 2012 | Comments

At inConcert we don’t have any ranks, we ain’t the fucking army you know?

We all are R&D staff. Of course we have different skills, some people is better at some things than others, and each person has his own inclination towards something, whether it’s math, machine learning or user interface design.

But there’s no rank.

There’s even job descriptions, whether you’re a designer or an engineer, it’s based on the idea proposed by Johanna Rothman, and it states:

What are you supposed to do
What skills you need
What are your deliverables

It serves as a compass when you start to understand what’s important in your position (no sole lickers please)

But there’s no rank.

You can be awesome and delivering solutions across the full stack or you can be struggling with a small subset of the product or technology we happen to be using. (And it will affect how much money you make)

But there’s no rank.

And this is important, because these are times of disease, and we have a disease in town I want to warn you about; it’s called the Badge Hunting disease.

Have you heard about the case of the CEO of his one person company? We used to call that a freelancer! These badge collecting trolls it’s like the World of Warcraft of the Linkedinz, have you seen them? That guy updating his profile every single week; his code passes a test? Test Passing Manager. Helped a colleague? Team Leader. Draw some mockups? Product Manager. Get some flyers printed and mailed? Event co-organizer. Made a website? Entrepreneur.

Are you trying to impress your mother? These guys are more worried about the looks than about getting something done, and you should avoid them like the plague.

I’m not advocating against having specific roles when a company grows into a stage of needing them, hey, I even think some badges you can earn, but if you deserve it I don’t think you really care about them.

Show me what you’ve built, all the rest is bullshit.

'Badge collected during the battle of the linkedinz in the year 2280' And that’s why we have no ranks, if you want a badge go play some WoW.

5 Hiring Mistakes You Shall Never Do

Sep 1st, 2012 | Comments

These are hiring mistakes you can make even knowing you’re making a mistake. The problem is that sometimes you have this deadline, or you’ve been searching for the perfect hire for months, and you lower your standards. Hey, maybe that guy who arrived 20 minutes late to the interview wasn’t that bad after all…

Or wasn’t he?

It’s always short sighted, there’s a saying in spanish “pan para hoy y hambre para mañana”, something like “feast today, famine tomorrow”. It will come back to bite you.

No matter what, here are 5 things that you shall never do.

1 You shall not hire in a hurry

Just the fact that you need someone in a hurry is a red flag. Hiring someone in that situation is going from bad to worse. In the short term you may put out the fire, but in the long run you’ve compromised your foundations.

If you really,really, really need a puppet to fill a chair tomorrow, hire a freelancer and dismiss it when the project’s over. Building a team is like building a family, you don’t invite the first monkey coming out the door to be your cousin when you need someone to hug you.

2 You shall not hire someone you don’t respect

It doesn’t mean you can only hire your heroes, respect is about recognition of abilities or achievements. You can respect someone smarts, you can respect what someone has built, or you can be in awe of someone’s interpersonal skills.

I’m not talking about B players hiring C players. I’m talking about the case you dismissed the position you’re hiring for, and started thinking a “good enough” was good enough.

It’s not.

Aim for someone who can blow the position out of the water, hiring someone who can barely do the job says you are ok with mediocrity in your team.

If you have some task that good enough will do, outsource it.

3 You shall never hire a prick

There will come the time when you interview this someone who’s a freaking beast, smart, fast, experienced, but he’s a jerk.

No hire.

There’s this company here in Uruguay who used to have two brilliant jerks in their own apartment to keep them from fucking up the rest of the team. Unless this kind of jerk isolation works for you, no hire, you’re building a team you know?

4 You shall never hire by price

Would you buy a plane ticket from an airline who’s not doing their maintenance, the pilot started flying yesterday, and had 10 incidents the last month, just because the tickets are half the price?

Why is it that people put their companies in the hand of incompetents just because they’re cheap? Cheap is always more exensive in the long run, it’s not only about hiring 10x developers, lousy developers create more bugs, write unmantainable code and will make your customers mad because they can’t handle a phone call like a human being.

Beware, more expensive doesn’t always equal better, there are some clever incompetents who are fond of asking more money when the market’s hot. Never hire by price means salary is not your main deciding factor.

5 You shall never ever hire on behalf of others

If your company has more than 10 employees, you are not hiring by yourself. You think you’re so smart you know what’s good for your company, above and beyond? It doesn’t matter, each person gets hired by her manager and her future team.

This is what you can do:

You set the standard
You set the hiring process
You get veto power at the end
That’s it

Let people do their own fuck ups and take responsibility for it, that’s how you grow better leaders

On Firing Someone

Aug 31st, 2012 | Comments

In Uruguay it’s not common for software companies to fire people, on one side the employment law makes it almost impossible to fire someone, even if they’ve robbed you and it’s video taped. On the other, there are so many companies evading taxes they’re held hostage of their own misdaemenor.

It’s a shame, firing someone that’s not pulling his own weight is almost as important as hiring the right people, is how you build your team; is how you choose who you are.

Letting someone go is a traumatic process for everyone involved, and being something uncommon, almost no one has experiences to share.

Since I’ve fired a few people over the years, I thought I should share a few things I’ve learned.

The problem won’t disappear by itself

First things first, a company is not a country club, it’s more like a pro sport team. You play to win, the best players are on the field, if someone doesn’t fit, he goes home.

When someone in the team is not playing his part, everybody knows it, and there’s nothing you can do to hide it. The offender knows it, your team know it, you know it, and the people on the grades know it.

If you have someone on your team that’s not performing and you’re doing nothing about it, you’re a buffoon, and everybody thinks you don’t have what it takes to run the show.

Who’s guilty? You’re guilty

Common reasons someone doesn’t perform as expected:

Doesn’t have the skills to do the job
Can’t adapt to the rest of the team
Can’t adapt to the dynamic of the company
Uncomfortable with the tasks at hand

Most of the time, it’s not the person’s fault, it’s your fault to hire them wrong, but anyway you’re demanding something you should have validated you’re unable to demand from that person. Aren’t you ashamed of yourself?

Ask yourself, did I checked this person was qualified for the job and a fit for the team and company culture?

Pro tip: This is why you want people to be hired by the ones working with them, not by some human resources suit 3 floors above.

Why you need to fire them

If you don’t, it undermines team morale

If you work in a company where someone is not pulling his own weight and everything is fine, would you be motivated to do your best work? Keeping a healthy team is your responsibility.

Work doesn’t get done

There’s no acceptable reason for having someone do a 50% work, it’s your job to find someone giving 100% on the position

Lowered standards

If you accept a job at half the quality from a person without the skills, hardly anyone will believe that quality work is crucial for success.

Opportunities by the thousand

'Dismissal communication approaches' A job is not a lifelong incarceration and evaluations are never absolutes. Someone who doesn’t fit in a team can shine in others. Never try to hard to make a pig sing, it wastes your time and it annoys the pig.

What is your responsibility

Being unemployed is a stressful situation, so sending someone home isn’t something you do on a rage day you’re mad about traffic congestion.

So…

You talk

Being crystal clear in your communications is the most undervalued skill in software development. People has a life. People have problems. Never thought about it huh?

Personal issues may impact work performance and you should be supportive. It’s also reasonable to think the company may have a bad time someday and your team won’t jump ship. So you don’t let people down, period.

Beware, this is not about blind loyalty. I’ve met people with a different problem every single week, and some companies are definitely a sinking ship.

You do gymnastics

I have my fair share of bad hires, and I’ve paid for them with a lot of personal time. The usual time is the one you spend helping someone adapt or acquire missing skills. It doesn’t always work, but it’s your job to try.

How long you do gymnastics is inversely proportional to the level of seniority of the subject; I’ve done a year and a half, and I’ve done two weeks.

Making the call

If talking doesn’t help, and gymnastics ain’t losing any fat; best for both is to break up.

As in any relationship, some people like to create uncomfortable situations in order to force the other be the one doing the break up.

Don’t be you.

On one side, you’re a chicken shit, on the other, it contaminates your opportunity to stand by something, letting someone go is also the time when people see what you think is important, what you think is critical, and what you think is unacceptable. So, have your company clean, and make the call.

Just be clear and avoid absolutes, “you are no good” is almost never the case. “We aren’t a good fit” is almost always certain.

Wrapping up

When communicating the leave, I prefer sending an email to the team, it takes the drama out of the situation. If you’ve done your homework this should be expected, and most of the time your team will be grateful.

'Dismissal communication approaches'

Avoid by any means justifying yourself on the decision. Trashing a person that’s not with you anymore to justify your call brings bad karma.

As the one in charge you’re not the one to be presented in a good light, do the shitty job, take the blame, and shut the fuck up.

The Curse of the Technocrat

Aug 30th, 2012 | Comments

I’m almost always mad when I think about the country being led by improvisers without the basic skills to run a hot-dog stand. Prioritizing politics over technical knowledge should have ended when Natural Philosophy came to town.

The rigurosity provided by the scientific method has uncovered a lot of charlatans

But it does exist a phenomena that I call The curse of the technocrat that’s rampant in the software world, it happens when the cursed technocrat starts thinking the most important thing is her technology.

I’m quite sure it’s part of an identity crisis, when you’ve been so long sharpening your saw, becoming good at something deeply technical by nature, the survival instincts of feeling good about yourself will make you think your shit is important by itself.

Well, it’s not.

The most important thing at a particular moment, is the problem you have to solve, technique per-se has no meaning.

Chair designed by a cursed technocrat It’s like having a carpenter who’s happy about using a saw but knows nothing about people using his chairs. Or a violinist with a great technique playing notes nobody wants to listen to.

As ridiculous as it seems, the software world is plagued by this cursed technocrats, with great technique, but don’t giving a fuck about customers having a better time.

These are my self assesment questions to detect any trace of the curse:

Can my grandma understand what problem I solved today, and for whom?
Is this doing life easier for somebody?
Will this make something cheaper for somebody?
Would someone pay to have this problem solved?
Is this a problem worth solving?
Have I spent more days paying technical debt than creating value?

Doing things for self gratification is great, but pay attention if you should have been an engineer, you may be under the curse of the technocrat.

An antidote? make yourself accountable on your results, not on your techniques. Good technique develops if you aim high enough.

The Hardest Thing to Grasp When Learning How to Program

Aug 29th, 2012 | Comments

Today I saw an old Quora post asking what’s the hardest concept to understand when learning how to program.

tl;dr

Programming is hard, because reality is hard
@guilespi Reflexiones De Resaca

My answer:

Some concepts are better explained by another student who has just grasped a concept than by a teacher who finds almost obvious all of the stuff.

So my list of difficult concepts to learn, is made of the many things most people do wrong, even with many years of experience.

Edge cases Almost everybody can get the happy-path-version of an algorithm working with a little bit of work. Not everybody can get the permutation of possibilities and edge cases right.
Organizing code Almost everybody can get an unmaintainable sheet of poorly designed code, spit some coherent output. Not everybody can even understand when they’re looking at a huge mess, why is it that it’s a mess
Avoid repetition This complexity arises from the need to have a sometimes huge system on your head at the same time. Designing small libraries and doing a bottom-up design helps, but there will come the time when you need to have this huge monstrosity in your head at once, and you’ll fail.
Concurrency While there are many strategies to deal with many things happening at once with your reality, having a lot of moving parts is hard. As it’s hard having a system with thousands of small pieces talking to each other.

Many of this situations are just particular cases derived from the Brooks paper There’s no silver bullet, you cannot evade reality essential complexity by doing tricks aimed at the accidental complexity of programming.

Programming is hard, because reality is hard.

Blog Archives Newer →

Interrupted

Unordered thoughts about programming, engineering and dealing with the people in the process.

Group-by Considered Harmful

Waterfall Is a Mindset

The World of Warcraft of the Linkedinz

5 Hiring Mistakes You Shall Never Do

1 You shall not hire in a hurry

2 You shall not hire someone you don’t respect

3 You shall never hire a prick

4 You shall never hire by price

5 You shall never ever hire on behalf of others

On Firing Someone

The problem won’t disappear by itself

Who’s guilty? You’re guilty

Why you need to fire them

What is your responsibility

Making the call

Wrapping up

The Curse of the Technocrat

The Hardest Thing to Grasp When Learning How to Program

tl;dr