Newsletter sign up

Gain insight from data

Data Collection Outlining And Eliminating Obstacles


Sometimes good data is hard to get… Not all data is created equal, and not all data is accessed equally. Often, during the data collection process, we can run into problems trying to access the data we want to get. Thinking back to our last data collection example, what if we really need a weather station, but don’t have one? Outlining the data sources can often highlight alternative solutions, or at the very least it will outline the costs required (both time and money) to access the data. Some common problems, along with some solutions, are:

1: Not having electronic data available.

Having electronic data can greatly benefit analysis, for instance, being able to search the text for instances of certain words. There can be problems, though. For instance, if the data you want comes from before 2000, it is highly unlikely it will be in a clean electronic format. Between 2000 and 2010, the data may be scanned in electronically, but still hard to search (such as an image PDF with no text). This type of data can be quite time-consuming to collect, but modern scanners and OCR methods mean you can scan documents quickly and convert them to searchable text. For example, scanning all of the local planning applications and creating a searchable index can now be done at a few dozen documents per work hour.

2: Impactful events during the data collection process.

One of the key “good data qualities” from yesterday was to have consistency in the data collection method. Without it, it can be hard to gauge long-term trends or the impact of other factors. For instance, if your shop moved during the data collection process, this event can have a drastic impact on sales. Effectively, this splits your dataset into a “pre-move” and a “post-move” dataset, each with their own characteristics. Outlining these major events can explain sudden changes (such as an unexpected sudden increase in sales) when performing the analysis. An example of a smaller event of the same type is a large sale – sales will increase, but profit may be consistent, a different characteristic to your business’ normal “non-sale” times.

3: Data is not clean or suffers from major inconsistencies.

If your sensors, such as the door customer counter, weather station, or even web traffic counter, are not correctly configured, this may greatly impact what we can do with it. There are two common types of error, bias and variance. If a sensor is biased, it is regularly a consistent amount incorrect, such as a weather station that always reports the temperature 2 degrees above what it actually is. This can be identified by cross-checking other data sources and removing the bias, with little harm done. Variance is a bigger problem. For instance, if our weather station added a random amount to the actual temperature, our data becomes very noisy and hard to predict. Removing various from a dataset is hard, but doable by fitting models to the data (i.e. “the data generally trends upwards”) and using the model, rather than the data, to perform the analysis.

4: Not having access to the data.

Some data is simply not accessible outside of certain circles. I saw this quite a lot in my previous career as an academic working on cybercrime. We had data we weren’t allowed to give to other people, and other people had data that we weren’t allowed to have. Think of the possibilities if we could combine our forces! Regardless, this is the state of the play for many industries concerned about intellectual property. Tools such as non-disclosure agreements can assist in getting new data, but can also severely restrict the resulting analysis.

5: Data for sale.

Data is a great asset for a company to have, so it should come as no surprise that there are many companies who simply collect and sell data. Sometimes, the data is available via an API, such as openweathermap, who provide worldwide weather and forecasts. It is free to a certain level of usage, after which you need to pay for more access. While some data is cheap (a few dollars a month), some services are incredibly expensive at tens, or hundreds, of thousands of dollars per month. This can be a barrier to “exploratory” data analysis, where we just collect a whole bunch of data and see what it says. This makes a strong case for starting with the business question to restrict the scope of the analysis.

6: Unknown sourcing.

Finally, for some data, we may just not know if the data exists or where to find it. It may help us to know the traffic build-up of every street in a city, but is that data stored anywhere? If so, where, and how do we access it? Finding new data sources can be tricky, but there are collections of resources that can be good starting points, such as for Australians, which has a whole bunch of datasets available.

top of page

Reading electricity usage with a smart meter


dataPipeline is a Victorian company, and due to that, we are lucky that we have smart electricity meters installed in our houses. This allows the energy companies to get exact meter usage without having to send someone to our house to read the meter. What makes this better though, is that this also allows us to read this data ourselves, and do to some interesting analysis with it!

Reading the data

Reading the data from the smart meter will require a smart meter reader. I personally use an Eagle Energy Gateway from Rainforest Automation, which may be a little overkill. This will create a server on your home network that you can access to get your data. It also connects you to an online service that collects the data and displays it for you.

Rainforest has other tools that are easier to use, such as the Emu, which gives you a screen showing the current usage. This post isn’t an ad for Rainforest though, there are plenty of other tools out there that can do the job, such as:
AGL, Lumo Energy, Origin Energy and a few others.

Connecting it up

In order to collect the data, the smart meter reader needs to be configured to connect to your smart meter, a process called ‘pairing’.

Up to recently, you needed to purchase the device through an authorised dealer, that is authorised by the energy company. That authorised dealer would then pair your smart reader and device, in collaboration with your energy provider. This was a bit annoying, as I wanted my own data, and the Eagle was one of the better tools for that use case.

My energy provider is Powercor, and they have recently released an online interface that lets you connect your own smart meter readers. All you do is provide the details of the reader to Powercor through their myEnergy website and then configure the device as per the device’s instructions. Once that is done, the device will start reading the smart meter usage.

Direct Benefits

With the data collected by the reader, the benefits start to come in. First, you can see, in real time, how much energy you are using at your house. This allows you to simply turn on/off items in your house to see how much energy they use!

Second, we have solar panels, and the smart reader is aware of this. Due to this, we can see when we are generating more electricity than we are using so that we can use other items, such as the washing machine, when it is cheap to do so. We don’t (yet) have a battery, and our feed-in payment is much less than the usage tariff, so it makes sense to use this electricity rather than feed it back to the grid.

Analytics Potential

Once the direct benefits are realised, analytics-based benefits start to be seen. For instance, we can now plan our energy usage based on known good times for our solar panels, and combine that with weather data to get better predictions. This answers questions such as “should I do the washing on Tuesday or Wednesday?”

Further analytics can be performed to simplify the usage of the device. For instance, we can perform some simple analytics to “red light/green light” particular usage. Continuing the theme of the washing machine, we can create a portal which has a green light when we can use the washing machine on solar power, and when we cannot. Simple dashboarding like this means we don’t need to remember how much energy each appliance uses and work this our ourselves each time.

Having this data gives us more control over our electricity usage. Using this data takes this a step further.

top of page

Data Collection, Connecting data


Following on from our last data collection blog, the next step is to link the available data to those business questions. This may be looking at your company’s inventory, past sales, or crop yields. This may also include looking outward at competitor sales, advertising levels, or world affairs.

To answer a business question using a data-driven method, we need to perform good, solid data analysis. To do good analysis, it is a requirement to obtain good data. There is a common saying in data analytics: “Garbage in, garbage out”, meaning that if your data isn’t very good, your results won’t be any good, regardless of the quality of the analysis.

The question, then, is what does good data look like? Good data has these qualities:

  • Low error in measurement, i.e. values are likely to be the actual values. Think of the problems a weather station would have if an air conditioner was blowing on it.
  • Consistency in data collection method. Back to the weather station, if we kept moving it from shade to the sun, our readings would be irregular.
  • Clearly labeled. If someone lost the manual to our weather station, is that 22 reading the temperature or the wind speed?
  • Longitudinal in nature. If we were looking for the average temperature for a given place, we need to consider seasons, so we need at least 3 years of good data.

These principals mean two things. First, you need to work out now, what data you will need in the future. Second, you need to do it properly early, lest you get a few years down the track and find your collected data doesn’t suit its intended purpose.

Another important factor in collecting data is to ensure you are collecting data that directly answers your business questions. There is no point collecting all of that weather data (and spending time calibrating the weather station) if that doesn’t directly impact your business questions.

As a final note, data itself can be a business advantage. Time spent collecting data, particularly using custom sensors or from hard to obtain sources, is an informational advantage over your competitors. It can also open up the possibility of answering new questions (that your competitors cannot), or answering other questions to a higher accuracy.

Think about the main unique sources of data that your organisation has access to. Write these down! It is a great stepping point.

top of page

PyCon 2016


Each year, conferences around the world are held to talk about application and theory in the Python programming language (dataPipeline’s language of choice). These PyCons are increasingly popular and increasingly useful. In this article we share a few of our favourite talks from recent PyCon events, focused around using Python in data analytics settings.

A beginners guide to deep learning

Irene Chen talks about a beginners guide in deep learning. Irene covers basics on classifiers, neural networks and more in her talk. If you are new and interested in machine learning this talk is quite informative, and I would definitely suggest giving it a watch. You’ll learn about techniques and applications, possibly giving some insight into some directions you can go in the future.

Visual Diagnostics for more informed Machine Learning

Rebecca Bilbro gives a talk on a more informed take on machine learning, She goes over a lot of ways to visualize machine learning algorithms using graphs and plots such as

  1. box plots
  2. histograms
  3. sploms (scatter plot matrices)
  4. joint plots

and so on, Rebbecca shows how to use these plots and graphs, she has even set up her own library to use these called yellowbrick. Yellowbrick is a suite of visual analysis and diagnostic tools designed to facilitate machine learning with Scikit-Learn.

Functional Programming for Pythonistas

Bianca Gibson talks about functional programming for python. This is a nice talk about how to use python and how it is different but in most cases easier to use than other languages.

Python for Windows

Peter Lovett talks about Python on windows. This is a great talk for anyone who is using windows and wishes to get involved with python. Peter covers limitations for Python on windows as well as positives for the OS.

This is a super helpful video if you are just starting out and are trying to figure out how to use Python with Windows.

top of page

Data Collection


Collecting the right data can be a tough first step for many small businesses. It can be hard to find the time or resources to get started. I’m a small business owner myself and know the difficulty of getting time or resources to start new projects outside of my expertise area.

Over a series of posts, We will discuss these 5 very important questions:

  • What are your business questions?
  • Connecting data to your questions.
  • Outlining and eliminating obstacles
  • Collecting data for the long term
  • Implementing a data collection process

Onto today’s topic: What are your business questions?

Data-driven decisions need good data analytics, and good data analytics is a result of a match between data and asking the right questions. This is true for science, government and industry. It is also especially true in small business, where there are fewer resources available, so the shotgun approach “collect all data and work it out later” is just not feasible.

The most important thing, as Stephen Covey says, is to “start with the end in mind”. Starting with business questions defined is critically important, as it helps reduce wasteful collection and analysis of data, which costs time and money.

A good business question is well scoped, and is clearly answerable, whether or not the answer is what you expected or wanted. Good business questions often come from hypotheses, which are predictions about how your business impacts, or is impacted, by other events.

For instance, a good business question is:

“Are the conversions from my website increasing over time?”

This is clearly defined in scope (we are interested in conversions from the website), can be answered succinctly (yes or no), and provides a great platform to start collecting data - we would collect conversion information from our website log.

An example of a poor business question is:

“How do I get new customers?”

Such a question doesn’t give a clear scope, is hard to answer succinctly, and is too open ended to just start collecting new information. It provides a start to a research project, which in turn may lead to a more clearly defined research question, but this type of question is hard to solve analytically.

The road to good insights from data starts with a good question.

top of page

Gamification for business


Gamification for business, GAMES.. for BUSINESS!? How could games possibly help out the work place?

Well hold on, it’s not an actual game. So what is gamification? How can it help your business?

Let’s start with what exactly gamification is.

Gamification is the process of taking something that already exists such as a website, an enterprise application etc. Which we can then integrate game mechanics into. The reasoning behind this is to motivate participation, increase the level of engagement from our employees, and gain stronger loyalty from them as well.

How does it work?

Gamification takes data techniques created by game designers which are actually used to engage players, it is then applied to a none game environment in order to motivate actions that add value to your business. is a good example of gamification, Mint is an app for helping you fix your financial problems. Mint uses gamification by employing a variety of goals and trackers, visuals breakdowns for understanding your spending and budget allocation.

budget visuals below

Gamification isn’t about creating a new thing, it’s all about amplifying the effect of an existing project, experience or task, by applying motivational techniques that make games so engaging.

Another example is Chorewars, Whilst Chorewares isn’t directly aimed at business, it could be used for day to day tasks for your employees. The idea of Chorewars is for you to create a hero, then you set tasks, As each task is completed you are rewarded with XP and gold coins to add things to your hero. This could be a fun way to motivate your employees, particularly if you run a company where you have the majority of your employees sitting at a desk all day.

Game Mechanics

According to Gamification is built on 10 main game mechanics, these mechanics are:

  • Fast Feedback: Immediate feedback or response to actions

    Encourage users to continue or adjust their activities with onscreen notifications, text messages or emails.

  • Transparency: Where everyone stands

    Show users exactly where they stand on the metrics that matter to you and to your audience.

  • Goals: Short- and long-term goals to achieve

    Missions or challenges give users a purpose for interaction, and educate users about what is valued and possible within the experience.

  • Badges: Evidence of accomplishments

    An indicator of accomplishment or mastery of a skill is especially meaningful within a community that understands its value.

  • Leveling Up: Status within the work community

    Levels indicate long-term or sustained achievement.

  • Onboarding: An engaging and compelling way to learn

    Video games train you how to play as you play – users learn by doing.

  • Competition: How I’m doing compared to others

    Raise the stakes for accomplishing a goal by showing users how they compare to others, as individuals or in teams.

  • Collaboration: Accomplish a goal working with others

    Connect users as a team to accomplish larger tasks, to drive competition, and to encourage knowledge sharing.

  • Community: A context for achievement

    Community gives meaning to goals, badges, competitions, and other mechanics.

  • Points: Tangible, measurable evidence of my accomplishments

    Used to keep score and establish status or accumulated to purchase virtual or real goods.

These mechanics are proven to motivate and engage users, and you may use any combination of these techniques to accomplish your business goals.

The end game plan

The end game idea is that gamification will transform business models by extending relationships, improving overall engagement in business and increasing loyalty with employees and customers. It is a working model because it motivates our desires that exist in us all for community, feedback, achievement and reward.

top of page

Training your chatbot


Chatbots have finally reached a stage where they can become quite useful, for business and for your day to day user, they are even super easy to create. The hard part is teaching them!

In our last post about chatbots we gave a quick introduction and explained how chatbots could help out your business.

In this post we are going to take steps on how to do a basic train, we will look at some results and talk about how we can improve the chatbot.

What does it mean to train?

When we say train a chatbot we mean to teach it how to communicate with a person. You wouldn’t want to go onto your bank website and see that they have a chat program to then ask them if you can get a loan for a car, and they respond with “yes I like hippos.”

So how do we make sure that the chatbot knows what it’s talking about and how to communicate it appropriately?

How do we train our bot?

In this section we will use chatterbot as an example. Chatterbot is a Python-based chatbot library. It allows a programmer to quickly create a chatbot using Python programming language. Here is how to create a really simple chatbot, this can be run offline or integrated into a web interface.

This code creates the chatbot, it’s name is Norman, Then you call in your questions and answers via the storage adapters, which once any input is typed into the chatbot it is then sent to the database.json.

The database.json is a file where all of the questions and answers created by having a conversation with the Chatbot are stored. This is where the Chatbot will go to find answers to your questions.

bot = ChatBot(
    "Norman",  # The Chatbots name.
        # Gives the Chatbot the ability to do simple math equations
         # Gives the Chatbot the ability to tell the time
    # Stores all chat information within.

# This code below allows the Chatbot to respond
while True:
     bot_input = bot.get_response(None)

    except(KeyboardInterrupt, EOFError, SystemExit):

At this point, the only thing the Chatbot will do is copy exactly what you say.

User: Hi

Chatbot : Hi

Note you can run this program offline or through a web interface

Now we want to be able to get the chatbot to talk back to us. It’s not that difficult, we just need to change the code slightly to use corpus data. Corpus data is a large collection of texts that we will use to train the bot. Chatterbot comes with a few corpus data packs, we are going to use the greetings corpus data for this example.

from chatterbot.trainers import ChatterBotCorpusTrainer

# Chatbot name
chatterbot = ChatBot("Training Example")

# This code tells the Chatbot that we are going to use corpus data in English.
# this is the location of the corpus data.

# This code below allows the Chatbot to respond
while True:
     bot_input = bot.get_response(None)

    except(KeyboardInterrupt, EOFError, SystemExit):

The corpus data is a list of greetings, for example, the following greetings can be found in this corpus:


Hi, How are you?
Hi, I’m good thanks

Whats up?
Not much, you?

In the next section, we will show what it looks like in code and how to use it.

Creating data and Training issues

We can manually create data for the chatbot to read ,so the way we would go about this would be to create a JSON file called greetings.json. Within this file, you would then add your greetings and the response you would want the chatbot to give back to the user for that greeting.

The code for manual entry would look like this:

{ "greetings": [ [ "Hello", "Hi" ], ] }

The result from running your chatbot and typing in Hello would look like this:

User: “Hello”

ChatBot: “Hi”

and so if you were to change the code to something else like

{ "greetings": [ [ "Hello", "Hi, How are you?" ], ] } The bot would respond like this:

User: “Hello”

ChatBot: “Hi, how are you?”

This is the simplest way to get your chatbot talking, however, it is completely impractical, for example, if a user typed:

User: “Hello”

Chatbot: “Hi, how are you?”

User: “good how are you?”

Chatbot: “Hi, how are you?”

As you can see this is going to be a big problem, doing it this way would be very hard if you had a large database of questions, you would need to manually type in every answer for every question you could think off. That’s a lot of questions!

The other main issue is that people ask things in different ways, The chatbot at this stage isn’t going to be able to pick that up which is going to lead to incorrect information and answers.

We still currently working on just exactly how we fix these issues, our latest project has seen us change from JSON to mongoDB for storage as JSON files can’t handle too much data. That fixes the storage problems, however we are still manually putting the input in.

If you wish to know how to change databases, the chatterbot documentation has all of the answers

How do we improve the bot?

This is the tricky question, currently the smartest bots such as A.L.I.C.E run off a language called AIML - The Artificial Intelligence Markup Language, This language makes it a bit easier to train your bot however they are still quite dumb, and you can tell you’re talking to a robot and not a human.

At dataPipeline we are currently integrating Parsey McParseface googles newest release on language parsing model for English (which they are stating is 94% accurate for text parsing), with SyntaxNet which is a an open-source neural network framework for TensorFlow that provides a foundation for Natural Language Understanding (NLU) systems(also created by google). In order to create an A.I Chatbot that can not only hold a human-like conversation but also learn from talking experience and become and intelligent piece of programming.

In our next post we will go over how to implement the Chatbot into a live web interface, and show you a working model!

top of page

Introduction to chatbots


We have reached a stage in chatbots where we can have increasingly engaging and human conversations, allowing businesses to take a hold of this technology and use it with their customers.

So what is a chatbot? And how can a business use them?

A chatbot is a program that is able to talk to people, answer questions and give information about well anything. Surprisingly making a chatbot isn’t all that difficult, If you use chatterbot It has detailed step by step instructions on how to create a simple chatbot using python. The hardest part is training the bot, which we will cover in the next blog post. Training involves telling your chatbbot what it is talking about and and how to hold that conversation.

In an article on business insider they explain how we’ve trained ourselves to click through apps or search in weird phrases to get the information we want. You wake up thinking ‘I wonder if it will rain today’ and instead of looking out of your window, you go to your phone and open an app to search with your zip code to see if it’s raining.

Chatbots could change everything about how you surf the web.
In the future, you could just say ‘I wonder if it will rain today’ and a chatbot would know your location and be able to answer conversationally, whether you should bring an umbrella. Which means you do not require an app or a search box.

As for the businesses side of things, If you had a company where you are getting asked a lot of support questions, You could set up a chatbot, on your website that could answer all of these questions for you. Chatbots still have a long way to go to be reliable, and are required to be trained for their specific use case. Chatbot quality is increasing quite quickly, making this a viable option for long term planning for many businesses.

In our next blog post we will go through how to train a chatbot, and how to do this in practice.

top of page

Version control in data mining


Developing data mining code is much like developing other code in most respects. However the nature of data mining is lots of back and forth experimentation, and this can make it difficult to keep track of changes. In this article, I share some of my practices that I use to try to keep on top of things.

At dataPipeline, we use version control for, well, almost everything. This includes our data mining projects, our other software and even our website . Version control is a means of keeping track of the files in a project and is usually used in software projects. However, it can be used for any other projects, from writing to data analysis and so on.

In version control, you add files to a central list. Any changes you make are then recorded on this list, including file edits, new files, deleted files, and so on. Version control systems, such as git, also allow you to create a commit message, where you add a note on the change, outlining why you did it.

Good practice for version control includes:

  • Creating text-based files, rather than custom formats, which can be more easily tracked by version control systems.
  • Providing good commit messages that clearly explain the changes made to the files and why they were made.
  • Having commits be incremental, rather than ground breaking, so that a single commit can be considered in isolation.
  • The code in the central list should always “work”, allowing code to be compared to previous versions.
  • Single stages of changes, a commit should be about a single change to the code, for a single reason.

On that last point, a “single change” could be one line, or 100 lines, but it is about “one thing”, in much the same way that a paragraph in a book is about “one topic”.

For data mining projects, keeping track of the software is an obvious component - all software we develop is stored in version control. Further than that, data mining projects also have experimentation parameters, such as the values given to particular methods, their results and so on. We keep track of results files, as well as the parameters that lead to them. This means that if we break the code, we can always backtrack to when it was working and recover our results. This is a lesson hard learned, from a project where I was getting good results, changed some code, and could not unchange it to get those same results again.

One type of file that we do not keep in version control is intermediate results files. Most data mining projects involve altering data from one format to another, and this results in lots of intermediate files. I don’t keep these in version control, as they should always be recoverable by running the code. I should point out that I do keep backups of these files, though, but that isn’t as critical as the code that created them.

Any data files, such as files that map keys to values (think “postcodes” to “lat/lon” mappings) are sometimes kept in version control, but only if they are small. Larger files are stored in backup instead, and downloadable from our internal systems if needed. The reason is that the version control system keeps track of all changes, so adding large files means that the code base stays large forever, even if that file isn’t needed later on.

Other Version control resources below

top of page

A guide to centralising your data


A guide to centralising your data

As an organisation’s data capability grows, so too must the tools they use. From excel spreadsheet to database, this initial period of growth can be quite tricky. In this post, we will look at some of the tools to use to centralise your data, and how they are likely to affect your workflows.

Most data analysis programs start the same way - Excel spreadsheets spread on network drives, emailed around, and so on. However, as the projects mature, you’ll want to start centralising both the data (so everyone is seeing the same data) and the common processes (such as fixing up common problems with your data). This can require a bit of setup, but the end result is significantly better - reduced work, reduced mistakes, better analysis, and outcomes.


If you have never used one before, centralised databases can be tricky. For starters, there is no common interface, and everyone needs to learn SQL in order to do basic things. This isn’t necessarily true - databases can be easy to work with, and Excel has lots of tools for working with them (if you do use Excel day-to-day).

Most people who have worked in IT will have some experience with databases, whether it is MySQL, MariaDB, PostGRES, Mongo, SQLite, or so on.

Let’s have a look at the basic pros and cons of some of the major databases:

  • MySQL or MariaDB: A nice database system. MariaDB is open source, made from MySQL when their company started closing off the community from development. Many, many people have experience with MySQL. Both are free, with MySQL having commercial assistance available.
  • MSSQL: Microsoft’s database of choice. If you are a Windows shop, with Windows servers, this is likely to be the path of least resistance, but expect to pay when the database grows.
  • PostGRES. Like MariaDB. A little more complicated to learn. If you are starting from scratch, with no database experience, I’d go with MariaDB. PostGRES has some nice features that MariaDB doesn’t, and appears overall a better database (but not by a lot).
  • SQLite: Super simple to use, super fast, and really portable (the whole database is contained in a single file). The only, and major downside, is that it is one user only.
  • MongoDB: One of the easiest “NoSQL” options, where you don’t have rigid tables – you just put data in, and worry about its format yourself, with the database not constraining you. Great if you are creating a new app, and don’t know what the structure will look like yet, but useful later on too.
  • Firebase: Effectively a hosted version of MongoDB. Great if you have an app, or will always have internet access. Firebase is low setup, low maintenance, and you let Google deal with difficult problems like scaling your servers.
  • Access: Good for one-shot small applications, but I would recommend skipping this step and going straight to a “real” database.

The database I recommend is… whatever your IT people have experience with. Any differences between major databases are likely to be significantly overcome by a lack of understanding of the database engine if you don’t have anyone with experience. Talk to your support team (or us!) and look at the best option for your environment.


I recommend setting up automated systems for getting your data into databases. There are lots of tools for importing data into databases, but unless you are entering data manually, you should be looking to link your systems together in order to effectively manage your data. Each manual step is a way to introduce delays, miscommunications, and data-centric problems.

This step can be quite complex, but the good news is that you set it and forget it. For instance, writing a program to download data from one program (say your bookkeeping system) and enter data into another (e.g. your sales prediction system) can be hard, but only needs to be done once, and updated if either of those connected systems has a major change.


Excel is a great tool. It gets a bad wrap, but at the end of the day, it is easy to use and fit for a lot of purposes. That said, there are better tools.

The first one many companies look at is Tableau. It has some of the easiest data connection and integration tools in the business but overall is quite clunky, making it hard to do some quite normal things. It is better than Excel for reporting on data, but keep to the basic Tableau templates and don’t try too much customization.

If you have regular reporting, I recommend investing in a system to do this for you. For instance, tools like Pandas in the Python programming language allow for complex data analysis to be done quite easily, and plotting tools like Bokeh produce awesome plots. These reports can be saved as HTML, which allows you to serve on your intranet, or put on your website publicly with ease. An example of a good candidate for this type of reporting is a quarterly sales report, the type of which should be automatically generated and sent to everyone each quarter with no human intervention - the report is quite straightforward after all!

In summary, centralising your data analysis capability is a difficult step to take, but the results are worth it for your company.

top of page

Colour Blindness with data visualization


Data analysts all over the world are using visual data programs such as Tableau, or Seaborn through Python coding to get a visualization of data graphs of some sort out to their clients.

As shown below these look great!

alt text

However what happens if you’re colour blind?

Colour blindness affects approximately 1 in 12 men (8%) and 1 in 200 women in the world. For anyone that suffers form this it can be rather hard to tell just exactly what colours are what. The following color combinations, are especially hard on color blind people: Green & Red; Green & Brown; Blue & Purple; Green & Blue; Light Green & Yellow; Blue & Grey; Green & Grey; Green & Black, You should try to avoid using these combinations. There have also been studies that suggest that brighter colours with more contrast may actually help a colour blind person recognise the colour more easily.

This is what the above image looks like if you were colour blind

alt text

As you can see it’s quite hard to tell what is what.

Jeffrey Shaffer who is known as the tableau zen master wrote this blog giving 5 tips about how you can change your tableau graphs and diagrams colours to better suit colour blind people. There is also an article about Seaborn and python coding here and how you can use them to get good colour blind results.

Daltonize is a program that simulates the three types of dichromatic color blindness and images and matplotlib figures. Generalizing and omitting a lot of details these types are:

  1. Deuteranopia: green weakness
  2. Protanopia: red weakness
  3. Tritanopia: blue weakness (extremely rare)

Daltonize can also adjust the color palette of an input image such that a color blind person can perceive the full information of the content.

If you’re planning on developing or designing some data sets using graphs with colours, remember the colour blind!

top of page

Cluster analysis


What is cluster analysis used for?

Clustering is a useful tool for exploring datasets. It can be used for text, images, and many other datasets, providing a glimpse into the internal structure of the data. Cluster analysis can be very useful when the data is hard to visualise, or for big data applications.

Cluster analysis is a type of unsupervised learning, which means we don’t have a preconceived idea of what we want the resulting model to learn. This is in contrast to supervised learning, such as learning which images are of cars versus trucks (useful for automated road toll billing). In supervised learning, we give a training dataset of data that is already labelled. The job of the analysis is to learn how to take the input to derive the given output.

For cluster analysis, exploring the data is the key. This can back up existing knowledge, or it can find new structures. For example, this article clusters US senators based on their voting habits. When you split the data into two clusters, a clear Republican versus Democrat divide appears. As you increase the clusters, it is seen that the Republican “voting block” stays quite rigid, while the Democrats are more prone to breaking into “sub-clusters”. Further, you can see which senators don’t vote with the party and whether independents lean more towards one party or the other.

One of the main challenges with cluster analysis is that of choosing features. For supervised learning, you can add new features - for example, Senator age could be added to the above-linked dataset. The learning process will learn whether or not the feature is useful because it knows what it is looking for. However for cluster analysis, adding new features can severely diminish the quality of the results as the clustering algorithm will gladly use any information, useful or not, that you give it.

Challenges aside, cluster analysis is a great way to get a viewpoint on a dataset. Take, for instance, the Whereabouts London project, which uses cluster analysis to show which areas of London are similar to each other, giving structure to London. alt text

Analyses like this give insight into data and provide a view for future analysis. In many cases, the cluster analysis can be used to determine the future questions that will be asked of the data.

top of page

The problem of the mean


The problem of the mean

The mean is the most used single measure for interpreting a dataset. It is often the first question we ask when we get some new data - “what is the average?”. However, it has some major issues that need to be kept in mind when using it, otherwise it is quite easy to get the wrong impression.

Calculating the mean is quite simple. You add up all the things and divide by how many there are. This simple formula can be found in nearly every piece of software that takes data, due to both its high use and its simplicity (is that cause and effect?).

The first major issue is that it is just a single statistic. There has been a significant amount of work done on creating single statistics to represent complicated ideas. This includes the mean to describe some data, the accuracy to evaluate the results, or the p-value to compare experiments. Every time a complex result is evaluated by a single statistic, there is significant information that is lost in the process.

After World War 2, pilots of air force planes from the United States had trouble controlling their planes. Slight imbalances caused slight delays, causing catastrophic results in the loss of pilots. It was quickly discovered that pilots never quite fit their cockpits, despite them being built to order for the average pilot.

What was less quickly discovered was that there was no such thing as an average pilot. The Air Force was quite strict about selecting pilots that conform to a standard height, but even then it was found that no pilot fit the average range on the 10 attributes measured to design cockpits. These were 10 attributes (such as height and arm length) that were studiously measured and used for design, but what was found was that this “average cockpit” actually served not a single pilot. Even worse, only three in every one hundred pilots would be approximately average in at least three dimensions (here, “approximately average” was a broad “in the middle 30%”).

For more information on this, see this fantastic article. It outlines this finding in more depth and its ultimate solution of designing cockpits to be customised rather than trying to find average-sized pilots.

The second major issue of using the mean is outliers. An outlier is a piece of data that is inconsistent with the rest of the dataset. It could be an anomaly that actually existed (such as an overly tall basketball player), a data input error (i.e. inserting 100 instead of 10.0), instrument failure (such as a malfunctioning temperature sensor), or any other of a variety of causes.

To give an example, consider the numbers [1, 2, 3, 4, 5]. This set of five numbers has a mean (3) that actually does a good job of representing the dataset. However, if we add just one more point, an outlier due to a data input failure. This new data point, which should have been 3, but was instead inputted as 30, changed the mean to a non-representative 7.5! This “mean” isn’t even in the realm of the possible values that we have seen so far!

Outliers are usually easily picked up by looking at distributions of data, rather than single statistics. They can be removed if they appear erroneous, adjusted if the cause of the error can be identified (such as a missing decimal place), or otherwise adjusted using algorithmic means.

Luckily, there are solutions. I recommend looking at distributions, rather than single statistics. A distribution, such as a histogram, can give a visual on the dataset that no single statistic could hope to achieve. Further, it is important that questions are asked of the data, and assumptions tested. This, however, is a topic for another post.

top of page

Evaluation is hard

Evaluation is hard

A key part of any data analytics project is evaluating the end result. Doing this effectively is hard, and can undermine the work done in the other stages of the project, especially if a bad evaluation framework results in the wrong model being chosen.

Evaluation is hard, so I highly recommend that significant thought is given to evaluation. There is an adage “what gets measured gets managed”, but this can be a double-edged sword. What, exactly, is measured is what, exactly, is optimised for. If your evaluation method misses the mark, your end results won’t do what you expected, despite the “high performance” reported.

Evaluation Metrics

The most natural way to evaluate something is to count the average of the number of times it was right, also known as the accuracy. If your system makes 100 predictions, and 70 were correct, your accuracy is 70%.

The accuracy is quite straight-forward to understand, but can’t deal with what is known as “unbalanced classes”. For example, if you are building a system to find pictures of boats on the internet, you’ll have much fewer images of boats compared to other things. If you have 99% of images containing no boats, you can easily get 99% accuracy by just predicting “no boat” to every image. This is a highly accurate system, but doesn’t help us at all. In this case, our evaluation metric has failed us.

There are a few ways to “balance” the accuracy, and the F-measure (also known as the F-score) is a commonly used method. First, we answer two questions:

  1. Of all the images that are of boats, what percentage did we classify as boats?
  2. Of all the images that were classified as boats, what percentage actually were?

The first question gives us the recall of our system, while the second gives us the precision. Combining these two results gives us the F-score. Our theoretical “never-boat” system would score very poorly on this score, leading us to choose a model that actually predicts boats effectively.

Another way of thinking about evaluation is through ranking. If our system is predicting the outcome of a football season, we might assign a likelihood to each team winning the championship. Our “prediction” for the winner is the one with the highest likelihood.

If the second-highest team wins instead, our prediction is wrong, in black-and-white terms. However, our system did quite well! For problems such as these, many evaluation methods work on ranking and likelihoods to compensate for this issue.

Other metrics apply for other problems. For example, text segmentation is a task where we try to break up text based on some criteria. An example is breaking a word into syllables. If our syllables are “off by one” character, they are wrong, but we probably want our evaluation criteria to identify that they were at least close!

Choosing an evaluation metric is hard, and takes careful consideration. As with any data analytics step, starting with the end in mind is the key - what is the goal you are trying to achieve?

From there, it is a task of critical thinking to ensure that this evaluation is effective. How do we measure that? What aren’t we measuring? How will that affect the results.

top of page

The future of fraud detection

The future of fraud detection

Fraud has always been a major problem for many sectors, including obvious ones such as banking, government welfare, grants, charities and so on.

Less obvious ones include, well, all of us. Whether that includes small business owners who get hit with fraudulent invoices, or customers who get their credit cards skimmed at a register, the threat of fraud affects us all.

Luckily, the field of fraud detection is growing rapidly, with more insightful algorithms coming out regularly, including from the public sector from researchers such as those from the Internet Commerce Security Laboratory. These researchers are developing new techniques for profiling and detecting fraud. There is also an industry sector eager to take these findings and put them into practice, resulting in fraud detection products usable by you and I.

As identified by John Verver, data analysis has a key role to play in fraud detection and the techniques are improving quite dramatically.

Most people have some experience with fraud detection, even if it is a second-hand account of a friend who was called by their bank to ask if a transaction on their account was legitimate. Behind the scenes, algorithms and experts work together to look for rules that identify normal and abnormal behaviour of fraud. This is data mining in practice, where those patterns are then codified into a filter that searches through transactions and finds those transactions likely to be fraudulent.

The Australian Taxaction Office is building (and using) systems that model the behaviour of taxpayers and using those models to look for abnormal behaviour. This includes how taxpayers interact. In the past, this would be seen as “too large” a problem, which is why the tax office has focused on different types of taxpayers (such as tradies) in different years, rather than everyone, every year. However, I estimate that very quickly (say, 5 years), organisations like the ATO would be able to model significant portions of Australia’s economy, leading to a situation where tax fraud is very difficult to hide.

Human-Algorithm Interaction

Recent news articles have talked about the use of social media to find those defrauding welfare systems. In one example, a couple were each obtaining a single’s benefit, despite obviously being in a long-term relationship (they lived together and just announced a pregnancy). In many cases, this sort of thing is done manually based on a tip-off. Someone calls in a suspected fraudster and the welfare agency investigates.

This type of investigation can benefit from an automated analysis, even if it does only part of the work. For instance, rules can be set up to look for patterns such as “frequent postings with another person on single’s payment” to look for this kind of fraud. Such systems are probably not yet advanced enough to automatically be judge and jury, although they are definitely smart enough to find “red flags” requiring further investigation.

This type of machine-learning alerting system can take a problem that is too large for a person, such as searching for details of people online, and turning it into a smaller number of specific events to investigate.

Companies that use this type of system should also be on the look out for how they improve over time. In this sense, the system learns new rules, and has new rules added by people.

This leads to a natural progression where more work is done via the automated system and the people are free to either investigate new forms of fraud or to do other tasks (such as building new products). As this happens, the company does more with existing sources and becomes more profitable, modern and robust.

top of page

Insight from external data

Insight from external data

Data analysis insight doesn’t just come from using data your organisation has created. While much of our knowledge looks internally, there are substantial gains to be had in our understanding of the world, and your business sector, but looking outward. These external datasets can come from others in your industry, government organisations, third party organisations, researchers and so on.

Using external data sources has an odd synergy with competitors. While having internal data provides your company with an advantage that others do not have (because they don’t have the same information), external data is often publicly available. This means that to have an advantage, you need to have better analysis tools and better integration with your internal data and expertise.

It might surprise many, but the Australian government has an extensive number of datasets available, such as through Below we outline some example datasets from the Australian Bureau of Statistics, which are publicly available for all businesses to use.

Building Approvals

Extensive information is released on the number, location, and types of building approvals each month. This type of data can be useful for organisations to find fast moving areas, new types of businesses being built, and other information about the location they conduct business in.

Crime Victimisation Crime is a big indicator of demographic information such as income and business viability in an area. Knowing crime statistics for an area can assist in determining whether to expand into a new area. Conversely, if your business is in providing security, these might be the exact areas you are looking for.

From their page:

The survey collected data, via personal interview, about people’s experiences of crime victimisation for a selected range of personal and household crimes. The survey also collected data about whether persons experiencing crime reported these incidents to police, selected characteristics of persons experiencing crime, and selected characteristics of the most recent incident they experienced.

Labour Force Probably the most talked about dataset of ABS, the current trends of employment are newsworthy even if they don’t change. It provides a general indicator of the expansion of business, a predictor of future customer sentiment, and much more. One important use is to help predict the outcome of elections – if employment is down, it is more likely that the government would be changed. Also on this page is some analysis of the general trend, without getting too political in insight.

Consumer Price Index The CPI (Consumer Price Index) records the percentage change in the cost of many goods, broken down by different sectors. Prices are also seasonally adjusted to provide a better direct comparison per quarter, and data is provided for a long time period.

Overall, these datasets (and many, many more) can help your business to obtain external insight, providing direction for your business. Integrating these datasets provides both technical and business challenges, but overall the outcome will benefit your business.

top of page

Automating tasks with data analytics


Automation can provide significant gains for your business. Even difficult tasks can be automated, with the effects impacting all industries including real estate, catering, manufacturing, or software development.

Data analytics can help to automate many tasks, specifically tasks that are related to decision making from past knowledge. In this sense, data analytics codifies existing knowledge that you or your businesses have, turning it into software that can make those decisions for you, faster and more scalable than previously possible. Recent advances in algorithms have meant that many businesses can now run with a fraction of the staffing resources they used to need. This automation frees staff from more monotonous work, allowing businesses to use, freeing those staff (who often have extensive experience in the industry sector) to work on new opportunities for the business. This cycle of opportunity-development-automation allows businesses to grow and evolve over time while keeping staff engaged and active in the business. All the while, the automated business ventures are still working earning more passive income for the business.

One example of a business benefiting from automation is used-car buying and selling. Thanks to automation, many online businesses have appeared that can take details about your car and offer a buying price based on internal information about different car types. Included in this model are adjustments for risk (i.e. cars that are more likely to have been in an accident), profit level (an adjustable parameter), and accounting for the time spent needing to pick up a car. The customer gets an instant price online, and then all the business needs to do is to send someone out to actually get the car - and even scheduling this can be automated! Insurance agencies also use this model to provide quotes for property, businesses and cars.

It is important to note that not all automation is the same. As business continues to leverage automation, it is the expertise that goes into building those automated processes that will help one business stand out from the rest. Your domain expertise, combined with our technical expertise can help automate tasks with a high accuracy and effectiveness. This domain expertise is also helpful for improving the model, so that it can learn from new scenarios, and also improve the quality of the decision in known scenarios.

Thinking about your business, what are some of the possibilities for automation? Here are some general attributes of tasks that indicate they can probably be automated:

  • Does it involve summarising large amounts of text?
  • Does the person doing the task need to perform the same action (physical or mental) many times?
  • Can the process be outlined sufficiently enough to give to a new staff member (including new staff members with experience in the industry)?
  • Is the task monotonous in nature? While it might be an odd criteria, "boring tasks" can often be automated in this way.
  • Do you have a history of data, mapping decisions or actions made, to the outcome they produced?
  • Does it involve taking data from one source, filtering or altering it, and storing it somewhere else? Examples include electronic filing, measurement analysis, or parsing outputs from a machine.

  • If you answered yes to any of those, there is a good chance it can be automated. At dataPipeline, we are always looking for new ways to bring automation through data analytics to small businesses, helping them grow and achieve more. Recent examples include predicting staff requirements blog, reading large amounts of text to get general trends blog, and integrating different data feeds into a single report.

    You might have ideas about tasks in your business that could be automated. If so, we are interested to hear more, and can help take you from idea to conception. Even if you just have a thought, we are happy to give initial guidance on whether it is possible, and what the plan could look like. Just send an email with your name, contact details, and a short description of your need to and we will contact you back with a plan of action.

    top of page

    Internet of things

    The Internet Of Things

    What is the Internet of Things (IoT)?

    The Internet of Things (IoT) is the next big wave of technology, after the current trend of mobile devices. It is growing very fast, and by 2025 it is predicted that most business sectors will be using it in some way or another.

    What is this internet of things anyway? Well it is really quite simple. The IoT is the connection of devices that talk to each other, from light switches, doors, cars, even whole cities! It can be just about anything - you connect these things to the internet and then be able to gather data or perform actions with these devices. The increase in data will be quite dramatic, but so too willl the increase in possibilities arising from it. Dr.Jhon Barrett explains the internet of things in his TED talk.

    As an example, there is a company named Automatic who have designed an adapter and an app for your smart phone to be used in the car. The adapter is plugged into your car, which then it will relay information back to the app on your phone about your car, from fuel usage and costs, distance driven, and even diagnostic information! If an engine light appears, the adapter sends a message via Bluetooth to your phone explaining what is wrong with the engine and what you might be able to do to fix the issue. Of course, this is only a small example of what can be done with the internet of things.

    As the internet of things continues to grow, it will greatly benefit business. Let’s say you are a farmer, and you have animals on that farm and the animals that need water. You could deploy sensors that monitor the temperature, humidity, soil water and so on. Each sensor pushes this information to a central server, which can alert the farmer if they need to water today or not.

    Sensors will be the main resource of data for at least the next few years of the internet of things. With them, a huge amount of new possibilites arise from traffic control, heart monitor control, self contacting emergency services appliances like fire alarms, and many more. All of this sounds great! However it is not without its drawbacks. Security will be a major issue with the internet of thing and with its coming business owners will now more than ever want to be making sure they have the best and most up to date security software. If everything is connected to the internet then attackers have more access points into your systems, and are able to target databases, private documents and even camera footage would be accessible. Many new internet-enabled systems have had such exploits, ranging from home monitors to even Barbies.

    Despite these challenges, businesses will be increasingly using IoT for connectivity, device-to-device communication and automation, and improving their service offerings. With this comes a raft of possibilities arising from performing data analysis on this data.

    top of page

    A/B tests


    A/B testing is to coarse

    A/B tests are used by a large number of organisations to help them choose which website design to use, which sales script to use, or many other things. However, they are often used in place of getting to know your customers, which can deliver much more powerful results.

    People run A/B tests to compare the effectiveness of two different websites, processes, or approaches to a task. As an example, if you run a sales funnel website, you might think of two approaches, and then run an A/B test to work out which one is more effective.

    An actual A/B test is then performed, where visitors to the website are randomly shown either of the two options - that is, they are shown webpage “A”, or webpage “B”. You then measure the conversion rate between the two options, to see which performs better.

    To determine if method “A” is actually better than “B”, we then statistically evaluate the results to obtain a probability that the result is accurate (well technically, that the result isn’t accurate, but that is a discussion for another day). Click here for a good article explaining the reasoning behind the complexity of A/B testing.

    A/B tests are a great way to start making data-driven decisions. However, they have quite a few issues, which are important to outline before you run experiments:

    Continue until the end

    You must decide on the number of visitors you will run the experiment for ahead of time, and not stop the experiment until that number is reached.

    “Peeking” at the results mid-way through the experiment will ruin any statistical significance your test has. This is especially true if you decide to “stop” the experiment because statistical significance has been achieved. The reason for this is that, even if the statistical calculations give you a significant answer, stopping an experiment mid-way through drastically undermines a key assumption behind the statistics.

    This website goes into great detail about these issues, why they occur from a technical level, and gives some examples. I highly recommend those interested in the “why” behind the “don’t peek” rule to read that page. Another source for more information is this website or this academic article if you really want the technical knowledge.

    In summary, if you peek at the results, you are breaking a fundamental assumption behind performing an A/B test, especially if you decide to stop the experiment because “significance has been reached”.

    p-values don't work

    There are two major reasons why you shouldn’t be using p-values from a t-test anyway, which is what most A/B test packages will give you. First, “Significance”, in the statistical sense, does not mean “Significance” from a business sense. If you want to work out which version of your website you should go for, an A/B test may tell you that one is better, with a significance of less than 0.05. This is the “magic” value that most people use. However, it means that one in twenty experiments will have the wrong result, and it only tells you there is a difference that is unlikely to be the result of randomness (chance) in your data. This “0.05” value is just an arbitrary value people use, which has the benefit of regularly finding “significant” results while seeming like a tough threshold to reach. Further, it doesn’t work when you run multiple tests at the same time. Finally, significance in a statistical sense is just the idea that the two are different, not that the difference has a real impact.

    Second, most p-values are computed based on a t-test, which has a number of assumptions behind it that are usually ignored, but quite important to ensure that the result actually matches the expectation behind running the test.

    Instead, we recommend that you use Bayesian statistics, which has been shown to produce accurate results, without the assumptions that go into a t-test. This doesn’t magically solve all of the issues, but it does provide a more robust framework.

    A/B testing is very coarse

    An A/B test really only checks that a particular version of your website (or process, etc) works against everyone that visits your site. A/B testing by itself is a tool to use, but should be used in the appropriate way. You use A/B testing to decide between a small number of options (usually two, but this isn’t a requirement).

    What is more important, though, is learning who your customers are. Learn their problems, requirements, and environments, and use that information to deliver more targeted information, rather than choosing the results of an A/B test.

    Customer segmentation is one method to help businesses do this. It starts with breaking your customer group into sub-groups (also called segments, clusters, partitions, or many other terms). Then you target each of these sub-groups with a more specific website/design/sales copy/etc based on their specific needs.

    You can still do A/B tests!

    We actually do run A/B tests ourselves and I do encourage you to do to the same, except as an exploratory tool and not a decision-making tool. First, we think about problems that clients may face, then talk with them about potential solutions. From there, we develop a product or solution and run an A/B-style test if we can’t work out which solution will be delivering better value. However, this test is only one piece of the puzzle and is used as evidence to make the decision, not to make the decision itself.

    We are really interested in hearing stories about any difficulties you have faced with A/B tests. Send an email to [email protected] and let us know the greatest challenge you face in using statistical tests in your organisation.

    top of page

    Start-up Analytics: First Steps


    You have a great idea for a new small business or start-up, and you are gathering quite a bit of interest from the people around you. Next, what sort of data should you be collecting?

    The answer, of course, depends on the business’ structure, products, and goals. However, there is some common ground which will apply to most businesses.

    Customer Acquisition

    The first major set of metrics relates to customer acquisitions. You’ll probably have the following questions:

    1. Where do my customers come from?
    2. How did they find me?
    3. Which channels are bringing me the most bang-for-the-buck?

    The easiest way to do this, at least at the start, is to just ask them. Ring up your current customers, send them an email, or attach a survey to your invoice/receipt. While this may seem like an approach that doesn’t scale, many large companies still do this, and it works very well.

    One thing to keep in mind with customer-focused surveys is length. I’ve found that many customers will get quite annoyed with two common mistakes companies make with their surveys.

    The first is a huge number of questions, pages of Likert scales, and lots and lots of question. This indicates a poorly formed set of questions, where the scope of the survey wasn’t restricted, and the “let’s just ask them everything” approach was taken. Keep it small, keep it simple, and limit yourself to about 5 questions, and never, ever, more than 10 questions for a business-to-customer survey (that includes if your customers are businesses). Having too many questions will cause low response rates, and you’ll miss important feedback - it will often be the case that the people who are frustrated with your business will not bother to fill out your 20 question survey, yet these are often the very people you want to hear from. (Oh, and by the way, a single question with 5 sub-questions counts as 5 questions!)

    I should note, that in the case of academic studies, this first problem often needs to happen, where a properly formulated research question may need lots of control questions and so on, leading to a huge number of overall questions.

    The second mistake I’ve seen in business’ surveys is the “and just one more thing” approach. This is related to the above, but aims to trick the user into filling out more information. The customer will fill out the one-page survey, submit, and then… get asked to fill out a few more questions. After all that, they will submit, and then… yet another page. This again is poor survey design, leading to a poor user experience. Think about the reliability of the answers to the later few questions, when you have been annoying the customer for the last 15 minutes with “oh, and another thing”.

    Business Progress

    The second set of metrics relates more internally – how your business is going: What are the expenses? How much income is coming in?

    This used to be quite hard to keep track of and visualise, but modern accounting packages are making this easier and easier. I can now bring up my Xero page, and see who has paid, how much I can expect to be coming in next week, what my overall expenses are, and then drill down to see the detail of all of those. Software such as this saves lots of time and problems related to “guessing” the overall status of your company, and saves lots of expense in chasing up the details.

    Accurate bookkeeping can seem like a tedious affair, but can make the world of difference when looking back and projecting forward. This, of course, is true for most record-keeping, but it is critical for businesses to thrive.

    In addition to the legal requirements though, there are many types of reports that you can do to gain better insight into your business. One great example is a cohort analysis, where customers are broken up into different groups (such as “month they signed-up” or demographic), and reported on, based on their spending and engagement with your company. This often gains better insight than looking at your customers as one big group. For instance, if the customers you gained last year no longer use your product, you have a retention problem, even if your overall number of users is going up.

    In many cases, the results will match your intuition about how the business is going. After all, it is your business, and you know how it is run. However as your business scales, it becomes harder to keep track yourself and more important to use summarisation reports to get a sense of where to go.

    Finally, it is really important to remember that data analytics is only as good as the people using it. Good analytics is driven by expertise in algorithms, expertise in application, and expertise in delivery. However, the most important person is the end-user, the client, who will not only be using the end results, but also asking the right questions early to guide the analysis.

    At dataPipeline, we can definitely help with choosing and applying the right algorithms, but we pride ourselves on integrating your expertise into the process. Let us be the technique experts, but you are the domain expert. If you would like to work together on a project, feel free to contact us anytime at, or call on 0430 013 554.

    top of page

    A Small Cafe Predicting Future Demand


    Case Study

    Based in Ballarat, a small cafe is looking to have a bit more control over their operations. The cafe is alternating between needing 2 or 3 staff on each day, depending on the number of customers that are coming in. On days where they get this wrong, they are either too busy to keep up with demand, or paying for a staff member they don’t need, hurting profitability.

    After we had an initial consultation talking about the requirements, some possible reasons for the variability are listed:

    • Weather - the cafe generally operates better in warmer weather, both generally per-season and specifically based on factors like rain.
    • Day of the week - Some days are generally better than others, although this trend is not always reliable.
    • Major events in region - the cafe is the busiest when there is a local event. However non-local events can cause a reduction in the number of customers.
    • Unknown causes - sometimes, large groups suddenly come in. These are difficult to manage properly and hard to predict.
    • School holidays and long weekends - different public and school holidays affected the cafe demand differently.

    In addition, the impact on work needed was discussed. The total sales for a day did not always accurately reflect the amount of work required. Instead, the number of customers was a better indicator for overall work, based on the data we already had available. We discussed that some customers were harder to serve than others. While some customers came in for off-the-shelf goods, others wanted coffee and cooked meals, which require more work.

    After taking the available data, where we used number of transactions to approximate the number of customers, we were able to come up with a predictive model that estimates the number of staff needed on a given day. The model takes into account many of the possible reasons above, and we set up an alert service where the estimates for the next two weeks are predicted each week, and emailed to the client. This lets the cafe owner plan ahead, and let staff know as early as possible what the roster is like. The predictive model uses data from the business and external sources, such as weather and holiday information.

    Evaluation of the model was performed by taking the previous roster and comparing the actual versus predicted staff numbers for those days. Comparing the results, the predictive model we came up with saved around 20% of costs through better staff management, while reducing the number of days where demand was greater than staff. The predictive model also has the capability to scale as the business grows.

    We also set some goals for collecting more data for the next year, including collecting more specific data, such as the items ordered. This would allow us to predict demand for each product type, and get a better estimate of the overall amount of work. In addition, we can start surveying some customers to find out if there are other factors that impact on demand throughout the year that we are unaware of. This surveying can be formal (by handing out paper-based surveys or providing a link to an online survey), or informal, where the staff simply ask and make a note of the answer.

    top of page

    Learning Data Mining with Python


    Recently, I’ve had my new book, Learning Data Mining with Python published. The book is an introduction to data mining for people that have the basics of programming already. Little time is given to the details of how code works, allowing more time for actually learning the algorithms and how they work.

    In the book, I cover prediction, classification, affinity analysis, clustering, and lots of other algorithm types. However, each algorithm is matched with a real-world based example on how to use it, and some examples on where else the concepts could be applied. Each chapter has its own contained code sample, meaning that by the end of the book, you’ll have twelve different data mining projects up and running.

    The book is available on Amazon and directly from the publisher Packt. If you want to have a peek inside first, you can read some of it at Google Books.

    The chapter titles are:

    • Chapter 1: Getting Started with Data Mining
    • Chapter 2: Classifying with scikit-learn Estimators
    • Chapter 3: Predicting Sports Winners with Decision Trees
    • Chapter 4: Recommending Movies Using Affinity Analysis
    • Chapter 5: Extracting Features with Transformers
    • Chapter 6: Social Media Insight Using Naive Bayes
    • Chapter 7: Discovering Accounts to Following Using Graph Mining
    • Chapter 8: Beating CAPTCHAs with Neural Networks
    • Chapter 9: Authorship Attribution
    • Chapter 10: Clustering News Articles
    • Chapter 11: Classifying Objects in Images Using Deep Learning
    • Chapter 12: Working with Big Data
    • Appendix A: Next Steps…

    top of page

    Summary of PyCon AU 2015


    I recently attended PyCon AU 2015, Australia’s main Python conference (that’s Python-the-programming-language). The conference was a great success, with lots of great talks, and motivated me significantly to get started on some new projects. Below are just some of the interesting talks I attended, and the outcomes from each. Most of the talks will be available online in the near future.

    Data Science Miniconf

    Friday 31st July

    Custom Python Applications in Neuroscience

    In this talk, Simon Salinas and Sharma Gagan spoke about their use of Python in developing packages for viewing brain scans. Their project was highly successful, with great outcomes after only a small amount of coding, showing the power of the Python programming language. Their code was able to view images in the widely used DICOM format (common in the medical industry).

    Adventures in scikit-learn’s Random Forest

    Greg Saunders spoke about random forests, a powerful data mining algorithm that requires little tuning to get good results in most datasets. He spoke about what they were, and how to use them. The explanation was a great summary of what can be a complex algorithm. The talk also contains a great explanation of how to do some data preprocessing and managing to get the data into a nice easy to use format.

    Not Invented Here: Portng Scientific Software to Python

    Getting code from academia to work can be a pain, and all to often there is a mentality to just rewrite it yourself. In this talk, Andrew Walker spoke about the dangers of doing that, and some other options for interoperability, i.e. getting code from another language working in Python. This talk was a great overview of the different techniques, and I’m going to be looking into a couple for my own projects.

    An End-to-End Machine Learning Ecosystem in a Quarter

    Chris Hausler from Zendesk spoke about how they (almost) built an enterprise-grade system for machine learning from their vast amount of customer data in just a quarter (of a year). The talk had lots of insights about the trade-offs needed to build a system well, but also quickly. Their system already does a great job in working out if customers will have a good service experience, and has great scope for improvement (many of which Chris and his team are working on already).

    Predicting sports winners using data analytics with pandas and scikit-learn

    This was my talk, in which I went through the material in chapter 3 of my new book (!) on how to predict the winner of an NBA match with Python’s pandas and scikit-learn libraries. The talk was well received, and I got lots of positive feedback. After the talk, I received some help from Sean Malloy which allowed me to integrate betting odds into the computation. My system doesn’t make a profit (yet), but the loss is only small, especially compared to lesser systems.

    PyCon Day 1

    Saturday 1st August

    Designed for education: a Python solution

    In this talk, Carrie Anne Philbin spoke about how to teach Python, particularly to kids. Some of the problems were not so obvious, like having a text editor that is simple to use, rather than a full-featured version. The talk was enjoyable and highlighted some of the assumptions technical people make about how teaching should happen, rather than how it does happen.

    Docker + Python

    In this talk, Tim Butler outlined Docker, and how to use it with a Python system. The talk was great. While I have been using docker for a while now, I never felt I was doing it right, but now I have some solid strategies for improving my usage of it.

    Python for less than $7

    My favourite talk of the conference was Graeme Cross’ overview of installing Python on microcontrollers that cost less than $15. The talk was quite in-depth, easy to access, and had a good discussion of the advantages and disadvantages of each. I’m currently investigating hardware programming of this type through the Ballarat Hackerspace, and I was madly making notes of nearly everything Graeme said to digest later!

    PyCon Day 2

    Sunday 2nd August

    Consequences of an Insightful Algorithm

    In this talk, Carina C. Zona outlined the dangers of applying machine learning in situations where people are involved. The take home message was to be aware of the limitations of the algorithms you employ in your organisation, what the impact is expected to be, and what the impact may be in error situations. While I felt the language used was a bit of “blame the algorithm”, I don’t think this was the intent - it is the people using the algorithms that need to be aware of how they are using them, and exactly what they are doing. For example, when trying to work out the memorable times in a person’s year, a big social media site inadvertently showed very negative and painful memories to some people. While these were the most impactful, they probably weren’t the ones that the person most wanted to remember.

    Learn you a Flask

    I loved this talk, as an introduction to Flask, which is a light-weight library for doing web services and websites. Lachlan Blackhall overviews normal usage of Flask, rather than the very introductory usage you find on most tutorials. I’ll be using the lessons learnt here quite regularly in upcoming projects.

    Easy wins with Cython: fast and multi-core

    While Python has a reputation of being a slow language to run, this usually isn’t true, and Caleb Hattingh outlined how to use Cython to speed it up anyway. Very little work needed to be done to get Cython working on some code, and the speed-up gains were over 8000% (I’m not joking!). For code that has lots of computation to do, this may be one of the better options, instead of going to another language.

    top of page

    Discovering trends in text using topic modelling


    You have developed a survey, got hundreds of people to fill it outand have started your analysis of the results. You find that most people ticked “Agree” or “Strongly Agree” to many of your questions, you got some numerical averages, like people visited your store 3.6 times on average each month. But how do you get useful information from the other field - “Additional Comments”?

    Many surveys ask people to provide additional comments. In addition, many other ways of collecting information involve text-based data. It is hard to analyse effectively, but that doesn’t mean you should ignore it. The data here can provide insight into what your customers or clients know that you don’t – incredibly useful information!

    Luckily there are lots of ways to analyse this data, although you won’t find these options in Excel. The most straightforward way is to manually read them all yourself and take notes on the themes that you observe. There are even formal methods for doing these, such as Grounded Theory, which provides some rigor to this type of analysis. However it is time consumingand there are ways to automate the analysis, giving you a good view of the data, without having to read hundreds of responses.

    A great option is to perform a word count. To do this, we take each document, split the text into words, and then count the frequency of each word. There are some nuances though, so a little processing is involved:

    1. Remove common words, like “the”, “is” and other words like this.
    2. Find word “stems”, so that “read” and “reading” are counted together, rather than as separate words.

    The following code will take a document from you and provide a word count with these fixes applied. Note that because words are stemmed, words like “thinking” will be counted as the word “think”.

    Enter some text above to count the words.

    From here, you can put these counts into a word cloud builder to create a nice visualisation. Click this button to generate an example:

    Click the button below to generate the word cloud

    There are also much more complex ways to extract more information:

    • Merge frequently occurring word pairs, giving you the difference between “artificial intelligence” and “artificial” - quite a difference to your analysis!
    • Flagging odd queries for manual analysis. For example, someone might put random text in, which can create odd word counts. Identifying this and removing it will improve the overall analysis.
    • Semantic merging, where words that have similar meaning are combined. For instance, when asked about the weather, some people might say “rainy” and others “wet”, but they really mean the same thing, and could be counted together.
    • Fix misspellings, so that you don’t lose word frequencies (particularly for harder to spell words).

    These types of analysis can also be combined with a topic analysis, where lower-level word meanings are combined to overall topics. For example, if we were analysing news articles, we might wish to split them into “sports” or “world news” articles. We could even combine it with last week’s blog post on semantic analysis – see if your “additional comments” in your survey are positive or negative!

    Overall, text based data can provide a huge insight into your business, but it does take a bit more to properly extract. If you need any assistance, we can help.

    top of page

    Analysing social media: Getting Started


    Social media has taken the world by storm, and businesses are often eager to integrate it into their business. Analysing how a business is viewed on social media is an important task, and one that can be quite difficult to keep up with. Social media analysis involves searching for mentions of your business or sector, analysing the messages, and even resolving complaints online. For many businesses, this has been thrust upon them - even if they didn’t want a social media presence, they do need to be aware of what people are saying about them. Even small businesses can use data in big ways, and social media analysis is a great way to get started.

    There are many businesses out there that can help with your social media presence. The most common form of assistance is to have someone run your social media for you - creating new content, responding to messages, and generally looking out for your business online. It can be hard work doing this type of analysis manually, and a little automation can go a long way. Many customers take to social media to voice their approval or complaints about different businesses. Some of these will be positive mentions, while some will be disgruntled customers who had a bad experience. If you don’t work quickly to reply to poor social media mentions, the negative impact can reach many potential customers before you have had a chance to respond to it!

    There are many ways to automate your social media analysis. The first step though, is to find mentions of your business (or broader sector) on these social media websites. You can do this manually - head to twitter, search for your business name and see what comes up. Regular searches like this can help you keep on top of social media, but most businesses are only mentioned rarely online. This means a lot of manual searching that have no results, and lots of wasted effort.

    In addition, there are lots of sites out there, such as Twitter, Facebook, Yelp, Google Places and so on (see here for more!). Even if you have accounts on all these sites, it can take a long time to search each, and to do those searches regularly.

    This is where automation comes in.

    Many of these websites have an API, an Application Programming Interface, which allows programs to interact with the website to collect and process data. This allows us to write a program that can automatically perform searches, such as looking for mentions of your business on each of these websites. Best of all, it can be set to run at all hours of the day, so you never miss a tweet! We can then write a program that records these mentions, performs some analysis (see below!), and sends you an email digest every day. Or every hour. Or even in real-time as they occur.

    We can also take it a step further, to see if these mentions are talking positively or negatively about the business. It can be helpful to respond to positive mentions online, but often not time-critical. Negative mentions often are more time critical though, and should probably be looked at with more urgency.

    Determining whether a mention is positive or negative is called sentiment analysis. To do this, we look at the words in the text, and see which ones denote positive or negative emotions. The distribution of these words helps to determine if the overall text is positive or negative. This is a relatively new area of data mining, but one that is being increasingly used in analyses of this type. Automated Sentiment Analysis tool

    Enter some text above to see if it is positive or negative.

    Above, I’ve put a basic sentiment analysis model online. Type in a message, and see if the program correctly guesses whether it has positive or negative sentiment. This particular model is about 75% accurate, and is very fast- it can analyse thousands of messages a second. Other models also available that are a bit slower, but more accurate. Depending on the website, automated sentiment analysis techniques like this can reach over 90%.

    Putting these together, we have a system that:

    1. Automatically looks for social media mentions,
    2. Determines if they are talking positively or negatively about the business, and
    3. Alerts you when they are found.

    Now, you can stay up to date with your social media presence, without the hassle of having to log into more than a dozen sites each day.

    Next week, I’ll talk about how to find trending topics online, getting a quick overview of what is happening in the news, on social media, or from your in-house documents.

    top of page

    Small Business Using Data Analytics in a Big Way


    Data analytics is being increasingly used by major organisations to attract new customers, serve existing customers more efficiently and improve overall performance. If you own or manage a small business, you probably already collect plenty of data that can be used to answer questions you may have about your business.

    Data analytics involves questioning your business, or its environment, and using data to drive action. Small businesses use data too, and can benefit from bettering their analysis of it. Almost every business tracks the amount of time spent on each project and tracks their expenses to see where money is being spent.

    As an example, customer segmentation is a powerful data analysis tool that gives business owners a way to manage different groups of customers. By analysing transaction data, we split all of the customers into like-minded groups. From here, you can create services that best target each segment and then send more targeted advertising.

    Grouping customer segments

    Analysing this data can yield useful insights, but so too can collecting data of other types. For instance, determining where your customers come from and how they find your business can help target future marketing opportunities. Tracking hours worked on projects can assist in predicting the time spent on future projects – and help determine if you need to hire a new employee to help with the workload.

    Analysis solutions come in two major types:

    • The first is the one-off report where a question is asked, data is collected and a report written that addresses the question. An example might be to analyse the current state of the market, with a view to determining whether prices will rise or fall next year.
    • The second major type of analysis solution is ongoing tracking, whereby data is collected on an on-going basis to track something of interest. For example, you could track the time that an item is stored in your warehouse before being shipped or sold. Tracking this metric over time will give you up-to-the minute information on how well you are utilising your space. Interventions, like having a sale on slow-moving items, can be tracked too, to see if they have a major impact on performance.

    Analytics can also help automate complex tasks. For instance, if you survey your users, you can use text mining to automatically discover topics people are talking about. Likewise, you can also track your social media pages and automatically flag any complaint to make sure you can respond quickly. I’ll cover each of these two types of analysis, and more, in upcoming posts.

    The key to finding good analytical opportunities is not to think about the data you do (or don’t) have, but the pressing questions that drive your business. In this way, data analytics doesn’t replace a manager or director of a business – it can’t – what it can do though, is to help them answer questions and to find new questions to ask. Starting with the question first allows you to determine what data you should be collecting, and how to analyse that data to give you the answers you need.

    top of page