Authentication methods have gone through huge development in the past few years and I believe that the spread of IoT devices will bring further advancement. Currently, however, the most widely used authentication method on the Internet is password-based authentication and we know that this method is far from being ideal – at least in the way many people use it.
Let’s imagine a website on which people can create accounts. The web application stores a non-decryptable hash (or more precisely the output of a hash function) of the password given by the user.
A hash functions assign a value to its input, which is currently the password. Additionally, it should not assign the same value for two different passwords or at least it must minimize the probability of this to happen. Also, there must not be a function which can be used to get the password back from the value, created by the hash function, hence the value is non-decryptable. One of the commonly used hash functions is MD5, which assigns a 128 bit long number (usually written using hexadecimal digits) to our passwords.
Let’s assume that someone finds a vulnerability in our imaginary web application which allows the attacker to access to the database where these hashes are stored. In order to find the password of a user, the attacker must find a string that creates the same hash as the one that can be found in the database. To achieve this, he/she can use a dictionary, which may contain commonly used passwords, the most common words of a language, etc. If this fails, brute-force attack can be used: the attacker can try every combination of the characters which are believed to be used in the password. The problem is that humans are usually quite predictable when it comes to choosing passwords. We rarely use a password which is much more complex than what is required by the policy of the site. Also, we are addicted to patterns, words and we can hardly remember truly random passwords.
To make the cracking of passwords harder, websites suggest, and in many cases even require the users to have more complex login credentials. To be considered strong, a password needs to be at least 8 characters long, it should contain at least one lowercase letter, one uppercase letter, one digit and one special character. One may think that these restrictions will force the users to create more random passwords. We decided to check if this is true and see if we can find any patterns which can be used by an attacker to make brute-force attacks more effective.
For our experiment, we used the human-only password list of CrackStation. From the password list, we use 8 characters long passwords which contains at least one lowercase letter, one uppercase letter, one digit and one special character. We get a list of 17152 passwords, which we divided into two parts. The first part, which contains 12864 passwords is used for our analysis, the second part, which contains 4288 passwords is used to validate our results.
We have created a list of 12864 strings which represents ideal passwords and will be referenced as our control set. Every string in this list is 8 characters long, each character is randomly chosen from the [a-zA-Z0-9,.?:_)(=/%!+”‘] set and all of the the strings contains at least one lowercase letter, one uppercase letter, one digit and one special character.
Number of passwords grouped by their length
Based on the whole dataset, most of the people chose to use 8 characters long passwords. The possible reason for this is that 8 character is the minimal length of passwords on several sites. As you can see on the diagram above, 9 and 10 characters long passwords are also common.
Distribution of characters
In order to make brute-force attacks more effective, an attacker may use the information about how frequently each character occurs in passwords: trying more common characters first can lead to a faster result, than trying them in alphabetical order.
The graph below shows a function similar to the cumulative distribution function, which is commonly used in probability theory to describe distributions. First sort the characters in descending order by the number of their occurrence, then let’s choose a character randomly from our dataset, and start to iterate trough this sorted list of characters. For each iteration, the function shows the chance whether the character chosen from the data set has already been iterated or not. The black curve shows the function for the distribution gained from the dataset, the blue curve is the same function for the distribution from the control set.
Note how much steeper the curve of the dataset is. Only by trying the first 20 most common of the 120 characters, we have more than 54% chance to have any randomly chosen character matched, whereas in the control set of ideal passwords, this probability is only 33%. Also, the control set is made up of 76 different characters; using more characters would make this probability even less.
To find out whether the knowledge of the language of the user helps creating a good order for the characters to use in a brute-force attack, we have also compared the distribution of the characters in the dataset with an English text, namely The Hitch Hiker’s Guide to the Galaxy. The total variation distance between the distribution of the dataset the Hitch Hiker’s Guide is 0.148. While this is much less than the distance between the control set and the book (0.349), it is still more than what we have expected for two texts with the same language. For example, the distance between The Hitch Hiker’s Guide to the Galaxy and Animal Farm is only 0.038. We won’t panic however – if the dataset represents the passwords used on the Internet well, it can be used to make brute-force attacks more effective.
Below you can see the 20 most common characters from each character class of the dataset and their probability of occurrence within the character class.
To be continued…
To make it easier to read, we decided to divide our post into two parts. In the second part, we will try to find out if there are patterns in the way people use and arrange the characters from the four character classes. We will make assumptions on how these patterns can affect the effectiveness of brute-force attacks and share our thoughts about the possible solutions of the problem for both users and providers.