Can crime be prevented?

As difficult as it may seem, the answer to the question that opens this article is: in many cases yes. We are in the age of information, the age of data, and, learning from this data, we can obtain information about virtually any phenomenon, and in particular crime.

The detection and prevention of crime is one of the great challenges faced by law enforcement agencies in making citizens’ lives safer. In this article we show how at Human Trends can contribute to the detection and prevention of crime

Association rules

One of the most intuitive and interpretable techniques of Machine Learning are the association rules. A rule of association is defined as an implication of the type “if X then Y”:

X ⇒ Y

where X and Y are individual items.

Let’s take an example. Imagine that a supermarket gives us access to its database, in which all the purchases made by customers are recorded, and they ask us to find frequent purchase patterns, such as which items are usually bought together. Applying the technique of association rules, one of the possible rules we could obtain would be

{onions, vegetables} ⇒ {meat}

This rule would indicate that a customer who buys onions and vegetables at the same time is likely to buy meat as well. This information can be used as a basis for decision making. In this particular case, the information thrown up by the association rules could be used to make marketing decisions, such as promotional prices for certain products or where to locate them within the supermarket.

Association rules and crime prevention

The association rules are a tool with enormous potential, and they are also easy to interpret. Thanks to the association rules we can obtain very precise and intuitive results, a fundamental requirement if we want to know what is happening.

If instead of having access to the database of a supermarket we had access to the database of the crimes/crimes that occurred in some city, state, etc., we could apply association rules to obtain frequent patterns that occur in these crimes and learn from those patterns in order to prevent crime.

The following is a practical example of how applying association rules can provide useful information about the crimes committed.

Crime Detection and Prevention in San Francisco

Frequently item sets of elements

The data set we have available contains information collected by different San Francisco police departments about crimes that occurred in San Francisco during the year 2016. The objective of this article is to find useful rules of association, which provide non-trivial knowledge and allow us to determine the relationship between the different variables available to us and the crimes that occurred.

The first step is to perform an exploratory data analysis and all the pre-processing required to have really representative data, as well as to be able to generate association rules1. Once this pre-processing is done, we can start to obtain interesting information by analyzing the data we have.

If we show a histogram of the elements we have, we get2:

La imagen tiene un atributo ALT vacío; su nombre de archivo es image-26-1024x532.png

Figure 1. Histogram of the elements present in the data set.

We can see that we have different variables that provide different types of information. We have variables that give us information about the type of crime, variables that give us information about when the crime occurred, and variables that give us information about where the crime occurred.

The information that can be extracted from the graph is as follows. The element Time = Late is the most frequent, followed by Category = Larceny/theft. For the variable Descript, the most frequent crime is grand theft from locked car. The days of the week are evenly distributed, with Friday and Saturday being slightly more frequent than the rest. In terms of districts, the Southern District stands out from the rest, with this being the most frequent. We can also see that the fourth week of the month is somewhat less than the other three. As for the seasons of the year (months), they are distributed practically equally, with autumn being the most frequent.

If we use an appropriate algorithm for this problem (unsupervised learning), we can efficiently obtain the frequent item sets. If we look at some of them (the first 10, for example), we get the following:

Figure 2. 10 first frequent itemsets.

Let’s look, as an example, at the first frequent item set. Seeing this information, a fact to emphasize is that the period in which there are more occurrences of crimes is in the afternoon. The “afternoon” interval has been defined, in this case, from 16:00 to 20:00. If we look at the distribution of hours, we get the following:

Figure 3. Histogram of the hours in which the crimes occurred.

We see that the most frequent value is 00:00, however, when considering the entire afternoon time interval (from 16:00 to 20:00), and counting all the crimes that occur in this time slot, this interval is the most frequent. Within this interval, the highest concentration of crimes that occur in the afternoon is around 16:00-17:00. In the United States, the working day is usually from 9:00 a.m. to 5:00 p.m. Therefore, this information reflects that many of the crimes occur at the time when people leave work and are returning home, or when shops, offices, etc. are closing.

This same analysis could be done as an example for the other frequent itemsets. This would give us an overview of the problem and a little insight into what kinds of rules we expect to get.

Association rules

Once the analysis of the frequent itemsets is done, we can move on to the generation of association rules. The generation of rules has also been done using the same algorithm used previously in obtaining frequent itemsets.

With the data we have, a total of 243 rules have been generated. Among all these rules, there are many that are obvious rules. Therefore, these rules do not provide knowledge that is not trivial, and do not serve as a valid result. To obtain non-trivial rules, we must dive into the generated rule set.

Some rules that do provide useful knowledge are shown below.

First rule

This rule tells us that, on this particular street, if a crime occurs, it will be a robbery with an 88.3% probability. This is quite a remarkable fact. Particularly, on this street, there are records of 298 crimes in total, of which 263 belong to the category of larceny/theft. If, in addition, we look within the crimes categorized as larceny/theft which is the most frequent in this street, we find the following:

We can see that they all have to do with vehicle theft (except the fourth instance). If we look through Google Maps at the particular point of this street that indicates the rule, we get the following image:

Figure 4. Particular point that appears in the first rule.

As we can see, it is a car park where there is a restaurant and a place of interest called Camera Obscura & Holograph, which is something like a scientific attraction. The restaurant is a high-priced restaurant, and therefore the people who come here are usually well off.

The fact that there are 263 vehicle-related thefts in a single year in such a specific area indicates that this is a hot spot, and that leaving the car in this location is a high risk. A possible solution to this problem would be to put a security guard in place so that people can stay calm while at the restaurant or in the area.

Second rule

This rule is noteworthy because the description of the crime is theft of belongings that were in an unlocked car. If we look for information about this crime in Californian Law, we find that:

“If you entered another person’s vehicle with criminal intent, you are not charged with theft with the use of force if the doors were unlocked. For example, if you entered the car through windows that were left open, the break-in is not established and you cannot be charged with burglary. However, you could be charged with theft of items in the car or theft of a car.

That is, if there is no evidence that the car door(s) have been forced, the alleged perpetrator(s) cannot be charged with burglary. This makes us think that the thieves have some mechanism to be able to open the car so that the entrance to this one seems that it has not been forced. The temporal distribution of this particular type of crime in the Mission District was as follows:

That there are 160 cases in a 3-month time period and that there is so much difference between the others is somewhat surprising. Consulting the internet, one finds that in the fall there are many events taking place in the Mission district, one of the most important being the Yerba Buena Gardens Festival. This festival, in the year 2016, started in summer and ended in autumn, coinciding these seasons with the times when there was more of this particular crime. This event brings together more people than usual in the Mission district, and therefore this increase in people could lead to an increase in the number of robberies in both summer and autumn. A picture of the event:

Figure 5. Yerba Buena Gardens Festival event.

There were many events like this during the summer-autumn period of 2016, and this may be the cause of the high rate of theft recorded during these dates.


The work shown here serves to provide a deeper insight into the various crimes that occured in San Francisco in 2016. Through the generation of rules, useful information has been obtained, as well as unexpected rules, providing information that was not expected.

This work of extracting rules could be oriented to look for particular rules, it could be the type of crime, area, etc., focusing on the needs of whoever requires Human Trends for the investigation. The great usefulness of association rules is reflected in this work, showing the simple way in which useful knowledge can be obtained from data of which, a priori, we have no knowledge whatsoever.

1 This exploratory data analysis and pre-processing is different for each problem and cannot be generalized. Besides, it’s a long process. In the article, the performance of this pre-processing is not explained in detail for ease of reading. All the results shown have been obtained from the data once preprocessed.

2 In order for the plot to be displayed correctly, the histogram does not reflect all the elements we have, only those that have a frequency of appearance greater than 0.05 in the complete data set.

Antonio Parri

Antonio Parri González

Data Scientist – Physicist

Leave a Reply

Your email address will not be published. Required fields are marked *