Cyclistic Case Study
Cyclistic Case Study
Cyclistic is Chicago based fictional bike-sharing company. They own 5,824 bicycles that are geotracked and locked into a network of 692 stations across Chicago. The bikes can be unlocked from one station and returned to any other station in the system anytime.
Their flexible pricing includes single-ride passes, full-day passes, and annual memberships. Customers who purchase single-ride or full-day passes are referred to as casual riders. Customers who purchase annual memberships are Cyclistic members.
Cyclistic’s finance analysts have concluded that annual members are much more profitable than casual riders, so the director of marketing believes the company’s future success depends on maximizing the number of annual memberships. Therefore, we need to understand how casual riders and annual members use Cyclistic bikes differently and why casual riders would buy a membership.
Considering the above overview, the business task is answering the question: how do annual members and casual riders use Cyclistic bikes differently? We need to answer this by analyzing the user data, in order for the marketing team to be able to effectively target casual riders, therefore maximizing the conversion rate to Cyclistic members.
The key stakeholders for this project are:
Lily Moreno, the marketing director
Cyclistic marketing analytics team
Cyclistic executive team
We can find the past 12 months of user data (from 01/10/2020 to 30/09/2021) stored on the companies AWS cloud storage. The data has been made available by Motivate International Inc. under this license.
The data is stored in 12 separate .csv files, each representing one month. The tables consist of 13 columns representing the time, location, ID and the casual/member rider characteristics. Each row represents a bike trip from start to finish and the rows range from 49.623 to 822.411.
At first glance, there seems to be a lot of NULL values, and the unique station id's have inconsistent formatting, which prevents the merging of the tables.
Because of the large amount of data, we will use SQL for the cleaning and analyzing process. The tool of choice will be BigQuery and all the data will be stored on GCS (Google Cloud Services).
Considering the unsalvageable state of the station id's, I decided to drop them, and use the station names and geographical coordinates instead. This allows the use of the UNION ALL operator, which merges the 12 tables into one. The WITH clause lets us store this newly created merged table for later use:
WITH all_data AS (
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202010-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202011-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202012-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202101-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202102-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202103-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202104-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202105-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202106-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202107-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202108-divvy-tripdata`
UNION ALL
SELECT * EXCEPT (start_station_id, end_station_id) FROM `sonic-momentum-324112.Cyclistic_data.202109-divvy-tripdata`),
Next, we created the duration column, which represents the number of minutes the ride lasted. This was done by subtracting the ride starting time from the ending time. We also extracted the day of the week and created a column that combines the start and end station. All in this query:
data AS (SELECT *, TIMESTAMP_DIFF(ended_at, started_at, MINUTE) AS duration,
CASE
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 1 THEN "Sunday"
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 2 THEN "Monday"
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 3 THEN "Tuesday"
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 4 THEN "Wednesday"
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 5 THEN "Thursday"
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 6 THEN "Friday"
WHEN EXTRACT (DAYOFWEEK FROM started_at) = 7 THEN "Saturday"
END AS day,
CONCAT (start_station_name," - ", end_station_name) AS trip
FROM all_data),
The geographical latitude and longitude varies slightly for the same station. Probably because of the change in the location of the dock where each bike was left. We can solve this issue if we group by stations and calculate the average latitude and longitude for them:
start_coordinates AS (SELECT DISTINCT start_station_name,
ROUND(AVG(start_lat), 8) AS start_lat,
ROUND (AVG(start_lng), 8) AS start_lng,
FROM all_data
WHERE start_station_name IS NOT NULL
GROUP BY start_station_name),
end_coordinates AS (SELECT DISTINCT end_station_name,
ROUND(AVG(end_lat), 8) AS end_lat,
ROUND (AVG(end_lng), 8) AS end_lng,
FROM all_data
WHERE end_station_name IS NOT NULL
GROUP BY end_station_name),
Then we created another temporary table, where we added in the newly created columns and joined in the cleaned latitude and longitude for the stations. This table was used to sort and filter through the data to find any inconsistencies:
filtered_data AS (SELECT ride_id, rideable_type, started_at, ended_at, d.start_station_name, d.end_station_name, s.start_lat, s.start_lng, e.end_lat, e.end_lng, member_casual, duration,day, trip
FROM data as d
LEFT JOIN start_coordinates AS s
ON d.start_station_name = s.start_station_name
LEFT JOIN end_coordinates AS e
ON d.end_station_name = e.end_station_name
By sorting and filtering through the data we encountered 3 issues:
The ride duration contained some impossible numbers, like negative minutes and 58 day long rides (when the longest one is a day)
The end station and start station have NULL values. Where the station was NULL, the geographical coordinates were also NULL, so we couldn't use them as an identifier.
There are 715 distinct stations when there should only be 692.
With a simple WHERE clause we can filter out all the NULL station values and any ride that is shorter than 1 minute or longer than a full day:
WHERE d.start_station_name IS NOT NULL AND d.end_station_name IS NOT NULL
AND duration BETWEEN 1 AND 1440)
To identify the duplicate stations, we can group by the geographical coordinates and count the distinct values of the stations.
##This was used to identify duplicate start stations using the longitude and latitude
WITH start_duplicates AS (SELECT CONCAT(start_lat," &", start_lng) AS coord, COUNT (DISTINCT (start_station_name)) AS count,
FROM `Cyclistic_data.final_data2`
GROUP BY coord
HAVING count > 1),
start_stations AS (SELECT start_station_name, CONCAT(start_lat," &", start_lng) AS coord
FROM `Cyclistic_data.final_data2`)
SELECT start_station_name, s.coord
FROM start_stations AS s
INNER JOIN start_duplicates AS d
ON s.coord=d.coord
GROUP BY start_station_name, s.coord
ORDER BY s.coord
##This was used to identify duplicate end stations using the longitude and latitude
WITH end_duplicates AS (SELECT CONCAT(end_lat," &", end_lng) AS coord, COUNT (DISTINCT (end_station_name)) AS count,
FROM `sonic-momentum-324112.Cyclistic_data.final_data2`
GROUP BY coord
HAVING count > 1),
end_stations AS (SELECT end_station_name, CONCAT(end_lat," &", end_lng) AS coord
FROM `sonic-momentum-324112.Cyclistic_data.final_data2`)
SELECT end_station_name, s.coord
FROM end_stations AS s
INNER JOIN end_duplicates AS d
ON s.coord=d.coord
GROUP BY end_station_name, s.coord
ORDER BY s.coord*/
Now that we have a list of all the duplicate stations, we can replace these using the CASE WHEN statement.
CASE
WHEN start_station_name = 'Harding Ave & 26th St' THEN 'Lawndale Ave & 30th St'
WHEN start_station_name = 'Komensky Ave & 31st St' THEN 'Tripp Ave & 31st St'
WHEN start_station_name = 'Kildare Ave & 26th St' THEN 'Tripp Ave & 31st St'
WHEN start_station_name = 'Lawndale Ave & 16th St' THEN 'N Green St & W Lake St'
WHEN start_station_name = 'N Carpenter St & W Lake St' THEN 'Lavergne & Fullerton'
WHEN start_station_name = 'Lamon Ave & Armitage Ave' THEN 'Tripp Ave & 31st St'
WHEN start_station_name = 'N Hampden Ct & W Diversey Ave' THEN 'Hampden Ct & Diversey Ave'
WHEN start_station_name = 'Kilpatrick Ave & Parker Ave' THEN 'Kostner Ave & Wrightwood Ave'
WHEN start_station_name = 'Mulligan Ave & Wellington Ave' THEN 'Meade Ave & Diversey Ave'
WHEN start_station_name = '351' THEN 'Meade Ave & Diversey Ave'
WHEN start_station_name = 'Roscoe & Harlem' THEN 'Oketo Ave & Addison St'
WHEN start_station_name = 'DuSable Museum' THEN 'Museum'
WHEN start_station_name = 'DuSable Lake Shore Dr & Ohio St' THEN 'Lake Shore Dr & Ohio St'
WHEN start_station_name = 'DuSable Lake Shore Dr & Monroe St' THEN 'Lake Shore Dr & Diversey Pkwy'
WHEN start_station_name = 'DuSable Lake Shore Dr & Wellington Ave' THEN 'Lake Shore Dr & Wellington Ave'
WHEN start_station_name = 'DuSable Lake Shore Dr & North Blvd' THEN 'Lake Shore Dr & North Blvd'
WHEN start_station_name = 'DuSable Lake Shore Dr & Belmont Ave' THEN 'Lake Shore Dr & Belmont Ave'
WHEN start_station_name = 'DuSable Lake Shore Dr & Diversey Pkwy' THEN 'Lake Shore Dr & Diversey Pkwy'
ELSE TRIM(start_station_name)
END AS start_station_name,
CASE
WHEN end_station_name = 'Harding Ave & 26th St' THEN 'Lawndale Ave & 30th St'
WHEN end_station_name = 'Komensky Ave & 31st St' THEN 'Tripp Ave & 31st St'
WHEN end_station_name = 'Central Park Ave & Douglas Blvd' THEN 'Tripp Ave & 31st St'
WHEN end_station_name = 'Lawndale Ave & 16th St' THEN 'Lawndale Ave & 16th St'
WHEN end_station_name = 'N Carpenter St & W Lake St' THEN 'N Green St & W Lake St'
WHEN end_station_name = 'N Damen Ave & W Wabansia St' THEN 'Damen Ave & Wabansia Ave'
WHEN end_station_name = 'N Hampden Ct & W Diversey Ave' THEN 'Hampden Ct & Diversey Ave'
WHEN end_station_name = 'Kilpatrick Ave & Parker Ave' THEN 'Kostner Ave & Wrightwood Ave'
WHEN end_station_name = 'Mulligan Ave & Wellington Ave' THEN 'Meade Ave & Diversey Ave'
WHEN end_station_name = 'Lamon Ave & Armitage Ave' THEN 'Lavergne & Fullerton'
WHEN end_station_name = 'DuSable Museum' THEN 'Museum'
WHEN end_station_name = 'DuSable Lake Shore Dr & Ohio St' THEN 'Lake Shore Dr & Ohio St'
WHEN end_station_name = 'DuSable Lake Shore Dr & Monroe St' THEN 'Lake Shore Dr & Diversey Pkwy'
WHEN end_station_name = 'DuSable Lake Shore Dr & Wellington Ave' THEN 'Lake Shore Dr & Wellington Ave'
WHEN end_station_name = 'DuSable Lake Shore Dr & North Blvd' THEN 'Lake Shore Dr & North Blvd'
WHEN end_station_name = 'DuSable Lake Shore Dr & Belmont Ave' THEN 'Lake Shore Dr & Belmont Ave'
WHEN end_station_name = 'DuSable Lake Shore Dr & Diversey Pkwy' THEN 'Lake Shore Dr & Diversey Pkwy'
ELSE TRIM(end_station_name)
END AS end_station_name,
With the duplicates removed, the final table is ready to be analyzed. The entire query can be found here!
In order to simplify and organize the analysis process, I've broken it down into 3 parts: Where?, When? and How? are casual riders and annual members using the bikes differently.
Using the latitude and longitude data we can pinpoint each start and end station in Chicago. This helps us to figure out where casual riders and annual members start and end their trips.
Interactive Tableau Dashboard For Casual GeoData
The above visualization shows the frequency of visits for casual riders to each bike station above 2000 uses. The most frequented stations can be spotted by their larger shape and red color. The top 5 stations are marked on the map.
The map clearly shows that casual riders overwhelmingly prefer riding along the Chicago shoreline, with the biggest concentration being around the Chicago Harbor and Millenium Park.
Annual members behave differently when it comes to their preferred stations:
Interactive Tableau Dashboard For Member GeoData
The annual member's usage is less dense on the shoreline, harbor and park. Member frequented stations are more spread out around Chicago. They are not as heavily concentrated on certain areas. However, we can still see that most of the traffic is around office buildings. Wells St. and Clark St. having the busiest stations. This also confirms our hypothesis, that annual members mostly use the bikes, to commute to work.
When we look at the total number of rides throughout the year, it all comes down to the temperature.
It looks like the number of trips slowly start increasing starting in March, peaking in July for the casual riders, and in August for the annual members. After the summer months, it begins to decrease hitting the lowest point in February. The peak season for both casual riders and annual members is between May and October.
45% of all the rides happen in the summer months.
Looking at the preferred days for the riders, gives us the following insights:
While the number of trips for annual members stays consistent throughout the week (with a slight decrease on Sunday and Monday), casual riders clearly prefer the weekend. The below query shows us that 33% of all trips for casuals happen on Saturday and Sunday:
SELECT (SELECT COUNT (*)
FROM `sonic-momentum-324112.Cyclistic_data.corrected_final`
WHERE day = "Saturday" OR day = "Sunday") / (
SELECT COUNT(*)
FROM `sonic-momentum-324112.Cyclistic_data.Corrected_Data`)
FROM `sonic-momentum-324112.Cyclistic_data.Corrected_Data`
Breaking it further down to hours:
The above chart represents the total nr. of trips at each hour during a day. The 3 peaks in the trend line for members hints at the fact that most of them use Cyclistic to commute to work. The first peak is around 7 AM when most people go to work, the second one at 12PM around lunch break, and the biggest peak is at 5 PM, when most people leave work. Casual riders have a slower increase from the waking hours and they peak at the exact same time as the members. After 5 PM the number of trips quickly decreases as the sun begins to set.
However, during the weekends the bike usage for both members and casual riders rises gradually peaking earlier, between 12PM and 5PM:
We haven't found any significant difference between casual riders and annual members when it comes to bike types. Both strongly prefer the classical bikes.
However, when it comes to ride duration, there are significant differences. The below query showed us that casual riders on average take 115% longer trips than members:
SELECT (((SELECT AVG(duration)
FROM `sonic-momentum-324112.Cyclistic_data.corrected_final`
WHERE member_casual = "casual") - (
SELECT AVG(duration)
FROM `sonic-momentum-324112.Cyclistic_data.Corrected_Data`
WHERE member_casual = "member"))/(SELECT AVG(duration)
FROM `sonic-momentum-324112.Cyclistic_data.Corrected_Data`
WHERE member_casual = "member"))*100
FROM `sonic-momentum-324112.Cyclistic_data.Corrected_Data`
This significant difference can also be seen on the annual trend line, representing the average ride duration (in minutes), throughout the year:
Even though casual ride duration takes a dip in the cold winter, it's still almost two times longer than annual members.
The daily chart for average ride length shows us the same thing:
The average ride duration is at least two times longer for casual riders on any day of the week.
Getting back to the business task: how do annual members and casual riders use Cyclistic bikes differently? The answers we found are:
While annual members mostly use the bikes to commute to work, casual riders use them to see the sights in Chicago, preferring the shoreline, most of the traffic being around the Chicago Harbor and Millenium Park.
While annual members consistently use the bikes each day of the week (with a slight dip on Sunday and Monday), casual riders prefer the weekend with most of the traffic happening from Friday to Sunday. They both avoid the winter cold and prefer riding in the summer. During the day, we can notice 3 peaks for the annual riders on weekdays, which coincides with the start of work (8 AM), lunch break (12 PM), and the chart peaks at the end of work (5 PM), which coincidently is the same time the chart peaks for the casual riders as well. The difference is that it steadily increases for casuals until 5 PM. On weekends traffic peaks for both between 12 PM and 5 PM but the majority are casuals on these days.
Casual Riders have an average ride duration that is 115% longer than the average duration for annual members.
Recommendations:
The best time to target casual riders with marketing campaigns is from the beginning of June until the end of August, on the weekend, between 12 PM and 5 PM. They can be found riding along the Chicago shoreline with most of the trips starting or finishing around the Harbor and Millenium Park.
Create a cheaper weekend-only membership to slowly entice casual users towards a full annual membership. This membership will only allow them to use the bikes on Friday, Saturday and Sunday.
Adding a program to the annual membership that incentivizes long rides. Getting points in accordance with the ride length or distance, which can be exchanged into rewards or free rides. This should incentivize casual riders who prefer longer rides, but might also appeal to annual members and motivate them to continue their membership.