...get to know your data, not all data is the same...
Explore
Understanding the data and how it's structured, helps me establish context and identify what questions I can answer.
Quality Data Sources
Ensuring the integrity of a dataset provides a strong foundation and confidence that my analysis will give the best possible result. When working with a new dataset, I check the reliability of the source, completeness, and relevancy to the project.
Process
Once I become familiar with a dataset, I process it ready for analysis. This includes any cleaning and transformation that may be required.
Data Cleaning
I clean the data to make sure it is complete and correct, otherwise my analysis may lead to inaccurate insights. The process I use varies depending on the source.
Generally, I check for and complete the following activities:
only keep data relevant to the problem (reduces complexity and increases efficiency)
check for and remove duplicate observations (accuracy)
check for and update misspelled values (ensures consistent categorisation)
check for null values and consider if they need to be removed, populated, or remain empty (accuracy)
remove extra spaces (consistency)
check the range of values in each column to ensure they are appropriate (outliers may be an error or could be integral to the analysis)
I verify any changes made to ensure they were executed appropriately.
Transformation
Transforming data helps me organise the data and makes it easier to use.
During this process, I:
review and update column names (this ensures they are meaningful)
combine or separate data where required (e.g. separating Date/Time for efficiency)
check and update required column data types (this ensures they match the data)
denormalise the data (for performance and dimension modelling)
I perform other transformations during analysis as required by the project. I adhere to Matthew Roche's general rule of "data should be transformed as far upstream as possible, and as far downstream as necessary". Meaning, if I'm making a transformation, I make it as close to the source as practically possible. This helps to:
increase efficiency and performance (transform once and reuse the data)
reduce maintenance (single or fewer locations to maintain logic)
introduce consistency and simplicity (the same data can be used across multiple applications)
Tools
The tools I use to inspect and process data vary depending on the source and requirements of the project.
Below is an example where I used R for my '2021 Cyclistic Riders' project.
Accountability
Processing data can result in a significantly different dataset. I create transparency by reporting my results.
Document
Documenting the steps I use to process data, ensures accountability that the steps were executed appropriately and the resulting data is reliable. This can also be used as a future reference for similar data processing requirements.
Change Log
I use Change Logs to document a list of notable modifications made to the data with explanations as to why they were made. This helps me communicate to my stakeholders and colleagues about the changes, which creates transparency and accountability.
Below is an example of a Change Log for my '2021 Cyclistic Riders' project.

Contact
If you would like to discuss job opportunities and how I can help you to deliver, I can be contacted through LinkedIn.
For a summary of the value and contributions I have made throughout my career, my resume is available on request.