manikaran.me

Professional Experience

Disney+Hotstar, Gurugram

Sr. Software Engineer

July'21-present

Cost Initiatives:
- Leveraged hive query stats to determine unused hive tables and archived the underlying data saving close to $7,000 monthly.
- We use Redash as a BI tool to create dashboards and running adhoc queries. We observed our adhoc queue used to run hive queries is constantly under scaled which is opposite to our expectations. I brought data corresponding to the queries, widgets, dashboards setup on Redash via a CDC job to the datalake and set up a monthly job to remove the schedule from queries which have not been used, resulting in reducing adhoc queue size by almost half.
- Archiving old data corresponding to the partitioned tables based on the usage.
- Combining multiple country specific jobs at the tenant level.
Event Onboarding
Platform Reliability
Cluster Setup:

Super Highway Labs (Shuttl), Gurugram

Data Engineer

Feb'21-July'21

Designing Data Lake: The whole infrastructure of the organization is hosted on AWS. The goal of this project is to move all the teams to use Athena for querying the data which would be single source of truth and the company wide data present in AWS S3.
- Data Ingestion
  - Scraping data from third party: Set up pipeline to download multiple dataset using third party API and store the data to data lake on AWS S3.
    Tools used: AWS Lambda, AWS Cloudwatch, AWS Secrets Manager, AWS Athena, AWS S3, PyArrow, Pandas, Terraform.
  - Ingesting business data stored in google sheets: For utilising the data stored in google sheets for OLAP queries, I have set up a pipeline to download the data stored in google sheets, dump it to s3 and create Athena tables.
    Tools used: AWS Lambda, AWS Cloudwatch, AWS Secrets Manager, AWS Athena, AWS S3, Terraform.
  - Migrating data from legacy MySQL instance to S3: Moved the data from MySQL database hosted on EC2 instance to S3 using AWS DMS. For creating the databases and tables, AWS Glue Crawelers were used.
- File compaction:
  - Slowness in query result was observed while querying the data from Athena due to the presence of thousands of very small files in data lake.
  - Wrote a generic spark job to compact the files stored in s3. Repartitioned the older data to lesser number of partitions to optimize the query performance.
  - Leveraged AWS Glue catalog to determine the partitions and location of the Athena tables in s3.
  Tools used: AWS Glue to deploy Spark job, AWS Athena, AWS S3, Apache spark.
- Reporting tool: designed a generic configuration based reporting tool which queries the data via Athena and deliver the result to the concerned parties via email or slack notification.
  Tools used: AWS Lambda, AWS Cloudwatch, AWS Secrets Manager, AWS Athena, AWS S3, AWS SES, Terraform.

Alphonso Inc., Bengaluru

Technologist

July'19-Jan'21

Data Quality Project: I have been a key player in improving the Data Quality in the Organization.
- Outlier Detection and Removal: Aim was to find out the outliers in the data and remove/flag the data points.
  - Performed several analysis to find out the potential issues and came up with a plan to flag the data in the production dataset.
  - Leveraged time series decomposition algorithm (STL decomposition) to detect the anomaly in the data and notify the concerned team via slack notification.
  - Presented the effect of removing/flagging the outliers and its consequence at different levels backed by the different statistics.
  - Made use of databrick's delta format to flag the already existing data.
  - Solved multiple challenges like the underlying data getting updated by multiple processes at the same time, firing jobs dependent on the outliers flag etc.
- Data Skewness dealing with location and viewership based skewness in the data
  - Derived threshold values to trim the data from some areas resulting in more balanced dataset
  - Flagged the production data not to be used when more balance data is required for analysis/reporting
Data Insight Project
- Data Monitoring Daily monitoring of the major datasets used in the Organization.
  - Set up daily jobs using Airflow to generate the stats of several datasets used in the organization.
  - Generated stats are stored in different datasets like InfluxDB, Elastic Search or HDFS.
  - Generalized the code to generate stats of any dataset based on the configuration file.
- Google sheet based dashboard used google sheets to share the state of the data to all the concerned teams
  - Used google API to publish the stats on google sheets
  - To detect the anomaly, came up with global thresholds for different metrics monitored.
  - Highlighted the row in the sheet having anomalous metric with different colors depending on the criticality of the metric.
- Created a dashboard on Kibana for data visualization
- Contributed to the development of the REACT based dashboard to provide in depth analysis of the datasets.
I had managed multiple ETL jobs to process the incoming data from our clients and generate aggregated datasets which was used by multiple downstream processes.
Was part of the critical discussions to make improvements in data processing, disk space usage, solving small file problems, releasing new version of data etc.

Scholastic Record

M.Tech in Computer Science	Indian Institute of Technology, Delhi	July 2019 DGPA: 8.98
B.Tech in Computer Science	Hansraj College, Delhi University	July 2017 Marks: 90.78%
Class XII	Kendriya Vidyalaya	May 2013 Marks: 95.60%
Class X	Kendriya Vidyalaya	May 2013 CGPA: 9.4

Online Courses

Introduction to Designing Data Lakes on AWS

Coursera
Issued March 2021 * No Expiration Date
Credential Id: F9EAH7BXZQJS

See credential

AWS Fundamentals: Going Cloud-Native

Coursera
Issued Feb 2021 * No Expiration Date
Credential Id: 4NL5Q2NNH3EQ

See credential

Big Data Analysis with Scala and Spark

Coursera
Issued Sep 2019 * No Expiration Date
Credential Id: 8524CVHCZKWZ

See credential