Data Factory Implementation
- Customer
- JSC "Kazakhtelecom"
- Project manager on the customer side
- IT Provider
- Axellect
- Year of project completion
- 2023
- Project timeline
- June, 2022 - November, 2023
- Project scope
- 22300 man-hours
- Goals
-
Before "Data Factory Implementation" project started, the following goals were set:1) Setup a clear strategic roadmap of the Data Factory development on a 3-year horizon;2) Provide Data Factory consumers with user-friendly instruments (data model, data marts, and data governance tools) to work with data effectively;3) Establish standardized approaches and processes on the whole company level to process datasets;4) Increase the conversion rate of analytical hypotheses into working solutions driving business value;5) Decrease Time-2-Market for any data asset (data mart, BI dashboard, ML model, etc.) implementation;6) Highlight to the company's Executives the importance of investments in data management and data governance.
- Project Results
-
The main results of the project are:
Qualitative:1) Single view on the Data Function development;2) Clear operating model (roles and processes)3) Approach to drive Data Function that can be used by the client independently in the future;
Quantitative:1) Time-2-Market for any data asset (data mart, BI dashboard, ML model, etc.) implementation decreased by ~ 17%;2) Сonversion rate of analytical hypotheses into working solutions increased x2,5.
The uniqueness of the project
The uniqueness of the project is represented by its clear orientation on the business impact. In this case, "Data Factory Implementation" provided d-people (engineers, analysts) and business users with the data ecosystem, which enforced all required instruments to process huge amounts of data effectively: precalculated data marts, data discovery tools (glossary and catalog), data quality engine and enablers to build and test advanced analytical models. Additionally, during the project implementation, data governance processes covering the full cycle of data asset development were established and incorporated into operational activities of data-related teams.
All of the above was measured into clear business metrics such as:
1) Time-2-Market for any data asset (data mart, BI dashboard, ML model, etc.) implementation decreased by ~ 17%;2) Сonversion rate of analytical hypotheses into working solutions increased x2,5.
- Used software
-
ETL instruments: Apache AirFlow, Luigi, RabbitMQ, Spark, DASK и RAY;Streaming ETL: Informatica Change Data Capture (CDC) and Apache Kafka;
Storage system: Hadoop HDFS with Apache Impala, Hive Metastore, Hadoop Yarn, Apache Ranger;
MPP database: Greenplum;
Analytics engine: ClickHouse;
Data Governance tools: Informatica Axon Data Governance, Informatica EDC (Enterprise Data Catalog), Informatica Data Quality.
- Difficulty of implementation
-
The main difficulties of the "Data Factory Implementation" project can be divided into technical and organizational prospects.Regarding technical ones, the main challenge was to integrate such a fragmented and sophisticated IT landscape into Data Factory. In this way, there are more than 100 internal and external sources having their business logic and producing about 4 petabytes of data. Integration of such datasets suggested cross-functional engagement: both d-people (engineers, analysts) and business users made their contributions.As for organizational prospects, it was quite time-consuming to transfer the company's employees from "the legacy" to the new ways of working with data: provide descriptions, quality rules, and lineage to their data assets. Another point was to describe what actually "data ownership" means, and what will be changed for the particular user. Such things demanded a lot of educational and change-management initiatives from project leaders.
- Project Description
-
"Data Factory Implementation" project included several milestones:
1) Data Strategy Development2) Data Strategy Implementation (DWH refactoring and data governance processes incorporation into day-to-day activities).
As for Data Strategy Development, the following parts took place:1.1) As Is Assessment: in this case, JSC "Kazakhtelecom" was investigated through its data landscape, data governance operating model (people and processes) and analytical use cases;1.2) To Be Model Creation: here, target data architecture (including functional components and particular vendors) was designed, as well as the operating model (target roles, responsibilities, and processes while working with data). Moreover, the corporate data model concept was created using an industrial model (TM FORUM) as a reference. Last, but the most important point is the roadmap for the 3-year horizon, which includes the list of initiatives and required budget for their implementation.
As for Data Strategy Implementation, it was divided into two parts:2.1) Organizational aspect: it included several data governance training seminars, which explained particular changes in the day-to-day operations of business lines. Moreover, after training courses, the operating model was piloted on the Retail business line. Further, it was corrected regarding the lessons learned and scaled on the whole organizational scope.2.2) Technical aspect: it was decided to set up a full track for the concrete data domain - in this case, managerial reporting was chosen. It began with the business requirements collection, further prototyping, and development data model and data mart layers (on the new functional components). After, Data Governance tools step forward: required business and technical metadata were established in the Data Catalog and Business Glossary with simultaneous lineage creation. Moreover, data quality rules were collected and onboarded to the Data Quality tool, to provide continuous monitoring.
Together with the client this approach was approved and used to establish iterative refactoring of the previous Data Warehouse Solution.
- Project geography
-
The project scope includes all branches of JSC "Kazakhtelecom" such as Retail, Corporate and Technical business lines, Information Technology division, and Central division, maintaining the functioning of all the organization's processes. All in all, the following project metrics could be indicated:1) More than 100 internal and external sources producing about 4 petabytes were integrated to Data Factory;2) More than 8000 raw tables were converted to data models and data marts, having descriptions, quality rules, and assigned business owners, highlighted in data governance tools;3) More than 10 data governance processes were formulated, formalized, and implemented into day-to-day operations;4) More than 100 people (internal employees, integrators, and consultants) were taking part in project implementation;5) A year and a half the full project taken, including strategy development and implementation.