728x90
반응형
인스턴스 이름 리전 퍼블릭ipv4주소 퍼블릭 ipv4 DNS 프라이빗 ipv4 pem 파일이름
Instance_title 서울 3.33.33.3xx ec2-3-xx-xx-xxx.ap-northeast-2.compute.amazonaws.com 172.31.xx.xxx student-key.pem

 

1.터미널 실행

   - windows의 경우에는 powershell을 사용합니다.(PuTTY)로 사용하는 경우엔 pem파일이 아닌 ppk파일을 사용합니다.

   - MAC의 경우에는 기본 터미널이나 iTerm 등 기존에 사용하던 터미널을 사용하시면 됩니다.

 

2. .ssh 폴더 들어가기

   - ssh폴더가 없는 경우 새로 만들어야 함.

 

3.ssh 명령어를 이용해 접속 수행

   - ssh -i "pem파일이름" ubuntu@퍼블릭 ipv4 주소 또는 퍼블릭 ipv4 DNS

   - 예시 : ssh -i "student-key.pem" ubuntu@ec2-15-152-xxx-xxx.ap-northeast-3.compute.amazonaws.com

 

window

c드라이브 - 사용자 - 컴퓨터 이름(한글이면 영어로 바꾸어야 함) 폴더에서 .ssh 새폴더 만들기

폴더에 student-key.pem 파일 넣기

 

windows powershell에서

cd .ssh

dir

.ssh 

 

반응형
728x90
반응형

https://www.youtube.com/watch?v=1ejIgrDDEhU 

 

https://www.youtube.com/watch?v=yyFXx8v6LbU 

 

놀라운 인사이트였다. 인구구조와 미시경제의 상관관계에 대한 얘기를 한다. 출생률이 떨어져 생산가능인구가 적어지면 임금가치 상승, 노조 활성화 등을 이유로 임금이 상승하며 그 만큼 물가도 상승한다는 내용이다.

 

우리나라도 인구절벽에 고령화로 겪고 있으며, 중국도 인구가 2위로 줄어드는 등 전세계가 겪고 있는 현상이다. 특이한 점은 우리가 정체기의 상징이라고 생각했던 일본은 오히려 로봇 등 생산성 향상을 통하여 잘 버텨낸 것이다. 타국이 이제서야 인구침체기를 겪을 시기에 일본은 오히려 '그 세대'를 버텨내어 안정화에 이를 것이라고 한다. 다만, 결과는 고령화도 넘어가서 그저 적은 수의 인구로 살아갈 뿐이라는 예상이 또 신기했다.

 

 

반응형
728x90
반응형

https://journals.sagepub.com/doi/pdf/10.1177/0022242920972932

 

 

 

 

 

반응형
728x90
반응형

http://www.globalhha.com/doclib/data/upload/doc_con/5e50c522eeb91.pdf

 

 

 

반응형
728x90
반응형

반응형

'Profile' 카테고리의 다른 글

Profile  (0) 2021.10.22
728x90
반응형

 

Machine Learning Operations (MLOps): Overview, Definition, and Architecture

 

https://arxiv.org/abs/2205.02302

 

ABSTRACT

The final goal of all industrial machine learning (ML) projects is to develop ML products and rapidly bring them into production. However, it is highly challenging to automate and operationalize ML products and thus many ML endeavors fail to deliver on their expectations. The paradigm of Machine Learning Operations (MLOps) addresses this issue. MLOps includes several aspects, such as best practices, sets of concepts, and development culture. However, MLOps is still a vague term and its consequences for researchers and professionals are ambiguous. To address this gap, we conduct mixed-method research, including a literature review, a tool review, and expert interviews. As a result of these investigations, we provide an aggregated overview of the necessary principles, components, and roles, as well as the associated architecture and workflows. Furthermore, we furnish a definition of MLOps and highlight open challenges in the field. Finally, this work provides guidance for ML researchers and practitioners who want to automate and operate their ML products with a designated set of technologies.

 

KEYWORDS

CI/CD, DevOps, Machine Learning, MLOps, Operations, Workflow Orchestration

 

1 Introduction

Machine Learning (ML) has become an important technique to leverage the potential of data and allows businesses to be more innovative [1], efficient [13], and sustainable [22]. However, the success of many productive ML applications in real-world settings falls short of expectations [21]. A large number of ML projects fail—with many ML proofs of concept never progressing as far as production [30]. From a research perspective, this does not come as a surprise as the ML community has focused extensively on the building of ML models, but not on (a) building production-ready ML products and (b) providing the necessary coordination of the resulting, often complex ML system components and infrastructure, including the roles required to automate and operate an ML system in a real-world setting [35]. For instance, in many industrial applications, data scientists still manage ML workflows manually to a great extent, resulting in many issues during the operations of the respective ML solution [26].

To address these issues, the goal of this work is to examine how manual ML processes can be automated and operationalized so that more ML proofs of concept can be brought into production. In this work, we explore the emerging ML engineering practice “Machine Learning Operations”—MLOps for short—precisely addressing the issue of designing and maintaining productive ML. We take a holistic perspective to gain a common understanding of the involved components, principles, roles, and architectures. While existing research sheds some light on various specific aspects of MLOps, a holistic conceptualization, generalization, and clarification of ML systems design are still missing. Different perspectives and conceptions of the term “MLOps” might lead to misunderstandings and miscommunication, which, in turn, can lead to errors in the overall setup of the entire ML system. Thus, we ask the research question:

RQ: What is MLOps?

To answer that question, we conduct a mixed-method research endeavor to (a) identify important principles of MLOps, (b) carve out functional core components, (c) highlight the roles necessary to successfully implement MLOps, and (d) derive a general architecture for ML systems design. In combination, these insights result in a definition of MLOps, which contributes to a common understanding of the term and related concepts.

In so doing, we hope to positively impact academic and practical discussions by providing clear guidelines for professionals and researchers alike with precise responsibilities. These insights can assist in allowing more proofs of concept to make it into production by having fewer errors in the system’s design and, finally, enabling more robust predictions in real-world environments.

The remainder of this work is structured as follows. We will first elaborate on the necessary foundations and related work in the field. Next, we will give an overview of the utilized methodology, consisting of a literature review, a tool review, and an interview study. We then present the insights derived from the application of the methodology and conceptualize these by providing a unifying definition. We conclude the paper with a short summary, limitations, and outlook. MLOps: Overview, Definition, and Architecture Kreuzberger, Kühl, and Hirschl

 

2 Foundations of DevOps

In the past, different software process models and development methodologies surfaced in the field of software engineering. Prominent examples include waterfall [37] and the agile manifesto [5]. Those methodologies have similar aims, namely to deliver production-ready software products. A concept called “DevOps” emerged in the years 2008/2009 and aims to reduce issues in software development [9,31]. DevOps is more than a pure methodology and rather represents a paradigm addressing social and technical issues in organizations engaged in software development. It has the goal of eliminating the gap between development and operations and emphasizes collaboration, communication, and knowledge sharing. It ensures automation with continuous integration, continuous delivery, and continuous deployment (CI/CD), thus allowing for fast, frequent, and reliable releases. Moreover, it is designed to ensure continuous testing, quality assurance, continuous monitoring, logging, and feedback loops. Due to the commercialization of DevOps, many DevOps tools are emerging, which can be differentiated into six groups [23,28]: collaboration and knowledge sharing (e.g., Slack, Trello, GitLab wiki), source code management (e.g., GitHub, GitLab), build process (e.g., Maven), continuous integration (e.g., Jenkins, GitLab CI), deployment automation (e.g., Kubernetes, Docker), monitoring and logging (e.g., Prometheus, Logstash). Cloud environments are increasingly equipped with ready-to-use DevOps tooling that is designed for cloud use, facilitating the efficient generation of value [38]. With this novel shift towards DevOps, developers need to care about what they develop, as they need to operate it as well. As empirical results demonstrate, DevOps ensures better software quality [34]. People in the industry, as well as academics, have gained a wealth of experience in software engineering using DevOps. This experience is now being used to automate and operationalize ML.

 

3 Methodology

To derive insights from the academic knowledge base while also drawing upon the expertise of practitioners from the field, we apply a mixed-method approach, as depicted in Figure 1. As a first step, we conduct a structured literature review [20,43] to obtain an overview of relevant research. Furthermore, we review relevant tooling support in the field of MLOps to gain a better understanding of the technical components involved. Finally, we conduct semi[1]structured interviews [33,39] with experts from different domains. On that basis, we conceptualize the term “MLOps” and elaborate on our findings by synthesizing literature and interviews in the next chapter (“Results”).

3.1 Literature Review

To ensure that our results are based on scientific knowledge, we conduct a systematic literature review according to the method of Webster and Watson [43] and Kitchenham et al. [20]. After an initial exploratory search, we define our search query as follows: ((("DevOps" OR "CICD" OR "Continuous Integration" OR "Continuous Delivery" OR "Continuous Deployment") AND "Machine Learning") OR "MLOps" OR "CD4ML"). We query the scientific databases of Google Scholar, Web of Science, Science Direct, Scopus, and the Association for Information Systems eLibrary. It should be mentioned that the use of DevOps for ML, MLOps, and continuous practices in combination with ML is a relatively new field in academic literature. Thus, only a few peer[1]reviewed studies are available at the time of this research. Nevertheless, to gain experience in this area, the search included non-peer-reviewed literature as well. The search was performed in May 2021 and resulted in 1,864 retrieved articles. Of those, we screened 194 papers in detail. From that group, 27 articles were selected based on our inclusion and exclusion criteria (e.g., the term MLOps or DevOps and CI/CD in combination with ML was described in detail, the article was written in English, etc.). All 27 of these articles were peer-reviewed.

 

 

3.2 Tool Review

After going through 27 articles and eight interviews, various open-source tools, frameworks, and commercial cloud ML services were identified. These tools, frameworks, and ML services were reviewed to gain an understanding of the technical components of which they consist. An overview of the identified tools is depicted in Table 1 of the Appendix.

 

3.3 Interview Study

To answer the research questions with insights from practice, we conduct semi-structured expert interviews according to Myers and Newman [33]. One major aspect in the research design of expert interviews is choosing an appropriate sample size [8]. We apply a theoretical sampling approach [12], which allows us to choose experienced interview partners to obtain high-quality data. Such data can provide meaningful insights with a limited number of interviews. To get an adequate sample group and reliable insights, we use LinkedIn—a social network for professionals—to identify experienced ML professionals with profound MLOps knowledge on a global level. To gain insights from various perspectives, we choose interview partners from different organizations and industries, different countries and nationalities, as well as different genders. Interviews are conducted until no new categories and concepts emerge in the analysis of the data. In total, we conduct eight interviews with experts (α - θ), whose details are depicted in Table 2 of the Appendix. According to Glaser and Strauss [5, p.61], this stage is called “theoretical saturation.” All interviews are conducted between June and August 2021. With regard to the interview design, we prepare a semistructured guide with several questions, documented as an interview script [33]. During the interviews, “soft laddering” is used with “how” and “why” questions to probe the interviewees’ means-end chain [39]. This methodical approach allowed us to gain additional insight into the experiences of the interviewees when required. All interviews are recorded and then transcribed. To evaluate the interview transcripts, we use an open coding scheme [8].

 

4 Results

We apply the described methodology and structure our resulting insights into a presentation of important principles, their resulting instantiation as components, the description of necessary roles, as well as a suggestion for the architecture and workflow resulting from the combination of these aspects. Finally, we derive the conceptualization of the term and provide a definition of MLOps.

4.1 Principles

A principle is viewed as a general or basic truth, a value, or a guide for behavior. In the context of MLOps, a principle is a guide to how things should be realized in MLOps and is closely related to the term “best practices” from the professional sector. Based on the outlined methodology, we identified nine principles required to realize MLOps. Figure 2 provides an illustration of these principles and links them to the components with which they are associated.

 

P1 CI/CD automation.

CI/CD automation provides continuous integration, continuous delivery, and continuous deployment. It carries out the build, test, delivery, and deploy steps. It provides fast feedback to developers regarding the success or failure of certain steps, thus increasing the overall productivity [15,17,26,27,35,42,46] [α, β, θ].

P2 Workflow orchestration.

Workflow orchestration coordinates the tasks of an ML workflow pipeline according to directed acyclic graphs (DAGs). DAGs define the task execution order by considering relationships and dependencies [14,17,26,32,40,41] [α, β, γ, δ, ζ, η]

P3 Reproducibility.

Reproducibility is the ability to reproduce an ML experiment and obtain the exact same results [14,32,40,46] [α, β, δ, ε, η]. P4 Versioning.

Versioning ensures the versioning of data, model, and code to enable not only reproducibility, but also traceability (for compliance and auditing reasons) [14,32,40,46] [α, β, δ, ε, η].

P5 Collaboration.

Collaboration ensures the possibility to work collaboratively on data, model, and code. Besides the technical aspect, this principle emphasizes a collaborative and communicative work culture aiming to reduce domain silos between different roles [14,26,40] [α, δ, θ].

P6 Continuous ML training & evaluation.

Continuous training means periodic retraining of the ML model based on new feature data. Continuous training is enabled through the support of a monitoring component, a feedback loop, and an automated ML workflow pipeline. Continuous training always includes an evaluation run to assess the change in model quality [10,17,19,46] [β, δ, η, θ].

P7 ML metadata tracking/logging.

Metadata is tracked and logged for each orchestrated ML workflow task. Metadata tracking and logging is required for each training job iteration (e.g., training date and time, duration, etc.), including the model specific metadata—e.g., used parameters and the resulting performance metrics, model lineage: data and code used—to ensure the full traceability of experiment runs [26,27,29,32,35] [α, β, δ, ε, ζ, η, θ].

P8 Continuous monitoring.

Continuous monitoring implies the periodic assessment of data, model, code, infrastructure resources, and model serving performance (e.g., prediction accuracy) to detect potential errors or changes that influence the product quality [4,7,10,27,29,42,46] [α, β, γ, δ, ε, ζ, η].

P9 Feedback loops.

Multiple feedback loops are required to integrate insights from the quality assessment step into the development or engineering process (e.g., a feedback loop from the experimental model engineering stage to the previous feature engineering stage). Another feedback loop is required from the monitoring component (e.g., observing the model serving performance) to the scheduler to enable the retraining [4,6,7,17,27,46] [α, β, δ, ζ, η, θ].

 

4.2 Technical Components

After identifying the principles that need to be incorporated into MLOps, we now elaborate on the precise components and implement them in the ML systems design. In the following, the components are listed and described in a generic way with their essential functionalities. The references in brackets refer to the respective principles that the technical components are implementing.

C1 CI/CD Component (P1, P6, P9).

The CI/CD component ensures continuous integration, continuous delivery, and continuous deployment. It takes care of the build, test, delivery, and deploy steps. It providesrapid feedback to developers regarding the success or failure of certain steps, thus increasing the overall productivity [10,15,17,26,35,46] [α, β, γ, ε, ζ, η]. Examples are Jenkins [17,26] and GitHub actions (η)

C2 Source Code Repository (P4, P5).

The source code repository ensures code storing and versioning. It allows multiple developers to commit and merge their code [17,25,42,44,46] [α, β, γ, ζ, θ]. Examples include Bitbucket [11] [ζ], GitLab [11,17] [ζ], GitHub [25] [ζ ,η], and Gitea [46].

C3 Workflow Orchestration Component (P2, P3, P6).

The workflow orchestration component offers task orchestration of an ML workflow via directed acyclic graphs (DAGs). These graphs represent execution order and artifact usage of single steps of the workflow [26,32,35,40,41,46] [α, β, γ, δ, ε, ζ, η]. Examples include Apache Airflow [α, ζ], Kubeflow Pipelines [ζ], Luigi [ζ], AWS SageMaker Pipelines [β], and Azure Pipelines [ε].

C4 Feature Store System (P3, P4).

A feature store system ensures central storage of commonly used features. It has two databases configured: One database as an offline feature store to serve features with normal latency for experimentation, and one database as an online store to serve features with low latency for predictions in production [10,14] [α, β, ζ, ε, θ]. Examples include Google Feast [ζ], Amazon AWS Feature Store [β, ζ], Tecton.ai and Hopswork.ai [ζ]. This is where most of the data for training ML models will come from. Moreover, data can also come directly from any kind of data store.

C5 Model Training Infrastructure (P6).

The model training infrastructure provides the foundational computation resources, e.g., CPUs, RAM, and GPUs. The provided infrastructure can be either distributed or non-distributed. In general, a scalable and distributed infrastructure is recommended [7,10,24– 26,29,40,45,46] [δ, ζ, η, θ]. Examples include local machines (not scalable) or cloud computation [7] [η, θ], as well as non-distributed or distributed computation (several worker nodes) [25,27]. Frameworks supporting computation are Kubernetes [η, θ] and Red Hat OpenShift [γ].

C6 Model Registry (P3, P4).

The model registry stores centrally the trained ML models together with their metadata. It has two main functionalities: storing the ML artifact and storing the ML metadata (see C7) [4,6,14,17,26,27] [α, β, γ, ε, ζ, η, θ]. Advanced storage examples include MLflow [α, η, ζ], AWS SageMaker Model Registry [ζ], Microsoft Azure ML Model Registry [ζ], and Neptune.ai [α]. Simple storage examples include Microsoft Azure Storage, Google Cloud Storage, and Amazon AWS S3 [17].

C7 ML Metadata Stores (P4, P7).

ML metadata stores allow for the tracking of various kinds of metadata, e.g., for each orchestrated ML workflow pipeline task. Another metadata store can be configured within the model registry for tracking and logging the metadata of each training job (e.g., training date and time, duration, etc.), including the model specific metadata—e.g., used parameters and the resulting performance metrics, model lineage: data and code used [14,25–27,32] [α, β, δ, ζ, θ]. Examples include orchestrators with built-in metadata stores tracking each step of experiment pipelines [α] such as Kubeflow Pipelines [α,ζ], AWS SageMaker Pipelines [α,ζ], Azure ML, and IBM Watson Studio [γ]. MLflow provides an advanced metadata store in combination with the model registry [32,35].

C8 Model Serving Component (P1).

The model serving component can be configured for different purposes. Examples are online inference for real-time predictions or batch inference for predictions using large volumes of input data. The serving can be provided, e.g., via a REST API. As a foundational infrastructure layer, a scalable and distributed model serving infrastructure is recommended [7,11,25,40,45,46] [α, β, δ, ζ, η, θ]. One example of a model serving component configuration is the use of Kubernetes and Docker technology to containerize the ML model, and leveraging a Python web application framework like Flask [17] with an API for serving [α]. Other Kubernetes supported frameworks are KServing of Kubeflow [α], TensorFlow Serving, and Seldion.io serving [40]. Inferencing could also be realized with Apache Spark for batch predictions [θ]. Examples of cloud services include Microsoft Azure ML REST API [ε], AWS SageMaker Endpoints [α, β], IBM Watson Studio [γ], and Google Vertex AI prediction service [δ].

C9 Monitoring Component (P8, P9).

The monitoring component takes care of the continuous monitoring of the model serving performance (e.g., prediction accuracy). Additionally, monitoring of the ML infrastructure, CI/CD, and orchestration are required [7,10,17,26,29,36,46] [α, ζ, η, θ]. Examples include Prometheus with Grafana [η, ζ], ELK stack (Elasticsearch, Logstash, and Kibana) [α, η, ζ], and simply TensorBoard [θ]. Examples with built-in monitoring capabilities are Kubeflow [θ], MLflow [η], and AWS SageMaker model monitor or cloud watch [ζ].

4.3 Roles

After describing the principles and their resulting instantiation of components, we identify necessary roles in order to realize MLOps in the following. MLOps is an interdisciplinary group process, and the interplay of different roles is crucial to design, manage, automate, and operate an ML system in production. In the following, every role, its purpose, and related tasks are briefly described:

R1 Business Stakeholder (similar roles: Product Owner, Project Manager).

The business stakeholder defines the business goal to be achieved with ML and takes care of the communication side of the business, e.g., presenting the return on investment (ROI) generated with an ML product [17,24,26] [α, β, δ, θ].

R2 Solution Architect (similar role: IT Architect).

The solution architect designs the architecture and defines the technologies to be used, following a thorough evaluation [17,27] [α, ζ].

R3 Data Scientist (similar roles: ML Specialist, ML Developer).

The data scientist translates the business problem into an ML problem and takes care of the model engineering, including the selection of the best-performing algorithm and hyperparameters [7,14,26,29] [α, β, γ, δ, ε, ζ, η, θ]. R4 Data Engineer (similar role: DataOps Engineer). The data engineer builds up and manages data and feature engineering pipelines. Moreover, this role ensures proper data ingestion to the databases of the feature store system [14,29,41] [α, β, γ, δ, ε, ζ, η, θ].

R4 Data Engineer (similar role: DataOps Engineer).

The data engineer builds up and manages data and feature engineering pipelines. Moreover, this role ensures proper data ingestion to the databases of the feature store system [14,29,41] [α, β, γ, δ, ε, ζ, η, θ].

R5 Software Engineer.

The software engineer applies software design patterns, widely accepted coding guidelines, and best practices to turn the raw ML problem into a well-engineered product [29] [α, γ].

R6 DevOps Engineer.

The DevOps engineer bridges the gap between development and operations and ensures proper CI/CD automation, ML workflow orchestration, model deployment to production, and monitoring [14–16,26] [α, β, γ, ε, ζ, η, θ].

R7 ML Engineer/MLOps Engineer.

The ML engineer or MLOps engineer combines aspects of several roles and thus has cross-domain knowledge. This role incorporates skills from data scientists, data engineers, software engineers, DevOps engineers, and backend engineers (see Figure 3). This cross-domain role builds up and operates the ML infrastructure, manages the automated ML workflow pipelines and model deployment to production, and monitors both the model and the ML infrastructure [14,17,26,29] [α, β, γ, δ, ε, ζ, η, θ]

5 Architecture and Workflow

On the basis of the identified principles, components, and roles, we derive a generalized MLOps end-to-end architecture to give ML researchers and practitioners proper guidance. It is depicted in Figure 4. Additionally, we depict the workflows, i.e., the sequence in which the different tasks are executed in the different stages. The artifact was designed to be technology-agnostic. Therefore, ML researchers and practitioners can choose the best-fitting technologies and frameworks for their needs.

As depicted in Figure 4, we illustrate an end-to-end process, from MLOps project initiation to the model serving. It includes (A) the MLOps project initiation steps; (B) the feature engineering pipeline, including the data ingestion to the feature store; (C) the experimentation; and (D) the automated ML workflow pipeline up to the model serving.

(A) MLOps project initiation.

(1) The business stakeholder (R1) analyzes the business and identifies a potential business problem that can be solved using ML. (2) The solution architect (R2) defines the architecture design for the overall ML system and, decides on the technologies to be used after a thorough evaluation. (3) The data scientist (R3) derives an ML problem—such as whether regression or classification should be used—from the business goal. (4) The data engineer (R4) and the data scientist (R3) work together in an effort to understand which data is required to solve the problem. (5) Once the answers are clarified, the data engineer (R4) and data scientist (R3) collaborate to locate the raw data sources for the initial data analysis. They check the distribution, and quality of the data, as well as performing validation checks. Furthermore, they ensure that the incoming data from the data sources is labeled, meaning that a target attribute is known, as this is a mandatory requirement for supervised ML. In this example, the data sources already had labeled data available as the labeling step was covered during an upstream process.

(B1) Requirements for feature engineering pipeline.

The features are the relevant attributes required for model training. After the initial understanding of the raw data and the initial data analysis, the fundamental requirements for the feature engineering pipeline are defined, as follows: (6) The data engineer (R4) defines the data transformation rules (normalization, aggregations) and cleaning rules to bring the data into a usable format. (7) The data scientist (R3) and data engineer (R4) together define the feature engineering rules, such as the calculation of new and more advanced features based on other features. These initially defined rules must be iteratively adjusted by the data scientist (R3) either based on the feedback coming from the experimental model engineering stage or from the monitoring component observing the model performance. 

(B2) Feature engineering pipeline.

The initially defined requirements for the feature engineering pipeline are taken by the data engineer (R4) and software engineer (R5) as a starting point to build up the prototype of the feature engineering pipeline. The initially defined requirements and rules are updated according to the iterative feedback coming either from the experimental model engineering stage or from the monitoring component observing the model’s performance in production. As a foundational requirement, the data engineer (R4) defines the code required for the CI/CD (C1) and orchestration component (C3) to ensure the task orchestration of the feature engineering pipeline. This role also defines the underlying infrastructure resource configuration. (8) First, the feature engineering pipeline connects to the raw data, which can be (for instance) streaming data, static batch data, or data from any cloud storage. (9) The data will be extracted from the data sources. (10) The data preprocessing begins with data transformation and cleaning tasks. The transformation rule artifact defined in the requirement gathering stage serves as input for this task, and the main aim of this task is to bring the data into a usable format. These transformation rules are continuously improved based on the feedback.

(11) The feature engineering task calculates new and more advanced features based on other features. The predefined feature engineering rules serve as input for this task. These feature engineering rules are continuously improved based on the feedback. (12) Lastly, a data ingestion job loads batch or streaming data into the feature store system (C4). The target can either be the offline or online database (or any kind of data store).

(C) Experimentation.

Most tasks in the experimentation stage are led by the data scientist (R3). The data scientist is supported by the software engineer (R5). (13) The data scientist (R3) connects to the feature store system (C4) for the data analysis. (Alternatively, the data scientist (R3) can also connect to the raw data for an initial analysis.) In case of any required data adjustments, the data scientist (R3) reports the required changes back to the data engineering zone (feedback loop). 

(14) Then the preparation and validation of the data coming from the feature store system is required. This task also includes the train and test split dataset creation. (15) The data scientist (R3) estimates the best-performing algorithm and hyperparameters, and the model training is then triggered with the training data (C5). The software engineer (R5) supports the data scientist (R3) in the creation of well-engineered model training code. (16) Different model parameters are tested and validated interactively during several rounds of model training. Once the performance metrics indicate good results, the iterative training stops. The bestperforming model parameters are identified via parameter tuning. The model training task and model validation task are then iteratively repeated; together, these tasks can be called “model engineering.” The model engineering aims to identify the bestperforming algorithm and hyperparameters for the model. (17) The data scientist (R3) exports the model and commits the code to the repository.

As a foundational requirement, either the DevOps engineer (R6) or the ML engineer (R7) defines the code for the (C2) automated ML workflow pipeline and commits it to the repository. Once either the data scientist (R3) commits a new ML model or the DevOps engineer (R6) and the ML engineer (R7) commits new ML workflow pipeline code to the repository, the CI/CD component (C1) detects the updated code and triggers automatically the CI/CD pipeline carrying out the build, test, and delivery steps. The build step creates artifacts containing the ML model and tasks of the ML workflow pipeline. The test step validates the ML model and ML workflow pipeline code. The delivery step pushes the versioned artifact(s)—such as images—to the artifact store (e.g., image registry).

(D) Automated ML workflow pipeline. The DevOps engineer (R6) and the ML engineer (R7) take care of the management of the automated ML workflow pipeline. They also manage the underlying model training infrastructure in the form of hardware resources and frameworks supporting computation such as Kubernetes (C5). The workflow orchestration component (C3) orchestrates the tasks of the automated ML workflow pipeline. For each task, the required artifacts (e.g., images) are pulled from the artifact store (e.g., image registry). Each task can be executed via an isolated environment (e.g., containers). Finally, the workflow orchestration component (C3) gathers metadata for each task in the form of logs, completion time, and so on.

Once the automated ML workflow pipeline is triggered, each of the following tasks is managed automatically: (18) automated pulling of the versioned features from the feature store systems (data extraction). Depending on the use case, features are extracted from either the offline or online database (or any kind of data store). (19) Automated data preparation and validation; in addition, the train and test split is defined automatically. (20) Automated final model training on new unseen data (versioned features). The algorithm and hyperparameters are already predefined based on the settings of the previous experimentation stage. The model is retrained and refined. (21) Automated model evaluation and iterative adjustments of hyperparameters are executed, if required. Once the performance metrics indicate good results, the automated iterative training stops. The automated model training task and the automated model validation task can be iteratively repeated until a good result has been achieved. (22) The trained model is then exported and (23) pushed to the model registry (C6), where it is stored e.g., as code or containerized together with its associated configuration and environment files.

For all training job iterations, the ML metadata store (C7) records metadata such as parameters to train the model and the resulting performance metrics. This also includes the tracking and logging of the training job ID, training date and time, duration, and sources of artifacts. Additionally, the model specific metadata called “model lineage” combining the lineage of data and code is tracked for each newly registered model. This includes the source and version of the feature data and model training code used to train the model. Also, the model version and status (e.g., staging or production-ready) is recorded.

Once the status of a well-performing model is switched from staging to production, it is automatically handed over to the DevOps engineer or ML engineer for model deployment. From there, the (24) CI/CD component (C1) triggers the continuous deployment pipeline. The production-ready ML model and the model serving code are pulled (initially prepared by the software engineer (R5)). The continuous deployment pipeline carries out the build and test step of the ML model and serving code and deploys the model for production serving. The (25) model serving component (C8) makes predictions on new, unseen data coming from the feature store system (C4). This component can be designed by the software engineer (R5) as online inference for realtime predictions or as batch inference for predictions concerning large volumes of input data. For real-time predictions, features must come from the online database (low latency), whereas for batch predictions, features can be served from the offline database (normal latency). Model-serving applications are often configured within a container and prediction requests are handled via a REST API. As a foundational requirement, the ML engineer (R7) manages the model-serving computation infrastructure. The (26) monitoring component (C9) observes continuously the modelserving performance and infrastructure in real-time. Once a certain threshold is reached, such as detection of low prediction accuracy, the information is forwarded via the feedback loop. The (27) feedback loop is connected to the monitoring component (C9) and ensures fast and direct feedback allowing for more robust and improved predictions. It enables continuous training, retraining, and improvement. With the support of the feedback loop, information is transferred from the model monitoring component to several upstream receiver points, such as the experimental stage, data engineering zone, and the scheduler (trigger). The feedback to the experimental stage is taken forward by the data scientist for further model improvements. The feedback to the data engineering zone allows for the adjustment of the features prepared for the feature store system. Additionally, the detection of concept drifts as a feedback mechanism can enable (28) continuous training. For instance, once the model-monitoring component (C9) detects a drift in the data [3], the information is forwarded to the scheduler, which then triggers the automated ML workflow pipeline for retraining (continuous training). A change in adequacy of the deployed model can be detected using distribution comparisons to identify drift. Retraining is not only triggered automatically when a statistical threshold is reached; it can also be triggered when new feature data is available, or it can be scheduled periodically.

6 Conceptualization

With the findings at hand, we conceptualize the literature and interviews. It becomes obvious that the term MLOps is positioned at the intersection of machine learning, software engineering, DevOps, and data engineering (see Figure 5 in the Appendix). We define MLOps as follows:

MLOps (Machine Learning Operations) is a paradigm, including aspects like best practices, sets of concepts, as well as a development culture when it comes to the end-to-end conceptualization, implementation, monitoring, deployment, and scalability of machine learning products. Most of all, it is an engineering practice that leverages three contributing disciplines: machine learning, software engineering (especially DevOps), and data engineering. MLOps is aimed at productionizing machine learning systems by bridging the gap between development (Dev) and operations (Ops). Essentially, MLOps aims to facilitate the creation of machine learning products by leveraging these principles: CI/CD automation, workflow orchestration, reproducibility; versioning of data, model, and code; collaboration; continuous ML training and evaluation; ML metadata tracking and logging; continuous monitoring; and feedback loops.

7 Open Challenges

Several challenges for adopting MLOps have been identified after conducting the literature review, tool review, and interview study. These open challenges have been organized into the categories of organizational, ML system, and operational challenges.

Organizational challenges.

The mindset and culture of data science practice is a typical challenge in organizational settings [2]. As our insights from literature and interviews show, to successfully develop and run ML products, there needs to be a culture shift away from model-driven machine learning toward a product-oriented discipline [γ]. The recent trend of data-centric AI also addresses this aspect by putting more focus on the data-related aspects taking place prior to the ML model building. Especially the roles associated with these activities should have a product-focused perspective when designing ML products [γ]. A great number of skills and individual roles are required for MLOps (β). As our identified sources point out, there is a lack of highly skilled experts for these roles—especially with regard to architects, data engineers, ML engineers, and DevOps engineers [29,41,44] [α, ε]. This is related to the necessary education of future professionals—as MLOps is typically not part of data science education [7] [γ]. Posoldova (2020) [35] further stresses this aspect by remarking that students should not only learn about model creation, but must also learn about technologies and components necessary to build functional ML products.

Data scientists alone cannot achieve the goals of MLOps. A multi-disciplinary team is required [14], thus MLOps needs to be a group process [α]. This is often hindered because teams work in silos rather than in cooperative setups [α]. Additionally, different knowledge levels and specialized terminologies make communication difficult. To lay the foundations for more fruitful setups, the respective decision-makers need to be convinced that an increased MLOps maturity and a product-focused mindset will yield clear business improvements [γ].

ML system challenges.

A major challenge with regard to MLOps systems is designing for fluctuating demand, especially in relation to the process of ML training [7]. This stems from potentially voluminous and varying data [10], which makes it difficult to precisely estimate the necessary infrastructure resources (CPU, RAM, and GPU) and requires a high level of flexibility in terms of scalability of the infrastructure [7,26] [δ].

Operational challenges.

In productive settings, it is challenging to operate ML manually due to different stacks of software and hardware components and their interplay. Therefore, robust automation is required [7,17]. Also, a constant incoming stream of new data forces retraining capabilities. This is a repetitive task which, again, requires a high level of automation [18] [θ]. These repetitive tasks yield a large number of artifacts that require a strong governance [24,29,40] as well as versioning of data, model, and code to ensure robustness and reproducibility [11,27,29]. Lastly, it is challenging to resolve a potential support request (e.g., by finding the root cause), as many parties and components are involved. Failures can be a combination of ML infrastructure and software [26].

8 Conclusion

With the increase of data availability and analytical capabilities, coupled with the constant pressure to innovate, more machine learning products than ever are being developed. However, only a small number of these proofs of concept progress into deployment and production. Furthermore, the academic space has focused intensively on machine learning model building and benchmarking, but too little on operating complex machine learning systems in real-world scenarios. In the real world, we observe data scientists still managing ML workflows manually to a great extent. The paradigm of Machine Learning Operations (MLOps) addresses these challenges. In this work, we shed more light on MLOps. By conducting a mixed-method study analyzing existing literature and tools, as well as interviewing eight experts from the field, we uncover four main aspects of MLOps: its principles, components, roles, and architecture. From these aspects, we infer a holistic definition. The results support a common understanding of the term MLOps and its associated concepts, and will hopefully assist researchers and professionals in setting up successful ML projects in the future.

 

반응형
728x90
반응형

블로그 : 

https://zzsza.github.io/mlops/2018/12/28/mlops/

 

머신러닝 오퍼레이션 자동화, MLOps

MLOps 춘추 전국 시대 정리 자료를 정리한 글입니다 최초 작성했던 글을 2021년 6월에 모두 수정했습니다 키워드 : MLOps, MLOps란, MLOps 정의, MLOps 플랫폼, MLOps 엔지니어, MLOps 뜻, MLOps pipeline, MLOps framewo

zzsza.github.io

 

MLOps를 왜 할까?

- 모든 도구는 결국 핵심기능과 더 나은 퍼포먼스 혹은 비용절감으로 통한다.

 

1. drift

- 모델의 고유하면서 고질적인 문제이긴 한데, model drift, data drift가 발생한다. 이를 발생할 때마다 데이터 하나하나를 처다보고 있을 수 없기 때문에 해결할 수 있는 효율적인 Data pipeline이 필요하다.

 

2. 인건비 절감

- 반복되는 개발소요를 interactive centric ai로 uiux에 의한 개발로 풀어낼 수 있다. 흔히 ai researcher가 억대 연봉이라 하는데 프로세스화 시킨다면 진입장벽을 낮춰 인력요구수준을 낮출 수 있을 것이다.

 

3. data 비용절감

- 데이터 수집조차도 비용이다. 실무에서 많이 느끼는 것은 양계장, 양식장의 직원과 협업하기 위해 그들에게 AI의 원리, 가용범위, 데이터수집기 품질관리, 라벨링 방식등을 계속 교육, 설득, 주입 시켜야 했다는 것이다. 도메인과 AI와의 갭이 있다. 현실을 데이터로 이끌어내는 것이 만만치 않았다.

 

- 수집된 데이터를 전처리하고 라벨링하는 비용도 만만치 않다. 이부분만 따로 때어놔도 제품이 될진데, 본연의 모델개발을 위해서라면 모델개발을 위한 일련의 과정에서 빠질 수 없다. 한 클래스당 최소 3만개, 권장 5~8만개는 학습 시켜야 한다. 그리고 성능 좋았다는 모델들 들어보면 현장에서 들어온 데이터 중에 경계선 라벨링 개선했다고 한다.

 

 

 

 

 

 

 

 

 

 

 

 

 

 

반응형
728x90
반응형

https://multicore-it.com/84?category=686770 

 

프로젝트로 알아보는 보다 나은 시스템 만들기(개요)

[프로젝트로 알아보는 보다 나은 시스템 만들기] 과정을 시작하며 필자는 이 과정에서 정보시스템 개발과 운영에 관련된 다양한 업무와 기술을 프로젝트라는 틀을 빌려 설명하고자 한다. 처리

multicore-it.com

 

 

프로젝트 계획 및 설계 관련 아주 훌륭한 내용이다.

반응형
728x90
반응형

목차

    반응형
    728x90
    반응형

    목차

      논문 링크 : 

      Hidden Technical Debt in Machine Learning Systems (nips.cc)

       

      검색어 : Technical Debt

       

       

      Abstract

       Machine learning offers a fantastically powerful toolkit for building useful complex prediction systems quickly. This paper argues it is dangerous to think of these quick wins as coming for free. Using the software engineering framework of technical debt, we find it is common to incur massive ongoing maintenance costs in real-world ML systems. We explore several ML-specific risk factors to account for in system design. These include boundary erosion, entanglement, hidden feedback loops, undeclared consumers, data dependencies, configuration issues, changes in the external world, and a variety of system-level anti-patterns.


      1 Introduction

      As the machine learning (ML) community continues to accumulate years of experience with live systems, a wide-spread and uncomfortable trend has emerged: developing and deploying ML systems is relatively fast and cheap, but maintaining them over time is difficult and expensive.

       

      This dichotomy can be understood through the lens of technical debt, a metaphor introduced by Ward Cunningham in 1992 to help reason about the long term costs incurred by moving quickly in software engineering. As with fiscal debt, there are often sound strategic reasons to take on technical debt. Not all debt is bad, but all debt needs to be serviced. Technical debt may be paid down by refactoring code, improving unit tests, deleting dead code, reducing dependencies, tightening APIs, and improving documentation [8]. The goal is not to add new functionality, but to enable future improvements, reduce errors, and improve maintainability. Deferring such payments results in compounding costs. Hidden debt is dangerous because it compounds silently.

       

      In this paper, we argue that ML systems have a special capacity for incurring technical debt, because they have all of the maintenance problems of traditional code plus an additional set of ML-specific issues. This debt may be difficult to detect because it exists at the system level rather than the code level. Traditional abstractions and boundaries may be subtly corrupted or invalidated by the fact that data influences ML system behavior. Typical methods for paying down code level technical debt are not sufficient to address ML-specific technical debt at the system level.

       

      This paper does not offer novel ML algorithms, but instead seeks to increase the community’s awareness of the difficult tradeoffs that must be considered in practice over the long term. We focus on system-level interactions and interfaces as an area where ML technical debt may rapidly accumulate. At a system-level, an ML model may silently erode abstraction boundaries. The tempting re-use or chaining of input signals may unintentionally couple otherwise disjoint systems. ML packages may be treated as black boxes, resulting in large masses of “glue code” or calibration layers that can lock in assumptions. Changes in the external world may influence system behavior in unintended ways. Even monitoring ML system behavior may prove difficult without careful design.


      2 Complex Models Erode Boundaries

      Traditional software engineering practice has shown that strong abstraction boundaries using encapsulation and modular design help create maintainable code in which it is easy to make isolated changes and improvements. Strict abstraction boundaries help express the invariants and logical consistency of the information inputs and outputs from an given component [8].

       

      Unfortunately, it is difficult to enforce strict abstraction boundaries for machine learning systems by prescribing specific intended behavior. Indeed, ML is required in exactly those cases when the desired behavior cannot be effectively expressed in software logic without dependency on external data. The real world does not fit into tidy encapsulation. Here we examine several ways that the resulting erosion of boundaries may significantly increase technical debt in ML systems.

       

      Entanglement.

      Machine learning systems mix signals together, entangling them and making isolation of improvements impossible. For instance, consider a system that uses features x1, ...xn in a model. If we change the input distribution of values in x1, the importance, weights, or use of the remaining n − 1 features may all change. This is true whether the model is retrained fully in a batch style or allowed to adapt in an online fashion. Adding a new feature xn+1 can cause similar changes, as can removing any feature xj . No inputs are ever really independent. We refer to this here as the CACE principle: Changing Anything Changes Everything. CACE applies not only to input signals, but also to hyper-parameters, learning settings, sampling methods, convergence thresholds, data selection, and essentially every other possible tweak.

       

      One possible mitigation strategy is to isolate models and serve ensembles. This approach is useful in situations in which sub-problems decompose naturally such as in disjoint multi-class settings like [14]. However, in many cases ensembles work well because the errors in the component models are uncorrelated. Relying on the combination creates a strong entanglement: improving an individual component model may actually make the system accuracy worse if the remaining errors are more strongly correlated with the other components.

       

      A second possible strategy is to focus on detecting changes in prediction behavior as they occur. One such method was proposed in [12], in which a high-dimensional visualization tool was used to allow researchers to quickly see effects across many dimensions and slicings. Metrics that operate on a slice-by-slice basis may also be extremely useful.

       

      Correction Cascades.

      There are often situations in which model Ma for problem A exists, but a solution for a slightly different problem A′ is required. In this case, it can be tempting to learn a model M′ a that takes Ma as input and learns a small correction as a fast way to solve the problem.

       

      However, this correction model has created a new system dependency on Ma, making it significantly more expensive to analyze improvements to that model in the future. The cost increases when correction models are cascaded, with a model for problem A′′ learned on top of M′ a , and so on, for several slightly different test distributions. Once in place, a correction cascade can create an improvement deadlock, as improving the accuracy of any individual component actually leads to system-level detriments. Mitigation strategies are to augment ma to learn the corrections directly within the same model by adding features to distinguish among the cases, or to accept the cost of creating a separate model for A′ .

       

      Undeclared Consumers.

      Oftentimes, a prediction from a machine learning model ma is made widely accessible, either at runtime or by writing to files or logs that may later be consumed by other systems. Without access controls, some of these consumers may be undeclared, silently using the output of a given model as an input to another system. In more classical software engineering, these issues are referred to as visibility debt [13].

       

      Undeclared consumers are expensive at best and dangerous at worst, because they create a hidden tight coupling of model ma to other parts of the stack. Changes to ma will very likely impact these other parts, potentially in ways that are unintended, poorly understood, and detrimental. In practice, this tight coupling can radically increase the cost and difficulty of making any changes to ma at all, even if they are improvements. Furthermore, undeclared consumers may create hidden feedback loops, which are described more in detail in section 4.

       

      Undeclared consumers may be difficult to detect unless the system is specifically designed to guard against this case, for example with access restrictions or strict service-level agreements (SLAs). In the absence of barriers, engineers will naturally use the most convenient signal at hand, especially when working against deadline pressures.

       


      3 Data Dependencies Cost More than Code Dependencies

      In [13], dependency debt is noted as a key contributor to code complexity and technical debt in classical software engineering settings. We have found that data dependencies in ML systems carry a similar capacity for building debt, but may be more difficult to detect. Code dependencies can be identified via static analysis by compilers and linkers. Without similar tooling for data dependencies, it can be inappropriately easy to build large data dependency chains that can be difficult to untangle.

       

      Unstable Data Dependencies.

      To move quickly, it is often convenient to consume signals as input features that are produced by other systems. However, some input signals are unstable, meaning that they qualitatively or quantitatively change behavior over time. This can happen implicitly, when the input signal comes from another machine learning model itself that updates over time, or a data-dependent lookup table, such as for computing TF/IDF scores or semantic mappings. It can also happen explicitly, when the engineering ownership of the input signal is separate from the engineering ownership of the model that consumes it. In such cases, updates to the input signal may be made at any time. This is dangerous because even “improvements” to input signals may have arbitrary detrimental effects in the consuming system that are costly to diagnose and address. For example, consider the case in which an input signal was previously mis-calibrated. The model consuming it likely fit to these mis-calibrations, and a silent update that corrects the signal will have sudden ramifications for the model.

       

      One common mitigation strategy for unstable data dependencies is to create a versioned copy of a given signal. For example, rather than allowing a semantic mapping of words to topic clusters to change over time, it might be reasonable to create a frozen version of this mapping and use it until such a time as an updated version has been fully vetted. Versioning carries its own costs, however, such as potential staleness and the cost to maintain multiple versions of the same signal over time.

       

      Underutilized Data Dependencies.

      In code, underutilized dependencies are packages that are mostly unneeded [13]. Similarly, underutilized data dependencies are input signals that provide little incremental modeling benefit. These can make an ML system unnecessarily vulnerable to change, sometimes catastrophically so, even though they could be removed with no detriment.

       

      As an example, suppose that to ease the transition from an old product numbering scheme to new product numbers, both schemes are left in the system as features. New products get only a new number, but old products may have both and the model continues to rely on the old numbers for some products. A year later, the code that stops populating the database with the old numbers is deleted. This will not be a good day for the maintainers of the ML system.

       

      Underutilized data dependencies can creep into a model in several ways.

        • Legacy Features. The most common case is that a feature F is included in a model early in its development. Over time, F is made redundant by new features but this goes undetected.

        • Bundled Features. Sometimes, a group of features is evaluated and found to be beneficial. Because of deadline pressures or similar effects, all the features in the bundle are added to the model together, possibly including features that add little or no value.

        • ǫ-Features. As machine learning researchers, it is tempting to improve model accuracy even when the accuracy gain is very small or when the complexity overhead might be high.

        • Correlated Features. Often two features are strongly correlated, but one is more directly causal. Many ML methods have difficulty detecting this and credit the two features equally, or may even pick the non-causal one. This results in brittleness if world behavior later changes the correlations.

       

      Underutilized dependencies can be detected via exhaustive leave-one-feature-out evaluations. These should be run regularly to identify and remove unnecessary features.

      Figure 1: Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex.

       

      Static Analysis of Data Dependencies.

      In traditional code, compilers and build systems perform static analysis of dependency graphs. Tools for static analysis of data dependencies are far less common, but are essential for error checking, tracking down consumers, and enforcing migration and updates. One such tool is the automated feature management system described in [12], which enables data sources and features to be annotated. Automated checks can then be run to ensure that all dependencies have the appropriate annotations, and dependency trees can be fully resolved. This kind of tooling can make migration and deletion much safer in practice.


      4 Feedback Loops

      One of the key features of live ML systems is that they often end up influencing their own behavior if they update over time. This leads to a form of analysis debt, in which it is difficult to predict the behavior of a given model before it is released. These feedback loops can take different forms, but they are all more difficult to detect and address if they occur gradually over time, as may be the case when models are updated infrequently.

       

      Direct Feedback Loops.

      A model may directly influence the selection of its own future training data. It is common practice to use standard supervised algorithms, although the theoretically correct solution would be to use bandit algorithms. The problem here is that bandit algorithms (such as contextual bandits [9]) do not necessarily scale well to the size of action spaces typically required for real-world problems. It is possible to mitigate these effects by using some amount of randomization [3], or by isolating certain parts of data from being influenced by a given model.

       

      Hidden Feedback Loops.

      Direct feedback loops are costly to analyze, but at least they pose a statistical challenge that ML researchers may find natural to investigate [3]. A more difficult case is hidden feedback loops, in which two systems influence each other indirectly through the world.

       

      One example of this may be if two systems independently determine facets of a web page, such as one selecting products to show and another selecting related reviews. Improving one system may lead to changes in behavior in the other, as users begin clicking more or less on the other components in reaction to the changes. Note that these hidden loops may exist between completely disjoint systems. Consider the case of two stock-market prediction models from two different investment companies. Improvements (or, more scarily, bugs) in one may influence the bidding and buying behavior of the other.


      5 ML-System Anti-Patterns

      It may be surprising to the academic community to know that only a tiny fraction of the code in many ML systems is actually devoted to learning or prediction – see Figure 1. In the language of Lin and Ryaboy, much of the remainder may be described as “plumbing” [11].

       

      It is unfortunately common for systems that incorporate machine learning methods to end up with high-debt design patterns. In this section, we examine several system-design anti-patterns [4] that can surface in machine learning systems and which should be avoided or refactored where possible.

       

      Glue Code.

      ML researchers tend to develop general purpose solutions as self-contained packages. A wide variety of these are available as open-source packages at places like mloss.org, or from in-house code, proprietary packages, and cloud-based platforms.

       

      Using generic packages often results in a glue code system design pattern, in which a massive amount of supporting code is written to get data into and out of general-purpose packages. Glue code is costly in the long term because it tends to freeze a system to the peculiarities of a specific package; testing alternatives may become prohibitively expensive. In this way, using a generic package can inhibit improvements, because it makes it harder to take advantage of domain-specific properties or to tweak the objective function to achieve a domain-specific goal. Because a mature system might end up being (at most) 5% machine learning code and (at least) 95% glue code, it may be less costly to create a clean native solution rather than re-use a generic package.

       

      An important strategy for combating glue-code is to wrap black-box packages into common API’s. This allows supporting infrastructure to be more reusable and reduces the cost of changing packages.

       

      Pipeline Jungles.

      As a special case of glue code, pipeline jungles often appear in data preparation. These can evolve organically, as new signals are identified and new information sources added incrementally. Without care, the resulting system for preparing data in an ML-friendly format may become a jungle of scrapes, joins, and sampling steps, often with intermediate files output. Managing these pipelines, detecting errors and recovering from failures are all difficult and costly [1]. Testing such pipelines often requires expensive end-to-end integration tests. All of this adds to technical debt of a system and makes further innovation more costly.

       

      Pipeline jungles can only be avoided by thinking holistically about data collection and feature extraction. The clean-slate approach of scrapping a pipeline jungle and redesigning from the ground up is indeed a major investment of engineering effort, but one that can dramatically reduce ongoing costs and speed further innovation.

       

      Glue code and pipeline jungles are symptomatic of integration issues that may have a root cause in overly separated “research” and “engineering” roles. When ML packages are developed in an ivorytower setting, the result may appear like black boxes to the teams that employ them in practice. A hybrid research approach where engineers and researchers are embedded together on the same teams (and indeed, are often the same people) can help reduce this source of friction significantly [16].

       

      Dead Experimental Codepaths.

      A common consequence of glue code or pipeline jungles is that it becomes increasingly attractive in the short term to perform experiments with alternative methods by implementing experimental codepaths as conditional branches within the main production code. For any individual change, the cost of experimenting in this manner is relatively low—none of the surrounding infrastructure needs to be reworked. However, over time, these accumulated codepaths can create a growing debt due to the increasing difficulties of maintaining backward compatibility and an exponential increase in cyclomatic complexity. Testing all possible interactions between codepaths becomes difficult or impossible. A famous example of the dangers here was Knight Capital’s system losing $465 million in 45 minutes, apparently because of unexpected behavior from obsolete experimental codepaths [15].

       

      As with the case of dead flags in traditional software [13], it is often beneficial to periodically reexamine each experimental branch to see what can be ripped out. Often only a small subset of the possible branches is actually used; many others may have been tested once and abandoned

       

      Abstraction Debt.

      The above issues highlight the fact that there is a distinct lack of strong abstractions to support ML systems. Zheng recently made a compelling comparison of the state ML abstractions to the state of database technology [17], making the point that nothing in the machine learning literature comes close to the success of the relational database as a basic abstraction. What is the right interface to describe a stream of data, or a model, or a prediction?

       

      For distributed learning in particular, there remains a lack of widely accepted abstractions. It could be argued that the widespread use of Map-Reduce in machine learning was driven by the void of strong distributed learning abstractions. Indeed, one of the few areas of broad agreement in recent years appears to be that Map-Reduce is a poor abstraction for iterative ML algorithms.

       

      he parameter-server abstraction seems much more robust, but there are multiple competing specifications of this basic idea [5, 10]. The lack of standard abstractions makes it all too easy to blur the lines between components.

       

      Common Smells.

      In software engineering, a design smell may indicate an underlying problem in a component or system [7]. We identify a few ML system smells, not hard-and-fast rules, but as subjective indicators.

       

        • Plain-Old-Data Type Smell. The rich information used and produced by ML systems is all to often encoded with plain data types like raw floats and integers. In a robust system, a model parameter should know if it is a log-odds multiplier or a decision threshold, and a prediction should know various pieces of information about the model that produced it and how it should be consumed.

       • Multiple-Language Smell. It is often tempting to write a particular piece of a system in a given language, especially when that language has a convenient library or syntax for the task at hand. However, using multiple languages often increases the cost of effective testing and can increase the difficulty of transferring ownership to other individuals.

        • Prototype Smell. It is convenient to test new ideas in small scale via prototypes. However, regularly relying on a prototyping environment may be an indicator that the full-scale system is brittle, difficult to change, or could benefit from improved abstractions and interfaces. Maintaining a prototyping environment carries its own cost, and there is a significant danger that time pressures may encourage a prototyping system to be used as a production solution. Additionally, results found at small scale rarely reflect the reality at full scale.


      6 Configuration Debt

      Another potentially surprising area where debt can accumulate is in the configuration of machine learning systems. Any large system has a wide range of configurable options, including which features are used, how data is selected, a wide variety of algorithm-specific learning settings, potential pre- or post-processing, verification methods, etc. We have observed that both researchers and engineers may treat configuration (and extension of configuration) as an afterthought. Indeed, verification or testing of configurations may not even be seen as important. In a mature system which is being actively developed, the number of lines of configuration can far exceed the number of lines of the traditional code. Each configuration line has a potential for mistakes.

       

      Consider the following examples. Feature A was incorrectly logged from 9/14 to 9/17. Feature B is not available on data before 10/7. The code used to compute feature C has to change for data before and after 11/1 because of changes to the logging format. Feature D is not available in production, so a substitute features D′ and D′′ must be used when querying the model in a live setting. If feature Z is used, then jobs for training must be given extra memory due to lookup tables or they will train inefficiently. Feature Q precludes the use of feature R because of latency constraints.

       

      All this messiness makes configuration hard to modify correctly, and hard to reason about. However, mistakes in configuration can be costly, leading to serious loss of time, waste of computing resources, or production issues. This leads us to articulate the following principles of good configuration systems:

        • It should be easy to specify a configuration as a small change from a previous configuration.

        • It should be hard to make manual errors, omissions, or oversights.

        • It should be easy to see, visually, the difference in configuration between two models.

        • It should be easy to automatically assert and verify basic facts about the configuration: number of features used, transitive closure of data dependencies, etc.

        • It should be possible to detect unused or redundant settings.

        • Configurations should undergo a full code review and be checked into a repository.

       


      7 Dealing with Changes in the External World

      One of the things that makes ML systems so fascinating is that they often interact directly with the external world. Experience has shown that the external world is rarely stable. This background rate of change creates ongoing maintenance cost.

       

      Fixed Thresholds in Dynamic Systems.

      It is often necessary to pick a decision threshold for a given model to perform some action: to predict true or false, to mark an email as spam or not spam, to show or not show a given ad. One classic approach in machine learning is to choose a threshold from a set of possible thresholds, in order to get good tradeoffs on certain metrics, such as precision and recall. However, such thresholds are often manually set. Thus if a model updates on new data, the old manually set threshold may be invalid. Manually updating many thresholds across many models is time-consuming and brittle. One mitigation strategy for this kind of problem appears in [14], in which thresholds are learned via simple evaluation on heldout validation data.

       

      Monitoring and Testing.

      Unit testing of individual components and end-to-end tests of running systems are valuable, but in the face of a changing world such tests are not sufficient to provide evidence that a system is working as intended. Comprehensive live monitoring of system behavior in real time combined with automated response is critical for long-term system reliability.

       

      The key question is: what to monitor? Testable invariants are not always obvious given that many ML systems are intended to adapt over time. We offer the following starting points.

        • Prediction Bias. In a system that is working as intended, it should usually be the case that the distribution of predicted labels is equal to the distribution of observed labels. This is by no means a comprehensive test, as it can be met by a null model that simply predicts average values of label occurrences without regard to the input features. However, it is a surprisingly useful diagnostic, and changes in metrics such as this are often indicative of an issue that requires attention. For example, this method can help to detect cases in which the world behavior suddenly changes, making training distributions drawn from historical data no longer reflective of current reality. Slicing prediction bias by various dimensions isolate issues quickly, and can also be used for automated alerting.

        • Action Limits. In systems that are used to take actions in the real world, such as bidding on items or marking messages as spam, it can be useful to set and enforce action limits as a sanity check. These limits should be broad enough not to trigger spuriously. If the system hits a limit for a given action, automated alerts should fire and trigger manual intervention or investigation.

        • Up-Stream Producers. Data is often fed through to a learning system from various upstream producers. These up-stream processes should be thoroughly monitored, tested, and routinely meet a service level objective that takes the downstream ML system needs into account. Further any up-stream alerts must be propagated to the control plane of an ML system to ensure its accuracy. Similarly, any failure of the ML system to meet established service level objectives be also propagated down-stream to all consumers, and directly to their control planes if at all possible.

       

      Because external changes occur in real-time, response must also occur in real-time as well. Relying on human intervention in response to alert pages is one strategy, but can be brittle for time-sensitive issues. Creating systems to that allow automated response without direct human intervention is often well worth the investment.


      8 Other Areas of ML-related Debt

      We now briefly highlight some additional areas where ML-related technical debt may accrue.

       

      Data Testing Debt.

      If data replaces code in ML systems, and code should be tested, then it seems clear that some amount of testing of input data is critical to a well-functioning system. Basic sanity checks are useful, as more sophisticated tests that monitor changes in input distributions.

       

      Reproducibility Debt.

      As scientists, it is important that we can re-run experiments and get similar results, but designing real-world systems to allow for strict reproducibility is a task made difficult by randomized algorithms, non-determinism inherent in parallel learning, reliance on initial conditions, and interactions with the external world.

       

      Process Management Debt.

      Most of the use cases described in this paper have talked about the cost of maintaining a single model, but mature systems may have dozens or hundreds of models running simultaneously [14, 6]. This raises a wide range of important problems, including the problem of updating many configurations for many similar models safely and automatically, how to manage and assign resources among models with different business priorities, and how to visualize and detect blockages in the flow of data in a production pipeline. Developing tooling to aid recovery from production incidents is also critical. An important system-level smell to avoid are common processes with many manual steps.

       

      Cultural Debt.

      There is sometimes a hard line between ML research and engineering, but this can be counter-productive for long-term system health. It is important to create team cultures that reward deletion of features, reduction of complexity, improvements in reproducibility, stability, and monitoring to the same degree that improvements in accuracy are valued. In our experience, this is most likely to occur within heterogeneous teams with strengths in both ML research and engineering.


      9 Conclusions: Measuring Debt and Paying it Off

      Technical debt is a useful metaphor, but it unfortunately does not provide a strict metric that can be tracked over time. How are we to measure technical debt in a system, or to assess the full cost of this debt? Simply noting that a team is still able to move quickly is not in itself evidence of low debt or good practices, since the full cost of debt becomes apparent only over time. Indeed, moving quickly often introduces technical debt. A few useful questions to consider are:

        • How easily can an entirely new algorithmic approach be tested at full scale?

        • What is the transitive closure of all data dependencies?

        • How precisely can the impact of a new change to the system be measured?

        • Does improving one model or signal degrade others?

        • How quickly can new members of the team be brought up to speed?

       

      We hope that this paper may serve to encourage additional development in the areas of maintainable ML, including better abstractions, testing methodologies, and design patterns. Perhaps the most important insight to be gained is that technical debt is an issue that engineers and researchers both need to be aware of. Research solutions that provide a tiny accuracy benefit at the cost of massive increases in system complexity are rarely wise practice. Even the addition of one or two seemingly innocuous data dependencies can slow further progress.

       

      Paying down ML-related technical debt requires a specific commitment, which can often only be achieved by a shift in team culture. Recognizing, prioritizing, and rewarding this effort is important for the long term health of successful ML teams.

       

      Acknowledgments

      This paper owes much to the important lessons learned day to day in a culture that values both innovative ML research and strong engineering practice. Many colleagues have helped shape our thoughts here, and the benefit of accumulated folk wisdom cannot be overstated. We would like to specifically recognize the following: Roberto Bayardo, Luis Cobo, Sharat Chikkerur, Jeff Dean, Philip Henderson, Arnar Mar Hrafnkelsson, Ankur Jain, Joe Kovac, Jeremy Kubica, H. Brendan McMahan, Satyaki Mahalanabis, Lan Nie, Michael Pohl, Abdul Salem, Sajid Siddiqi, Ricky Shan, Alan Skelly, Cory Williams, and Andrew Young.

       

      A short version of this paper was presented at the SE4ML workshop in 2014 in Montreal, Canada.


      My Opinions : 

      그 유명한 구글 논문이다. 다들 AI를 외칠 때 Systemic AI의 개념을 제시하였다. 이로서 다시금 MLOps, S/W로 돌아오게 되었다고 생각한다.

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

       

      반응형

      + Recent posts