Katib is a Kubernetes-native project for automated machine learning (AutoML).

Overview

logo

Build Status Coverage Status Go Report Card Releases Slack Status

Katib is a Kubernetes-native project for automated machine learning (AutoML). Katib supports Hyperparameter Tuning, Early Stopping and Neural Architecture Search.

Katib is the project which is agnostic to machine learning (ML) frameworks. It can tune hyperparameters of applications written in any language of the users’ choice and natively supports many ML frameworks, such as TensorFlow, Apache MXNet, PyTorch, XGBoost, and others.

Katib can perform training jobs using any Kubernetes Custom Resources with out of the box support for Kubeflow Training Operator, Argo Workflows, Tekton Pipelines and many more.

Katib stands for secretary in Arabic.

Search Algorithms

Katib supports several search algorithms. Follow the Kubeflow documentation to know more about each algorithm and check the Suggestion service guide to implement your custom algorithm.

Hyperparameter Tuning Neural Architecture Search Early Stopping
Random Search ENAS Median Stop
Grid Search DARTS
Bayesian Optimization
TPE
Multivariate TPE
CMA-ES
Sobol's Quasirandom Sequence
HyperBand

To perform above algorithms Katib supports the following frameworks:

Installation

For the various Katib installs check the Kubeflow guide. Follow the next steps to install Katib standalone.

Prerequisites

This is the minimal requirements to install Katib:

  • Kubernetes >= 1.17
  • kubectl >= 1.21

Latest Version

For the latest Katib version run this command:

kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=master"

Release Version

For the specific Katib release (for example v0.11.1) run this command:

kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.11.1"

Make sure that all Katib components are running:

$ kubectl get pods -n kubeflow

NAME                                READY   STATUS      RESTARTS   AGE
katib-cert-generator-rw95w          0/1     Completed   0          35s
katib-controller-566595bdd8-hbxgf   1/1     Running     0          36s
katib-db-manager-57cd769cdb-4g99m   1/1     Running     0          36s
katib-mysql-7894994f88-5d4s5        1/1     Running     0          36s
katib-ui-5767cfccdc-pwg2x           1/1     Running     0          36s

For the Katib Experiments check the complete examples list.

Documentation

Community

We are always growing our community and invite new users and AutoML enthusiasts to contribute to the Katib project. The following links provide information about getting involved in the community:

Contributing

Please feel free to test the system! Developer guide is a good starting point for our developers.

Blog posts

Events

Citation

If you use Katib in a scientific publication, we would appreciate citations to the following paper:

A Scalable and Cloud-Native Hyperparameter Tuning System, George et al., arXiv:2006.02085, 2020.

Bibtex entry:

@misc{george2020katib,
    title={A Scalable and Cloud-Native Hyperparameter Tuning System},
    author={Johnu George and Ce Gao and Richard Liu and Hou Gang Liu and Yuan Tang and Ramdoot Pydipaty and Amit Kumar Saha},
    year={2020},
    eprint={2006.02085},
    archivePrefix={arXiv},
    primaryClass={cs.DC}
}
Issues
  • how to collect the indicator of training results???

    how to collect the indicator of training results???

    /kind bug

    After completion of bayesianoptimization automated training, the corresponding indicator results cannot be collected. Could you please tell me how to collect the indicator of training results. My yaml file is as follows: apiVersion: "kubeflow.org/v1alpha3" kind: Experiment metadata: namespace: kubeflow labels: controller-tools.k8s.io: "1.0" name: bayesianoptimization-example spec: objective: type: maximize goal: 0.99 objectiveMetricName: Validation-accuracy additionalMetricNames: - accuracy algorithm: algorithmName: bayesianoptimization algorithmSettings: - name: "random_state" value: "10" parallelTrialCount: 3 maxTrialCount: 12 maxFailedTrialCount: 3 MetricsCollectorSpec: Collector: Kind: stdOut parameters: - name: --lr parameterType: double feasibleSpace: min: "0.01" max: "0.03" - name: --num-layers parameterType: int feasibleSpace: min: "2" max: "5" - name: --optimizer parameterType: categorical feasibleSpace: list: - sgd - adam - ftrl trialTemplate: goTemplate: rawTemplate: |- apiVersion: batch/v1 kind: Job metadata: name: {{.Trial}} namespace: {{.NameSpace}} spec: template: spec: containers: - name: {{.Trial}} image: docker.io/katib/mxnet-mnist-example command: - "python" - "/mxnet/example/image-classification/train_mnist.py" - "--batch-size=64" {{- with .HyperParameters}} {{- range .}} - "{{.Name}}={{.Value}}" {{- end}} {{- end}} restartPolicy: Never

    What steps did you take and what happened: [A clear and concise description of what the bug is.]

    What did you expect to happen:

    Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

    Environment:

    • Kubeflow version:0.7.0
    • Minikube version:
    • Kubernetes version: (use kubectl version):1.15.5
    • OS (e.g. from /etc/os-release):CentOS Linux release 7.7.1908
    kind/bug 
    opened by cleveryg 99
  • Disable dynamic creation for admission hooks and update dependencies

    Disable dynamic creation for admission hooks and update dependencies

    Fixes: https://github.com/kubeflow/katib/issues/1405.

    This PR introduces new mechanism to get certificate for webhooks. I updated YAMLs for our webhooks. I added initContainer to Katib controller which executes cert-generator.sh script. This script creates CertificateSigningRequest, katib-webhook-cert secret and patches webhooks configurations with appropriate caBundle. Since we have katib-webhook-cert secret in the manifest, cleanup process should delete everything.

    So we don't need to deploy cert-manager for Katib.

    @gaocegege @johnugeorge @yanniszark @kuikuikuizzZ @knkski What do you think about this approach ?

    Also I updated controller-runtime to v0.8.2 and k8s.io deps to v0.20.4. That requires some changes:

    • Change some packages location
    • Change the arguments for client calls (List, Get, etc.)
    • In the newer Kubernetes versions we can't add owner reference for cluster-scoped objects (e.g. PV) with namespace-scoped object (e.g. Suggestion). Thus, I have to disable owner reference for the PV which is created when Experiment has FromVolume resume policy. For that reason, I added PersistentVolumeReclaimPolicy: Delete for the PV and once PVC is garbage collected, PV should also be deleted.
    • I removed PyTorch operator from the dependencies because of this problem.

    I still need to make some tests and create new image for cert generator. It would be great if you can start to review this.

    /cc @gaocegege @johnugeorge

    lgtm size/XXL approved 
    opened by andreyvelich 62
  • [feature] Reconsider the design of Trial Template

    [feature] Reconsider the design of Trial Template

    /kind feature

    Describe the solution you'd like [A clear and concise description of what you want to happen.]

    We need to marshal the TFJob to JSON string then use it to create experiments if we are using K8s client-go. It is not good. And, go template is ugly, too.

    Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

    priority/p0 kind/feature 
    opened by gaocegege 56
  • Switch to AWS CI/CD

    Switch to AWS CI/CD

    Related: https://github.com/kubeflow/katib/issues/1332. I will debug the infra in this PR.

    I also made few changes to improve CI/CD quality.

    /cc @gaocegege @johnugeorge /cc @Jeffwan @PatrickXYS @jlewi @Bobgy

    lgtm size/XXL approved 
    opened by andreyvelich 55
  • Katib v1alpha2 API for CRDs

    Katib v1alpha2 API for CRDs

    @YujiOshima @gaocegege @johnugeorge @alexandraj777 @hougangliu @xyhuang

    This is an initial proposal for the Katib v1alpha2 API. The changes here reflect the discussion in https://github.com/kubeflow/katib/issues/370.

    Comments and suggestions are welcome.

    Please note that the NAS APIs are not included here since the feature is still in early phase.


    This change is Reviewable

    lgtm approved size/L 
    opened by richardsliu 54
  • Studyctl crd

    Studyctl crd

    Add StudyController CRD: studycontroller.kubeflow.org Operator: StudyController

    Update examples. This implementation is polling workers status in go process of StudyController. Though I understand this is not an elegant implementation, this is the least impact to existing codes.

    Next step we should make worker CRD and its controller and support multi-type jobs (k8s, TF-Job..). Assign @gaocegege


    This change is Reviewable

    lgtm size/XXL approved 
    opened by YujiOshima 50
  • Population based training

    Population based training

    What this PR does / why we need it:

    Support the discovery of modulated hyperparameters rather than attempting to find a fixed set over the entire training process. The paper has more details about the technique.

    Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):

    This PR provides some initial support for PBT within Katib (#1382).

    Checklist:

    • [ ] Docs included if any changes are user facing
    lgtm size/XXL approved ok-to-test 
    opened by a9p 46
  • Improve Katib README

    Improve Katib README

    Related: #1332. I will debug the infra in this PR.

    • [x] This is the PR to see if we can trigger AWS Presubmit.
    • [x] This is the PR to see if Github UI integrate aws-kf-ci-bot
    size/XS lgtm approved 
    opened by PatrickXYS 44
  • can't set up CRD

    can't set up CRD "Experiment"

    when I deploy katib_v1alpha3 with scripts/v1alpha3/deploy.sh, the katib-controller pod gives the following error: {"level":"info","ts":1578296376.3173876,"logger":"entrypoint","msg":"Config:","experiment-suggestion-name":"default","cert-local-filesystem":false} {"level":"info","ts":1578296376.375878,"logger":"entrypoint","msg":"Registering Components."} {"level":"info","ts":1578296376.3765948,"logger":"entrypoint","msg":"Setting up controller"} {"level":"info","ts":1578296376.3766346,"logger":"experiment-controller","msg":"Using the default suggestion implementation"} {"level":"info","ts":1578296376.3767953,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"experiment-controller","source":"kind source: /, Kind="} {"level":"error","ts":1578296376.3768966,"logger":"kubebuilder.source","msg":"if kind is a CRD, it should be installed before calling Start","kind":{"Group":"kubeflow.org","Kind":"Experiment"},"error":"no matches for kind "Experiment" in version "kubeflow.org/v1alpha3"","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source.(*Kind).Start\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/source/source.go:89\ngithub.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Watch\n\t/go/src/github.com/kubeflow/katib/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:122\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.addWatch\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:119\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:107\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.Add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:62\ngithub.com/kubeflow/katib/pkg/controller%2ev1alpha3.AddToManager\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/controller.go:28\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1alpha3/main.go:90\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"} {"level":"error","ts":1578296376.377135,"logger":"experiment-controller","msg":"Experiment watch failed","error":"no matches for kind "Experiment" in version "kubeflow.org/v1alpha3"","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.addWatch\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:121\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:107\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.Add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:62\ngithub.com/kubeflow/katib/pkg/controller%2ev1alpha3.AddToManager\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/controller.go:28\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1alpha3/main.go:90\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"} {"level":"error","ts":1578296376.3772092,"logger":"experiment-controller","msg":"Trial watch failed","error":"no matches for kind "Experiment" in version "kubeflow.org/v1alpha3"","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:108\ngithub.com/kubeflow/katib/pkg/controller.v1alpha3/experiment.Add\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/experiment/experiment_controller.go:62\ngithub.com/kubeflow/katib/pkg/controller%2ev1alpha3.AddToManager\n\t/go/src/github.com/kubeflow/katib/pkg/controller.v1alpha3/controller.go:28\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1alpha3/main.go:90\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"} {"level":"error","ts":1578296376.377267,"logger":"entrypoint","msg":"unable to register controllers to the manager","error":"no matches for kind "Experiment" in version "kubeflow.org/v1alpha3"","stacktrace":"github.com/kubeflow/katib/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/kubeflow/katib/vendor/github.com/go-logr/zapr/zapr.go:128\nmain.main\n\t/go/src/github.com/kubeflow/katib/cmd/katib-controller/v1alpha3/main.go:91\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203"}

    And the ui pod gives the following error: 2020/01/06 06:56:46 CreateExperiment from YAML failed: no matches for kind "Experiment" in version "kubeflow.org/v1alpha3"

    lifecycle/stale 
    opened by wyong16 41
  • Katib experiments run indefintely without completing a single trial

    Katib experiments run indefintely without completing a single trial

    /kind bug

    Hi, I'm setting a Katib job through the Kale deployment panel - after creating a Kale pipeline. The pipeline builds successfully but the Katib experiments run forever and don't complete a single trial.

    I expect the Katib jobs to run successfully, but to no avail.

    Any way/suggestion to go about this?

    Environment:

    • Kubeflow version (kfctl version):
    • Minikube version (minikube version):
    • Kubernetes version: (use kubectl version):
    • OS (e.g. from /etc/os-release):
    kind/bug 
    opened by Dampolo03 39
  • ERROR:grpc._server:Exception calling application: Method not implemented!

    ERROR:grpc._server:Exception calling application: Method not implemented!

    /kind bug

    Hi, I'm having trouble using katib v1alpha3. First, I installed katib by the followings

    1. git clone https://github.com/kubeflow/katib
    2. sh katib/scripts/v1alpha3/deploy.sh

    And I tried to apply random-example.yaml kubectl apply -f random-example.yaml (example in katib/examples/v1alpha3)

    Results: kubectl get pods -n kubeflow NAME READY STATUS RESTARTS AGE katib-controller-6c6974678d-zsnlc 1/1 Running 1 24m katib-db-558f649dc6-8cd9t 1/1 Running 0 24m katib-manager-5f74bdff84-4d78z 1/1 Running 0 24m katib-ui-6568bd6b44-qbq5k 1/1 Running 0 24m random-example-random-846dc99654-bxb8j 1/1 Running 0 23m

    kubectl get trials -n kubeflow NAME TYPE STATUS AGE random-example-drpkvb4b Running True 23m random-example-k7xv6ktt Running True 23m random-example-w6jlwdp2 Running True 23m

    kubectl get experiment -n kubeflow -oyaml apiVersion: v1 items:

    • apiVersion: kubeflow.org/v1alpha3 kind: Experiment metadata: annotations: kubectl.kubernetes.io/last-applied-configuration: | {"apiVersion":"kubeflow.org/v1alpha3","kind":"Experiment","metadata":{"annotations":{},"labels":{"controller-tools.k8s.io":"1.0"},"name":"random-example","namespace":"kubeflow"},"spec":{"algorithm":{"algorithmName":"random"},"maxFailedTrialCount":3,"maxTrialCount":12,"objective":{"additionalMetricNames":["accuracy"],"goal":0.99,"objectiveMetricName":"Validation-accuracy","type":"maximize"},"parallelTrialCount":3,"parameters":[{"feasibleSpace":{"max":"0.03","min":"0.01"},"name":"--lr","parameterType":"double"},{"feasibleSpace":{"max":"5","min":"2"},"name":"--num-layers","parameterType":"int"},{"feasibleSpace":{"list":["sgd","adam","ftrl"]},"name":"--optimizer","parameterType":"categorical"}],"trialTemplate":{"goTemplate":{"rawTemplate":"apiVersion: batch/v1\nkind: Job\nmetadata:\n name: {{.Trial}}\n namespace: {{.NameSpace}}\nspec:\n template:\n spec:\n containers:\n - name: {{.Trial}}\n image: docker.io/kubeflowkatib/mxnet-mnist-example\n command:\n - "python"\n - "/mxnet/example/image-classification/train_mnist.py"\n - "--batch-size=64"\n {{- with .HyperParameters}}\n {{- range .}}\n - "{{.Name}}={{.Value}}"\n {{- end}}\n {{- end}}\n restartPolicy: Never"}}}} creationTimestamp: "2019-12-20T07:58:52Z" finalizers:
      • update-prometheus-metrics generation: 2 labels: controller-tools.k8s.io: "1.0" name: random-example namespace: kubeflow resourceVersion: "11682124" selfLink: /apis/kubeflow.org/v1alpha3/namespaces/kubeflow/experiments/random-example uid: 9005bab0-22fe-11ea-8cf0-0679676001a5 spec: algorithm: algorithmName: random algorithmSettings: null maxFailedTrialCount: 3 maxTrialCount: 12 metricsCollectorSpec: collector: kind: StdOut objective: additionalMetricNames:
        • accuracy goal: 0.99 objectiveMetricName: Validation-accuracy type: maximize parallelTrialCount: 3 parameters:
      • feasibleSpace: max: "0.03" min: "0.01" name: --lr parameterType: double
      • feasibleSpace: max: "5" min: "2" name: --num-layers parameterType: int
      • feasibleSpace: list:
        • sgd
        • adam
        • ftrl name: --optimizer parameterType: categorical trialTemplate: goTemplate: rawTemplate: |- apiVersion: batch/v1 kind: Job metadata: name: {{.Trial}} namespace: {{.NameSpace}} spec: template: spec: containers: - name: {{.Trial}} image: docker.io/kubeflowkatib/mxnet-mnist-example command: - "python" - "/mxnet/example/image-classification/train_mnist.py" - "--batch-size=64" {{- with .HyperParameters}} {{- range .}} - "{{.Name}}={{.Value}}" {{- end}} {{- end}} restartPolicy: Never status: conditions:
      • lastTransitionTime: "2019-12-20T07:58:52Z" lastUpdateTime: "2019-12-20T07:58:52Z" message: Experiment is created reason: ExperimentCreated status: "True" type: Created
      • lastTransitionTime: "2019-12-20T08:00:22Z" lastUpdateTime: "2019-12-20T08:00:22Z" message: Experiment is running reason: ExperimentRunning status: "True" type: Running currentOptimalTrial: observation: metrics: null parameterAssignments: null startTime: "2019-12-20T07:58:52Z" trials: 3 trialsRunning: 3 kind: List metadata: resourceVersion: "" selfLink: ""

    kubectl logs -n kubeflow random-example-random-846dc99654-bxb8j INFO:hyperopt.utils:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support. INFO:hyperopt.fmin:Failed to load dill, try installing dill via "pip install dill" for enhanced pickling support. ERROR:grpc._server:Exception calling application: Method not implemented! Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/grpc/_server.py", line 434, in _call_behavior response_or_iterator = behavior(argument, context) File "/usr/src/app/github.com/kubeflow/katib/pkg/apis/manager/v1alpha3/python/api_pb2_grpc.py", line 135, in ValidateAlgorithmSettings raise NotImplementedError('Method not implemented!') NotImplementedError: Method not implemented!

    What can I do to fix it? Thank you for your help in solving this problem.

    • Kubernetes version: (use kubectl version): Client Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp", GitCommit:"903c3b31caddc675ce2d8bddf62aa0f875c2a3bc", GitTreeState:"clean", BuildDate:"2019-05-08T06:16:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"13", GitVersion:"v1.13.5+icp", GitCommit:"903c3b31caddc675ce2d8bddf62aa0f875c2a3bc", GitTreeState:"clean", BuildDate:"2019-05-08T06:16:32Z", GoVersion:"go1.11.5", Compiler:"gc", Platform:"linux/amd64"}

    • OS (e.g. from /etc/os-release): CentOS Linux release 7.7.1908 (Core)

    kind/bug 
    opened by devxoxo 38
  • Metrics can go out of view when visualizing experiment metrics

    Metrics can go out of view when visualizing experiment metrics

    /kind bug

    What steps did you take and what happened: It appears that metrics can go outside the viewport when visualizing katib metrics.
    image

    What did you expect to happen: I expected the metrics to be bounded by the viewport so that they are always visible (we should always be able to see y = 0) or to have the ability to zoom in or scroll.

    Environment:

    • Katib version (check the Katib controller image version): katib-controller:v0.12.0, katib sdk 0.13.0
    • Kubernetes version: (kubectl version): 1.21
    • OS (uname -a): Ubuntu
    • Kubeflow: 1.4.1
    • Cloud provider: AWS
    • Screen resolution : 1920x1080 100% (not zoomed in)

    Impacted by this bug? Give it a πŸ‘ We prioritize the issues with the most πŸ‘

    kind/bug 
    opened by AlexandreBrown 0
  • Plan to support Java SDK

    Plan to support Java SDK

    /kind feature

    Describe the solution you'd like [A clear and concise description of what you want to happen.]

    I want to put HPO in my own MLOps platform through Katib. I currently use Java Spring for backend service, but Katib supports only Python SDK. I know it supports the gRPC API, but it has limitations to use (ex. Creating new experiment).


    Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

    I wonder if you have any plans to support SDK for Java. If there is a plan, I wonder which version the Java SDK will be applied.


    Love this feature? Give it a πŸ‘ We prioritize the features with the most πŸ‘

    kind/feature 
    opened by pod3275 0
  • WIP:implement postgres for katib db

    WIP:implement postgres for katib db

    What this PR does / why we need it:

    • Implement postgres for katib backend and support it as a built-in feature of katib.

    • For proof of concept, I just replace mysql to postgres as a default db. However, it could be rollbacked after reviews. It's just a temporary change for review-convinience.

    • I tested with following kustomization configuration : https://github.com/anencore94/katib/tree/test-kustomization-for-postgres-db

    Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #915

    Checklist:

    • [ ] Docs included if any changes are user facing
    size/XL do-not-merge/work-in-progress 
    opened by anencore94 8
  • Bump terser from 4.8.0 to 4.8.1 in /pkg/ui/v1beta1/frontend

    Bump terser from 4.8.0 to 4.8.1 in /pkg/ui/v1beta1/frontend

    Bumps terser from 4.8.0 to 4.8.1.

    Changelog

    Sourced from terser's changelog.

    v4.8.1 (backport)

    • Security fix for RegExps that should not be evaluated (regexp DDOS)
    Commits

    Dependabot compatibility score

    Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.


    Dependabot commands and options

    You can trigger Dependabot actions by commenting on this PR:

    • @dependabot rebase will rebase this PR
    • @dependabot recreate will recreate this PR, overwriting any edits that have been made to it
    • @dependabot merge will merge this PR after your CI passes on it
    • @dependabot squash and merge will squash and merge this PR after your CI passes on it
    • @dependabot cancel merge will cancel a previously requested merge and block automerging
    • @dependabot reopen will reopen this PR if it is closed
    • @dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
    • @dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
    • @dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
    • @dependabot use these labels will set the current labels as the default for future PRs for this repo and language
    • @dependabot use these reviewers will set the current reviewers as the default for future PRs for this repo and language
    • @dependabot use these assignees will set the current assignees as the default for future PRs for this repo and language
    • @dependabot use this milestone will set the current milestone as the default for future PRs for this repo and language

    You can disable automated security fix PRs for this repo from the Security Alerts page.

    size/XS dependencies javascript 
    opened by dependabot[bot] 2
  • GPU not consuming for Katib experiment - GKE  Could not load   dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared   object file: No such file or directory

    GPU not consuming for Katib experiment - GKE Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory

    /kind bug

    What steps did you take and what happened: I am trying to create a kubeflow pipeline that tunes the hyper parameters of a text classification model in tensorflow using katib on GKE clusters. I created a cluster using the below commands

    CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
    ZONE="us-central1-a"
    MACHINE_TYPE="n1-standard-2"
    SCOPES="cloud-platform"
    NODES_NUM=1
    
    gcloud container clusters create $CLUSTER_NAME --zone $ZONE --machine-type $MACHINE_TYPE --scopes $SCOPES --num-nodes $NODES_NUM
    
    gcloud config set compute/zone $ZONE
    gcloud container clusters get-credentials $CLUSTER_NAME
    
    export PIPELINE_VERSION=1.8.2
    kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/cluster-scoped-resources?ref=$PIPELINE_VERSION"
    kubectl wait --for condition=established --timeout=60s crd/applications.app.k8s.io
    kubectl apply -k "github.com/kubeflow/pipelines/manifests/kustomize/env/dev?ref=$PIPELINE_VERSION"
    # katib
    kubectl apply -k "github.com/kubeflow/katib.git/manifests/v1beta1/installs/katib-standalone?ref=v0.13.0"
    kubectl apply -k "github.com/kubeflow/training-operator/manifests/overlays/standalone?ref=v1.4.0"
    kubectl apply -f ./test.yaml
    
    # disabling caching
    export NAMESPACE=kubeflow
    kubectl get mutatingwebhookconfiguration cache-webhook-${NAMESPACE}
    kubectl patch mutatingwebhookconfiguration cache-webhook-${NAMESPACE} --type='json' -p='[{"op":"replace", "path": "/webhooks/0/rules/0/operations/0", "value": "DELETE"}]'
    
    kubectl describe configmap inverse-proxy-config -n kubeflow | grep googleusercontent.com
    
    GPU_POOL_NAME="gpu-pool2"
    CLUSTER_NAME="kubeflow-pipelines-standalone-v2"
    CLUSTER_ZONE="us-central1-a"
    GPU_TYPE="nvidia-tesla-k80"
    GPU_COUNT=1
    MACHINE_TYPE="n1-highmem-8"
    NODES_NUM=1
    
    # Node pool creation may take several minutes.
    gcloud container node-pools create ${GPU_POOL_NAME} --accelerator type=${GPU_TYPE},count=${GPU_COUNT} --zone ${CLUSTER_ZONE} --cluster ${CLUSTER_NAME} --num-nodes=0 --machine-type=${MACHINE_TYPE} --scopes=cloud-platform --num-nodes $NODES_NUM
      
    kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/cos/daemonset-preloaded.yaml
    kubectl apply -f https://raw.githubusercontent.com/kubernetes/kubernetes/master/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
    

    I then created a kubeflow pipeline:

    
    from kfp import compiler
    import kfp
    import kfp.dsl as dsl
    from kfp import components
    
    @dsl.pipeline(
        name="End to End Pipeline",
        description="An end to end mnist example including hyperparameter tuning, train and inference"
    )
    def pipeline_func(
        time_loc = "gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
        hyper_image_uri_train = "gcr.io/.............../hptunekatib:v7",
        hyper_image_uri = "gcr.io/.............../hptunekatibclient:v7",
        model_uri = "gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
        experiment_name = "dbpedia-exp-1",
        experiment_namespace = "kubeflow",
        experiment_timeout_minutes = 60
    ):
        
        # first stage : ingest and preprocess -> returns uploaded gcs URI for the pre processed dataset, setting memmory to 32GB, CPU to 4 CPU
        hp_tune = dsl.ContainerOp(
              name='hp-tune-katib',
              image=hyper_image_uri,
              arguments=[
                '--experiment_name', experiment_name,
                '--experiment_namespace', experiment_namespace,
                '--experiment_timeout_minutes', experiment_timeout_minutes,
                '--delete_after_done', True,
                '--hyper_image_uri', hyper_image_uri_train,
                '--time_loc', time_loc, 
                '--model_uri', model_uri
    
              ],
              file_outputs={'best-params': '/output.txt'}
            ).set_gpu_limit(1)
        
        # restricting the maximum usable memory and cpu for preprocess stage
        hp_tune.set_memory_limit("49G")
        hp_tune.set_cpu_limit("7")
    
    # Run the Kubeflow Pipeline in the user's namespace.
    if __name__ == '__main__':
        
        # compiling the model and generating tar.gz file to upload to Kubeflow Pipeline UI
        import kfp.compiler as compiler
    
        compiler.Compiler().compile(
            pipeline_func, 'pipeline_db.tar.gz'
        )
    

    These are my two continers.

    1. To launch the katib experiments based on the specified parameters and arguments passed to the dsl.ContainerOp()
    2. The main training script for text classification. This container is passed as "image" to the trial spec for katib

    gcr.io/.............../hptunekatibclient:v7

    # importing required packages
    import argparse
    import datetime
    from datetime import datetime as dt
    from distutils.util import strtobool
    import json
    import os
    import logging
    import time
    import pandas as pd
    from google.cloud import storage
    from pytz import timezone
    
    from kubernetes.client import V1ObjectMeta
    
    from kubeflow.katib import KatibClient
    from kubeflow.katib import ApiClient
    from kubeflow.katib import V1beta1Experiment
    
    from kubeflow.katib import ApiClient
    from kubeflow.katib import V1beta1ExperimentSpec
    from kubeflow.katib import V1beta1AlgorithmSpec
    from kubeflow.katib import V1beta1ObjectiveSpec
    from kubeflow.katib import V1beta1ParameterSpec
    from kubeflow.katib import V1beta1FeasibleSpace
    from kubeflow.katib import V1beta1TrialTemplate
    from kubeflow.katib import V1beta1TrialParameterSpec
    from kubeflow.katib import V1beta1MetricsCollectorSpec
    from kubeflow.katib import V1beta1CollectorSpec
    from kubeflow.katib import V1beta1FileSystemPath
    from kubeflow.katib import V1beta1SourceSpec
    from kubeflow.katib import V1beta1FilterSpec
    
    logger = logging.getLogger()
    logging.basicConfig(level=logging.INFO)
    
    FINISH_CONDITIONS = ["Succeeded", "Failed"]
    
    
    # function to record the start time and end time to calculate execution time, pipeline start up and teardown time
    def write_time(types, time_loc):
    
        formats = "%Y-%m-%d %I:%M:%S %p"
    
        now_utc = dt.now(timezone('UTC'))
        now_asia = now_utc.astimezone(timezone('Asia/Kolkata'))
        start_time = str(now_asia.strftime(formats))
        time_df = pd.DataFrame({"time":[start_time]})
        print("written")
        time_df.to_csv(time_loc + types + ".csv", index=False)
    
    
    def get_args():
        parser = argparse.ArgumentParser(description='Katib Experiment launcher')
        parser.add_argument('--experiment_name', type=str,
                            help='Experiment name')
        parser.add_argument('--experiment_namespace', type=str, default='anonymous',
                            help='Experiment namespace')
        parser.add_argument('--experiment_timeout_minutes', type=int, default=60*24,
                            help='Time in minutes to wait for the Experiment to complete')
        parser.add_argument('--delete_after_done', type=strtobool, default=True,
                            help='Whether to delete the Experiment after it is finished')
        parser.add_argument('--hyper_image_uri', type=str, default="gcr.io/.............../hptunekatib:v2",
                            help='Hyper image uri')
        parser.add_argument('--time_loc', type=str, default="gs://faris_bucket_us_central/Pipeline_data/input_dataset/dbpedia_model/GKE_Katib/time_csv/",
                            help='Time loc')
        parser.add_argument('--model_uri', type=str, default="gs://faris_bucket_us_central/Pipeline_data/dbpedia_hyper_models/GKE_Katib/",
                            help='Model URI')
        
        return parser.parse_args()
    
    def wait_experiment_finish(katib_client, experiment, timeout):
        polling_interval = datetime.timedelta(seconds=30)
        end_time = datetime.datetime.now() + datetime.timedelta(minutes=timeout)
        experiment_name = experiment.metadata.name
        experiment_namespace = experiment.metadata.namespace
        while True:
            current_status = None
            try:
                current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
            except Exception as e:
                logger.info("Unable to get current status for the Experiment: {} in namespace: {}. Exception: {}".format(
                    experiment_name, experiment_namespace, e))
            # If Experiment has reached complete condition, exit the loop.
            if current_status in FINISH_CONDITIONS:
                logger.info("Experiment: {} in namespace: {} has reached the end condition: {}".format(
                    experiment_name, experiment_namespace, current_status))
                return
            # Print the current condition.
            logger.info("Current condition for Experiment: {} in namespace: {} is: {}".format(
                experiment_name, experiment_namespace, current_status))
            # If timeout has been reached, rise an exception.
            if datetime.datetime.now() > end_time:
                raise Exception("Timout waiting for Experiment: {} in namespace: {} "
                                "to reach one of these conditions: {}".format(
                                    experiment_name, experiment_namespace, FINISH_CONDITIONS))
            # Sleep for poll interval.
            time.sleep(polling_interval.seconds)
    
    
    if __name__ == "__main__":
        
    
        args = get_args()
        
        write_time("hyper_parameter_tuning_start", args.time_loc)
        
        # Trial count specification.
        max_trial_count = 2
        max_failed_trial_count = 2
        parallel_trial_count = 1
    
        # Objective specification.
        objective = V1beta1ObjectiveSpec(
            type="minimize",
            # goal=100,
            objective_metric_name="accuracy"
            # additional_metric_names=["accuracy"]
        )
    
        # Objective specification.
    #     metrics_collector_specs = V1beta1MetricsCollectorSpec(
    #         collector=V1beta1CollectorSpec(kind="File"),
    #         source=V1beta1SourceSpec(
    #             file_system_path=V1beta1FileSystemPath(
    #                 # format="TEXT",
    #                 path="/opt/trainer/katib/metrics.log",
    #                 kind="File"
    #             ),
    #             filter=V1beta1FilterSpec(
    #                 # metrics_format=["{metricName: ([\\w|-]+), metricValue: ((-?\\d+)(\\.\\d+)?)}"]
    #                 metrics_format=["([\w|-]+)\s*=\s*([+-]?\d*(\.\d+)?([Ee][+-]?\d+)?)"]
    
    #             )
    #         )
    #     )
    
        # Algorithm specification.
        algorithm = V1beta1AlgorithmSpec(
            algorithm_name="random",
        )
    
        # Experiment search space.
        # In this example we tune learning rate and batch size.
        parameters = [
            V1beta1ParameterSpec(
                name="batch_size",
                parameter_type="discrete",
                feasible_space=V1beta1FeasibleSpace(
                    list=["32", "42", "52", "62", "64"]
                ),
            ),
            V1beta1ParameterSpec(
                name="learning_rate",
                parameter_type="double",
                feasible_space=V1beta1FeasibleSpace(
                    min="0.001",
                    max="0.005"
                ),
            )
        ]
    
        # TODO (andreyvelich): Use community image for the mnist example.
        trial_spec = {
            "apiVersion": "kubeflow.org/v1",
            "kind": "TFJob",
            "spec": {
                "tfReplicaSpecs": {
                    "PS": {
                        "replicas": 1,
                        "restartPolicy": "Never",
                        "template": {
                            "metadata": {
                                "annotations": {
                                    "sidecar.istio.io/inject": "false",
                                }
                            },
                            "spec": {
                                "containers": [
                                    {
                                        "name": "tensorflow",
                                        "image": args.hyper_image_uri,
                                        "command": [
                                            "python",
                                            "/opt/trainer/task.py",
                                            "--model_uri=" + args.model_uri,
                                            "--batch_size=${trialParameters.batchSize}",
                                            "--learning_rate=${trialParameters.learningRate}"
    
                                        ],
                                        "ports" : [
                                            {
                                                "containerPort": 2222,
                                                "name" : "tfjob-port"
                                            }
                                        ]
                                        # "resources": {
                                        #     "limits" : {
                                        #         "cpu": "1"
                                        #     }
                                        # }
                                    }
                                ]
                            }
                        }
                    },
                    "Worker": {
                        "replicas": 1,
                        "restartPolicy": "Never",
                        "template": {
                            "metadata": {
                                "annotations": {
                                    "sidecar.istio.io/inject": "false",
                                }
                            },
                            "spec": {
                                "containers": [
                                    {
                                        "name": "tensorflow",
                                        "image": args.hyper_image_uri,
                                        "command": [
                                            "python",
                                            "/opt/trainer/task.py",
                                            "--model_uri=" + args.model_uri,
                                            "--batch_size=${trialParameters.batchSize}",
                                            "--learning_rate=${trialParameters.learningRate}"
                                        ],
                                        "ports" : [
                                            {
                                                "containerPort": 2222,
                                                "name" : "tfjob-port"
                                            }
                                        ]
                                        # "resources": {
                                        #     "limits" : {
                                        #         "nvidia.com/gpu": 1
                                        #     }
                                        # }
                                    }
                                ]
                            }
                        }
                    }
                }
            }
        }
    
    
        # Configure parameters for the Trial template.
        trial_template = V1beta1TrialTemplate(
            primary_container_name="tensorflow",
            trial_parameters=[
                V1beta1TrialParameterSpec(
                    name="batchSize",
                    description="batch size",
                    reference="batch_size"
                ),
                V1beta1TrialParameterSpec(
                    name="learningRate",
                    description="Learning rate",
                    reference="learning_rate"
                ),
            ],
            trial_spec=trial_spec
        )
    
        # Create an Experiment from the above parameters.
        experiment_spec = V1beta1ExperimentSpec(
            max_trial_count=max_trial_count,
            max_failed_trial_count=max_failed_trial_count,
            parallel_trial_count=parallel_trial_count,
            objective=objective,
            algorithm=algorithm,
            parameters=parameters,
            trial_template=trial_template
        )
    
        experiment_name = args.experiment_name
        experiment_namespace = args.experiment_namespace
    
        logger.info("Creating Experiment: {} in namespace: {}".format(experiment_name, experiment_namespace))
    
        # Create Experiment object.
        experiment = V1beta1Experiment(
            api_version="kubeflow.org/v1beta1",
            kind="Experiment",
            metadata=V1ObjectMeta(
                name=experiment_name,
                namespace=experiment_namespace
            ),
            spec=experiment_spec
        )
        logger.info("Experiment Spec : " + str(experiment_spec))
        
        
        logger.info("Experiment: " + str(experiment))
    
        # Create Katib client.
        katib_client = KatibClient()
        # Create Experiment in Kubernetes cluster.
        output = katib_client.create_experiment(experiment, namespace=experiment_namespace)
    
        # Wait until Experiment is created.
        end_time = datetime.datetime.now() + datetime.timedelta(minutes=60)
        while True:
            current_status = None
            # Try to get Experiment status.
            try:
                current_status = katib_client.get_experiment_status(name=experiment_name, namespace=experiment_namespace)
            except Exception:
                logger.info("Waiting until Experiment is created...")
            # If current status is set, exit the loop.
            if current_status is not None:
                break
            # If timeout has been reached, rise an exception.
            if datetime.datetime.now() > end_time:
                raise Exception("Timout waiting for Experiment: {} in namespace: {} to be created".format(
                    experiment_name, experiment_namespace))
            time.sleep(1)
    
        logger.info("Experiment is created")
    
        # Wait for Experiment finish.
        wait_experiment_finish(katib_client, experiment, args.experiment_timeout_minutes)
    
        # Check if Experiment is successful.
        if katib_client.is_experiment_succeeded(name=experiment_name, namespace=experiment_namespace):
            logger.info("Experiment: {} in namespace: {} is successful".format(
                experiment_name, experiment_namespace))
    
            optimal_hp = katib_client.get_optimal_hyperparameters(
                name=experiment_name, namespace=experiment_namespace)
            logger.info("Optimal hyperparameters:\n{}".format(optimal_hp))
    
            # # Create dir if it doesn't exist.
            # if not os.path.exists(os.path.dirname("output.txt")):
            #     os.makedirs(os.path.dirname("output.txt"))
            # Save HyperParameters to the file.
            with open("output.txt", 'w') as f:
                f.write(json.dumps(optimal_hp))
        else:
            logger.info("Experiment: {} in namespace: {} is failed".format(
                experiment_name, experiment_namespace))
            # Print Experiment if it is failed.
            experiment = katib_client.get_experiment(name=experiment_name, namespace=experiment_namespace)
            logger.info(experiment)
    
        # Delete Experiment if it is needed.
        if args.delete_after_done:
            katib_client.delete_experiment(name=experiment_name, namespace=experiment_namespace)
            logger.info("Experiment: {} in namespace: {} has been deleted".format(
                experiment_name, experiment_namespace))
            
        write_time("hyper_parameter_tuning_end", args.time_loc)
    
    

    Dockerfile

    FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8
    
    # installing packages
    RUN pip install pandas
    RUN pip install gcsfs
    RUN pip install google-cloud-storage
    RUN pip install pytz
    RUN pip install kubernetes
    RUN pip install kubeflow-katib
    # moving code to preprocess
    
    RUN mkdir /hp_tune
    COPY task.py /hp_tune
    
    # CREDENTIAL Authentication
    COPY /prj-vertex-ai-2c390f7e8fec.json /hp_tune/prj-vertex-ai-2c390f7e8fec.json
    ENV GOOGLE_APPLICATION_CREDENTIALS="/hp_tune/prj-vertex-ai-2c390f7e8fec.json"
    
    # entry point
    ENTRYPOINT ["python3", "/hp_tune/task.py"]
    

    gcr.io/.............../hptunekatib:v7

    # import os
    # os.system("pip install tensorflow-gpu==2.8.0")
    
    from sklearn.preprocessing import LabelEncoder
    import tensorflow as tf
    import tensorflow_hub as hub
    import tensorflow_text as text
    import os
    from tensorflow.keras.layers import Conv1D, MaxPool1D ,Embedding ,concatenate
    from tensorflow.keras.layers import Activation, Dropout, Flatten, Dense,Input 
    from tensorflow.keras.models import Model 
    from tensorflow import keras
    from datetime import datetime
    from pytz import timezone
    from sklearn.model_selection import train_test_split
    import pandas as pd
    from google.cloud import storage
    import argparse
    import logging
    
    logger = logging.getLogger()
    logging.basicConfig(level=logging.INFO)    
    
    logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
    import subprocess
    process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = process.communicate()
    logger.info("NVIDIA SMI " + str(out))
    def format_strs(x):
        strs = ""
        if x > 0:
            sign_t = "+"
            strs += "+"
        else:
            sign_t = "-"
            
            strs += "-"
            
        strs = strs + "{:.1e}".format(x)
        
        if "+" in strs[1:]:
            sign = "+"
            strs = strs[1:].split("+")
        else:
            sign = "-"
            strs = strs[1:].split("-")
            
        last_d = strs[1][1:] if strs[1][0] == "0" else strs[1]
        
        strs_f = sign_t + strs[0] + sign + last_d
        return strs_f
        
    def get_args():
        '''Parses args. Must include all hyperparameters you want to tune.'''
    
        parser = argparse.ArgumentParser()
        
        parser.add_argument(
          '--learning_rate',
          required=True,
          type=float,
          help='learning_rate')
        
        parser.add_argument(
          '--batch_size',
          required=True,
          type=int,
          help='batch_size')
        
        parser.add_argument(
          '--model_uri',
          required=True,
          type=str,
          help='Model Uri')
        
        args = parser.parse_args()
        return args
    
    def download_blob(bucket_name, source_blob_name, destination_file_name):
        """Downloads a blob from the bucket."""
        # The ID of your GCS bucket
        # bucket_name = "your-bucket-name"
    
        # The ID of your GCS object
        # source_blob_name = "storage-object-name"
    
        # The path to which the file should be downloaded
        # destination_file_name = "local/path/to/file"
    
        storage_client = storage.Client()
    
        bucket = storage_client.bucket(bucket_name)
    
        # Construct a client side representation of a blob.
        # Note `Bucket.blob` differs from `Bucket.get_blob` as it doesn't retrieve
        # any content from Google Cloud Storage. As we don't need additional data,
        # using `Bucket.blob` is preferred here.
        blob = bucket.blob(source_blob_name)
        blob.download_to_filename(destination_file_name)
    
    
    def create_dataset():
        
        download_blob("faris_bucket_us_central", "Pipeline_data/input_dataset/dbpedia_model/data/" + "train.csv", "train.csv")
        
        trainData = pd.read_csv('train.csv')
        trainData.columns = ['label','title','description']
        
        # trainData = trainData.sample(frac=0.002)
        
        X_train, X_test, y_train, y_test = train_test_split(trainData['description'], trainData['label'], stratify=trainData['label'], test_size=0.1, random_state=0)
        
        return X_train, X_test, y_train, y_test
    
    
    def train_model(train_X, train_y, test_X, test_y, learning_rate, batch_size):
      
        logger.info("Training with lr = " + str(learning_rate) + "bs = " + str(batch_size))
        bert_preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3")
        bert_encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-12_H-128_A-2/2", trainable=False)
    
        text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text')
        preprocessed_text = bert_preprocess(text_input)
        outputs = bert_encoder(preprocessed_text)
    
        # Neural network layers
        l = tf.keras.layers.Dropout(0.2, name="dropout")(outputs['pooled_output']) # dropout_rate
        l = tf.keras.layers.Dense(14,activation='softmax',kernel_initializer=tf.keras.initializers.GlorotNormal(seed=24))(l) # dense_units
    
        model = Model(inputs=[text_input], outputs=l)
    
        model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate),loss='categorical_crossentropy',metrics=['accuracy'])
        
        history = model.fit(train_X, train_y, epochs=5, validation_data=(test_X, test_y), batch_size=batch_size)
        
        return model, history
    
    
    def main():
        
        args = get_args()
        logger.info("Creating dataset")
        train_X, test_X, train_y, test_y = create_dataset()
        
        # one_hot_encoding the class label
        encoder = LabelEncoder()
        encoder.fit(train_y)
        y_train_encoded = encoder.transform(train_y)
        y_test_encoded = encoder.transform(test_y)
    
        y_train_ohe = tf.keras.utils.to_categorical(y_train_encoded)
        y_test_ohe = tf.keras.utils.to_categorical(y_test_encoded)
        
        logger.info("Training model")
        model = train_model(
            train_X,
            y_train_ohe,
            test_X,
            y_test_ohe,
            args.learning_rate,
            int(float(args.batch_size))
        )
        
        logger.info("Saving model")
        artifact_filename = 'saved_model'
        local_path = artifact_filename
        tf.saved_model.save(model[0], local_path)
        
        # Upload model artifact to Cloud Storage
        model_directory = args.model_uri + "-".join(os.environ["HOSTNAME"].split("-")[:-2]) + "/"
        local_path = "saved_model/assets/vocab.txt"
        storage_path = os.path.join(model_directory, "assets/vocab.txt")
        blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
        blob.upload_from_filename(local_path)
        
        local_path = "saved_model/variables/variables.data-00000-of-00001"
        storage_path = os.path.join(model_directory, "variables/variables.data-00000-of-00001")
        blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
        blob.upload_from_filename(local_path)
        
        local_path = "saved_model/variables/variables.index"
        storage_path = os.path.join(model_directory, "variables/variables.index")
        blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
        blob.upload_from_filename(local_path)
        
        local_path = "saved_model/saved_model.pb"
        storage_path = os.path.join(model_directory, "saved_model.pb")
        blob = storage.blob.Blob.from_string(storage_path, client=storage.Client())
        blob.upload_from_filename(local_path)
    
        logger.info("Model Saved at " + model_directory)
        
        logger.info("Keras Score: " + str(model[1].history["accuracy"][-1]))
        
        hp_metric = model[1].history["accuracy"][-1]
        
        print("accuracy =", format_strs(hp_metric))
    
    if __name__ == "__main__":
        main()
    
    

    Dockerfile

    # FROM gcr.io/deeplearning-platform-release/tf-cpu.2-8
    FROM gcr.io/deeplearning-platform-release/tf-gpu.2-8
    
    RUN mkdir -p /opt/trainer
    
    # RUN pip install scikit-learn
    RUN pip install tensorflow_text==2.8.1
    # RUN pip install tensorflow-gpu==2.8.0
    
    # CREDENTIAL Authentication
    COPY /prj-vertex-ai-2c390f7e8fec.json /prj-vertex-ai-2c390f7e8fec.json
    ENV GOOGLE_APPLICATION_CREDENTIALS="/prj-vertex-ai-2c390f7e8fec.json"
    
    COPY *.py /opt/trainer/
    
    # # RUN chgrp -R 0 /opt/trainer && chmod -R g+rwX /opt/trainer
    # RUN chmod -R 777 /home/trainer
    
    ENTRYPOINT ["python", "/opt/trainer/task.py"]
    
    # Sets up the entry point to invoke the trainer.
    # ENTRYPOINT ["python", "-m", "trainer.task"]
    
    
    

    The pipeline runs but it doesnot use the GPU and this piece of code

    logger.info("Num GPUs Available: " + str(tf.config.list_physical_devices('GPU')))
    import subprocess
    process = subprocess.Popen(['sh', '-c', 'nvidia-smi'], stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    out, err = process.communicate()
    logger.info("NVIDIA SMI " + str(out))
    

    gives empty list and empty string. It is like the GPU doesnot exist. I am attaching the logs of the container

    insertId | labels."compute.googleapis.com/resource_name" | labels."k8s-pod/group-name" | labels."k8s-pod/job-name" | labels."k8s-pod/replica-index" | labels."k8s-pod/replica-type" | labels."k8s-pod/training_kubeflow_org/job-name" | labels."k8s-pod/training_kubeflow_org/operator-name" | labels."k8s-pod/training_kubeflow_org/replica-index" | labels."k8s-pod/training_kubeflow_org/replica-type" | logName | receiveLocation | receiveTimestamp | receivedLocation | resource.labels.cluster_name | resource.labels.container_name | resource.labels.location | resource.labels.namespace_name | resource.labels.pod_name | resource.labels.project_id | resource.type | severity | textPayload | timestamp
    -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | --
    saaah727bfds9ymw | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stdout | 2022-07-11T06:07:35.222632672Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | INFO | accuracy = +9.9e-1 | 2022-07-11T06:07:30.812554270Z
    cg5hf72zfi4a8ymi | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | INFO:root:Num GPUs Available: [] | 2022-07-11T06:07:30.812527036Z
    0n32rintpe0v865p | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811609: I   tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does   not appear to be running on this host (dbpedia-exp-1-ntq7tfvj-ps-0):   /proc/driver/nvidia/version does not exist | 2022-07-11T06:07:30.812519914Z
    et3b3w8ji0nlmfc3 | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811541: W   tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit:   UNKNOWN ERROR (303) | 2022-07-11T06:07:30.812511863Z
    u8jhqsnsjg3n114l | gke-kubeflow-pipelines-s-default-pool-e4e6dda3-544k | kubeflow.org | dbpedia-exp-1-ntq7tfvj | 0 | ps | dbpedia-exp-1-ntq7tfvj | tfjob-controller | 0 | ps | projects/prj-vertex-ai/logs/stderr | 2022-07-11T06:07:35.218143792Z | kubeflow-pipelines-standalone-v2 | tensorflow | us-central1-a | kubeflow | dbpedia-exp-1-ntq7tfvj-ps-0 | prj-vertex-ai | k8s_container | ERROR | 2022-07-11 06:07:30.811461: W   tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load  /kind bug
    

    What did you expect to happen:

    I expected the pipeline stage to use GPU and run the text classiication using GPU but it doesnt.

    Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

    Environment:

    • Katib version (check the Katib controller image version): v0.13.0
    • Kubernetes version: (kubectl version): 1.22.8-gke.202
    • OS (uname -a): linux/ COS in containers

    Impacted by this bug? Give it a πŸ‘ We prioritize the issues with the most πŸ‘

    kind/bug 
    opened by farisfirenze 8
  • Add support for entire kubeflow pipelines as trial target (in addition to containers)

    Add support for entire kubeflow pipelines as trial target (in addition to containers)

    /kind feature

    Describe the solution you'd like Hyper parameters not only affect the training step but also upstream pipeline components like feature transformation for example (e.g. parameters of a normalization transformation). In addition, transformation and training steps should be able to make use of kfp's parallel components (e.g. SparkJob, TFJob, ...). It would be helpful to not only allow to specify containers as trial targets but also complete kubeflow pipelines. As the latter also expose parameters they can be either set directly (non-hyperparameters) or added to the hyper parameter space.

    Anything else you would like to add: I've started to create simple container image which can be used as trial target which acts as a proxy and downstream triggers parameterized Kubeflow pipeline executions with the respective hyper parameters. A Kubernetes Custom Resource can be created as well down the line.


    Love this feature? Give it a πŸ‘ We prioritize the features with the most πŸ‘

    kind/feature 
    opened by romeokienzler 1
Releases(v0.14.0-rc.0)
  • v0.14.0-rc.0(Jun 30, 2022)

  • v0.13.0(Mar 4, 2022)

    This is the Katib v0.13.0 release.

    Breaking changes:

    1. Namespace label for Metrics collector enabled Katib namespaces is changed to katib.kubeflow.org/metrics-collector-injection=enabled #1740
    2. Current request number field in gRPC API is renamed to current_request_number #1728
    3. training.kubeflow.org prefix is added to the default primary pod labelsjob-role and replica-type of the Training Operators #1813

    New Features

    Algorithms and Components

    • Implement validation for Early Stopping (#1709 by @tenzen-y)
    • Change namespace label for Metrics Collector injection (#1740 by @andreyvelich)
    • Modify gRPC API with Current Request Number (#1728 by @andreyvelich)
    • Allow to remove each resource in Katib config (#1729 by @andreyvelich)
    • Support leader election for Katib Controller (#1713 by @tenzen-y)
    • Change default Metrics Collect format (#1707 by @anencore94)
    • Bump Python version to 3.9 (#1731 by @tenzen-y)
    • Update Go version to 1.17 (#1683 by @andreyvelich)
    • Create Python script to run e2e Argo Workflow (#1674 by @andreyvelich)
    • Reimplement Katib Cert Generator in Go (#1662 by @tenzen-y)
    • SDK: change list apis to return objects as default (#1630 by @anencore94)

    UI Features

    • Enhance Katib UI feasible space (#1721 by @seong7)
    • Handle missing TrialTemplates in Katib UI (#1652 by @kimwnasptd)
    • Add Prettier devDependency in Katib UI (#1629 by @seong7)

    Documentation

    • Fix a link for GRPC API documentation (#1786 by @tenzen-y)
    • Add my presentations that include Katib (#1753 by @terrytangyuan)
    • Add Akuity to list of adopters (#1749 by @terrytangyuan)
    • Change Argo -> Argo Workflows (#1741 by @terrytangyuan)
    • Update Algorithm Service Doc for the new CI script (#1724 by @andreyvelich)
    • Update link to Training Operator (#1699 by @terrytangyuan)
    • Refactor examples folder structure (#1691 by @andreyvelich)
    • Fix README in examples directory (#1687 by @tenzen-y)
    • Add Kubeflow MXJob example (#1688 by @andreyvelich)
    • Update FPGA examples (#1685 by @eliaskoromilas)
    • Refactor README (#1667 by @andreyvelich)
    • Change the minimal Kustomize version in the developer guide (#1675 by @tenzen-y)
    • Add Katib release process guide (#1641 by @andreyvelich)

    Bug Fixes

    • Remove unrecognized keys from metadata.yaml in Charmed operators (#1759 by @DnPlas)
    • Fix the default Metrics Collector regex (#1755 by @andreyvelich)
    • Fix Status Handling in Charmed Operators (#1743 by @DomFleischmann)
    • Fix bug on list type HP in Katib UI (#1704 by @seong7)
    • Fix Range for Int and Double values in Grid search (#1732 by @andreyvelich)
    • Check if parameter references exist in Experiment parameters (#1726 by @henrysecond1)
    • Fix same set for HyperParameters in Bayesian Optimization algorithm (#1701 by @fabianvdW)
    • Close MySQL statement and rows resources when SQL exec ends (#1720 by @chenwenjun-github)
    • Fix Cluster Role of Katib Controller to access image pull secrets (#1725 by @henrysecond1)
    • Emit events when fails to reconcile all Trials (#1706 by @henrysecond1)
    • Missing metrics port annotation (#1715 by @alexeykaplin)
    • Fix absolute value in Katib UI (#1676 by @anencore94)
    • Add missing omitempty parameter to APIs (#1645 by @andreyvelich)
    • Reconcile semantics for Suggestion Algorithms (#1633 by @johnugeorge)
    • Fix default label for Training Operators (#1813 by @andreyvelich)
    • Update supported Python version for Katib SDK (#1798 by @tenzen-y)

    Misc

    • Use release tags for Trial images (#1757 by @andreyvelich)
    • Upgrade cert-manager API from v1alpha2 to v1 (#1752 by @haoxins)
    • Add Workflow to Publish Katib Images (#1746 by @andreyvelich)
    • Update Charmed Katib Operators + CI to 0.12 (#1717 by @knkski)
    • Updating Katib CI to use Training Operator (#1710 by @midhun1998)
    • Update OWNERS for Charmed operators (#1718 by @ca-scribner)
    • Implement some unit tests for the Katib Config package (#1690 by @tenzen-y)
    • Add GitHub Actions for Python unit tests (#1677 by @andreyvelich)
    • Add OWNERS file for the Katib new UI (#1681 by @kimwnasptd)
    • Add envtest to check reconcileRBAC (#1678 by @tenzen-y)
    • Use golangci-lint as linter for Go (#1671 by @tenzen-y)
    Source code(tar.gz)
    Source code(zip)
  • v0.12.0(Oct 6, 2021)

    This is the Katib v0.12.0 release.

    The major advantages:

    • Optuna Suggestion service with the new algorithms, big thanks to @g-votte and @c-bata.
    • Sobol's Quasirandom Sequence algorithm and IPOP-CMA-ES or BIPOP-CMA-ES restart strategies, big thanks to @c-bata.
    • Katib can perform Argo Workflows, big thanks to @andreyvelich.

    New Features

    Algorithms and Components

    • Add Optuna based suggestion service (#1613 by @g-votte)
    • Support Sobol's Quasirandom Sequence using Goptuna. (#1523 by @c-bata)
    • Bump the Goptuna version up to v0.8.0 with IPOP-CMA-ES and BIPOP-CMA-ES support. (#1519 by @c-bata)
    • Validate possible operations for Grid suggestion (#1205 by @andreyvelich)
    • Validate for Bayesian Optimization algorithm settings (#1600 by @anencore94)
    • Add Support for Argo Workflows (#1605 by @andreyvelich)
    • Add Support for XGBoost Operator with LightGBM example (#1603 by @andreyvelich)
    • Allow empty resources for CPU and Memory in Katib config (#1564 by @andreyvelich)
    • Add kustomization overlay: katib-openshift (#1513 by @maanur)
    • Switch to SDI in Katib Charm (#1555 by @knkski)

    UI Features

    • Add Multivariate TPE to Katib UI (#1625 by @andreyvelich)
    • Update Katib UI with Optuna Algorithm Settings (#1626 by @andreyvelich)
    • Change the default image for the new Katib UI (#1608 by @andreyvelich)

    Documentation

    • Add Katib 2021 ROADMAP (#1524 by @andreyvelich)
    • Add AutoML and Training WG Summit July 2021 (#1615 by @andreyvelich)
    • Add the new Katib presentations 2021 (#1539 by @andreyvelich)
    • Add Doc checklist to PR template (#1568 by @andreyvelich)
    • Fix typo in operators/README (#1557 by @evilnick)
    • Adds docs on how to use Katib Charm within KF (#1556 by @RFMVasconcelos)
    • Fix a link to Kustomize manifest for new Katib UI (#1521 by @c-bata)

    Bug Fixes

    • Fix UI for handling missing params (#1657 by @kimwnasptd)
    • Reconcile semantics for Suggestion Algorithms (#1644 by @johnugeorge)
    • Fix Metrics Collector error in case of non-existing Process (#1614 by @andreyvelich)
    • Fix mysql version in docker image (#1594 by @munagekar)
    • Fix grep in Tekton Experiment Doc (#1578 by @andreyvelich)
    • Error messages corrected (#1522 by @himanshu007-creator)
    • Install charmcraft 1.0.0 (#1593 by @DomFleischmann)

    Misc

    • Modify XGBoostJob example for the new Controller (#1623 by @andreyvelich)
    • Modify Labels for controller resources (#1621 by @andreyvelich)
    • Modify Labels for Katib Components (#1611 by @andreyvelich)
    • Upgrade CRDs to apiextensions.k8s.io/v1 (#1610 by @andreyvelich)
    • Update Katib SDK with OpenAPI generator (#1572 by @andreyvelich)
    • Disable default PV for Experiment with resume from volume (#1552 by @andreyvelich)
    • Remove PV from MySQL component (#1527 by @andreyvelich)
    • feat: add naming regex check on validating webhook (#1541 by @anencore94)

    Change Log

    Check the Full Change Log.

    Source code(tar.gz)
    Source code(zip)
  • v0.11.1(Jun 11, 2021)

    This is the Katib v0.11.1 release.

    Bug fixes

    • Fix Katib manifest for Kubeflow 1.3 (https://github.com/kubeflow/katib/pull/1503 by @yanniszark)
    • Fix Katib release script (https://github.com/kubeflow/katib/pull/1510 by @andreyvelich)

    Enhancements

    • Remove Application CR (https://github.com/kubeflow/katib/pull/1509 by @yanniszark)
    • Modify Katib manifest to support newer Kustomize version (https://github.com/kubeflow/katib/pull/1515 by @DavidSpek and @andreyvelich)

    Check the Full Change Log.

    Source code(tar.gz)
    Source code(zip)
  • v0.11.0(Mar 22, 2021)

    This is the Katib v0.11.0 release. The major advantages:

    • Katib is now supporting Kubernetes >= 1.18
    • Possibility to deploy a new Katib UI, big thanks to @kimwnasptd!
    • Juju operator support, big thanks to @DomFleischmann, @knkski and @RFMVasconcelos!

    New Features

    Core Features

    • Disable dynamic Webhook creation (https://github.com/kubeflow/katib/pull/1450 by @andreyvelich and @tenzen-y)
    • Add the waitAllProcesses flag to the Katib config (https://github.com/kubeflow/katib/pull/1394 by @robbertvdg)
    • Migrate Katib to Go modules (https://github.com/kubeflow/katib/pull/1438 by @andreyvelich)
    • Update Katib SDK with the get_success_trial_details API (https://github.com/kubeflow/katib/pull/1442 by @Adarsh2910)
    • Add release process script (https://github.com/kubeflow/katib/pull/1473 by @andreyvelich)
    • Refactor the Katib installation using Kustomize (https://github.com/kubeflow/katib/pull/1464 by @andreyvelich)

    UI Features and Enhancements

    • First step for the Katib new UI implementation (https://github.com/kubeflow/katib/pull/1427 by @kimwnasptd)
    • Add missing fields to the Katib new UI (https://github.com/kubeflow/katib/pull/1463 by @kimwnasptd)
    • Add instructions to install the new Katib UI (https://github.com/kubeflow/katib/pull/1476 by @kimwnasptd)

    Katib Juju operator

    • Add Juju operator support for Katib (https://github.com/kubeflow/katib/pull/1403 by @knkski and @RFMVasconcelos)
    • Add GitHub Actions for the Juju operator (https://github.com/kubeflow/katib/pull/1407 by @knkski)
    • Add install docs for the Juju operator (https://github.com/kubeflow/katib/pull/1411 by @RFMVasconcelos)
    • Modify ClusterRoles for the Juju operator (https://github.com/kubeflow/katib/pull/1426 by @DomFleischmann)
    • Update the Juju operator with the new Katib Webhooks (https://github.com/kubeflow/katib/pull/1465 by @knkski)

    Bug fixes

    • Fix compare step for Early Stopping (https://github.com/kubeflow/katib/pull/1386 by @andreyvelich)
    • Fix Early Stopping in the Goptuna Suggestion (https://github.com/kubeflow/katib/pull/1404 by @andreyvelich)
    • Fix SDK examples to work with the Katib 0.10 (https://github.com/kubeflow/katib/pull/1402 by @andreyvelich)
    • Fix links in the TFEvent Metrics Collector (https://github.com/kubeflow/katib/pull/1417 by @zuston)
    • Fix the gRPC build script (https://github.com/kubeflow/katib/pull/1492 by @andreyvelich)

    Documentation

    • Modify docs for the Katib 0.10 (https://github.com/kubeflow/katib/pull/1392 by @andreyvelich)
    • Add Katib presentation list (https://github.com/kubeflow/katib/pull/1446 by @andreyvelich)
    • Add Canonical to the Katib Adopters (https://github.com/kubeflow/katib/pull/1401 by @RFMVasconcelos)
    • Update developer guide with the Katib controller flags (https://github.com/kubeflow/katib/pull/1449 by @annajung)
    • Add Fuzhi to the Katib Adopters (https://github.com/kubeflow/katib/pull/1451 by @Planck0591)
    • Fix Katib broken links to the Kubeflow guides (https://github.com/kubeflow/katib/pull/1477 by @theofpa)
    • Add the Katib Webhook docs (https://github.com/kubeflow/katib/pull/1486 by @andreyvelich)

    Misc

    • Add recreate strategy for the MySQL deployment (https://github.com/kubeflow/katib/pull/1393 by @andreyvelich)
    • Modify worker image for the Katib AWS CI/CD (https://github.com/kubeflow/katib/pull/1423 by @PatrickXYS)
    • Add the SVG logo for Katib (https://github.com/kubeflow/katib/pull/1414 by @knkski)
    • Verify empty Objective in the Experiment defaults (https://github.com/kubeflow/katib/pull/1445 by @andreyvelich)
    • Move the Katib manifests upstream (https://github.com/kubeflow/katib/pull/1432 by @yanniszark)
    • Build the Trial images in the Katib CI (https://github.com/kubeflow/katib/pull/1457 by @andreyvelich)
    • Add script to update the boilerplates (https://github.com/kubeflow/katib/pull/1491 by @andreyvelich)

    Change Log

    Check the Full Change Log.

    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Nov 7, 2020)

    This is the Katib 0.10 release for the Kubeflow 1.2. The new Katib v1beta1 API version has been released.

    New Features

    Core Features

    • The new Trial template design (https://github.com/kubeflow/katib/issues/1208)
    • Support custom Kubernetes CRD in the Trial template (https://github.com/kubeflow/katib/issues/1214)
      • Add example for the Tekton Pipeline (https://github.com/kubeflow/katib/pull/1339)
      • Add example for the Kubeflow MPIJob (https://github.com/kubeflow/katib/pull/1342)
    • Support early stopping with the Median Stopping Rule (https://github.com/kubeflow/katib/pull/1344)
    • Resume Experiment from the volume (https://github.com/kubeflow/katib/pull/1275)
      • Support volume settings in the Katib config (https://github.com/kubeflow/katib/pull/1291)
    • Extract the Experiment metrics in multiple ways (https://github.com/kubeflow/katib/pull/1140)
    • Update the Python SDK for the v1beta1 version (https://github.com/kubeflow/katib/pull/1252)

    UI Features and Enhancements

    • Show the Trial parameters on the submit Experiment page (https://github.com/kubeflow/katib/pull/1224)
    • Enable to set the Trial template YAML from the submit Experiment page (https://github.com/kubeflow/katib/pull/1363)
    • Optimise the Katib UI image (https://github.com/kubeflow/katib/pull/1232)
    • Enable sorting in the Trial list table (https://github.com/kubeflow/katib/pull/1251)
    • Add pages to the Trial list table (https://github.com/kubeflow/katib/pull/1262)
    • Use the V4 version for the Material UI (https://github.com/kubeflow/katib/pull/1254)
    • Automatically delete an empty ConfigMap with Trial templates (https://github.com/kubeflow/katib/pull/1260)
    • Create a ConfigMap with Trial templates (https://github.com/kubeflow/katib/pull/1265)
    • Support metrics strategies on the submit Experiment page (https://github.com/kubeflow/katib/pull/1364)
    • Add the resume policy to the submit Experiment page (https://github.com/kubeflow/katib/pull/1362)
    • Enable to create an early stopping Experiment from the submit Experiment page (https://github.com/kubeflow/katib/pull/1373)

    Bug fixes

    • Check the Trials count before deleting it (https://github.com/kubeflow/katib/pull/1223)
    • Check that Trials are deleted (https://github.com/kubeflow/katib/pull/1288)
    • Fix the out of range error in the Hyperopt suggestion (https://github.com/kubeflow/katib/pull/1315)
    • Fix the pod ownership to inject the metrics collector (https://github.com/kubeflow/katib/pull/1303)

    Misc

    • Switch the test infra to the AWS (https://github.com/kubeflow/katib/pull/1356)
    • Use the docker.io/kubeflowkatib registry to release images (https://github.com/kubeflow/katib/pull/1372)

    Change Log

    See the Full Change Log.

    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Jun 16, 2020)

  • v0.6.0-rc.0(Jun 28, 2019)

  • v0.1.2-alpha(Jun 5, 2018)

    Full Changelog

    Closed issues:

    • [request] Invite libbyandhelen as reviewer for algorithm support #82
    • cli failed to connect #80
    • CreateStudy RPC error: Objective_Value_Name is required #73
    • [cli] Use cobra to refactor the cli #54
    • Reduce time it takes to build all images #50
    • [release] Ksonnet the katib #32

    Merged pull requests:

    Source code(tar.gz)
    Source code(zip)
    katib-cli-darwin-amd64.darwin(13.65 MB)
    katib-cli-linux-amd64(13.66 MB)
  • v0.1.1-alpha(Apr 26, 2018)

    Full Changelog

    Closed issues:

    • [upstream] Update name in kubernetes/test-infra #63
    • [go] Update the package name, again #62
    • [test] Fix broken unit test cases #58
    • Provide a cli binary for macOS / darwin #57
    • Error running katib on latest master (04/13) #44
    • Upload existing models to modelDB interface #43
    • [release] Add cli to v0.1.0-alpha #31
    • [discussion] Find a new way to install CLI #26
    • [maintainance] Setup the repository #8
    • Existing approaches and design for hyperparameter-tuning #2

    Merged pull requests:

    Source code(tar.gz)
    Source code(zip)
    katib-cli-darwin-amd64(13.57 MB)
    katib-cli-linux-amd64(13.59 MB)
  • v0.1.0-alpha(Apr 10, 2018)

    Closed issues:

    • [suggestion] Move the logic about random service to random package #18
    • [build-release] Reuse the vendor during the image building process #14
    • [go] Rename the package from mlkube/katib to this repo #7
    • [go] Establish vendor dependencies for go #5
    • Rename to hyperparameter-tuning ? #1

    Merged pull requests:

    Source code(tar.gz)
    Source code(zip)
    katib-cli-darwin-amd64(11.57 MB)
    katib-cli-linux-amd64(10.22 MB)
Owner
Kubeflow
Kubeflow is an open, community driven project to make it easy to deploy and manage an ML stack on Kubernetes
Kubeflow
On-line Machine Learning in Go (and so much more)

goml Golang Machine Learning, On The Wire goml is a machine learning library written entirely in Golang which lets the average developer include machi

Conner DiPaolo 1.4k Aug 12, 2022
Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

Gorgonia 4.6k Aug 7, 2022
Machine Learning libraries for Go Lang - Linear regression, Logistic regression, etc.

package ml - Machine Learning Libraries ###import "github.com/alonsovidales/go_ml" Package ml provides some implementations of usefull machine learnin

Alonso Vidales 192 Jul 27, 2022
Gorgonia is a library that helps facilitate machine learning in Go.

Gorgonia is a library that helps facilitate machine learning in Go. Write and evaluate mathematical equations involving multidimensional arrays easily

Gorgonia 4.6k Aug 5, 2022
Prophecis is a one-stop machine learning platform developed by WeBank

Prophecis is a one-stop machine learning platform developed by WeBank. It integrates multiple open-source machine learning frameworks, has the multi tenant management capability of machine learning compute cluster, and provides full stack container deployment and management services for production environment.

WeBankFinTech 356 Aug 3, 2022
Go Machine Learning Benchmarks

Benchmarks of machine learning inference for Go

Nikolay Dubina 23 May 27, 2022
Deploy, manage, and scale machine learning models in production

Deploy, manage, and scale machine learning models in production. Cortex is a cloud native model serving platform for machine learning engineering teams.

Cortex Labs 7.8k Aug 4, 2022
A High-level Machine Learning Library for Go

Overview Goro is a high-level machine learning library for Go built on Gorgonia. It aims to have the same feel as Keras. Usage import ( . "github.

AUNUM 343 Aug 6, 2022
Standard machine learning models

Cog: Standard machine learning models Define your models in a standard format, store them in a central place, run them anywhere. Standard interface fo

Replicate 2.7k Aug 11, 2022
PaddleDTX is a solution that focused on distributed machine learning technology based on decentralized storage.

δΈ­ζ–‡ | English PaddleDTX PaddleDTX is a solution that focused on distributed machine learning technology based on decentralized storage. It solves the d

null 60 Jul 14, 2022
Self-contained Machine Learning and Natural Language Processing library in Go

Self-contained Machine Learning and Natural Language Processing library in Go

NLP Odyssey 1.2k Aug 10, 2022
A Kubernetes Native Batch System (Project under CNCF)

Volcano is a batch system built on Kubernetes. It provides a suite of mechanisms that are commonly required by many classes of batch & elastic workloa

Volcano 2.5k Aug 9, 2022
Reinforcement Learning in Go

Overview Gold is a reinforcement learning library for Go. It provides a set of agents that can be used to solve challenges in various environments. Th

AUNUM 294 Aug 8, 2022
Spice.ai is an open source, portable runtime for training and using deep learning on time series data.

Spice.ai Spice.ai is an open source, portable runtime for training and using deep learning on time series data. ⚠️ DEVELOPER PREVIEW ONLY Spice.ai is

Spice.ai 741 Jul 30, 2022
FlyML perfomant real time mashine learning libraryes in Go

FlyML perfomant real time mashine learning libraryes in Go simple & perfomant logistic regression (~100 LoC) Status: WIP! Validated on mushrooms datas

Vadim Kulibaba 1 May 30, 2022
Go (Golang) encrypted deep learning library; Fully homomorphic encryption over neural network graphs

DC DarkLantern A lantern is a portable case that protects light, A dark lantern is one who's light can be hidden at will. DC DarkLantern is a golang i

Raven 1 Dec 2, 2021
A tool for building identical machine images for multiple platforms from a single source configuration

Packer Packer is a tool for building identical machine images for multiple platforms from a single source configuration. Packer is lightweight, runs o

null 2 Oct 3, 2021
A native Go clean room implementation of the Porter Stemming algorithm.

Go Porter Stemmer A native Go clean room implementation of the Porter Stemming Algorithm. This algorithm is of interest to people doing Machine Learni

Charles Iliya Krempeaux 179 Oct 15, 2021
A Hackathon project created by Alpha Interface team for Agri-D Food Hack

Alpha Interface A Hackathon project created by Alpha Interface team for Agri-D Food Hack Installation Downloading Wasp and wasp-cli https://wiki.iota.

Jirawat Boonkumnerd 2 Nov 13, 2021