PyText Notes-优快云博客

本文链接：https://blog.youkuaiyun.com/kevin_darkelf/article/details/95600542

文章目录

Pytext
Notes
Other Intro Doc

Pytext

git : https://github.com/facebookresearch/pytext
paper : PyText-A-seamless-path-from-NLP-research-to-production-using-PyTorch3.pdf
blog: https://code.fb.com/ai-research/pytext-open-source-nlp-framework/

Notes

what

PyText is a modeling framework that helps researchers and engineers build end-to-end pipelines for training or inference.

goal

PyText, built on PyTorch 1.0 6, is designed to achieve the following:

Make experimentation with new modeling ideas as easy and as fast as possible.
Make it easy to use pre-built models on new data with minimal extra work.
Define a clear workflow for both researchers and en- gineers to build, evaluate, and ship their models to production with minimal overhead.
Ensure high performance (low latency and high throughput) on deployed models at inference.

design

Everything in PyText is a component.
Task: combines various componentsrequired for a training or inference task into a pipeline.
there is a sample config for a document classification task. It can be configured as a JSON file that defines the parameters of all the children components

{
    "config":{
        "task":{
            "DocClassificationTask":{
                "data_handler":{
                    "columns_to_read":[
                        "doc_label",
                        "text"
                    ],
                    "shuffle":true
                },
                "model":{
                    "representation":{
                        "BiLSTMPooling":{
                            "pooling":{
                                "SelfAttention":{
                                    "attn_dimension":128,
                                    "dropout":0.4
                                }
                            },
                            "bidirectional":true,
                            "dropout":0.4,
                            "lstm":{
                                "lstm_dim":200,
                                "num_layers":2
                            }
                        }
                    },
                    "output_config":{
                        "loss":{
                            "CrossEntropyLoss":{

                            }
                        }
                    },
                    "decoder":{
                        "hidden_dims":[
                            128
                        ]
                    }
                },
                "features":{
                    "word_feat":{
                        "embed_dim":200,
                        "pretrained_embeddings_path":"/tmp/embeds",
                        "vocab_size":250000,
                        "vocab_from_train_data":true
                    }
                },
                "trainer":{
                    "random_seed":0,
                    "epochs":15,
                    "early_stop_after":0,
                    "log_interval":1,
                    "eval_interval":1,
                    "max_clip_norm":5
                },
                "optimizer":{
                    "type":"adam",
                    "lr":0.001,
                    "weight_decay":0.00001
                },
                "metric_reporter":{
                    "output_path":"/tmp/test_out.txt"
                },
                "exporter":{

                }
            }
        }
    }
}

features

It provides ways to customize handling of raw data, reporting of metrics, training methodologyand exporting of trained models.
PyText users are free to implement one or more of these components and can expect the entire pipeline to work out of the box.
A number of default pipelines are implemented for popular tasks which can be used as-is.

workflow

Implement the model in PyText, and make sure offline metrics on the test set look good.
Publish the model to the bundled PyTorch-based infer- ence service, and do a real-time small scale evaluation on a live traffic sample.
Export it automatically to a Caffe2 net. In some cases, e.g. when using complex control flow logic and custom data-structures, this might not yet be supported via PyTorch 1.0.
If the procedure in 3 isn’t supported, use the Py- Torch C++ API9 to rewrite the model (only the torch.nn.Module10 subclass) and wrap it in a Caffe2 operator.
Publish the model to the production-grade Caffe2 pre- diction service and start serving live traffic

Challenges

Data pre-processing

This library, a featurization library, preprocesses the raw input by performing tasks like
- Text tokenization and normalization
- Mapping characters to IDs for character-based models
- Perform token alignments for gazetteer features
- Training: raw data -> Data Handler( invoke a feature lib) -> Trainer -> Exporter
- Inference: raw data -> Data Processor(invoke the same feature lib) -> Predictor(holding a model exported by a exporter) -> Predictions
Vocabulary management