文章目录
Pytext
- git : https://github.com/facebookresearch/pytext
- paper : PyText-A-seamless-path-from-NLP-research-to-production-using-PyTorch3.pdf
- blog: https://code.fb.com/ai-research/pytext-open-source-nlp-framework/
Notes
what
PyText is a modeling framework that helps researchers and engineers build end-to-end pipelines for training or inference.
goal
PyText, built on PyTorch 1.0 6, is designed to achieve the following:
- Make experimentation with new modeling ideas as easy and as fast as possible.
- Make it easy to use pre-built models on new data with minimal extra work.
- Define a clear workflow for both researchers and en- gineers to build, evaluate, and ship their models to production with minimal overhead.
- Ensure high performance (low latency and high throughput) on deployed models at inference.
design
- Everything in PyText is a component.
- Task: combines various
components
required for a training or inference task into a pipeline.
there is a sample config for a document classification task. It can be configured as a JSON file that defines the parameters of all the children components
{
"config":{
"task":{
"DocClassificationTask":{
"data_handler":{
"columns_to_read":[
"doc_label",
"text"
],
"shuffle":true
},
"model":{
"representation":{
"BiLSTMPooling":{
"pooling":{
"SelfAttention":{
"attn_dimension":128,
"dropout":0.4
}
},
"bidirectional":true,
"dropout":0.4,
"lstm":{
"lstm_dim":200,
"num_layers":2
}
}
},
"output_config":{
"loss":{
"CrossEntropyLoss":{
}
}
},
"decoder":{
"hidden_dims":[
128
]
}
},
"features":{
"word_feat":{
"embed_dim":200,
"pretrained_embeddings_path":"/tmp/embeds",
"vocab_size":250000,
"vocab_from_train_data":true
}
},
"trainer":{
"random_seed":0,
"epochs":15,
"early_stop_after":0,
"log_interval":1,
"eval_interval":1,
"max_clip_norm":5
},
"optimizer":{
"type":"adam",
"lr":0.001,
"weight_decay":0.00001
},
"metric_reporter":{
"output_path":"/tmp/test_out.txt"
},
"exporter":{
}
}
}
}
}
features
- It provides ways to customize
handling of raw data
,reporting of metrics
,training methodology
andexporting of trained models
. - PyText users are free to implement one or more of these components and can expect the entire pipeline to work out of the box.
- A number of default pipelines are implemented for popular tasks which can be used as-is.
workflow
- Implement the model in PyText, and make sure offline metrics on the test set look good.
- Publish the model to the bundled PyTorch-based infer- ence service, and do a real-time small scale evaluation on a live traffic sample.
- Export it automatically to a Caffe2 net. In some cases, e.g. when using complex control flow logic and custom data-structures, this might not yet be supported via PyTorch 1.0.
- If the procedure in 3 isn’t supported, use the Py- Torch C++ API9 to rewrite the model (only the torch.nn.Module10 subclass) and wrap it in a Caffe2 operator.
- Publish the model to the production-grade Caffe2 pre- diction service and start serving live traffic
Challenges
-
Data pre-processing
This library,
a featurization library
, preprocesses the raw input by performing tasks like- Text tokenization and normalization
- Mapping characters to IDs for character-based models
- Perform token alignments for gazetteer features
- Training: raw data -> Data Handler( invoke a feature lib) -> Trainer -> Exporter
- Inference: raw data -> Data Processor(invoke the same feature lib) -> Predictor(holding a model exported by a exporter) -> Predictions
-
Vocabulary management
Future Work
- Modeling Capabilities
- Performance Benchmarks and Improvements
- Training speed
- Inference speed
- Model Interpretability
- Model Robustness
- Mobile Deployment Support
Other Intro Doc
- https://www.jiqizhixin.com/articles/2018-12-15-3?from=synced&keyword=pytext
- https://blog.youkuaiyun.com/sinat_33455447/article/details/85064284