Serverless Stanford Named Entity Recognizer
This project enables you to deploy the Stanford Named Entity Recognizer (NER) to a "serverless" environment based on AWS Lambda and API Gateway.
Why?
The general advantages of serverless computing include cost, scalability and productivity. Specifically, these translate to:
- The ability to analyse text in virtually any environment - most notably from the browser
- Processing a large number of texts concurrently - potentially thousands
- Ease and speed of iteration - just deploy with one command after making changes to your models or label interpretation logic
How?
Getting started
-
Make sure you have the following installed on your machine:
Or
-
Sign up for an AWS account
-
Configure your AWS credentials for deployment with the Serverless framework. Make sure these are set up as the environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
if working with docker. -
Install dependencies:
-
With docker:
docker build -t sner .
Or
-
With Node/JDK/Maven: Install the Serverless dependencies using the command in the project root directory:
npm install
-
Deploying to AWS
With docker:
docker run --rm -it -e AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID -e AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY sner npm run deploy -- --stage=dev
Or
With Node/JDK/Maven:
npm run deploy -- --stage=dev
You should see your POST and GET endpoints displayed after a successful deployment e.g.
...
endpoints:
POST - https://xxxxxx.execute-api.xx-xxxx-x.amazonaws.com/dev/entities
GET - https://xxxxxx.execute-api.xx-xxxx-x.amazonaws.com/dev/entities
...
Trying it out
You can try using the GET endpoint by simply appending the query parameter "text" to it along with the text you wish to analyse e.g.
https://xxxxxx.execute-api.xx-xxxx-x.amazonaws.com/dev/entities?text=Stanford University is located in Silicon Valley and was founded in November 1885
Response:
{
"ORGANIZATION": [
{
"name": "Stanford University",
"count": 1
}
],
"LOCATION": [
{
"name": "Silicon Valley",
"count": 1
}
],
"DATE": [
{
"name": "November 1885",
"count": 1
}
]
}
Example payload for the POST endpoint:
{
"text": "Stanford University is located in Silicon Valley and was founded in November 1885"
}
What?
Label interpretation logic
The "business logic" lives in the EntityExtractor class and processes text in the following way:
- Finds labels associated with each word in a string using the CoreNLP library
- Filters the labels to leave only those corresponding to named entities
- Extracts the names, types and number of times each entity occurs in the text from the remaining labels
- Groups the entity names and counts by their types
Configuration
The pom.xml and serverless.yml files contain most of the important settings in this project.
<project>
<!--...-->
<properties>
<!--...-->
<ner.model1>english.all.3class.distsim</ner.model1>
<ner.model2>english.conll.4class.distsim</ner.model2>
<ner.model3>english.muc.7class.distsim</ner.model3>
<!--...-->
</properties>
<!--...-->
<build>
<plugins>
<!--...-->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<!--...-->
<configuration>
<!--...-->
<filters>
<filter>
<!-- This minimises the output jar file size to remain within the [Lambda limits](https://docs.aws.amazon.com/lambda/latest/dg/limits.html) by only including your selected models -->
<includes>
<include>${ner.prefix}${ner.model1}.*</include>
<include>${ner.prefix}${ner.model2}.*</include>
<include>${ner.prefix}${ner.model3}.*</include>
</includes>
</filter>
</filters>
</configuration>
<!--...-->
</plugin>
<!--...-->
</plugins>
</build>
<!--...-->
</project>
<properties>
<nlp.version>3.9.1</nlp.version>
<!--...-->
</properties>
-
Change the AWS Lambda name, memory, region in the serverless.yml file
-
Configure your endpoints in the serverless.yml file