mongo.ipynb

Document database MongoDB¶

The goal of the exercise is to learn the basics of working with the document database MongoDB. In the exercise, you will work with data that describes restaurant ratings in JSON format. One entry will look like this:

{
  "address": {
     "building": "1007",
     "coord": [ -73.856077, 40.848447 ],
     "street": "Morris Park Ave",
     "zipcode": "10462"
  },
  "borough": "Bronx",
  "cuisine": "Bakeries",
  "grades": [
     { "date": { "$date": 1393804800000 }, "grade": "A", "score": 2 },
     { "date": { "$date": 1378857600000 }, "grade": "A", "score": 6 },
     { "date": { "$date": 1358985600000 }, "grade": "A", "score": 10 },
     { "date": { "$date": 1322006400000 }, "grade": "A", "score": 9 },
     { "date": { "$date": 1299715200000 }, "grade": "B", "score": 14 }
  ],
  "name": "Morris Park Bake Shop",
  "restaurant_id": "30075445"
}

You will use the Python library pymongo to connect and query.

In [ ]:

!pip install pymongo
import pymongo  # install and import the client library

In [ ]:

# connection to the database server running in the DataLab environment
client = pymongo.MongoClient('mongodb://mongo:27017/', username='student', password='student')
db = client['datalab']      # connection to the database
db.list_collection_names()  # we display all the collections in the database

In [ ]:

db.restaurants.estimated_document_count()  # total number of documents in the `restaurants` collection

Queries¶

The basic types of queries are performed using the command db.[collection].find([query]). The call returns a cursor object, which you can use to find the number of documents in the result and scroll through individual records. The following command will display restaurants in Manhattan:

In [ ]:

result = db.restaurants.find({ "borough": "Manhattan" })
for doc in result.limit(5):  # we only go through the first 5 documents included in the result
    print(doc)

If we want to find out only the number of records, we can use db.[collection].count_documents([query]).

In [ ]:

db.restaurants.count_documents({ "borough": "Manhattan" })

You can also refer to nested objects in queries, e.g. to the routing number nested in the address object:

In [ ]:

db.restaurants.count_documents({ "address.zipcode": "10075" })

Linking via . can be used even when nested objects are in an array:

In [ ]:

db.restaurants.count_documents({ "grades.grade": "B" })  # number of restaurants with `B` rating

In addition to searching by values, you can use comparison using the smaller $lt or larger $gt commands. The following query returns restaurants with a rating greater than 30:

In [ ]:

db.restaurants.count_documents({ "grades.score": { "$gt": 30 } })

More complex queries are created by specifying multiple properties in the query object (logical conjunction a). For example the following query returns Italian restaurants with zip code 10075:

In [ ]:

result = db.restaurants.find({ "cuisine": "Italian", "address.zipcode": "10075" })
for doc in result:
    # data is represented as objects that are worked with like a map
    print(doc['name'])  # we will list the names of the restaurants

The logical condition or is written with a separate command $or, the following example returns restaurants that offer Italian food or have the zip code 10075:

In [ ]:

db.restaurants.count_documents({ "$or": [ { "cuisine": "Italian" }, { "address.zipcode": "10075" } ] })

The results can be arranged with the sort command. The following command sorts all objects first by neighborhood in ascending order and then by zip code in descending order:

In [ ]:

result = db.restaurants.find().sort([("borough", 1), ("address.zipcode", -1)])
for doc in result.limit(5):  # for the first 5 documents in the result
    print(doc['borough'], doc['address']['zipcode'])  # write the name of the neighborhood and the zip code in the address

Aggregations¶

With the help of aggregations, we can create complex queries that filter and group data according to specified criteria. With aggregations, the query is broken down into a sequence of basic operations, which are then applied sequentially to the objects in the collection. During processing, data can be filtered ($match command), transformed ($project command), ordered ($sort command) and aggregated ($group command).

The basic command format for aggregating data is db.[collection].aggregate( [ [command1], [command2], [...] ] ).

For example the following command first filters out only those objects from the collection that are in Queens and cook Brazilian cuisine, then groups them by zip code, and finally calculates how many restaurants there are for each zip code:

In [ ]:

result = db.restaurants.aggregate([
  { "$match": { "borough": "Queens", "cuisine": "Brazilian" } },
  { "$group": { "_id": "$address.zipcode" , "count": { "$sum": 1 } } }
])

for count in result:
    print(count)

In the $group command, instead of the simple number of objects in the group, you can specify additional aggregation functions, such as average $avg (in the example, only part of the query with $group command is shown):

In [ ]:

{ "$group": { "_id": "$address.zipcode" , "avg_grade": { "$avg": "$grades.score" } } }

You can specify null as _id to calculate the aggregate function for all records without grouping. See this page for more information about aggregations.

Tasks¶

Task 2.1¶

Count how many B-rated Italian restaurants there are in Manhattan.

Task 2.2¶

Sort Brazilian restaurants in Queens by zip code in descending order and name in ascending order.

Task 2.3¶

Read the documentation for the $unwind aggregation command. Calculate the total score for all restaurants.

Task 2.4¶

Read the documentation for the $unwind aggregation command. Calculate the total score for restaurants in Queens and group them by grade rating.

Task 2.5¶

Read the documentation for the $unwind aggregation command. Calculate the total score for restaurants in Queens, group them by zip code, and return the 10 zip codes with the highest scores.

Task 2.6¶

Create a geographic index for the address.coord property with command db.restaurants.create_index([('address.coord', pymongo.GEO2D)]). Read the documentation on the $geoNear command. Return the 10 closest restaurants around the geographic location [ -73.856077, 40.848447 ].

In [ ]: