Web Scraping on AQI

Jyothi Panuganti
3 min readMay 20, 2020

Collecting the Data for analysis

In General, let's understand what is web scraping, and the uses of web scraping.

**what is Web scraping**

Web scraping is the technique in which we can collect the data from any website. The web scraping software may access the data while web scraping can be done manually by the software tool.

Note: You should be very careful you proceeding for scraping some of the sites will not allow collecting the data through any tool, before proceeding please read their terms and conditions of the site.

The architecture of web scraping

By seeing the above picture we can say that scraping in

step1 request is sent to the website using different types of techniques

stpe2 in response we get the information from the website

step3 scraper coding collect that from the response

saves the data in the 4th step into the data storage.

Likewise, I have used some code to scrape the data from the Air Quality Index website: http://en.tutiempo.net/climate.

this is the actual URL for scraping for our particulars: https://en.tutiempo.net/climate/3-2010/ws-431280.html

The URL can be divided into two groups:

  1. https://en.tutiempo.net/climate/ and
  2. another one can be like this 3–2010/ws-431280.html this URL consists of two things a particular month and year form. Here {3} is a Month i.e., march and {2010} is the year we just need to concentrate on these constraints.
  3. In the sample code instead of giving month and year, I have just written {} and {}

also used .format()with two parameters which are a month and the year to retrieve the data.

if you go through the code you will come out of the confusion here is the code:

🙂 def retrieve_html(): -> defining a function as retrieve_html

🙂 for year in range(2010,2019): -> we have a range of the year using a for loop from 2010 t0 2019
🙂 for month in range(1,13): -> this line is for moth using a for loop()

🙂if(month<10):
url=’
https://en.tutiempo.net/climate/{}-{}/ws-431280.html'.format(month
,year) :-> i have Explained this line in the above using picture

else:
🙂url=’https://en.tutiempo.net/climate/{}-{}/ws-431280.html’.format(month
,year)

🙂 texts=requests.get(url) -> requesting the url.

after retrieving data from the site you need to save as data storage for that also we need to three lines of code lets have fun with code here is to make a joy

🤩 if not os.path.exists(“Data_Hyd/Html_Data/{}”.format(year)): -> creating a path for the directory

🤩 os.makedirs(“Data_Hyd/Html_Data/{}”.format(year)) : -> we have created folder as Data_Hyd in that Html_Data.

🤩 with open(“Data_Hyd/Html_Data/{}/{}.html”.format(year,month),”wb”) as output: -> in the Html_data directory a file which consists of year and constraints

🤩output.write(text_utf)

Here is the directory which collected the data today have a look

here under the Html_data folder we can find years directories each consists of 12 months.likewise I have collected 2013 to 2018 for different countries and cities for my analysis.

I am very curious about my work on AQI(Air Quality Index Predictions)

I have started with web scraping as the Part-1 of the project that the scraping of the website using simple code not less than 20 lines.

I have done with data collection and I will be back with preprocessing and feature engineering

I hope you enjoy my narration if yes clap.👏

Have great learning in data science

--

--

Jyothi Panuganti

Data Science Enthusiast, Blogger, content writer, and Freelancer.