BukManiac/Optimizations
CS290F Fall 2006 - UCSB Computer Science - Thorsten von Eicken
BukManiac -- Your Search for Books ends here !
Home | Overview | Design | Critical Paths & Analysis | Optimizations | Scaling the Database | Scaling the Application | Plugging in the Cache
Contents |
Optimization strategies used
1. Index on Title, Author
We implemented the database using the standard InnoDB database provided by MySQL. The entries here show the time taken to access the database for SQL statements generated when the user navigates through the path for a given critical flow.
According to our analysis, the database access time was very large for search by title/author queries as it had resulted in a scan of the entire books table. The LIKE %blah% queries to address fulltext search turn out to be very expensive. To address the same, we implemented an Index to implement fulltext search for the title and author field of the books table.
We created a book_searches table which is the MyISAM table for books (contains only id, title and author fields). Indexes were created on the title and author field of the book_searches table.
Final Observation: We observed that with the index, performance is enhanced greatly as the table is not searched in its entirity and rather an inexpensive search on the index takes place, greatly reducing the database lookup time. The time for updating Orders and orderlines tables remains the same. To prove this,corresponding garphs have been shown below .
2. Retrieving less data
We tried to work with automatic pagination implemented in Rails. But we observed certain problems such as
- Rails, by default does a SELECT COUNT(*) FROM table to implement pagination. This is expensive on the database
- Sometimes, a nil object is returned for the next page, even if it exists.
So, we decided to implement Pagination on our own as described in the book. as given for
- Search Results
- Displaying Users Orders
- Displaying Orders in Admin View
Code in model:
def Order.find_user_orders(user , page_num)
find(:all,
:conditions => "user_id = "+ user.id.to_s,
:limit => 10,
:offset => page_num*10,
:order => "id DESC"
)
end
Calling the method:
<%=button_to ('Next', :controller => 'users', :action => 'showorders' , :pagenum => nextpage)
Displaying first 5 reviews with book description: Instead of dispalying the entire reviews of a book , we now display only the first 5 reviews. This sounds logical as the user would rarely go through all of the reviews.
3. Cleaning Database Tables
We noticed that there were certain fields in tables that were not required such as ISBN, average rating, number of ratings, etc in books. Hence, we removed these fields to reduce the size of the table.
4. Denormalizing the Schema
Earlier we were using User table to display the reviewer Name on the Reviews page. But, now we are keeping the Reviewer name in the reviews table itself. This avoids an additional access to the User table and thus saves time.
This has also been done in the Orders table, we store the order total in the order table even though it can be evaluated from the corresponding rows in the orderlines table.
Performance Measurement Setup on 1 Server
In order to evaluate the Index approach, we ran the following EC2 instances -
A) Rails application using '%like%' sql query (InnoDB Table for books)
B) Rails application using Index approach (Index on MyISAM Table,, as explained above)
C) Load Generator : This instance was used to simulate multiple concurrent users. These users are directed to navigate the website
following our pre-identified Critical Paths. For this purpose, we ran a shell script that dynamically created httperf commands
with different 'Number of sessions/sec' for a fixed number of sessions. This was repeated for sessions = 50,100,200.
In the end, this script was executed to create load on above specified server instances.
HTTPERF => This is a tool to measure web server performance.It loads a server via simulating users in teh following way : it generates a fixed number of HTTP GET requests and measures how many replies (responses) came back from the server and at what rate the responses arrived.
We used the following httperf command to create load on the server.
httperf --hog --server server_name --wsesslog=number_of_sessions,0,loadfile --session-cookie --rate session_rate > ouput_file
Explanation of Httperf parameters used :
--hog : This option requests to use up as many TCP ports as necessary.Without this option, httperf is typically limited to using ephemeral ports (in the range from 1024 to 5000). --server : This option is used to address the Host with which we are required to establish a connection . --wsesslog N1,N2,filename: This option requests the generation and measurement of sessions instead of individual requests. N1 is the total number of sessions to generate. N2 is the time(in seconds) that user think separates consecutive call bursts. We kept N2 = 0 , for faster execution of the load tests. Filename is the file which contains the route for the simulates user. --port : This option specifies the port number N on which the web server is listening for HTTP requests. By default, httperf uses port number 80. hence we did not specify this option as our instances were running on Port 80 itself. --rate : This option specifies the fixed rate at which connections or sessions are created. Since we requied to load the server with concurrent and not sequential sessions, we successively increased this rate to higher values starting from 0.5 --session-cookie : This option takes care of cookie on per session basis.
We had identified the following 3 Citical paths for our users:
1. Flow 1: 30% Home Page -> Search by Title -> View 1st book -> Add to Cart -> Login -> Checkout
2. Flow 2 : 10% Home Page -> Search by Title -> View 1st book -> View 2nd book -> Add to Cart -> Login ->Seacrh by Title -> View
3rd book -> Show Cart -> Checkout
3. Flow 3: 60% Home Page -> Search By title -> View 2nd book -> View 4th book -> Search by Author -> View 1st book -> No Checkout
See Bukmaniac Critical Flows for further details
Results
We observed that with the Fulltext search Index, the performace improved remarkably !
Response Time
The graph shows a comparison for response time for the load simulating the mix of the above mentioned critical paths for 200 total sessions.
The response time decreases significantly for the index based search because the search results are returned quickly by the database.
Without the index based search, most of the queries on the database take more than 1 second to complete (for books). This increases the response time greatly. With the full text index based search, the results are quickly returned from the index. The increase of response time with the increasing rate of users increases but not as much when compared to non-index search.
Reply Rate
The average reply rate is greatly enhanced by use of the index. Reasons being the same that index access is faster than table scanning (using LIKE %abcd% type queries)
Since replies are generated quickly, more replies can be generated at a given time.


