bunsly / homeharvest Goto Github PK
View Code? Open in Web Editor NEWPython package for real estate scraping of MLS listing data
Home Page: https://tryhomeharvest.com/
License: MIT License
Python package for real estate scraping of MLS listing data
Home Page: https://tryhomeharvest.com/
License: MIT License
Make it possible to scrape realtor.ca as well.
I would like to thank you first for offering this tool for individual home buyer to search for properties.
This is really helpful and I really appreciate the work.
One question I have in the return array is that Is there a way to include schools in the return array?
Thank you.
Hi,
I was wondering if realtor.com provides home value estimates as one of their fields. If so, could you please add it? Redfin and Zillow provide home value estimates.
Thank you,
Amir
Excellent work, guys. I saw this repo from your reddit post. It seems you used to support Zillow and Redfin listings in the past. Any reason why you stopped it?
Hello again,
I just realized that the sold df only returns a few properties. Typically users are interested in the sold properties over the last 3 months, 6 months and one year for analysis. Could you please add this option? In the sold datebase SOLD DATE is missing and perhaps it can be used to retrieve sold properties over the past months.
Also, I was wondering if it would be possible to have more than one location, for example "San Diego", "Ramona" for the location field. This is useful as the city names do not cover the suburb areas and it could be quite useful to be able to state more than one location in the location field.
Thank you very much in advance for adding these to the program.
Regards,
Amir Ali
At first glance, the example code does not return Zillow descriptions.
I am working on learning python and found this project. It is really awesome!
I was able to modify the output name of the excel file based on the parameters i was searching for.
I am trying to find a way to add realtor names to the list of sold properties. I looked through the html on the site and found it is under <strong data-testi id or Seller-sc-okw218-0 jDsXJc. Looking through your code it looks like it goes directly to the api so the names might not be the same on the site as what would show up on the json payload.
Is there a way to run a query to the api to see what field names they have? or can realtor fields be be added? I was trying to do it myself but have a lack up understanding in how this all works together at this point.
Sorry if this is not the place to post this but am not sure what to do without the fieldnames.
Traceback (most recent call last):
properties: pd.DataFrame = scrape_property(
^^^^^^^^^^^^^^^^
final_df = _scrape_single_site(location, site_name[0], listing_type, proxy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
results = site.search()
^^^^^^^^^^^^^
resp.raise_for_status()
raise HTTPError(http_error_msg, response=self)
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.zillow.com/homes/recently_sold/92037_rb/
Hi, In the downloaded excel, the 'style' field has an additional string 'PropertyType.'. Please remove it. Thank you.
this is a great project, i want to ask if you are able to scrape the value "Quantarium" in realtor.com.
Regards,
Hi this is really impressive! Thank you so much for sharing. I am looking for more property history that we can see from the Realtor.com website or the home value estimate. Do you think if it's something that is valuable to add to this project? If so, can you direct me to their API doc so that I can add it myself if you don't have time to do this? Really appreciate it!
I've tested with multiple cities, and the total count in the csv matches with with the count on realtor.com (Select Hide Pending/Contigent)
However, when I tried to query for entire state, the count in the csv is significantly lower than on realtor.com.
Example:
For state CA: From HomeHarvest csv < 10,000 listings
On Realtor.com: 78,097 listings
I've tested with other states too, but same observation
Searching by address does not give the expected details for full_baths
and half_baths
. POST request returns null
for bath info, leading to None
in the dataframe.
properties = scrape_property(
location="denver co",
listing_type="for_sale"
)
for idx in range(properties.shape[0]):
detail = scrape_property(location=properties.iloc[idx]['street'], listing_type='for_sale')
print(detail.iloc[0]["full_baths"], detail.iloc[0]["half_baths"])
None None
1 None
3 None
.
.
.
None None
None None
None None
.
.
.
Hello. When creating the csv output it only shows the value of the property but not the monthly rent the property owner is asking for. For the rental use case it would be important to know that to evaluate properties.
Hi. I've been trying to run your program but for some reason it's limiting the results to about 2000 and I know that there has to be more than 2000 listings available. Could you look into this please?
Hi there. I'm a licensed Realtor and tech hobbyist. I've been playing with the HomeHarvest script in python and it's incredible. Thank you so much for putting this together!
I'm reaching out with a request and a few questions.
Request: Is there any possibility of adding home condition as an additional attribute that can be returned? Normally, the MLS will have something like "Good", "Excellent", etc.
Question 1: Is there an upper limit to radius or past_days?
Question 2: Are there limits on the property types returned? For example, does the search exclude land? Multifamily properties? (I'm trying to get the HomeHarvest results to match my search results on Realtor.com and am getting close by playing with property type filters but not yet finding a precise match.)
Thank you again for this library - it's absolutely incredible.
In the latest homeharvest version (0.3.8), the property's street name in the scrapped data is missing the direction:
19754 Sonia Ln
instead of
19754 SW Sonia Ln
Hello,
The Realtor.Com "sold" data has many properties with missing values for the "price" field. This hinders market analysis.
Thank you,
Amir
Thank you for providing this tool for individual home buyers to search for properties; it's really helpful. I have a question about the schema. Are we utilizing all the attributes of the schema to fetch data? If not, could you please guide me on how to access the full schema of the Realtor data source? I need to fetch additional data about properties such as property descriptions, listing brokerage info, listing agent info, property tax rate, property tax amount estimate, number of units, etc. Your assistance would be greatly appreciated.
In the GitHub keywords, previous issues/pull requests and the git history I see both mentions of zillow and redfin scraping. However at first glance it looks like the scrapers were removed in version 3.0 (29664e4)
Are these sites no longer supported or is there just a different way to scrape from them that I'm missing?
Python 3.10.11
Versions tested: 0.2.13
What I tried to do:
properties: pd.DataFrame = scrape_property(
site_name=["zillow", "realtor.com", "redfin"],
location="85281",
listing_type="for_rent" # for_sale / sold
)
Output:
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.zillow.com/homes/for_rent/85281_rb/
Testing this URL in browser works OK.
Trying to add "nearby_schools" to the search query causes an error for the area search. Same error when trying to include "tax_history". Is there a list of valid fields that can be included in the query?
I was wondering how I can get the photos of the listings? When I try to use the img_src column (like outputted in the excel sheet) it says it doesn't exist. Are the photos not scraped? Any way to add this?
It is odd to me that my excel sheet does show an img_src (and parsed images, like f.e. from https://photos.zillowstatic.com), but next that I am completely unable to find anything about images (nor img_src anywhere in the code).
Thanks!
implement tls client
It appears that Realtor.com splits its search results into a small section of "Relevant listings" and a longer section with the rest of the properties. The scrape_property function only returns the small first section of listings, not the 2nd section of longer listings.
Curious if there's any features that feel missing or would make the tool better
Forgive me if I am dumb, I just started using this repo today. But it appears that the list_price
on many sold properties is incorrect, and it given as the same as the sold_price
. When I visit the Realtor web link to the listing and scroll down the 'Property History' section, I can see that the list price often does not match what is being returned by the scrape_property
function.
When I tried the query in the ReadMe for San Diego, I did not find the same issue but it appears to be widespread in Portland, OR (though I have no idea why it would be different there).
Replication:
from homeharvest import scrape_property
properties = scrape_property(
location="Portland, OR",
listing_type="sold", # or (for_sale, for_rent, pending)
# past_days=30, # sold in last 30 days - listed in last 30 days if (for_sale, for_rent)
date_from="2024-04-01", # alternative to past_days
date_to="2024-04-18",
# foreclosure=True
# mls_only=True, # only fetch MLS listings
)
properties = properties.sort_values(by='mls_id')
for i in range(1,5):
print('Listed:', properties['list_price'].iloc[i],
'Sold:', properties['sold_price'].iloc[i],
'Link:', properties['property_url'].iloc[i])
Listed: 520000 Sold: 520000 Link: https://www.realtor.com/realestateandhomes-detail/1393689161
Listed: 575000 Sold: 575000 Link: https://www.realtor.com/realestateandhomes-detail/9466515869
Listed: 345000 Sold: 345000 Link: https://www.realtor.com/realestateandhomes-detail/1036098292
Listed: 275000 Sold: 275000 Link: https://www.realtor.com/realestateandhomes-detail/1448309661
Since Zillow and Redfin support has been dropped, there is little sense in keeping tags such as zillow
, zillow-scraper
, redfin
, and redfin-scraper
attached to the repository.
They should be removed to avoid causing confusion.
When making a request where location is equal to a specific address I do not believe any alt photos are being returned. If I search by zip code though for example I do get alt photos, even for properties that did not return alt photos when I searched by that specific address. Let me know if there's any other information I can provide but I suspect this would be easily reproducible unless I'm making some user error I'm just not seeing.
Awesome library as a whole by the way, great work.
Hello,
Thank you very much for this wonderful code. I tested the code with Redfin and is running perfectly. May I request the following additions if possible? I only use Redfin data.
403 Client Error: Forbidden for url: https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta
I was trying to lookup a city using your format: CITY,ST (State) and I am getting this error thrown
To start with, really amazing tool and thanks for sharing.
I found the url for "for_rent" properties are not working. It seems like the "for_rent" are using a different id rather than mls listing ids.
Sometimes I get a KeyError: 'centroid'
coming from this line when using scrape_property
with the radius
parameter. It seems the key is missing from location_info
which is determined by a call to this API endpoint: "https://parser-external.geo.moveaws.com/suggest"
in handle_location
. I'm guessing centroid is missing b/c the latter endpoint can't match the address to coordinates? If so, should we raise an exception instead?
Based on the demo it does not seem to allow you to filter based on zip code. Is it possible to add this?
May I ask if many cities are unable to obtain data under the "Sold" tag when crawling house information. The error message is as follows. What is the reason for this?
'NoneType' object has no attribute 'value'
It seems that there is an issue with the web scraping tool at tryhomeharest.com. May I ask what the reason is?
The error is as follows: HTTPSConnectionPool(host='parser-external.geo.moveaws.com', port=443): Max retries exceeded with url: /suggest?input=San+Francisco%2C+CA&client_id=for-sale&limit=1&area_types=city%2Cstate%2Ccounty%2Cpostal_code%2Caddress%2Cstreet%2Cneighborhood%2Cschool%2Cschool_district%2Cuniversity%2Cpark (Caused by ProxyError('Unable to connect to proxy', OSError('Tunnel connection failed: 407 Proxy Authentication Required')))
Can the search params be modified to include the dict of Tags for each property as well as the long paragraph like description. Additional info like the property details section which is shown on an individual listing would be great as well.
I'm looking through the code to see if there's a way to get the scraper to pull in school information, which is a huge item people are concerned with when looking at homes. I'm not seeing that data represented anywhere.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.