bunsly / homeharvest Goto Github PK

View Code? Open in Web Editor NEW

262.0 11.0 56.0 271 KB

Python package for real estate scraping of MLS listing data

Home Page: https://tryhomeharvest.com/

License: MIT License

Python 100.00%

finance properties real-estate realtor redfin webscraping zillow data scraper scraping

homeharvest's People

Contributors

Stargazers

Watchers

Forkers

levelingupdata josh1billion huanyunyilun etewiah cinmay2014 ddxv jackman337 jake-aft poealone firmai-research extremtechnology smokingthegoat razach zimw bryanchance blkluv gregmagdits drainvillebob goudarziha romelgomez lab-projects flnvaz jeffrey840 tomleviant alexmcvay mwongj bbennett36 cradcore robertomr100 meliani evanschalton maxtocarev fabiosato j182razor jbalber rikrdo89 emirn bgnetti harshalgajjar mahdi337 hakanfire joecryptotoo ravjot28 zidanwang2025 postpcera kyla2001 zoopster r3da amchhabra adrn moseswanjohi treymcmeans itsjacobhere

homeharvest's Issues

[Idea] Add Realtor.ca for canadian listings

Make it possible to scrape realtor.ca as well.

Schools?

I would like to thank you first for offering this tool for individual home buyer to search for properties.
This is really helpful and I really appreciate the work.

One question I have in the return array is that Is there a way to include schools in the return array?

Thank you.

Return empty DataFrame instead of raising NoResultsFound exception

Estimate of home value

Hi,

I was wondering if realtor.com provides home value estimates as one of their fields. If so, could you please add it? Redfin and Zillow provide home value estimates.

Thank you,
Amir

Is scrapping Zillow and Redfin still applicable?

Excellent work, guys. I saw this repo from your reddit post. It seems you used to support Zillow and Redfin listings in the past. Any reason why you stopped it?

I just realized that the sold df only returns a few properties. Typically users are interested in the sold properties over the last 3 months, 6 months and one year for analysis. Could you please add this option? In the sold datebase SOLD DATE is missing and perhaps it can be used to retrieve sold properties over the past months.

Also, I was wondering if it would be possible to have more than one location, for example "San Diego", "Ramona" for the location field. This is useful as the city names do not cover the suburb areas and it could be quite useful to be able to state more than one location in the location field.

Thank you very much in advance for adding these to the program.

Regards,
Amir Ali

Zillow descriptions missing

At first glance, the example code does not return Zillow descriptions.

Any way to pull realtor data with sold properties or listings?

I am working on learning python and found this project. It is really awesome!
I was able to modify the output name of the excel file based on the parameters i was searching for.
I am trying to find a way to add realtor names to the list of sold properties. I looked through the html on the site and found it is under <strong data-testi id or Seller-sc-okw218-0 jDsXJc. Looking through your code it looks like it goes directly to the api so the names might not be the same on the site as what would show up on the json payload.

Is there a way to run a query to the api to see what field names they have? or can realtor fields be be added? I was trying to do it myself but have a lack up understanding in how this all works together at this point.

Sorry if this is not the place to post this but am not sure what to do without the fieldnames.

Zillow error message

Traceback (most recent call last):
properties: pd.DataFrame = scrape_property(
^^^^^^^^^^^^^^^^
final_df = _scrape_single_site(location, site_name[0], listing_type, proxy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

results = site.search()
          ^^^^^^^^^^^^^

resp.raise_for_status()

raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.zillow.com/homes/recently_sold/92037_rb/

style

Hi, In the downloaded excel, the 'style' field has an additional string 'PropertyType.'. Please remove it. Thank you.

Scrape an important value

this is a great project, i want to ask if you are able to scrape the value "Quantarium" in realtor.com.

Regards,

Looking for more features, such as home value estimate or history

Hi this is really impressive! Thank you so much for sharing. I am looking for more property history that we can see from the Realtor.com website or the home value estimate. Do you think if it's something that is valuable to add to this project? If so, can you direct me to their API doc so that I can add it myself if you don't have time to do this? Really appreciate it!

Query listings for entire state

I've tested with multiple cities, and the total count in the csv matches with with the count on realtor.com (Select Hide Pending/Contigent)

However, when I tried to query for entire state, the count in the csv is significantly lower than on realtor.com.

Example:
For state CA: From HomeHarvest csv < 10,000 listings
On Realtor.com: 78,097 listings

I've tested with other states too, but same observation

>50% of property types in Austin marked "other"

bathroom information not accurate when searching by address

Searching by address does not give the expected details for full_baths and half_baths . POST request returns null for bath info, leading to None in the dataframe.

Steps to reproduce

properties = scrape_property(
    location="denver co",
    listing_type="for_sale"
)

for idx in range(properties.shape[0]):
        detail = scrape_property(location=properties.iloc[idx]['street'], listing_type='for_sale')
        print(detail.iloc[0]["full_baths"], detail.iloc[0]["half_baths"])

Expected output

None None
1 None
3 None
.
.
.

Received output

None None
None None
None None
.
.
.

csv issue - rent prices

Hello. When creating the csv output it only shows the value of the property but not the monthly rent the property owner is asking for. For the rental use case it would be important to know that to evaluate properties.

Return limited

Hi. I've been trying to run your program but for some reason it's limiting the results to about 2000 and I know that there has to be more than 2000 listings available. Could you look into this please?

Tech-Hobbyist Realtor Request

Hi there. I'm a licensed Realtor and tech hobbyist. I've been playing with the HomeHarvest script in python and it's incredible. Thank you so much for putting this together!

I'm reaching out with a request and a few questions.

Request: Is there any possibility of adding home condition as an additional attribute that can be returned? Normally, the MLS will have something like "Good", "Excellent", etc.
Question 1: Is there an upper limit to radius or past_days?
Question 2: Are there limits on the property types returned? For example, does the search exclude land? Multifamily properties? (I'm trying to get the HomeHarvest results to match my search results on Realtor.com and am getting close by playing with property type filters but not yet finding a precise match.)

Thank you again for this library - it's absolutely incredible.

0.3 Update

Change default site to realtor (and remove changing site in examples, as this will be the main site)
Radius search (for comps)
Last X days search for sold listings
Simplify data points to be similar, perhaps slightly more informational than MLS

Property street name is missing N/E/W/S Directions

In the latest homeharvest version (0.3.8), the property's street name in the scrapped data is missing the direction:

19754 Sonia Ln

instead of

19754 SW Sonia Ln

Missing price value in the Realtor.com df

Hello,

The Realtor.Com "sold" data has many properties with missing values for the "price" field. This hinders market analysis.

Thank you,
Amir

Additional attributes of property

Thank you for providing this tool for individual home buyers to search for properties; it's really helpful. I have a question about the schema. Are we utilizing all the attributes of the schema to fetch data? If not, could you please guide me on how to access the full schema of the Realtor data source? I need to fetch additional data about properties such as property descriptions, listing brokerage info, listing agent info, property tax rate, property tax amount estimate, number of units, etc. Your assistance would be greatly appreciated.

Zillow/Redfin removed?

In the GitHub keywords, previous issues/pull requests and the git history I see both mentions of zillow and redfin scraping. However at first glance it looks like the scrapers were removed in version 3.0 (29664e4)

Are these sites no longer supported or is there just a different way to scrape from them that I'm missing?

Zillow: 403 Forbidden

Python 3.10.11
Versions tested: 0.2.13

What I tried to do:

properties: pd.DataFrame = scrape_property(
    site_name=["zillow", "realtor.com", "redfin"],
    location="85281",
    listing_type="for_rent" # for_sale / sold
)

Output:
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.zillow.com/homes/for_rent/85281_rb/

Testing this URL in browser works OK.

Feature additions

"Pending" or "Contingent" property searches
Last X months ago search on Sold properties

nearby_schools causes 500 error

Trying to add "nearby_schools" to the search query causes an error for the area search. Same error when trying to include "tax_history". Is there a list of valid fields that can be included in the query?

Photos

I was wondering how I can get the photos of the listings? When I try to use the img_src column (like outputted in the excel sheet) it says it doesn't exist. Are the photos not scraped? Any way to add this?

It is odd to me that my excel sheet does show an img_src (and parsed images, like f.e. from https://photos.zillowstatic.com), but next that I am completely unable to find anything about images (nor img_src anywhere in the code).

Thanks!

Fields & Bug Fixes

Zestimate field
Auth token retrieval -> class method, only runs once per python instance (to prevent hitting px)

Date ranges (from and to), replacing past_days parameter

Zillow 403 Response on initial search

implement tls client

scrape_properties() only returning a small list of "relevant listings"

It appears that Realtor.com splits its search results into a small section of "Relevant listings" and a longer section with the rest of the properties. The scrape_property function only returns the small first section of listings, not the 2nd section of longer listings.

Anything y'all want added or fixed?

Curious if there's any features that feel missing or would make the tool better

list_price on SOLD properties is often incorrectly given to be equal to the sold_price

Forgive me if I am dumb, I just started using this repo today. But it appears that the list_price on many sold properties is incorrect, and it given as the same as the sold_price. When I visit the Realtor web link to the listing and scroll down the 'Property History' section, I can see that the list price often does not match what is being returned by the scrape_property function.

When I tried the query in the ReadMe for San Diego, I did not find the same issue but it appears to be widespread in Portland, OR (though I have no idea why it would be different there).

Replication:

from homeharvest import scrape_property

properties = scrape_property(
  location="Portland, OR",
  listing_type="sold",  # or (for_sale, for_rent, pending)
  # past_days=30,  # sold in last 30 days - listed in last 30 days if (for_sale, for_rent)

  date_from="2024-04-01", # alternative to past_days
  date_to="2024-04-18",
  # foreclosure=True

  # mls_only=True,  # only fetch MLS listings
)

properties = properties.sort_values(by='mls_id')

for i in range(1,5):
    print('Listed:', properties['list_price'].iloc[i], 
          'Sold:', properties['sold_price'].iloc[i], 
          'Link:', properties['property_url'].iloc[i])
          
Listed: 520000 Sold: 520000 Link: https://www.realtor.com/realestateandhomes-detail/1393689161
Listed: 575000 Sold: 575000 Link: https://www.realtor.com/realestateandhomes-detail/9466515869
Listed: 345000 Sold: 345000 Link: https://www.realtor.com/realestateandhomes-detail/1036098292
Listed: 275000 Sold: 275000 Link: https://www.realtor.com/realestateandhomes-detail/1448309661

Remove Zillow and Redfin from repository tags.

Since Zillow and Redfin support has been dropped, there is little sense in keeping tags such as zillow, zillow-scraper, redfin, and redfin-scraper attached to the repository.

They should be removed to avoid causing confusion.

Alt Photos not Returned When Searching a Single Address

When making a request where location is equal to a specific address I do not believe any alt photos are being returned. If I search by zip code though for example I do get alt photos, even for properties that did not return alt photos when I searched by that specific address. Let me know if there's any other information I can provide but I suspect this would be easily reproducible unless I'm making some user error I'm just not seeing.

Awesome library as a whole by the way, great work.

Missing fields

Hello,
Thank you very much for this wonderful code. I tested the code with Redfin and is running perfectly. May I request the following additions if possible? I only use Redfin data.

The listing_type only supports "sold" and "for_sale". It would be great if "Pending" or "Contingent" properties can also be returned.
The for_sale df does not return DAY_ON_MARKET, LATITUDE, and LONGITUDE
the sold df does not return SOLD_DATE, LATITUDE, and LONGITUDE.
Thank you very much in advance for these updates.
Regards,
Amir Ali

Bunsly Search Error

403 Client Error: Forbidden for url: https://www.realtor.com/api/v1/rdc_search_srp?client_id=rdc-search-new-communities&schema=vesta

I was trying to lookup a city using your format: CITY,ST (State) and I am getting this error thrown

URL for "for_rent" property not accurate

To start with, really amazing tool and thanks for sharing.

I found the url for "for_rent" properties are not working. It seems like the "for_rent" are using a different id rather than mls listing ids.

`KeyError: 'centroid'` for certain addresses when setting `radius` parameter in `scrape_property`

Sometimes I get a KeyError: 'centroid' coming from this line when using scrape_property with the radius parameter. It seems the key is missing from location_info which is determined by a call to this API endpoint: "https://parser-external.geo.moveaws.com/suggest" in handle_location. I'm guessing centroid is missing b/c the latter endpoint can't match the address to coordinates? If so, should we raise an exception instead?

Question

Based on the demo it does not seem to allow you to filter based on zip code. Is it possible to add this?

Seeking help

May I ask if many cities are unable to obtain data under the "Sold" tag when crawling house information. The error message is as follows. What is the reason for this?

'NoneType' object has no attribute 'value'

Seeking help

It seems that there is an issue with the web scraping tool at tryhomeharest.com. May I ask what the reason is?

The error is as follows: HTTPSConnectionPool(host='parser-external.geo.moveaws.com', port=443): Max retries exceeded with url: /suggest?input=San+Francisco%2C+CA&client_id=for-sale&limit=1&area_types=city%2Cstate%2Ccounty%2Cpostal_code%2Caddress%2Cstreet%2Cneighborhood%2Cschool%2Cschool_district%2Cuniversity%2Cpark (Caused by ProxyError('Unable to connect to proxy', OSError('Tunnel connection failed: 407 Proxy Authentication Required')))