Coder Social home page Coder Social logo

bunsly / homeharvest Goto Github PK

View Code? Open in Web Editor NEW
262.0 11.0 56.0 271 KB

Python package for real estate scraping of MLS listing data

Home Page: https://tryhomeharvest.com/

License: MIT License

Python 100.00%
finance properties real-estate realtor redfin webscraping zillow data scraper scraping

homeharvest's People

Contributors

cullenwatson avatar ddxv avatar joecryptotoo avatar robertomr100 avatar zacharyhampton avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

homeharvest's Issues

Schools?

I would like to thank you first for offering this tool for individual home buyer to search for properties.
This is really helpful and I really appreciate the work.

One question I have in the return array is that Is there a way to include schools in the return array?

Thank you.

Estimate of home value

Hi,

I was wondering if realtor.com provides home value estimates as one of their fields. If so, could you please add it? Redfin and Zillow provide home value estimates.

Thank you,
Amir

Sold duration

Hello again,

I just realized that the sold df only returns a few properties. Typically users are interested in the sold properties over the last 3 months, 6 months and one year for analysis. Could you please add this option? In the sold datebase SOLD DATE is missing and perhaps it can be used to retrieve sold properties over the past months.

Also, I was wondering if it would be possible to have more than one location, for example "San Diego", "Ramona" for the location field. This is useful as the city names do not cover the suburb areas and it could be quite useful to be able to state more than one location in the location field.

Thank you very much in advance for adding these to the program.

Regards,
Amir Ali

Any way to pull realtor data with sold properties or listings?

I am working on learning python and found this project. It is really awesome!
I was able to modify the output name of the excel file based on the parameters i was searching for.
I am trying to find a way to add realtor names to the list of sold properties. I looked through the html on the site and found it is under <strong data-testi id or Seller-sc-okw218-0 jDsXJc. Looking through your code it looks like it goes directly to the api so the names might not be the same on the site as what would show up on the json payload.

Is there a way to run a query to the api to see what field names they have? or can realtor fields be be added? I was trying to do it myself but have a lack up understanding in how this all works together at this point.

Sorry if this is not the place to post this but am not sure what to do without the fieldnames.

Zillow error message

Traceback (most recent call last):
properties: pd.DataFrame = scrape_property(
^^^^^^^^^^^^^^^^
final_df = _scrape_single_site(location, site_name[0], listing_type, proxy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

results = site.search()
          ^^^^^^^^^^^^^

resp.raise_for_status()

raise HTTPError(http_error_msg, response=self)

requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.zillow.com/homes/recently_sold/92037_rb/

style

Hi, In the downloaded excel, the 'style' field has an additional string 'PropertyType.'. Please remove it. Thank you.

Scrape an important value

this is a great project, i want to ask if you are able to scrape the value "Quantarium" in realtor.com.

Regards,

Looking for more features, such as home value estimate or history

Hi this is really impressive! Thank you so much for sharing. I am looking for more property history that we can see from the Realtor.com website or the home value estimate. Do you think if it's something that is valuable to add to this project? If so, can you direct me to their API doc so that I can add it myself if you don't have time to do this? Really appreciate it!

Query listings for entire state

I've tested with multiple cities, and the total count in the csv matches with with the count on realtor.com (Select Hide Pending/Contigent)

However, when I tried to query for entire state, the count in the csv is significantly lower than on realtor.com.

Example:
For state CA: From HomeHarvest csv < 10,000 listings
On Realtor.com: 78,097 listings

I've tested with other states too, but same observation

bathroom information not accurate when searching by address

Searching by address does not give the expected details for full_baths and half_baths . POST request returns null for bath info, leading to None in the dataframe.

Steps to reproduce

properties = scrape_property(
    location="denver co",
    listing_type="for_sale"
)

for idx in range(properties.shape[0]):
        detail = scrape_property(location=properties.iloc[idx]['street'], listing_type='for_sale')
        print(detail.iloc[0]["full_baths"], detail.iloc[0]["half_baths"])

Expected output

None None
1 None
3 None
.
.
.

Received output

None None
None None
None None
.
.
.

csv issue - rent prices

Hello. When creating the csv output it only shows the value of the property but not the monthly rent the property owner is asking for. For the rental use case it would be important to know that to evaluate properties.

Return limited

Hi. I've been trying to run your program but for some reason it's limiting the results to about 2000 and I know that there has to be more than 2000 listings available. Could you look into this please?

Tech-Hobbyist Realtor Request

Hi there. I'm a licensed Realtor and tech hobbyist. I've been playing with the HomeHarvest script in python and it's incredible. Thank you so much for putting this together!

I'm reaching out with a request and a few questions.

Request: Is there any possibility of adding home condition as an additional attribute that can be returned? Normally, the MLS will have something like "Good", "Excellent", etc.
Question 1: Is there an upper limit to radius or past_days?
Question 2: Are there limits on the property types returned? For example, does the search exclude land? Multifamily properties? (I'm trying to get the HomeHarvest results to match my search results on Realtor.com and am getting close by playing with property type filters but not yet finding a precise match.)

Thank you again for this library - it's absolutely incredible.

0.3 Update

  • Change default site to realtor (and remove changing site in examples, as this will be the main site)
  • Radius search (for comps)
  • Last X days search for sold listings
  • Simplify data points to be similar, perhaps slightly more informational than MLS

Additional attributes of property

Thank you for providing this tool for individual home buyers to search for properties; it's really helpful. I have a question about the schema. Are we utilizing all the attributes of the schema to fetch data? If not, could you please guide me on how to access the full schema of the Realtor data source? I need to fetch additional data about properties such as property descriptions, listing brokerage info, listing agent info, property tax rate, property tax amount estimate, number of units, etc. Your assistance would be greatly appreciated.

Zillow/Redfin removed?

In the GitHub keywords, previous issues/pull requests and the git history I see both mentions of zillow and redfin scraping. However at first glance it looks like the scrapers were removed in version 3.0 (29664e4)

Are these sites no longer supported or is there just a different way to scrape from them that I'm missing?

Zillow: 403 Forbidden

Python 3.10.11
Versions tested: 0.2.13

What I tried to do:

properties: pd.DataFrame = scrape_property(
    site_name=["zillow", "realtor.com", "redfin"],
    location="85281",
    listing_type="for_rent" # for_sale / sold
)

Output:
requests.exceptions.HTTPError: 403 Client Error: Forbidden for url: https://www.zillow.com/homes/for_rent/85281_rb/

Testing this URL in browser works OK.

Feature additions

  • "Pending" or "Contingent" property searches
  • Last X months ago search on Sold properties

nearby_schools causes 500 error

Trying to add "nearby_schools" to the search query causes an error for the area search. Same error when trying to include "tax_history". Is there a list of valid fields that can be included in the query?

Photos

I was wondering how I can get the photos of the listings? When I try to use the img_src column (like outputted in the excel sheet) it says it doesn't exist. Are the photos not scraped? Any way to add this?

It is odd to me that my excel sheet does show an img_src (and parsed images, like f.e. from https://photos.zillowstatic.com), but next that I am completely unable to find anything about images (nor img_src anywhere in the code).

Thanks!

Fields & Bug Fixes

  • Zestimate field
  • Auth token retrieval -> class method, only runs once per python instance (to prevent hitting px)

list_price on SOLD properties is often incorrectly given to be equal to the sold_price

Forgive me if I am dumb, I just started using this repo today. But it appears that the list_price on many sold properties is incorrect, and it given as the same as the sold_price. When I visit the Realtor web link to the listing and scroll down the 'Property History' section, I can see that the list price often does not match what is being returned by the scrape_property function.

When I tried the query in the ReadMe for San Diego, I did not find the same issue but it appears to be widespread in Portland, OR (though I have no idea why it would be different there).

Replication:

from homeharvest import scrape_property
​
properties = scrape_property(
  location="Portland, OR",
  listing_type="sold",  # or (for_sale, for_rent, pending)
  # past_days=30,  # sold in last 30 days - listed in last 30 days if (for_sale, for_rent)
​
  date_from="2024-04-01", # alternative to past_days
  date_to="2024-04-18",
  # foreclosure=True
​
  # mls_only=True,  # only fetch MLS listings
)
​
properties = properties.sort_values(by='mls_id')

for i in range(1,5):
    print('Listed:', properties['list_price'].iloc[i], 
          'Sold:', properties['sold_price'].iloc[i], 
          'Link:', properties['property_url'].iloc[i])
          
Listed: 520000 Sold: 520000 Link: https://www.realtor.com/realestateandhomes-detail/1393689161
Listed: 575000 Sold: 575000 Link: https://www.realtor.com/realestateandhomes-detail/9466515869
Listed: 345000 Sold: 345000 Link: https://www.realtor.com/realestateandhomes-detail/1036098292
Listed: 275000 Sold: 275000 Link: https://www.realtor.com/realestateandhomes-detail/1448309661

Remove Zillow and Redfin from repository tags.

Since Zillow and Redfin support has been dropped, there is little sense in keeping tags such as zillow, zillow-scraper, redfin, and redfin-scraper attached to the repository.

They should be removed to avoid causing confusion.

Alt Photos not Returned When Searching a Single Address

When making a request where location is equal to a specific address I do not believe any alt photos are being returned. If I search by zip code though for example I do get alt photos, even for properties that did not return alt photos when I searched by that specific address. Let me know if there's any other information I can provide but I suspect this would be easily reproducible unless I'm making some user error I'm just not seeing.

Awesome library as a whole by the way, great work.

Missing fields

Hello,
Thank you very much for this wonderful code. I tested the code with Redfin and is running perfectly. May I request the following additions if possible? I only use Redfin data.

  1. The listing_type only supports "sold" and "for_sale". It would be great if "Pending" or "Contingent" properties can also be returned.
  2. The for_sale df does not return DAY_ON_MARKET, LATITUDE, and LONGITUDE
  3. the sold df does not return SOLD_DATE, LATITUDE, and LONGITUDE.
    Thank you very much in advance for these updates.
    Regards,
    Amir Ali

URL for "for_rent" property not accurate

To start with, really amazing tool and thanks for sharing.

I found the url for "for_rent" properties are not working. It seems like the "for_rent" are using a different id rather than mls listing ids.

`KeyError: 'centroid'` for certain addresses when setting `radius` parameter in `scrape_property`

Sometimes I get a KeyError: 'centroid' coming from this line when using scrape_property with the radius parameter. It seems the key is missing from location_info which is determined by a call to this API endpoint: "https://parser-external.geo.moveaws.com/suggest" in handle_location. I'm guessing centroid is missing b/c the latter endpoint can't match the address to coordinates? If so, should we raise an exception instead?

Question

Based on the demo it does not seem to allow you to filter based on zip code. Is it possible to add this?

Seeking help

May I ask if many cities are unable to obtain data under the "Sold" tag when crawling house information. The error message is as follows. What is the reason for this?

'NoneType' object has no attribute 'value'

Seeking help

It seems that there is an issue with the web scraping tool at tryhomeharest.com. May I ask what the reason is?

The error is as follows: HTTPSConnectionPool(host='parser-external.geo.moveaws.com', port=443): Max retries exceeded with url: /suggest?input=San+Francisco%2C+CA&client_id=for-sale&limit=1&area_types=city%2Cstate%2Ccounty%2Cpostal_code%2Caddress%2Cstreet%2Cneighborhood%2Cschool%2Cschool_district%2Cuniversity%2Cpark (Caused by ProxyError('Unable to connect to proxy', OSError('Tunnel connection failed: 407 Proxy Authentication Required')))

Tags/Description of a property

Can the search params be modified to include the dict of Tags for each property as well as the long paragraph like description. Additional info like the property details section which is shown on an individual listing would be great as well.

School ratings included

I'm looking through the code to see if there's a way to get the scraper to pull in school information, which is a huge item people are concerned with when looking at homes. I'm not seeing that data represented anywhere.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.