📰 news-fetch

news-fetch is an open-source, easy-to-use news crawler that extracts structured information from almost any news website 🌐. It can recursively follow internal hyperlinks and read RSS feeds to fetch both recent and archived articles 📚. You only need to provide the root URL of the news website to crawl it completely 🔍. News-fetch combines the power of multiple state-of-the-art libraries and tools, including news-please by Felix Hamborg and Newspaper3K by Lucas (欧阳象) Ou-Yang. This package leverages features from both of these works 🤖.

I built this tool to minimize NaN or empty values when scraping data from various news websites 🚀. It’s platform-independent and written in Python 3, making it easy for programmers and developers to access news data for their applications 💻.

🔗 Project Links

Source	Link
PyPI:	https://pypi.org/project/news-fetch/
Repository:	https://santhoshse7en.github.io/news-fetch/
Documentation:	https://santhoshse7en.github.io/news-fetch_doc/ (Not Yet Created!)

📦 Dependencies

📝 Extracted Information

news-fetch extracts the following attributes from news articles. You can also check out an example JSON file generated by news-please.

📰 Headline
✍️ Author(s)
📅 Publication date
🗞️ Publication
📂 Category
🌍 Source domain
📑 Article content
📝 Summary
🔑 Keywords
🌐 URL
🌐 Language

🔧 Dependency Installation

Use the package manager pip to install the required dependencies:

pip install -r requirements.txt

🚀 Usage

You can download it by clicking the green download button on Github.

To scrape all the news details, use the newspaper function:

from newsfetch.news import Newspaper

news = Newspaper(url='https://www.thehindu.com/news/cities/Madurai/aa-plays-a-pivotal-role-in-helping-people-escape-from-the-grip-of-alcoholism/article67716206.ece')
print(news.headline)
# Output: 'AA plays a pivotal role in helping people escape from the grip of alcoholism'

To extract URLs from a targeted website, call the GoogleSearchNewsURLExtractor by providing the keyword and newspaper link as arguments:

from newsfetch.google import GoogleSearchNewsURLExtractor

google = GoogleSearchNewsURLExtractor(keyword='Alcoholics Anonymous', news_domain='https://timesofindia.indiatimes.com/')
print(google.urls)
"""
['https://timesofindia.indiatimes.com/city/pune/pune-takes-a-stand-against-alcoholism-experts-collaborate-with-alcoholics-anonymous/articleshow/114438466.cms', 
'https://timesofindia.indiatimes.com/city/mumbai/we-have-lost-jobs-homes-alcoholics-anonymous/articleshow/96824383.cms', 
'https://timesofindia.indiatimes.com/city/gurgaon/gurgaons-alcoholics-open-up-about-their-road-to-recovery/articleshow/45080744.cms', 
'https://timesofindia.indiatimes.com/city/goa/alcoholism-is-illness-not-issue-of-weak-willpower-say-experts/articleshow/105320008.cms', 
'https://timesofindia.indiatimes.com/city/bhopal/alcoholism-is-an-illness-bhopal-aa-silver-jubilee-celebration/articleshow/106849014.cms', 
'https://timesofindia.indiatimes.com/city/ahmedabad/alcoholics-anonymous-switches-to-online-sessions/articleshow/76144639.cms', 
'https://timesofindia.indiatimes.com/city/kochi/keralites-trying-to-kick-alcoholism-alcoholics-anonymous/articleshow/13977818.cms', 
'https://timesofindia.indiatimes.com/city/chandigarh/alcoholics-anonymous-turned-their-lives-around/articleshow/18239.cms', 
'https://timesofindia.indiatimes.com/city/mumbai/like-air-india-flyer-alcoholics-anonymous-members-reap-whirlwind-of-job-loss-broken-homes/articleshow/96820403.cms', 
'https://timesofindia.indiatimes.com/city/nagpur/alcoholics-anonymous-meet-promotes-one-day-at-a-time/articleshow/50538092.cms']
"""

🤝 Contributing

Pull requests are welcome! For major changes, please open an issue first to discuss what you would like to change.

Make sure to update tests as appropriate.

📄 License

This project is licensed under the MIT License.