How to get your trainigdata
Aus exmediawiki
Inhaltsverzeichnis
How to get the trainigdata?
exMedia_Machines/Seminar_Einführung-in-die-Programmierung-KI/04_07-11_maschinelles-lesen/02_load_scrape-data.ipynb
see more...: https://www.nltk.org/book/ch03.html
File aus eigener Datenbank einlesen
filename = 'Dateipfad' file = open(filename, 'rt') amw1 = file.read() file.close()
vorbearbeitete Trainingsdatenbanken
links hierein
Wikipedia
Wiki2Text
Extrahieren eines Plain-Text-Korpus aus MediaWiki-XML-Dumps wie Wikipedia, siehe: https://github.com/rspeer/wiki2text
Wikiextractor
https://github.com/attardi/wikiextractor
WikipediaAPI
https://pypi.org/project/Wikipedia-API/
Tweets scrapen
- https://medium.com/@limavallantin/mining-twitter-for-sentiment-analysis-using-python-a74679b85546
- https://medium.com/better-programming/how-to-build-a-twitter-sentiments-analyzer-in-python-using-textblob-948e1e8aae14
- https://www.researchgate.net/post/How_to_download_the_hashtag_data_set_from_twitter_and_instagram
Beispielcode von https://gist.github.com/sxshateri/540aead254bfa7810ee8bbb2d298363e:
import tweepy import csv import pandas as pd import sys # API credentials here consumer_key = 'INSERT CONSUMER KEY HERE' consumer_secret = 'INSERT CONSUMER SECRET HERE' access_token = 'INSERT ACCESS TOKEN HERE' access_token_secret = 'INSERT ACCESS TOKEN SECRET HERE' auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True) # Search word/hashtag value HashValue = "" # search start date value. the search will start from this date to the current date. StartDate = "" # getting the search word/hashtag and date range from user HashValue = input("Enter the hashtag you want the tweets to be downloaded for: ") StartDate = input("Enter the start date in this format yyyy-mm-dd: ") # Open/Create a file to append data csvFile = open(HashValue+'.csv', 'a') #Use csv Writer csvWriter = csv.writer(csvFile) for tweet in tweepy.Cursor(api.search,q=HashValue,count=20,lang="en",since=StartDate, tweet_mode='extended').items(): print (tweet.created_at, tweet.full_text) csvWriter.writerow([tweet.created_at, tweet.full_text.encode('utf-8')]) print ("Scraping finished and saved to "+HashValue+".csv") #sys.exit()
Webseiten downloaden
im Html-Format:
url = "https://theorieblog.attac.de/quo-vadis-homo-spiens/" html = request.urlopen(url).read().decode('utf8') print(html[:60])
schon im Textformat (z.B. von Gutenberg):
from urllib import request url = "http://www.gutenberg.org/files/2554/2554-0.txt" response = request.urlopen(url) raw = response.read().decode('utf8') print(raw[1000:1275])