How to get your trainigdata
Aus exmediawiki
How to get the trainigdata?
exMedia_Machines/Seminar_Einführung-in-die-Programmierung-KI/04_07-11_maschinelles-lesen/02_load_scrape-data.ipynb
see more...: https://www.nltk.org/book/ch03.html
File aus eigener Datenbank einlesen
filename = 'Dateipfad' file = open(filename, 'rt') amw1 = file.read() file.close()
vorbearbeitete Trainingsdatenbanken
links hierein
Wikipedia
Wiki2Text
Extrahieren eines Plain-Text-Korpus aus MediaWiki-XML-Dumps wie Wikipedia, siehe: https://github.com/rspeer/wiki2text
Wikiextractor
https://github.com/attardi/wikiextractor
WikipediaAPI
https://pypi.org/project/Wikipedia-API/
Tweets scrapen
- https://medium.com/@limavallantin/mining-twitter-for-sentiment-analysis-using-python-a74679b85546
- https://medium.com/better-programming/how-to-build-a-twitter-sentiments-analyzer-in-python-using-textblob-948e1e8aae14
- https://www.researchgate.net/post/How_to_download_the_hashtag_data_set_from_twitter_and_instagram
Beispielcode von https://gist.github.com/sxshateri/540aead254bfa7810ee8bbb2d298363e:
import tweepy
import csv
import pandas as pd
import sys
# API credentials here
consumer_key = 'INSERT CONSUMER KEY HERE'
consumer_secret = 'INSERT CONSUMER SECRET HERE'
access_token = 'INSERT ACCESS TOKEN HERE'
access_token_secret = 'INSERT ACCESS TOKEN SECRET HERE'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth,wait_on_rate_limit=True,wait_on_rate_limit_notify=True)
# Search word/hashtag value
HashValue = ""
# search start date value. the search will start from this date to the current date.
StartDate = ""
# getting the search word/hashtag and date range from user
HashValue = input("Enter the hashtag you want the tweets to be downloaded for: ")
StartDate = input("Enter the start date in this format yyyy-mm-dd: ")
# Open/Create a file to append data
csvFile = open(HashValue+'.csv', 'a')
#Use csv Writer
csvWriter = csv.writer(csvFile)
for tweet in tweepy.Cursor(api.search,q=HashValue,count=20,lang="en",since=StartDate, tweet_mode='extended').items():
print (tweet.created_at, tweet.full_text)
csvWriter.writerow([tweet.created_at, tweet.full_text.encode('utf-8')])
print ("Scraping finished and saved to "+HashValue+".csv")
#sys.exit()
Webseiten downloaden
im Html-Format:
url = "https://theorieblog.attac.de/quo-vadis-homo-spiens/" html = request.urlopen(url).read().decode('utf8') print(html[:60])
schon im Textformat (z.B. von Gutenberg):
from urllib import request url = "http://www.gutenberg.org/files/2554/2554-0.txt" response = request.urlopen(url) raw = response.read().decode('utf8') print(raw[1000:1275])