#archiveteam-bs 2017-12-02,Sat

↑back Search

Time	Nickname	Message
00:03 ^🔗		Stilett0 has joined #archiveteam-bs
00:04 ^🔗	ola_norsk	JAA: aye..trying to archive a twitter hashtag has taught me that :/ "There was a problem loading..(retry button)")
00:05 ^🔗	JAA	Yeah, Twitter's also pretty good at not letting you grab everything.
00:05 ^🔗	JAA	Reddit as well.
00:05 ^🔗	JAA	(We were having a discussion about that earlier in #archivebot.)
00:05 ^🔗	JAA	At least you can iterate over all thread IDs in a reasonable amount of time on Reddit though.
00:07 ^🔗	JAA	So it appears that you can get 10k results from the vid.me API.
00:07 ^🔗	ola_norsk	i feel naughty doing curl requests to https://web.archive.org/save/https://twitter.com/hashtag/netneutrality?f=tweets , currently every 3rd minute :/
00:07 ^🔗	JAA	You can do that for different categories, new/hot, and probably search terms (didn't try).
00:09 ^🔗	JAA	There are 17 categories plus hot, new, and team picks. In the ideal case, that means 20 sections times 10k results, which is still only about 1/7th of the whole site.
00:09 ^🔗	JAA	This is only about how to gather lists of videos and their metadata (uploader, description, etc.), not the actual videos.
00:09 ^🔗	JAA	(Videos are available as Dash and HLS streams.)
00:10 ^🔗	JAA	There are also tags, and of course you can retrieve all (?) of an uploader's videos.
00:10 ^🔗	ola_norsk	JAA: As for twitter, i think one problem is that they would easily present an archive of ANYTHING, as long as they get paid for it.
00:10 ^🔗	JAA	For each tag, you get hot, new, and top videos.
00:11 ^🔗	JAA	ola_norsk: Yeah, probably.
00:11 ^🔗	ola_norsk	JAA: most definitely
00:14 ^🔗	JAA	There's a "random video" link. We could hammer that to get videos. I don't want to do the math how many times we need to retrieve it to discover the vast majority of all videos right now though.
00:14 ^🔗	ola_norsk	JAA: for a legal warrant, or a slump of money, they could present all tweets with any hastag, since the dawn of ti..twitter
00:14 ^🔗	JAA	Ah, I thought you were talking about vid.me now.
00:15 ^🔗	JAA	Yeah, there is a company which has an entire archive of Twitter, I believe.
00:15 ^🔗	ola_norsk	ah, sorry, that was just a link regarding GOG Connect
00:16 ^🔗	JAA	Ah, you're not in #archiveteam. vid.me is shutting down on Dec 14.
00:16 ^🔗	JAA	That's why I'm looking into them.
00:16 ^🔗	ola_norsk	really? that soon?
00:16 ^🔗	JAA	https://medium.com/vidme/goodbye-for-now-120b40becafa
00:16 ^🔗	ola_norsk	wow, that's going to piss off alot of germans :D
00:17 ^🔗	JAA	ola_norsk: I was thinking about Gnip, by the way. Looks like Twitter bought them a few years ago.
00:18 ^🔗	ola_norsk	"We’re building something new." ..
00:19 ^🔗	ola_norsk	a.k.a "Trust us, we're not completely destroying this shit..We're building something new!"..
00:20 ^🔗	Frogging	free image/video host "couldn't find a path to sustainability"
00:20 ^🔗	ola_norsk	man, i actually thought vid.me had something good going
00:20 ^🔗	Frogging	what a surprise :p
00:21 ^🔗	ola_norsk	https://archive.org/details/jscott_geocities
00:24 ^🔗	ola_norsk	wow, there's actually people who cancelled their youtube accounts after having used vid.me's easy export solution
00:24 ^🔗	ola_norsk	and as far as i know, that shit might not be such easy to export back, since i don't think YT does import by url..
00:25 ^🔗	ola_norsk	og well
00:28 ^🔗	omglolbah	why not upload to both? <.<
00:28 ^🔗	ola_norsk	aye
00:30 ^🔗	ola_norsk	omglolbah: according to "SidAlpha", if you know that youtuber, he would'nt because it would mean he'd have to interact on several platforms..
00:30 ^🔗	omglolbah	If only he had moved to vidme
00:31 ^🔗	ola_norsk	that was his response to the request for that, not move, but upload there as well
00:31 ^🔗	omglolbah	no, I'm saying I wished he had moved so that he would be gone :p
00:31 ^🔗	ola_norsk	oh
00:34 ^🔗	ola_norsk	where does shit go if Youtube goes though? I mean, Google Video went to Youtube..
00:36 ^🔗	ola_norsk	Where did Yahoo Video go?
00:36 ^🔗	ola_norsk	Justin.tv became Twitch right?
00:37 ^🔗	zino	Justin.tv created Twitch and then closed down. Nothing was automatically moved. I don't know if Justin had vods though.
00:37 ^🔗	ola_norsk	aye
00:38 ^🔗	JAA	I was wrong about the vid.me API not returning all results.
00:38 ^🔗	JAA	The actual API does return everything, or at least nearly everything.
00:38 ^🔗	JAA	The "API" used by the website doesn't.
00:38 ^🔗	JAA	I just didn't find the real API docs previously.
00:38 ^🔗	JAA	https://docs.vid.me/#api-Videos-List
00:38 ^🔗	JAA	No auth required either.
00:38 ^🔗	JAA	You can get chunks of 100 videos per request.
00:38 ^🔗	zino	\o/
00:39 ^🔗	zino	Do we have a death date?
00:39 ^🔗	JAA	It gets quite slow for large offsets, indicating that they don't know how to use offsets.
00:39 ^🔗	JAA	14 Dec
00:39 ^🔗	zino	:-/
00:39 ^🔗	JAA	how to use indices*
00:39 ^🔗	zino	indexes?
00:40 ^🔗	Frogging	where would youtube go? nowhere. it's too big :p
00:40 ^🔗	JAA	I never know which plural's correct.
00:40 ^🔗	ola_norsk	Frogging; aye
00:40 ^🔗	bithippo	Frodding: we'll just show up with a tractor trailer "Load it all in back y'all"
00:40 ^🔗	JAA	The real API returns a bit more videos, by the way: 1360532.
00:40 ^🔗	JAA	(About 11k more, specifically.)
00:41 ^🔗	JAA	Might be the NSFW/unmoderated/private filter stuff.
00:41 ^🔗	JAA	bithippo: YouTube is around 1 exabyte. Have fun with that.
00:41 ^🔗	JAA	Well, at least that order of magnitude.
00:41 ^🔗	bithippo	I used to manage hundreds of petabytes :-P
00:41 ^🔗	*	ola_norsk shoves in in his usb stick and applies youtube-dl !
00:41 ^🔗	bithippo	lol
00:42 ^🔗	JAA	I'm sure someone from China will sell you a 1 EB USB stick if you ask them.
00:42 ^🔗	JAA	Well, "1 EB".
00:42 ^🔗	bithippo	Which will quickly err out once a few GB have been written .... :(
00:42 ^🔗	JAA	Yep
00:42 ^🔗	ola_norsk	i'll just save it all in /dev/null
00:42 ^🔗	JAA	Or not error out, just overwrite the previous data etc.
00:43 ^🔗	zino	Depends. Some of them are cyclical, so you can write all you want as long as you don't try to read it. :)
00:43 ^🔗	JAA	Yep
00:43 ^🔗	JAA	I'm a fan of S4.
00:43 ^🔗	JAA	The Super Simple Storage Service.
00:43 ^🔗	JAA	http://www.supersimplestorageservice.com/
00:44 ^🔗	bithippo	That pricing is a bargain.
00:44 ^🔗	zino	bithippo: Interesting. What did you work with that included 100s of PiBs? I deal in 10s of them.
00:44 ^🔗	bithippo	Data taking for LHC detector
00:44 ^🔗	JAA	Ooh, nice!
00:45 ^🔗	bithippo	Only a couple hundred TB of spinning disk on storage arrays, the rest were tape archive libraries.
00:45 ^🔗	zino	bithippo: Ah. I sort of do that on the sly. Part of our storage is for the Nordic LHC grid.
00:45 ^🔗	bithippo	#TeamCMS
00:46 ^🔗	zino	I deal mostly with crimate data though. Have a few petabytes of that.
00:46 ^🔗	bithippo	That's awesome.
00:46 ^🔗	bithippo	I <3 big data sets
00:47 ^🔗	zino	Indeed :)
00:47 ^🔗	dashcloud	@ola_norsk If you're interested in how to make something be emulated on IA, here's some pages that lay it out for you- http://digitize.archiveteam.org/index.php/Internet_Archive_Emulation http://digitize.archiveteam.org/index.php/Making_Software_Emulate_on_IA
00:49 ^🔗	ola_norsk	dashcloud: ty, i'm thinking there must be ways. If there's dosbox, there's e.g Frodo that could run in that..
00:50 ^🔗	dashcloud	I've done a bunch of DOSBOX games, and there's a whole collection of emulated DOS/Win31/Mac Classic stuff up
00:51 ^🔗	zino	ola_norsk: What, the C64 emu?
00:51 ^🔗	ola_norsk	yes
00:51 ^🔗	zino	No, nonononono. Go helt the jsmess people get Vice running instead.
00:52 ^🔗	ola_norsk	i was hoping that was already done
00:52 ^🔗	zino	I know it's started.
00:52 ^🔗	ola_norsk	good stuff
00:52 ^🔗	zino	But it might be stalled forever for all I know.
00:54 ^🔗	ola_norsk	i have no idea about these things, but it would be cool to see C64 on Internet Arcade
00:54 ^🔗	zino	JAA: I'll have very little time to do anything before the 9th, and probably not much after either, but ping me if storage is needed for vid.me.
00:55 ^🔗	ola_norsk	dashcloud, ill try to make an item per that, using dosbox
00:55 ^🔗	ola_norsk	ty for info
00:56 ^🔗	JAA	zino: Will do. I'll set up a scrape of the API first to get all the relevant information about the platform. Then we'll see.
00:57 ^🔗	dashcloud	if your software needs installation or configuration before the first run, you'll want to do that ahead of time
00:57 ^🔗	JAA	scrape/archive, whatever. That's the information we can save for sure.
00:57 ^🔗	JAA	Unless they ban us...
00:58 ^🔗	JAA	Using minVideoId and maxVideoId might be faster than the offset/limit method, especially for the later pages.
00:59 ^🔗	JAA	Current video IDs are slightly above 19 million, so that's around 190k requests (to be sure no videos are missed).
01:03 ^🔗	jrwr	attending my first 2600 meeting
01:05 ^🔗	wp494	so the thing with vidme, there's a bunch of original stuff
01:06 ^🔗	wp494	there's a little bit of lewd stuff (they ban outright porn, but they do permit "artistic" nsfw)
01:06 ^🔗	wp494	and then there's a bit of it that consists of reuploads of copyrighted stuff
01:06 ^🔗	jrwr	OK
01:06 ^🔗	wp494	not that I think it'll be a big deal since IA can just dark the affected stuff if someone does come yelling, but something to keep in mind
01:07 ^🔗	JAA	Sounds more or less what I'd expect.
01:07 ^🔗	JAA	like*
01:09 ^🔗	JAA	I can't find any information about API rate limits, except this Reddit thread: https://redd.it/6acvg5
01:11 ^🔗		icedice has quit IRC (Quit: Leaving)
01:18 ^🔗		Ceryn has quit IRC (Connection closed)
01:26 ^🔗	ola_norsk	"The Internet is Living on Borrowed Time" .. https://vid.me/1LriY (ironically on vid.me) ..That's pretty dark title, for being Lunduke :d
01:33 ^🔗	JAA	To be fair, it's also available on YouTube: https://www.youtube.com/watch?v=1VD_pJOFnZ0
01:54 ^🔗	ola_norsk	thats not fair :D
01:54 ^🔗	ola_norsk	i think most of his vids are also on IA :d
01:55 ^🔗	ola_norsk	but yeah
01:57 ^🔗	ola_norsk	seriously though. I imagine there's a shitload of german vidme'ers currently bewildered as to what to do..
01:59 ^🔗	ola_norsk	a lot of people used the url importing at vidme, thinking they would simple move their entire channels..
02:00 ^🔗	ola_norsk	from what i've heard tales of, germany youtube is not the same youtube as everywhere elsetube
02:07 ^🔗	ranma	GEMA blocks a fuckton of music there
02:07 ^🔗	ola_norsk	aye
02:09 ^🔗	ola_norsk	ranma: is that the only reason though? There were so many germans coming to vid.me it was made a video about it..
02:12 ^🔗	ranma	JAA: how do you get your data OUT of S4?
02:12 ^🔗	ranma	and what are the costs?
02:12 ^🔗	CoolCanuk	s4?
02:13 ^🔗	ola_norsk	ranma: "German INVASION"...100k creators..https://vid.me/JjNaH
02:13 ^🔗	ranma	oh, it's a joke :'(
02:14 ^🔗	CoolCanuk	i hate slow internet. ml
02:14 ^🔗	CoolCanuk	*fml
02:14 ^🔗		phuzion has joined #archiveteam-bs
02:18 ^🔗	ola_norsk	ranma: does it simply block ALL music? i can't see any other reason for such a noticable influx and flight of users
02:19 ^🔗	ola_norsk	ranma: It's actually hard to browse vidme because of it at times, since often 1 in 2 videos on the feed is german
02:22 ^🔗	ranma	kinda wish some site could ZIP/7z another site
02:22 ^🔗	ranma	just noticed archivebot slurped down https://ftp.modland.com/
02:23 ^🔗	CoolCanuk	did it completely slurp modland?
02:23 ^🔗	ola_norsk	dd -i http://google.com -o http://bing.com
02:23 ^🔗	ranma	<Major> Muad-Dib: Your job for https://ftp.modland.com/ has finished.
02:25 ^🔗	ranma	actually, not that i'd have the space for it, tho
02:25 ^🔗		ola_norsk has quit IRC (its the beer talking)
02:26 ^🔗	CoolCanuk	does anyone have a great upload script for ia? their docs are too much for me to understand and uploading 1 by 1 is painful
02:34 ^🔗	ez	for anyone wanting to mirror vid.me, its possible to page everything there: https://api.vid.me/videos/list?minVideoId=100&maxVideoId=1000
02:34 ^🔗	ez	just step the min/max (its easier on the db).
02:34 ^🔗	CoolCanuk	..... https://usercontent.irccloud-cdn.com/file/PZalOsZ6/image.png
02:35 ^🔗	ez	JAA: ^
02:35 ^🔗	CoolCanuk	our wiki is more stable than this beta-like system :P
02:39 ^🔗	bithippo	CoolCanuk: What do you mean by "upload script for ia"?
02:40 ^🔗	bithippo	Such as https://github.com/jjjake/internetarchive ?
02:40 ^🔗	CoolCanuk	an easier way
02:41 ^🔗	CoolCanuk	eg I can loop is for 100s of files in a folder, but upload as 100 items.
02:41 ^🔗	bithippo	That repo is your best bet for that sort of operation.
02:42 ^🔗	bithippo	What sort of files and metadata?
02:42 ^🔗	CoolCanuk	pdf
02:43 ^🔗	CoolCanuk	currently, newspapers
02:43 ^🔗	CoolCanuk	and sears crap
02:44 ^🔗	bithippo	Hmm
02:46 ^🔗	bithippo	The two routes would be "web interface", which gives you a nice interface and shouldn't be too painful if you're putting up each folder as an item (with all of the files contained within that folder attributed to the item). Failing that, you'd need some light python or bash scripting skills to pickup up files per item, associate metadata with each item, and upload.
02:46 ^🔗	bithippo	I could be wrong of course! But that's my interpretation based on working with the IA interfaces.
02:47 ^🔗	ez	tbh, IA interface is just plain atrocious to use
02:47 ^🔗	bithippo	Indeed.
02:47 ^🔗	ez	i suppose thats artificial barrier of entry on purpose to avoid people uploading crap
02:48 ^🔗	CoolCanuk	I guess
02:48 ^🔗	CoolCanuk	I'm uploading stuff I know will probably not be found anywhere else
02:49 ^🔗	ez	yea, the commitment to jump the hoops is paired with commitment to curate content
02:50 ^🔗	CoolCanuk	Only think I'm worried about is repetitive strain injury
02:52 ^🔗		wp494_ has joined #archiveteam-bs
02:59 ^🔗		wp494 has quit IRC (Read error: Operation timed out)
03:03 ^🔗		ld1 has quit IRC (Quit: ~)
03:06 ^🔗		ld1 has joined #archiveteam-bs
03:19 ^🔗	CoolCanuk	why does IA have a difficult time using the FIRST page of a pdf as the COVER >:(
03:41 ^🔗		wp494_ is now known as wp494
03:47 ^🔗	MrRadar	X-posting from #archiveteam: if you're using youtube-dl to grab vid.me content, be aware of this issue: https://github.com/rg3/youtube-dl/issues/14199
03:47 ^🔗	MrRadar	tl;dr: their HLS streams return a data format youtube-dl doesn't fully handle resulting in corrupted output files
03:48 ^🔗	MrRadar	Use a workaround in the 2nd to last comment to force youtube-dl to grab from the DASH endpoints instead
03:54 ^🔗	wp494	posting highlights of https://www.youtube.com/watch?v=KMaWSinw4MI&t=41m33s here
03:55 ^🔗	wp494	first one being that linus has significant disagreements with senior management, and especially NCIX's owner
03:55 ^🔗	wp494	which seems to be a very common theme
03:56 ^🔗	wp494	he also left NCIX because the people he mentored departed
03:56 ^🔗	wp494	says he thinks some were forced out because of extraordinarily poor management decisions
03:57 ^🔗	wp494	(in his opinion)
03:57 ^🔗	Frogging	I'm reading this https://np.reddit.com/r/bapcsalescanada/comments/77h771/for_anyone_that_purchased_a_8700k_from_ncix/domm2ca/?context=3
04:09 ^🔗		josho493 has joined #archiveteam-bs
04:09 ^🔗	wp494	linus pitched what sounded like a pretty good idea, try and get bought, but how? his solution was to open "NCIX Lite"s across the country which would be really small pickup places that you could ship to since shipping direct to your home sometimes killed the deal
04:10 ^🔗	wp494	said that the writing was on the wall as early as 7 years ago (before Amazon was doing pickup) to anyone actively paying attention, so the idea would've been to attract someone similar to Amazon if not Amazon themselves using that infrastructure when they wanted to gobble someone Whole Foods style
04:12 ^🔗	wp494	linus said when management didn't do that, he said it became obvious that he had to GTFO
04:12 ^🔗	wp494	he says he hasn't been screwed over personally by Steve (the owner) and his wife unlike some of the other horror stories going out
04:16 ^🔗	wp494	he wound up signing a non-compete for 2 years (which got extended by 1)
04:17 ^🔗	wp494	when he left he took the LTT assets, and did it on paper (and was glad he did), because even though he wouldn't think Steve would do anything untoward to him, creditors are sharks looking for their next kill
04:19 ^🔗	wp494	and that's about it
04:27 ^🔗	ez	why do people bother with fairly standard eshop drama, was ncix the canadian amazon or something?
04:27 ^🔗	MrRadar	More like Canadian Newegg... before Newegg moved into Canada
04:28 ^🔗	MrRadar	It was the place to go for computer parts online, from what I understand
04:28 ^🔗	ez	ah
04:28 ^🔗	ez	razor thin margins, yea we have that locally too
04:28 ^🔗	ez	all with fake "in stock" stickers where you wait 2 weeks and everything
04:30 ^🔗		qw3rty115 has joined #archiveteam-bs
04:34 ^🔗		qw3rty114 has quit IRC (Read error: Operation timed out)
04:38 ^🔗	Frogging	they had a location here in Ottawa. I used to shop there until they closed it
05:07 ^🔗		josho493 has quit IRC (Quit: Page closed)
05:09 ^🔗	CoolCanuk	defunct as of today :o
05:09 ^🔗	CoolCanuk	*yesterday
05:11 ^🔗		Mateon1 has quit IRC (Ping timeout: 245 seconds)
05:12 ^🔗		Mateon1 has joined #archiveteam-bs
05:15 ^🔗	CoolCanuk	am I the only one who doesnt really see the big deal of google home/mini or amazon alexa?
05:29 ^🔗		ranavalon has quit IRC (Read error: Connection reset by peer)
05:36 ^🔗	Frogging	there's a big deal?
05:38 ^🔗		shindakun has joined #archiveteam-bs
05:38 ^🔗		Jcc10 has joined #archiveteam-bs
05:42 ^🔗	ez	CoolCanuk: we're all waiting for amazon to give access to alexa transcripts to app devs
05:42 ^🔗	ez	so we can start archiving every little embarassing thing anyone has ever said
05:42 ^🔗	wp494	which of vidme's logos should I use for the article, the wordmark or their "astro" mascot
05:42 ^🔗	wp494	https://vid.me/media
05:48 ^🔗	CoolCanuk	the one on the main page (red)
05:48 ^🔗	wp494	wordmark it is
05:48 ^🔗	CoolCanuk	sadly cant be eps or svg :(
05:49 ^🔗	wp494	gonna resize it a little otherwise it'll appear about as big in a warrior project
05:49 ^🔗	CoolCanuk	or we could fix the template
05:49 ^🔗	CoolCanuk	wait what do you mean
05:50 ^🔗	wp494	lemme go dig through the spuf logs to show you
05:50 ^🔗	wp494	(come to think of it I'm not even sure if I took an image, I might have just pull requested and moved on)
05:51 ^🔗	CoolCanuk	our {{Template project}} should be fixed to a larger logo size
05:51 ^🔗	CoolCanuk	using it online is not an issue, because we can dynamicly resize
05:51 ^🔗	CoolCanuk	http://tracker.archiveteam.org/
05:52 ^🔗	wp494	yeah there it could benefit from being a touch bigger at least for logos that are rectangles instead of squares
05:52 ^🔗	CoolCanuk	apparently we can't... :\|
05:52 ^🔗	wp494	(it seems to like squares the best)
05:52 ^🔗	CoolCanuk	"benefit"?
05:52 ^🔗	CoolCanuk	distortion?
05:52 ^🔗	wp494	and yeah, I was about to say, our copy of mediawiki isn't quite as flexible as wikimedia's where you can stuff in any number and it'll spit it out for you
05:52 ^🔗	wp494	even ridiculously large ones like 10000px
05:52 ^🔗	CoolCanuk	I just noticed that. that's too bad
05:53 ^🔗	CoolCanuk	another reason to use SVG.
05:54 ^🔗	wp494	even SVGs too
05:54 ^🔗	CoolCanuk	no. SVGs are not raster
05:55 ^🔗	CoolCanuk	you can blow them up to 1000000000px and it will never distort unless you have embedded rasters
05:57 ^🔗	CoolCanuk	https://upload.wikimedia.org/wikipedia/commons/3/35/Tux.svg
06:02 ^🔗	wp494	ok I was gonna recreate an example with SPUF but there's a live one that I can get you right now
06:03 ^🔗	wp494	see how the miiverse logo goes a bit out of its bounds and pushes content downwards: https://i.imgur.com/P3Wcfbp.png
06:03 ^🔗	CoolCanuk	ew
06:03 ^🔗	CoolCanuk	logo should be within that white div, not yellow
06:04 ^🔗	CoolCanuk	(within, not overlaid)
06:04 ^🔗	wp494	now take the version of the steam icon we had stored on the wiki and stuffed into the project code (http://www.archiveteam.org/images/4/48/Steam_Icon_2014.png) and it wound up being a bit worse than that example
06:04 ^🔗	wp494	luckily a 100px version that mediawiki gracefully generated more or less solved things: https://github.com/ArchiveTeam/spuf-grab/pull/2/commits/1c319d3d144cc13599f1fe571e699ca8b3d79e60
06:04 ^🔗	CoolCanuk	not the image's fault, it's the tracker ;)
06:05 ^🔗	wp494	afaik tracker main page was ok
06:05 ^🔗	CoolCanuk	how could it be ok
06:05 ^🔗	wp494	note how it looks like it's fine on http://tracker.archiveteam.org/
06:05 ^🔗	CoolCanuk	simply use max-width for img in css
06:05 ^🔗	CoolCanuk	*height
06:06 ^🔗	wp494	but with that said scroll bars do appear
06:06 ^🔗	CoolCanuk	then you need to
06:06 ^🔗	CoolCanuk	overflow: hidden
06:06 ^🔗	wp494	but it's nothing near as annoying as the in-warrior example, though still a nuisance albeit very minor
06:07 ^🔗	CoolCanuk	I will fix it
06:08 ^🔗	wp494	k so a 600 x 148 version will go up on the wiki
06:08 ^🔗	wp494	and then if it causes problems we can grab a 100px url
06:08 ^🔗	wp494	for project code
06:09 ^🔗	CoolCanuk	we have or
06:09 ^🔗	CoolCanuk	**or
06:09 ^🔗	CoolCanuk	just use max-height: 100px
06:09 ^🔗	CoolCanuk	;)
06:09 ^🔗	wp494	ok project page is going up
06:09 ^🔗	CoolCanuk	lol how did it let you upload file name with a space :P
06:09 ^🔗	CoolCanuk	it makes me use _
06:10 ^🔗	wp494	it does insert a _
06:10 ^🔗	wp494	the recent changes bot treats it as a space though
06:11 ^🔗	wp494	but for actually using the filename you're going to need to use underscores
06:11 ^🔗	CoolCanuk	o
06:11 ^🔗	wp494	aw crap I'm getting spam filtered and I don't even get a prompt to put in the secret phrase
06:12 ^🔗	wp494	oh well let's see if this workaround of inserting a space in the url works
06:12 ^🔗	CoolCanuk	heh
06:12 ^🔗	CoolCanuk	SHHHH that's supposed to be a secret :x
06:13 ^🔗	wp494	ok wow that apparently worked
06:15 ^🔗	CoolCanuk	i'll fix it for ya
06:15 ^🔗	wp494	gl with the filter
06:15 ^🔗	CoolCanuk	oh you fixed it
06:15 ^🔗	wp494	I was surprised I was even able to toss such a tiny little stone at that goliath
06:18 ^🔗	wp494	ok that's a solid foundation I think
06:19 ^🔗	CoolCanuk	huh
06:20 ^🔗	CoolCanuk	I have a workaround :P
06:21 ^🔗		slyphic has quit IRC (Read error: Operation timed out)
06:21 ^🔗	Odd0002	I got a 508 clicking that purplebot link
06:21 ^🔗	SketchCow	godane: What does "WOC" mean with the MPGs?
06:21 ^🔗	Odd0002	resource limit reached
06:22 ^🔗	CoolCanuk	this 208 error will be the death of me
06:22 ^🔗	CoolCanuk	*508
06:23 ^🔗	Odd0002	connection timed out now...
06:23 ^🔗	CoolCanuk	same here ughhhh
06:24 ^🔗	CoolCanuk	ffff
06:24 ^🔗	CoolCanuk	impossible to eidt
06:25 ^🔗	Odd0002	oh finally
06:25 ^🔗	CoolCanuk	there must be more than just "shared hosting" being the problem
06:28 ^🔗	CoolCanuk	can the topic in #archiveteam changed from Compuserve to vidme? lmfao
06:28 ^🔗	CoolCanuk	*be changed
06:32 ^🔗	wp494	if it gets pointed out a few times like with compuserve then someone will probably do it
06:32 ^🔗	wp494	if it's just once or twice more then it's no big deal just say "yeah we're on it"
06:33 ^🔗	CoolCanuk	fair
07:06 ^🔗	CoolCanuk	making up a tag for vidme is going to be tricky. it's so short.. hard to come with a spinoff
07:42 ^🔗		Pixi has quit IRC (Ping timeout: 255 seconds)
07:42 ^🔗		Pixi has joined #archiveteam-bs
07:44 ^🔗		BlueMaxim has quit IRC (Ping timeout: 633 seconds)
07:45 ^🔗		BlueMaxim has joined #archiveteam-bs
09:05 ^🔗		Dimtree has quit IRC (Peace)
09:11 ^🔗		Dimtree has joined #archiveteam-bs
09:55 ^🔗		fie has quit IRC (Ping timeout: 245 seconds)
10:11 ^🔗		fie has joined #archiveteam-bs
10:16 ^🔗		CoolCanuk has quit IRC (Quit: Connection closed for inactivity)
10:27 ^🔗		BlueMaxim has quit IRC (Read error: Connection reset by peer)
10:35 ^🔗		schbirid has joined #archiveteam-bs
11:08 ^🔗		fie has quit IRC (Ping timeout: 246 seconds)
11:21 ^🔗		fie has joined #archiveteam-bs
11:33 ^🔗		bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
11:44 ^🔗		jschwart has joined #archiveteam-bs
12:11 ^🔗	JAA	ez: Yep, that's what I came up with yesterday as well. You can either iterate min/maxVideoId in blocks of 100 with limit=100 or implement pagination. I'd probably go for the former, i.e. retrieve video IDs 1 to 100, 101 to 200, etc. (need to figure out whether these parameters are exclusive or not though).
12:17 ^🔗		MangoTec has joined #archiveteam-bs
12:41 ^🔗	jrwr	my god
12:41 ^🔗	jrwr	the best thing I've ever heard just got tweeted
12:41 ^🔗	jrwr	@ElonMusk: Payload will be my midnight cherry Tesla Roadster playing Space Oddity. Destination is Mars orbit. Will be in deep space for a billion years or so if it doesn’t blow up on ascent.
12:57 ^🔗	zino	Elon knows how to put on a show.
12:57 ^🔗	jrwr	Yep
12:58 ^🔗	jrwr	I mean, he thinks its going to blow, they didn't want to make a real payload... so fuck it send a Car
13:01 ^🔗	zino	At this time I'll recommand the old Top Gear episode where they convert a car to a space shuttle and blast it off with rockets.
13:01 ^🔗	zino	recommend*
13:24 ^🔗		MangoTec has quit IRC (Quit: Page closed)
13:43 ^🔗	schbirid	hetzner's auctions seem to have dropped in price a lot, 1/3 aka -10€ for what i have
13:43 ^🔗	schbirid	https://www.hetzner.com/sb
13:44 ^🔗	schbirid	nvm, had fucking US version without VAT =(
13:49 ^🔗	odemg	https://medium.com/vidme/goodbye-for-now-120b40becafa
13:49 ^🔗	odemg	https://medium.com/vidme/goodbye-for-now-120b40becafa
13:49 ^🔗	odemg	https://medium.com/vidme/goodbye-for-now-120b40becafa
13:49 ^🔗	odemg	What the fuck!!
13:49 ^🔗	odemg	Okay you know about it
13:49 ^🔗	odemg	but what the actual fuck!
14:10 ^🔗	jrwr	people are finding out its REALLY hard to make a video website
14:16 ^🔗	odemg	It's easy to make a video site, it's just hard to monetise it, mediacru.sh was the best in terms of technology in my opinion but they didn't manage to monetise either.
14:16 ^🔗		ranavalon has joined #archiveteam-bs
14:17 ^🔗		ranavalon has quit IRC (Remote host closed the connection)
14:17 ^🔗	odemg	I'm collecting video ids from reddit anyways, heads up the bulk of the older urls (and possibly new ones) are going to be reddit porn related.
14:17 ^🔗		ranavalon has joined #archiveteam-bs
14:18 ^🔗	schbirid	wait, youtube is still operating at loss
14:18 ^🔗	schbirid	why the FUCK are people making so much money on their ad share then?
14:18 ^🔗	schbirid	*?
14:21 ^🔗		voidsta has joined #archiveteam-bs
14:22 ^🔗	odemg	Google isn't operating at a loss, so they can keep YouTube afloat and keep trying new things to pump up their bottom line, which is why we see a new yt related shit storm every other week, yt may as well be called YouTube[beta] or YouTube[this is an experiment]
14:23 ^🔗	schbirid	YouTube{incredible journey]
14:24 ^🔗	odemg	Though because it's Google ad because there is no real competition for them making any real headway we can talk like yt is 'never' going to close doors, or turn their service off, but it'll come, maybe not today, maybe not in 5 years, but it'll come when we're 'what the fucking' at a Google blog post announcing there coming plans to phase out YouTube or just turn it off.
14:26 ^🔗	odemg	Hopefully that comes at a time 500PB* is nothing and something we can grab in a few months
14:26 ^🔗	JAA	... except YouTube will be 10 EB by then.
14:27 ^🔗	Kaz	wait what
14:27 ^🔗	Kaz	vimeo is dead?
14:28 ^🔗	jrwr	vid.me I though
14:28 ^🔗	Kaz	vid.me
14:28 ^🔗	Kaz	ffs it looked very close to vimeo
14:28 ^🔗	odemg	It's an odd time we're living in when we first started 10TB was insane to think we could get, now we're doing sites nearing 300TB without a great deal of thought, we're scaling pretty well with the times I suppose, but how long before ia close doors and we have to find somewhere to put that? (I know we're talking about it...)
14:29 ^🔗	jrwr	Ugh
14:29 ^🔗	jrwr	if IA ever goes bust
14:29 ^🔗	Kaz	so
14:29 ^🔗	Kaz	we have 2 weeks for vidme
14:29 ^🔗	odemg	yup
14:29 ^🔗	JAA	I'm setting up an API scrape right now.
14:30 ^🔗	Kaz	probably needs a channel, not sure how big it is
14:30 ^🔗	odemg	#vidmeh
14:30 ^🔗	JAA	1.3x million videos
14:30 ^🔗	schbirid	vidwithoutme, vidnee, vidmeh
14:31 ^🔗	Kaz	vidmeh will do
14:31 ^🔗	zino	This will almost need to be a warrior project. We can probably fix storage, but there is no way we can download this in time using a script-solution unless someone buys up Amazon nodes to do it.
14:32 ^🔗	zino	JAA: Any idea what the average size of a video is?
14:32 ^🔗	JAA	zino: I haven't looked at the videos themselves at all yet, only the metadata.
14:32 ^🔗	JAA	The API returns a link to download the videos as an MP4, by the way.
14:32 ^🔗	JAA	The website uses Dash/HLS.
14:35 ^🔗	JAA	Those MP4s are hosted on CloudFront, by the way, i.e. Amazon. That could be annoying.
14:51 ^🔗	jrwr	wiki is slow as balls
15:07 ^🔗		voidsta has left
15:08 ^🔗	MrRadar	zino: I've been scraping a few channels and here's what I've seen so far. Their highest quality is 2 mbps video (at 1080p or 720p depending on the original resolution) with audio between 128kbps and 320 kbps(!)
15:09 ^🔗	MrRadar	SD-quality video is around 1200 kbps
15:09 ^🔗	jrwr	Ugh
15:10 ^🔗	jrwr	thats not too bad overall
15:10 ^🔗	MrRadar	And I'm grabbing with youtube-dl's "bestvideo+bestaudio" option, if storage/bandwidth becomes an issue they have lower-quality versions we could grab instead
15:11 ^🔗	jrwr	Na
15:11 ^🔗	jrwr	We have da powerrrr
15:11 ^🔗	jrwr	right now I'm working on the grabber, mostly just going to mod eroshare-grab
15:12 ^🔗	MrRadar	Some files are randomly capped at 150 KB/s download while others will saturate my 50 mbit connection
15:12 ^🔗	jrwr	the channel pages are going to be interesting since they scroll load type
15:12 ^🔗	MrRadar	As long as the URLs for those follow a pattern that shouldn't be too hard
15:13 ^🔗	MrRadar	Oh, I just noticed there's a channel, #vidmeh
15:13 ^🔗	jrwr	ya
15:23 ^🔗	HCross2	Nothing is bloody working
15:24 ^🔗	jrwr	for what
15:24 ^🔗	HCross2	I've spent all day trying to get my proxmox cluster sorted
15:30 ^🔗	jrwr	dat CDN
15:35 ^🔗		Jcc10 has quit IRC (Ping timeout: 260 seconds)
15:39 ^🔗	jrwr	hay JAA your pulling all the APIs, are you saving all the reposes so we can get the raw URL for the videos?
15:39 ^🔗	jrwr	reponses*
16:01 ^🔗	JAA	Yeah, of course I save them. To WARC, specifically.
16:04 ^🔗		kristian_ has joined #archiveteam-bs
16:29 ^🔗		CoolCanuk has joined #archiveteam-bs
16:33 ^🔗		fie has quit IRC (Ping timeout: 360 seconds)
16:43 ^🔗		shin has joined #archiveteam-bs
16:44 ^🔗		fie has joined #archiveteam-bs
17:06 ^🔗	shindakun	don't know if it will help but i was made a brute force video/metadata downloader for vidme https://github.com/shindakun/vidme i don't really have the bandwidth or storage to let it run though
17:07 ^🔗	shindakun	you guys already have a lot of tooling though
17:07 ^🔗	JAA	No need to bruteforce, we can get a list of all videos through their API.
17:07 ^🔗	PurpleSym	shindakun: /join #vidmeh
17:07 ^🔗	JAA	(I'm doing that currently.)
17:08 ^🔗	shindakun	that's basically what it does sort of... i found some seemed to be unlisted so i request details for every videoid
17:08 ^🔗	shindakun	off to vidmeh lol
17:08 ^🔗	JAA	Right. There's an API endpoint for getting lists of videos though, so you don't have to run through all ~19M IDs.
17:09 ^🔗	JAA	You can do it with 190k requests. With further optimisation, it might be possible to decrease that even further, but that's a bit more complex.
17:09 ^🔗		ola_norsk has joined #archiveteam-bs
17:11 ^🔗	ola_norsk	made a test C64/dosbox emulator item (https://archive.org/details/iaCSS64_test) , but it seems very slow. At least on my potato pc.
17:13 ^🔗	ola_norsk	unfortunatly i'm no ms-dos guru. But might there be a way to optimize speed trough some dos utilites/settings that could reside in the zip file?
17:14 ^🔗	zino	You are emulating in two layers. It's not going to be fast, or accurate.
17:16 ^🔗	ola_norsk	yeah it's kind of emu-inception :d But, could fastER be done perhaps?
17:17 ^🔗	ola_norsk	i did try it in Brave browser as well as Chromium, and Brave seemed to run it a bit better.
17:17 ^🔗	ola_norsk	and my pc is kind of shit
17:21 ^🔗	Igloo	/join #vidmeh
17:21 ^🔗	Igloo	ahem
17:27 ^🔗		Stilett0 has quit IRC (Ping timeout: 246 seconds)
17:29 ^🔗	CoolCanuk	ahhhhh. CLEVER
17:32 ^🔗		Pixi has quit IRC (Quit: Pixi)
17:38 ^🔗		kristian_ has quit IRC (Quit: Leaving)
17:45 ^🔗		mundus201 is now known as mundus
17:46 ^🔗		Pixi has joined #archiveteam-bs
18:08 ^🔗	hook54321	How can I automatically save links from an RSS feed onto the wayback machine?
18:08 ^🔗		pizzaiolo has joined #archiveteam-bs
18:14 ^🔗	CoolCanuk	i'd use something like this http://xmlgrid.net/xml2text.html . then get rid of the non urls in excel/google sheets.
18:15 ^🔗	JAA	Ew
18:15 ^🔗	CoolCanuk	then upload your list of urls to pastebin, get the raw link. in #archivebot , use !ao < PASTEBINrawLINK
18:15 ^🔗	CoolCanuk	you got a better idea, JAA ? :P
18:15 ^🔗	ola_norsk	if you have the links in a list; curl --silent --max-time 120 --connect-timeout 30 'https://web.archive.org/save/THE_LINK_TO_SAVE' > /dev/null , is a way to save them i think
18:15 ^🔗	JAA	Grab the feed, extract the links (by parsing the XML), throw them into wpull, upload WARC to IA. Throw everything into a cronjob, done.
18:16 ^🔗	CoolCanuk	o ok
18:16 ^🔗	JAA	I suspect he's looking for something that doesn't require writing code though.
18:16 ^🔗	CoolCanuk	most users are :P
18:16 ^🔗	CoolCanuk	also why curl? cant we just use HTTP GET?
18:17 ^🔗	astrid	that's what curl does
18:17 ^🔗	JAA	That's what curl does. You could also use wget, wpull, or whatever else.
18:17 ^🔗	JAA	Hell, you could do it with openssl s_client if you really wanted to.
18:18 ^🔗	JAA	And yeah, you can obviously replace the "throw them into wpull, upload WARC to IA" with that.
18:18 ^🔗	CoolCanuk	oh.. I thought curl downloads the web.archive.org page as well
18:18 ^🔗	JAA	It wouldn't grab the requisites though, I think.
18:18 ^🔗	JAA	CoolCanuk: That's exactly what it does, and it triggers a server-side archiving.
18:19 ^🔗	CoolCanuk	unhelpful if you have a bad internet connection and don't want to download the archive.org page every request :P
18:19 ^🔗	ola_norsk	idk :d i just use that as cronjobs to save tweets https://pastebin.com/raw/ZE4udKTi
18:20 ^🔗	arkiver	no page requisites are saved when you use /save/ like that
18:20 ^🔗	arkiver	only the one URL you have after /save/
18:20 ^🔗	arkiver	no images, or other stuff from the page is saved
18:20 ^🔗	ola_norsk	doh
18:21 ^🔗	CoolCanuk	(which is probably fine for net neutrality.. it should mostly be text/links to othe rsites)
18:21 ^🔗	CoolCanuk	if there are any images, it's likely already been posted before
18:22 ^🔗	arkiver	you can't see what picture is on a page if it's not saved
18:22 ^🔗	arkiver	no matter how many times the picture might have been saved in other places acros the web
18:23 ^🔗	CoolCanuk	you can't see pictures that are still online?
18:23 ^🔗	ola_norsk	twitter also uses their damn tc.co url shortening
18:23 ^🔗	arkiver	I think we save things in case they go offline
18:24 ^🔗	astrid	<3
18:25 ^🔗	CoolCanuk	(I hope that wasn't passive aggressive) :(
18:26 ^🔗	*	arkiver isn't an aggressive person :)
18:27 ^🔗	CoolCanuk	aggressive at archiving :P
18:27 ^🔗	CoolCanuk	hehe
18:27 ^🔗	arkiver	:)
18:28 ^🔗	ola_norsk	i've been running those cronjobs since the 26th (i think). Should i perhaps just halt that idea then, or might it be useful data for someone else to dig trough? At least the text and links are there i guess..
18:29 ^🔗	ola_norsk	was planning to run them until the netneutrality voting stuff is over on the 14th(?)
18:29 ^🔗	arkiver	text is always useful
18:30 ^🔗	JAA	Definitely better than nothing.
18:30 ^🔗	arkiver	I believe the data from Alexa on IA also does not include pictures
18:30 ^🔗	arkiver	but I'm not totally sure about that
18:34 ^🔗	ola_norsk	i'm just going to let it run then
18:34 ^🔗	JAA	What does the /save/ URL return exactly? Are the URLs for page requisites also replaced with /save/ URLs?
18:35 ^🔗	JAA	If so, it might be possible to use wget --page-requisites to grab them.
18:35 ^🔗	ola_norsk	one sec
18:38 ^🔗	ola_norsk	https://pastebin.com/raw/dJrVbnpr
18:39 ^🔗	ola_norsk	that's what i get when running: curl -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.89 Chrome/62.0.3202.89 Safari/537.36" --silent --max-time 120 --connect-timeout 30 'https://web.archive.org/save/https://twitter.com/hashtag/netneutrality?f=tweets'
18:40 ^🔗	JAA	Yep, everything there is also replaced with /save/ URLs.
18:40 ^🔗	JAA	So give wget --page-requisites a shot if you want.
18:40 ^🔗	JAA	(Plus a bunch of other options, obviously.)
18:40 ^🔗	arkiver	JAA: yes
18:41 ^🔗	ola_norsk	ok
18:41 ^🔗	arkiver	I believe embed are replace with a /save/_embed/ URL and links with a /save/ URL
18:41 ^🔗	JAA	Yep
18:42 ^🔗	ola_norsk	by 'other options' do you mean just to make it run quiet?
18:44 ^🔗	JAA	Yeah, and making it not write the files to disk.
18:44 ^🔗	ola_norsk	ok
18:45 ^🔗	JAA	Not sure what else you'd need for this.
18:45 ^🔗	ola_norsk	me neither unfortunatly, i had to browse a bit just to learn that much curl :d
18:45 ^🔗	ola_norsk	but i'll check it out
18:52 ^🔗	ola_norsk	i did ask info@archive.org if it's ok to do the curl commands so frequent (every 3-5 minute), but no response back yet.
18:53 ^🔗	ola_norsk	i just hope they won't suddenly go 'wtf is this!?' and block me :d
19:04 ^🔗		ZexaronS has joined #archiveteam-bs
19:08 ^🔗	arkiver	no
19:08 ^🔗	arkiver	it's just one URL that's saved per curl command
19:08 ^🔗	arkiver	https://archive.org/details/liveweb?sort=-publicdate
19:09 ^🔗	arkiver	the number of URLs per item in there is a lot higher than how many you are saving in a day
19:12 ^🔗	ola_norsk	as long it's fine with IA i'm good
19:13 ^🔗	ola_norsk	arkiver: could there be a way to 'retro-crawl' the tweets i've already saved?
19:14 ^🔗	ola_norsk	to get the images to load into the saves, i mean
19:14 ^🔗		Stilett0 has joined #archiveteam-bs
19:15 ^🔗	ola_norsk	this is the mail i wrote on the 27th btw: https://pastebin.com/AV1vbKUr
19:16 ^🔗	arkiver	I'm sure they're fine with it
19:16 ^🔗	ola_norsk	good stuff
19:16 ^🔗	arkiver	let me know if anything goes wrong
19:16 ^🔗	ola_norsk	ok
19:17 ^🔗	arkiver	with the 'retro-crawl', I guess you could get the older captures, get the URLs for the pictures from those and save those
19:17 ^🔗	arkiver	but you can't really /save/ an old page again
19:17 ^🔗	arkiver	or continue a /save/ or something
19:19 ^🔗	ola_norsk	ok. I'm guessing at least some number of the tweets are bound to have become deleted by the users themselves (or banned user accounts).
19:19 ^🔗	JAA	If you visit the pages, it should grab any images that aren't in the archives already.
19:20 ^🔗	JAA	So I guess you could make your browser go through all those old crawls.
19:20 ^🔗	ola_norsk	ouch
19:20 ^🔗	ola_norsk	but yeah, that is what i meant :d
19:20 ^🔗	JAA	Or perhaps it would work with wget --page-requisites as well, not sure.
19:21 ^🔗	ola_norsk	i'll rather try that than sit scrolling in my browser :D
19:32 ^🔗	ola_norsk	opening a capture in the browser does not seem to work to pull the images https://web.archive.org/web/20171130120002/https:/twitter.com/hashtag/netneutrality?f=tweets
19:32 ^🔗	ola_norsk	only user avatars etc seems to be present
19:36 ^🔗	ola_norsk	and those f*cking t.co links...pissing me off :/
19:51 ^🔗		dashcloud has quit IRC (Read error: Operation timed out)
19:52 ^🔗		dashcloud has joined #archiveteam-bs
19:58 ^🔗	CoolCanuk	I'm not American and articles aren't helping... how fast is Cumulus Media declining ?
19:58 ^🔗	CoolCanuk	This looks like quite the "portfolio" https://en.wikipedia.org/wiki/List_of_radio_stations_owned_by_Cumulus_Media
20:01 ^🔗	schbirid	does gdrive use some kind of incremental throttling for uploads? i am down to 1.5MB/s now :(
20:01 ^🔗	schbirid	and it seems quite linear over time
20:01 ^🔗		bitspill has joined #archiveteam-bs
20:02 ^🔗	ola_norsk	CoolCanuk: https://www.marketwatch.com/investing/stock/cmlsq ..Not sure if it's really indicative though
20:02 ^🔗	CoolCanuk	omg
20:02 ^🔗	CoolCanuk	0.095?!
20:03 ^🔗	CoolCanuk	iHeartRadio also seems troubled
20:04 ^🔗	CoolCanuk	however, iHeartRadio in Canada is likely not impacted, since I'm pretty sure Bell purchased rights to use it and it's a crappy radio streaming app for Bell Media radio stations- not true iHeartRadio
20:06 ^🔗	ola_norsk	CoolCanuk: All i see is the slope going down :d https://www.marketwatch.com/investing/stock/cmlsq/charts That's basically the max of my knowledge about stocks and shit :d
20:07 ^🔗	CoolCanuk	same here
20:14 ^🔗		SimpBrain has quit IRC (Remote host closed the connection)
20:18 ^🔗	ola_norsk	CoolCanuk: a friend of mine who unfortunitaly passed away in 2015 once showed me daytrading thingy software. If i remember correctly the only thing that differed from the free API testing was that all the data was delayed
20:20 ^🔗	ola_norsk	CoolCanuk: it wouldn't be useful for trading, but perhaps for alerting about online services going to hell
20:28 ^🔗	Kaz	schbirid: i think there's a limit of 750GB/day uploaded?
20:29 ^🔗	Kaz	if you're close to that, could explain things
20:29 ^🔗	schbirid	ah, maybe
20:30 ^🔗	schbirid	nope... today is just at "Transferred: 104.014 GBytes (1.540 MBytes/s)"
20:31 ^🔗	zino	schbirid, any packet loss?
20:31 ^🔗	schbirid	no idea, how do i check?
20:31 ^🔗	Frogging	ping maybe
20:32 ^🔗	zino	Well, step one: Be on linux (and run the upload from the same machine), step two: run "mtr hostname.here"
20:32 ^🔗	schbirid	no idea what the hostnames for gdrive are
20:32 ^🔗	Frogging	oh yeah mtr that's better
20:32 ^🔗	schbirid	mtr rules
20:32 ^🔗	zino	Step 0: Install iftop and check what address all your data is going too. :)
20:32 ^🔗	zino	to*
20:33 ^🔗	schbirid	duh, i feel dumb
20:35 ^🔗	zino	Don't. There are many ways to do this, and today you learned a new one.
20:35 ^🔗	schbirid	relearned
20:36 ^🔗	zino	There will be a test on what all flags to tar are and what they do tomorrow!
20:38 ^🔗	schbirid	i use longform
20:38 ^🔗	schbirid	:P
20:38 ^🔗	schbirid	tar is easy
20:38 ^🔗	schbirid	looks like there is notraffic at all and rclone is doing some crap instead. makes sense to have the "speed" die down linearly then
20:54 ^🔗		dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
20:55 ^🔗		SmileyG has joined #archiveteam-bs
20:56 ^🔗		Smiley has quit IRC (Read error: Operation timed out)
20:57 ^🔗	ola_norsk	CoolCanuk: that cumulus media thing made my brain conjure up some silly idea https://pastebin.com/raw/32k6st0E
21:04 ^🔗		SmileyG has quit IRC (Ping timeout: 260 seconds)
21:04 ^🔗		dashcloud has joined #archiveteam-bs
21:07 ^🔗		Smiley has joined #archiveteam-bs
21:19 ^🔗	Kaz	schbirid: any cpu activity from rclone?
21:20 ^🔗		BlueMaxim has joined #archiveteam-bs
21:20 ^🔗	schbirid	i just straced it and it has connection time outs all over
21:57 ^🔗		schbirid has quit IRC (Quit: Leaving)
21:58 ^🔗	CoolCanuk	should Wikia be moved to Fandom, or is it okay to redirect Fandom to Wikia?
22:00 ^🔗	ola_norsk	JAA: i tried this wget command, wget -O /dev/null --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --quiet --page-requisites "https://web.archive.org/save/https://twitter.com/hashtag/bogus?f=tweets" ..it's 100% quiet, though it doesn't seem to return more than using curl did.
22:01 ^🔗	ola_norsk	JAA: i won't know until the captures show up on wayback though
22:09 ^🔗	JAA	ola_norsk: You might want to write a log file to figure out what it's doing exactly. -o is the option, I think.
22:13 ^🔗	ola_norsk	JAA: without -O it does make a directory structure. but it doesn't seem to contain image data
22:14 ^🔗	ola_norsk	JAA: It seems to be just the same data, only then in e.g web.archive.org/save/https\:/twitter.com/hashtag/bogus\?f=tweets
22:15 ^🔗	ola_norsk	in folders, i mean, instead of the (same?) data going to -O
22:17 ^🔗	JAA	Hm
22:19 ^🔗	ola_norsk	JAA: https://pastebin.com/FKu3mHbh this showes the structure of what it does
22:20 ^🔗	ola_norsk	JAA: the 'hashtag/bogus?f=tweets' is the only file apart from robots.txt
22:20 ^🔗	JAA	Right
22:20 ^🔗		noirscape has joined #archiveteam-bs
22:21 ^🔗	ola_norsk	could Lynx browser be tricked into acting like a 'real' browser perhaps?
22:22 ^🔗		noirscape has quit IRC (Client Quit)
22:22 ^🔗	JAA	I doubt it.
22:23 ^🔗		fie has quit IRC (Ping timeout: 633 seconds)
22:23 ^🔗	JAA	Not sure why your command doesn't work.
22:23 ^🔗	JAA	But yeah, a log file would help.
22:23 ^🔗	JAA	Maybe with -v or -d even.
22:23 ^🔗	ola_norsk	one sec
22:25 ^🔗		MrDignity has joined #archiveteam-bs
22:27 ^🔗	ola_norsk	JAA: 'default output is verbose.' ..and there's quite little there i'm afraid :/
22:28 ^🔗	ola_norsk	ill see if there's some options that give it better
22:28 ^🔗		fie has joined #archiveteam-bs
22:30 ^🔗	JAA	ola_norsk: "Not following https://web.archive.org/save/_embed/https://pbs.twimg.com/profile_images/848200666199629824/ZwvxQIzP_bigger.jpg because robots.txt forbids it."
22:30 ^🔗	JAA	Fucking robots.txt
22:30 ^🔗	JAA	It breaks everything. :-P
22:30 ^🔗	ola_norsk	try setting --user-agent
22:31 ^🔗	JAA	I did.
22:31 ^🔗	ola_norsk	maybe it's a javascript thingy, that loads all the shit? :/
22:31 ^🔗	JAA	I used your exact command.
22:31 ^🔗	JAA	-e robots=off
22:31 ^🔗	ola_norsk	hmm
22:35 ^🔗		shin has quit IRC (Quit: Connection closed for inactivity)
22:36 ^🔗	ola_norsk	JAA: here is output from me running the command (Note, it's in norwegian :/ ) https://pastebin.com/awJ9j4D8
22:37 ^🔗	CoolCanuk	(please correct me if i'm wrong)
22:37 ^🔗	ola_norsk	JAA: could it be i'm using older wget or something?
22:37 ^🔗	JAA	ola_norsk: With -e robots=off?
22:37 ^🔗	JAA	Maybe, what version are you using?
22:37 ^🔗	JAA	I'm on 1.18.
22:38 ^🔗	JAA	I don't think it should matter too much though.
22:39 ^🔗	ola_norsk	JAA: GNU Wget 1.17.1
22:39 ^🔗	ola_norsk	sry, didn't notice the robots=off
22:40 ^🔗	JAA	Hmm
22:40 ^🔗	JAA	It seems that it doesn't work with -O /dev/null, interesting.
22:41 ^🔗	ola_norsk	robots=off did something else indeed, but i'm guessing it didn't do much better than when you ran it
22:41 ^🔗	ola_norsk	a slew of 404 errors appeared
22:42 ^🔗	JAA	Yeah, I got a bunch of 404s as well, but not all requests were 404s.
22:44 ^🔗	ola_norsk	--2017-12-02 23:41:17-- https://web.archive.org/save/_embed/https://abs.twimg.com/a/1512085154/css/t1/images/ui-icons_2e83ff_256x240.png
22:44 ^🔗	ola_norsk	Kobler til web.archive.org (web.archive.org)\|207.241.225.186\|:443 … tilkoblet.
22:44 ^🔗	ola_norsk	HTTP-forespørsel sendt. Venter på svar … 404 Not Found
22:44 ^🔗	ola_norsk	2017-12-02 23:41:18 PROGRAMFEIL 404: Not Found.
22:44 ^🔗	ola_norsk	is one png
22:44 ^🔗	JAA	Yeah, that doesn't exist.
22:45 ^🔗	JAA	But my command earlier grabbed https://pbs.twimg.com/profile_images/848200666199629824/ZwvxQIzP_bigger.jpg for example.
22:45 ^🔗	ola_norsk	so, it's robots.txt on the endpoints that causes the failures?
22:45 ^🔗	JAA	robots.txt at web.archive.org, yes.
22:45 ^🔗	JAA	Ah, no.
22:46 ^🔗	JAA	That's what causes wget not to retrieve the page requisites without -e robots=off.
22:46 ^🔗	ola_norsk	no, i mean at e.g : abs.twimg.com ?
22:46 ^🔗	JAA	Those 404s, not sure. Might just be broken links or misparsing.
22:47 ^🔗	ola_norsk	damn internet, it's a broken big fat mess
22:48 ^🔗	ola_norsk	cloudflare and shit
22:48 ^🔗	CoolCanuk	which website are you trying to access that cloudflare wont let you
22:48 ^🔗	CoolCanuk	I can possibly help get the true IP
22:49 ^🔗	ola_norsk	it's to get waybackmachine to capture webpages, including images, with doing just request
22:50 ^🔗	CoolCanuk	oh :/
22:51 ^🔗	JAA	HTML is a huge clusterfuck. Well, to be precise, HTML is fine, but the parsing engines' forgiveness is awful.
22:51 ^🔗	JAA	And don't get me started on JavaScript.
22:51 ^🔗	ola_norsk	CoolCanuk: i've messed up, thinking it would actually do captures by doing just that with automatic requests..but turns out it wasn't that easy :/
22:52 ^🔗	ola_norsk	JAA: aye. Is it possible that twitter uses javascript to put in the images, AFTER the page is loaded?
22:52 ^🔗	JAA	Definitely possible.
22:52 ^🔗	ola_norsk	JAA: if so, i'm giving up even trying :d
22:53 ^🔗	JAA	But at least part of it is not scripted.
22:55 ^🔗	JAA	My test earlier grabbed https://pbs.twimg.com/media/DQDHMryX4AEseEo.jpg for example, which is an image from a post most likely (though I'm not going to try and figure out which one).
22:55 ^🔗	ola_norsk	I think i'll just let the curl stuff run until the 14th, and let someone brigther than me figure it out in the future.
22:55 ^🔗	JAA	Sometimes, I hate the WM interface. "3 captures" click only lists one.
22:56 ^🔗	ola_norsk	one thing is images, but another is that basically all links on twitter are shorterened links
22:57 ^🔗	JAA	Yeah, but if you want to follow those, you'll definitely need more than that.
22:57 ^🔗	JAA	I mean, it might work with --recursive and --level 1 or something like that.
22:57 ^🔗	JAA	But it would really be better to just write WARCs locally and upload those to IA.
22:57 ^🔗	ola_norsk	the t.co links do come with the actual link the ALT= tag i think , not sure though
22:58 ^🔗	ola_norsk	<a alt=> property i mean
22:58 ^🔗	JAA	Never looked into them.
22:59 ^🔗	JAA	What you're describing is more or less what I'm doing from time to time with webcams.
22:59 ^🔗	JAA	I did that during the eclipse in the US in August, and I'm currently retrieving images from cams across Catalonia every 5 minutes.
23:00 ^🔗	JAA	It's just a script which runs wpull in the background + sleep 300 in a loop.
23:01 ^🔗	JAA	A cronjob might be cleaner, but whatever.
23:02 ^🔗	ola_norsk	with --recursive it does seem to take a hell of a lot longer..
23:02 ^🔗	ola_norsk	and that's maybe a good sign
23:02 ^🔗	JAA	Yeah, it's now retrieving all of Twitter.
23:02 ^🔗	JAA	Well, maybe not all of it, but a ton.
23:03 ^🔗	*	ola_norsk suddenly archive all of internets
23:03 ^🔗	JAA	Solving IA's problems. Genius!
23:03 ^🔗	ola_norsk	aye
23:03 ^🔗	ola_norsk	maybe that level thing is not a bad idea :d
23:05 ^🔗	JAA	:-P
23:05 ^🔗	ola_norsk	any way i could limit it to let's say 1-2 "hops" away from twitter? :D
23:05 ^🔗	ola_norsk	...seriously, it's still going
23:06 ^🔗	ola_norsk	it went from #bogus hashtag to shotting #MAGA..
23:07 ^🔗	JAA	Yep, and it'll retrieve every other hashtag it can find.
23:07 ^🔗	ola_norsk	aye
23:07 ^🔗	JAA	It's the best recursion. Believe me!
23:07 ^🔗	ola_norsk	'recurse all the things!' lol
23:09 ^🔗	ola_norsk	at the very least i think it needs some pause between these requests :d
23:09 ^🔗	zino	stack exausted, core dumped.
23:10 ^🔗	ola_norsk	it's doing bloddy mobile.twitter.com now ..
23:10 ^🔗	ola_norsk	nobody needs that
23:12 ^🔗	ola_norsk	it's brilliant though :D , i just hope it did the images :D
23:12 ^🔗	JAA	It did exactly what you told it to. :-P
23:13 ^🔗	ola_norsk	that just proves computes are stupid :d
23:13 ^🔗	JAA	Yeah, that or... :-P
23:14 ^🔗	ola_norsk	the Illuminati did it
23:16 ^🔗	ola_norsk	but, i'm thinking if was limited to just 1-2 hops, even 1, that would be enough to get most images. Or?
23:17 ^🔗	JAA	--page-requisites gets the images already.
23:17 ^🔗	JAA	(But apparently only if you actually write the files to disk. My tests with -O /dev/null did not work.)
23:17 ^🔗	JAA	You only need recursion with a level limit if you also want to follow links on the page.
23:17 ^🔗	JAA	Which might make sense, retrieving the individual tweets for example.
23:18 ^🔗	ola_norsk	could you pastebin the command you did that does image capture?
23:18 ^🔗	JAA	But if you want to have any control over what it grabs (for example, not 100 copies of the support and ToS sites), it'll get complex...
23:18 ^🔗	JAA	Uh
23:18 ^🔗	JAA	Closed the window already, hold on.
23:19 ^🔗	ola_norsk	the --recursion is violent :d
23:19 ^🔗	JAA	It's awesome, you just need to know how to control it. :-)
23:19 ^🔗		jschwart has quit IRC (Quit: Konversation terminated!)
23:20 ^🔗	ola_norsk	aye
23:21 ^🔗	ola_norsk	as for any output, if i can't put in /dev/null it'll go in a ramdisk that cleared quicly
23:21 ^🔗	ola_norsk	that's
23:22 ^🔗	JAA	Uhm, dafuq? https://web.archive.org/web/20171202231923/https:/twitter.com/hashtag/bogus?f=tweets
23:23 ^🔗	JAA	That's my grab from a few minutes ago.
23:23 ^🔗	JAA	Well, it did grab the CSS etc.
23:23 ^🔗	JAA	I didn't specify the UA though. That might have something to do with it.
23:24 ^🔗	ola_norsk	i'm not sure how they distrubute the requests between 'nodes'
23:24 ^🔗	JAA	The command was wget --page-requisites -e robots=off 'https://web.archive.org/save/https://twitter.com/hashtag/bogus?f=tweets'
23:24 ^🔗	ola_norsk	ty
23:24 ^🔗	JAA	Regarding the temporary files: mktemp -d, then cd into it, run wget, cd out, rm -rf the directory.
23:24 ^🔗	JAA	Five-line bash script. :-)
23:28 ^🔗	ola_norsk	JAA: sometimes i notice twitter.com requires login for anything. Maybe it varies by country. I'm not sure.
23:29 ^🔗	ola_norsk	ty, gold stuff
23:29 ^🔗	JAA	Yeah, Twitter's quite annoying to do anything with it at all.
23:29 ^🔗	JAA	We still don't have a solution for archiving an entire account or hashtag.
23:30 ^🔗	ola_norsk	they make money of off doing that
23:30 ^🔗	ola_norsk	so they will not make it easy
23:31 ^🔗	ola_norsk	if you're from a research institution, they would easily hand over hashtag archive from day0. For a slump of money, of course
23:32 ^🔗	JAA	Yeah
23:36 ^🔗	CoolCanuk	is there a mirror of the wiki we can use until it's stable?
23:36 ^🔗	JAA	No, I don't think so.
23:37 ^🔗	JAA	There's a snapshot from a few months ago in the Wayback Machine, I believe.
23:46 ^🔗	ola_norsk	that command entails 1.7Megabytes of data :D what is the internet coming to?? lol
23:48 ^🔗	ola_norsk	mankind doesn't deserve it :d
23:49 ^🔗	JAA	"The average website is now larger than the original DOOM." was a headline a few years ago...
23:49 ^🔗	JAA	web page* I guess
23:49 ^🔗	ola_norsk	aye, i think just the fucking front page of my online bank is ~10MB :/
23:52 ^🔗	ola_norsk	no wonder dolphins are dying from space radiation and ozone

irclogger-viewer