#archiveteam-bs 2017-12-02,Sat

↑back Search

Time Nickname Message
00:03 πŸ”— Stilett0 has joined #archiveteam-bs
00:04 πŸ”— ola_norsk JAA: aye..trying to archive a twitter hashtag has taught me that :/ "There was a problem loading..(retry button)")
00:05 πŸ”— JAA Yeah, Twitter's also pretty good at not letting you grab everything.
00:05 πŸ”— JAA Reddit as well.
00:05 πŸ”— JAA (We were having a discussion about that earlier in #archivebot.)
00:05 πŸ”— JAA At least you can iterate over all thread IDs in a reasonable amount of time on Reddit though.
00:07 πŸ”— JAA So it appears that you can get 10k results from the vid.me API.
00:07 πŸ”— ola_norsk i feel naughty doing curl requests to https://web.archive.org/save/https://twitter.com/hashtag/netneutrality?f=tweets , currently every 3rd minute :/
00:07 πŸ”— JAA You can do that for different categories, new/hot, and probably search terms (didn't try).
00:09 πŸ”— JAA There are 17 categories plus hot, new, and team picks. In the ideal case, that means 20 sections times 10k results, which is still only about 1/7th of the whole site.
00:09 πŸ”— JAA This is only about how to gather lists of videos and their metadata (uploader, description, etc.), not the actual videos.
00:09 πŸ”— JAA (Videos are available as Dash and HLS streams.)
00:10 πŸ”— JAA There are also tags, and of course you can retrieve all (?) of an uploader's videos.
00:10 πŸ”— ola_norsk JAA: As for twitter, i think one problem is that they would easily present an archive of ANYTHING, as long as they get paid for it.
00:10 πŸ”— JAA For each tag, you get hot, new, and top videos.
00:11 πŸ”— JAA ola_norsk: Yeah, probably.
00:11 πŸ”— ola_norsk JAA: most definitely
00:14 πŸ”— JAA There's a "random video" link. We could hammer that to get videos. I don't want to do the math how many times we need to retrieve it to discover the vast majority of all videos right now though.
00:14 πŸ”— ola_norsk JAA: for a legal warrant, or a slump of money, they could present all tweets with any hastag, since the dawn of ti..twitter
00:14 πŸ”— JAA Ah, I thought you were talking about vid.me now.
00:15 πŸ”— JAA Yeah, there is a company which has an entire archive of Twitter, I believe.
00:15 πŸ”— ola_norsk ah, sorry, that was just a link regarding GOG Connect
00:16 πŸ”— JAA Ah, you're not in #archiveteam. vid.me is shutting down on Dec 14.
00:16 πŸ”— JAA That's why I'm looking into them.
00:16 πŸ”— ola_norsk really? that soon?
00:16 πŸ”— JAA https://medium.com/vidme/goodbye-for-now-120b40becafa
00:16 πŸ”— ola_norsk wow, that's going to piss off alot of germans :D
00:17 πŸ”— JAA ola_norsk: I was thinking about Gnip, by the way. Looks like Twitter bought them a few years ago.
00:18 πŸ”— ola_norsk "We’re building something new." ..
00:19 πŸ”— ola_norsk a.k.a "Trust us, we're not completely destroying this shit..We're building something new!"..
00:20 πŸ”— Frogging free image/video host "couldn't find a path to sustainability"
00:20 πŸ”— ola_norsk man, i actually thought vid.me had something good going
00:20 πŸ”— Frogging what a surprise :p
00:21 πŸ”— ola_norsk https://archive.org/details/jscott_geocities
00:24 πŸ”— ola_norsk wow, there's actually people who cancelled their youtube accounts after having used vid.me's easy export solution
00:24 πŸ”— ola_norsk and as far as i know, that shit might not be such easy to export back, since i don't think YT does import by url..
00:25 πŸ”— ola_norsk og well
00:28 πŸ”— omglolbah why not upload to both? <.<
00:28 πŸ”— ola_norsk aye
00:30 πŸ”— ola_norsk omglolbah: according to "SidAlpha", if you know that youtuber, he would'nt because it would mean he'd have to interact on several platforms..
00:30 πŸ”— omglolbah If only he had moved to vidme
00:31 πŸ”— ola_norsk that was his response to the request for that, not move, but upload there as well
00:31 πŸ”— omglolbah no, I'm saying I wished he had moved so that he would be gone :p
00:31 πŸ”— ola_norsk oh
00:34 πŸ”— ola_norsk where does shit go if Youtube goes though? I mean, Google Video went to Youtube..
00:36 πŸ”— ola_norsk Where did Yahoo Video go?
00:36 πŸ”— ola_norsk Justin.tv became Twitch right?
00:37 πŸ”— zino Justin.tv created Twitch and then closed down. Nothing was automatically moved. I don't know if Justin had vods though.
00:37 πŸ”— ola_norsk aye
00:38 πŸ”— JAA I was wrong about the vid.me API not returning all results.
00:38 πŸ”— JAA The actual API does return everything, or at least nearly everything.
00:38 πŸ”— JAA The "API" used by the website doesn't.
00:38 πŸ”— JAA I just didn't find the real API docs previously.
00:38 πŸ”— JAA https://docs.vid.me/#api-Videos-List
00:38 πŸ”— JAA No auth required either.
00:38 πŸ”— JAA You can get chunks of 100 videos per request.
00:38 πŸ”— zino \o/
00:39 πŸ”— zino Do we have a death date?
00:39 πŸ”— JAA It gets quite slow for large offsets, indicating that they don't know how to use offsets.
00:39 πŸ”— JAA 14 Dec
00:39 πŸ”— zino :-/
00:39 πŸ”— JAA how to use indices*
00:39 πŸ”— zino indexes?
00:40 πŸ”— Frogging where would youtube go? nowhere. it's too big :p
00:40 πŸ”— JAA I never know which plural's correct.
00:40 πŸ”— ola_norsk Frogging; aye
00:40 πŸ”— bithippo Frodding: we'll just show up with a tractor trailer "Load it all in back y'all"
00:40 πŸ”— JAA The real API returns a bit more videos, by the way: 1360532.
00:40 πŸ”— JAA (About 11k more, specifically.)
00:41 πŸ”— JAA Might be the NSFW/unmoderated/private filter stuff.
00:41 πŸ”— JAA bithippo: YouTube is around 1 exabyte. Have fun with that.
00:41 πŸ”— JAA Well, at least that order of magnitude.
00:41 πŸ”— bithippo I used to manage hundreds of petabytes :-P
00:41 πŸ”— * ola_norsk shoves in in his usb stick and applies youtube-dl !
00:41 πŸ”— bithippo lol
00:42 πŸ”— JAA I'm sure someone from China will sell you a 1 EB USB stick if you ask them.
00:42 πŸ”— JAA Well, "1 EB".
00:42 πŸ”— bithippo Which will quickly err out once a few GB have been written .... :(
00:42 πŸ”— JAA Yep
00:42 πŸ”— ola_norsk i'll just save it all in /dev/null
00:42 πŸ”— JAA Or not error out, just overwrite the previous data etc.
00:43 πŸ”— zino Depends. Some of them are cyclical, so you can write all you want as long as you don't try to read it. :)
00:43 πŸ”— JAA Yep
00:43 πŸ”— JAA I'm a fan of S4.
00:43 πŸ”— JAA The Super Simple Storage Service.
00:43 πŸ”— JAA http://www.supersimplestorageservice.com/
00:44 πŸ”— bithippo That pricing is a bargain.
00:44 πŸ”— zino bithippo: Interesting. What did you work with that included 100s of PiBs? I deal in 10s of them.
00:44 πŸ”— bithippo Data taking for LHC detector
00:44 πŸ”— JAA Ooh, nice!
00:45 πŸ”— bithippo Only a couple hundred TB of spinning disk on storage arrays, the rest were tape archive libraries.
00:45 πŸ”— zino bithippo: Ah. I sort of do that on the sly. Part of our storage is for the Nordic LHC grid.
00:45 πŸ”— bithippo #TeamCMS
00:46 πŸ”— zino I deal mostly with crimate data though. Have a few petabytes of that.
00:46 πŸ”— bithippo That's awesome.
00:46 πŸ”— bithippo I <3 big data sets
00:47 πŸ”— zino Indeed :)
00:47 πŸ”— dashcloud @ola_norsk If you're interested in how to make something be emulated on IA, here's some pages that lay it out for you- http://digitize.archiveteam.org/index.php/Internet_Archive_Emulation http://digitize.archiveteam.org/index.php/Making_Software_Emulate_on_IA
00:49 πŸ”— ola_norsk dashcloud: ty, i'm thinking there must be ways. If there's dosbox, there's e.g Frodo that could run in that..
00:50 πŸ”— dashcloud I've done a bunch of DOSBOX games, and there's a whole collection of emulated DOS/Win31/Mac Classic stuff up
00:51 πŸ”— zino ola_norsk: What, the C64 emu?
00:51 πŸ”— ola_norsk yes
00:51 πŸ”— zino No, nonononono. Go helt the jsmess people get Vice running instead.
00:52 πŸ”— ola_norsk i was hoping that was already done
00:52 πŸ”— zino I know it's started.
00:52 πŸ”— ola_norsk good stuff
00:52 πŸ”— zino But it might be stalled forever for all I know.
00:54 πŸ”— ola_norsk i have no idea about these things, but it would be cool to see C64 on Internet Arcade
00:54 πŸ”— zino JAA: I'll have very little time to do anything before the 9th, and probably not much after either, but ping me if storage is needed for vid.me.
00:55 πŸ”— ola_norsk dashcloud, ill try to make an item per that, using dosbox
00:55 πŸ”— ola_norsk ty for info
00:56 πŸ”— JAA zino: Will do. I'll set up a scrape of the API first to get all the relevant information about the platform. Then we'll see.
00:57 πŸ”— dashcloud if your software needs installation or configuration before the first run, you'll want to do that ahead of time
00:57 πŸ”— JAA scrape/archive, whatever. That's the information we can save for sure.
00:57 πŸ”— JAA Unless they ban us...
00:58 πŸ”— JAA Using minVideoId and maxVideoId might be faster than the offset/limit method, especially for the later pages.
00:59 πŸ”— JAA Current video IDs are slightly above 19 million, so that's around 190k requests (to be sure no videos are missed).
01:03 πŸ”— jrwr attending my first 2600 meeting
01:05 πŸ”— wp494 so the thing with vidme, there's a bunch of original stuff
01:06 πŸ”— wp494 there's a little bit of lewd stuff (they ban outright porn, but they do permit "artistic" nsfw)
01:06 πŸ”— wp494 and then there's a bit of it that consists of reuploads of copyrighted stuff
01:06 πŸ”— jrwr OK
01:06 πŸ”— wp494 not that I think it'll be a big deal since IA can just dark the affected stuff if someone does come yelling, but something to keep in mind
01:07 πŸ”— JAA Sounds more or less what I'd expect.
01:07 πŸ”— JAA like*
01:09 πŸ”— JAA I can't find any information about API rate limits, except this Reddit thread: https://redd.it/6acvg5
01:11 πŸ”— icedice has quit IRC (Quit: Leaving)
01:18 πŸ”— Ceryn has quit IRC (Connection closed)
01:26 πŸ”— ola_norsk "The Internet is Living on Borrowed Time" .. https://vid.me/1LriY (ironically on vid.me) ..That's pretty dark title, for being Lunduke :d
01:33 πŸ”— JAA To be fair, it's also available on YouTube: https://www.youtube.com/watch?v=1VD_pJOFnZ0
01:54 πŸ”— ola_norsk thats not fair :D
01:54 πŸ”— ola_norsk i think most of his vids are also on IA :d
01:55 πŸ”— ola_norsk but yeah
01:57 πŸ”— ola_norsk seriously though. I imagine there's a shitload of german vidme'ers currently bewildered as to what to do..
01:59 πŸ”— ola_norsk a lot of people used the url importing at vidme, thinking they would simple move their entire channels..
02:00 πŸ”— ola_norsk from what i've heard tales of, germany youtube is not the same youtube as everywhere elsetube
02:07 πŸ”— ranma GEMA blocks a fuckton of music there
02:07 πŸ”— ola_norsk aye
02:09 πŸ”— ola_norsk ranma: is that the only reason though? There were so many germans coming to vid.me it was made a video about it..
02:12 πŸ”— ranma JAA: how do you get your data OUT of S4?
02:12 πŸ”— ranma and what are the costs?
02:12 πŸ”— CoolCanuk s4?
02:13 πŸ”— ola_norsk ranma: "German INVASION"...100k creators..https://vid.me/JjNaH
02:13 πŸ”— ranma oh, it's a joke :'(
02:14 πŸ”— CoolCanuk i hate slow internet. ml
02:14 πŸ”— CoolCanuk *fml
02:14 πŸ”— phuzion has joined #archiveteam-bs
02:18 πŸ”— ola_norsk ranma: does it simply block ALL music? i can't see any other reason for such a noticable influx and flight of users
02:19 πŸ”— ola_norsk ranma: It's actually hard to browse vidme because of it at times, since often 1 in 2 videos on the feed is german
02:22 πŸ”— ranma kinda wish some site could ZIP/7z another site
02:22 πŸ”— ranma just noticed archivebot slurped down https://ftp.modland.com/
02:23 πŸ”— CoolCanuk did it *completely* slurp modland?
02:23 πŸ”— ola_norsk dd -i http://google.com -o http://bing.com
02:23 πŸ”— ranma <Major> Muad-Dib: Your job for https://ftp.modland.com/ has finished.
02:25 πŸ”— ranma actually, not that i'd have the space for it, tho
02:25 πŸ”— ola_norsk has quit IRC (its the beer talking)
02:26 πŸ”— CoolCanuk does anyone have a great upload script for ia? their docs are too much for me to understand and uploading 1 by 1 is painful
02:34 πŸ”— ez for anyone wanting to mirror vid.me, its possible to page everything there: https://api.vid.me/videos/list?minVideoId=100&maxVideoId=1000
02:34 πŸ”— ez just step the min/max (its easier on the db).
02:34 πŸ”— CoolCanuk ..... https://usercontent.irccloud-cdn.com/file/PZalOsZ6/image.png
02:35 πŸ”— ez JAA: ^
02:35 πŸ”— CoolCanuk our wiki is more stable than this beta-like system :P
02:39 πŸ”— bithippo CoolCanuk: What do you mean by "upload script for ia"?
02:40 πŸ”— bithippo Such as https://github.com/jjjake/internetarchive ?
02:40 πŸ”— CoolCanuk an easier way
02:41 πŸ”— CoolCanuk eg I can loop is for 100s of files in a folder, but upload as 100 items.
02:41 πŸ”— bithippo That repo is your best bet for that sort of operation.
02:42 πŸ”— bithippo What sort of files and metadata?
02:42 πŸ”— CoolCanuk pdf
02:43 πŸ”— CoolCanuk currently, newspapers
02:43 πŸ”— CoolCanuk and sears crap
02:44 πŸ”— bithippo Hmm
02:46 πŸ”— bithippo The two routes would be "web interface", which gives you a nice interface and shouldn't be too painful if you're putting up each folder as an item (with all of the files contained within that folder attributed to the item). Failing that, you'd need some light python or bash scripting skills to pickup up files per item, associate metadata with each item, and upload.
02:46 πŸ”— bithippo I could be wrong of course! But that's my interpretation based on working with the IA interfaces.
02:47 πŸ”— ez tbh, IA interface is just plain atrocious to use
02:47 πŸ”— bithippo Indeed.
02:47 πŸ”— ez i suppose thats artificial barrier of entry on purpose to avoid people uploading crap
02:48 πŸ”— CoolCanuk I guess
02:48 πŸ”— CoolCanuk I'm uploading stuff I know will probably not be found anywhere else
02:49 πŸ”— ez yea, the commitment to jump the hoops is paired with commitment to curate content
02:50 πŸ”— CoolCanuk Only think I'm worried about is repetitive strain injury
02:52 πŸ”— wp494_ has joined #archiveteam-bs
02:59 πŸ”— wp494 has quit IRC (Read error: Operation timed out)
03:03 πŸ”— ld1 has quit IRC (Quit: ~)
03:06 πŸ”— ld1 has joined #archiveteam-bs
03:19 πŸ”— CoolCanuk why does IA have a difficult time using the FIRST page of a pdf as the COVER >:(
03:41 πŸ”— wp494_ is now known as wp494
03:47 πŸ”— MrRadar X-posting from #archiveteam: if you're using youtube-dl to grab vid.me content, be aware of this issue: https://github.com/rg3/youtube-dl/issues/14199
03:47 πŸ”— MrRadar tl;dr: their HLS streams return a data format youtube-dl doesn't fully handle resulting in corrupted output files
03:48 πŸ”— MrRadar Use a workaround in the 2nd to last comment to force youtube-dl to grab from the DASH endpoints instead
03:54 πŸ”— wp494 posting highlights of https://www.youtube.com/watch?v=KMaWSinw4MI&t=41m33s here
03:55 πŸ”— wp494 first one being that linus has significant disagreements with senior management, and especially NCIX's owner
03:55 πŸ”— wp494 which seems to be a very common theme
03:56 πŸ”— wp494 he also left NCIX because the people he mentored departed
03:56 πŸ”— wp494 says he thinks some were forced out because of extraordinarily poor management decisions
03:57 πŸ”— wp494 (in his opinion)
03:57 πŸ”— Frogging I'm reading this https://np.reddit.com/r/bapcsalescanada/comments/77h771/for_anyone_that_purchased_a_8700k_from_ncix/domm2ca/?context=3
04:09 πŸ”— josho493 has joined #archiveteam-bs
04:09 πŸ”— wp494 linus pitched what sounded like a pretty good idea, try and get bought, but how? his solution was to open "NCIX Lite"s across the country which would be really small pickup places that you could ship to since shipping direct to your home sometimes killed the deal
04:10 πŸ”— wp494 said that the writing was on the wall as early as 7 years ago (before Amazon was doing pickup) to anyone actively paying attention, so the idea would've been to attract someone similar to Amazon if not Amazon themselves using that infrastructure when they wanted to gobble someone Whole Foods style
04:12 πŸ”— wp494 linus said when management didn't do that, he said it became obvious that he had to GTFO
04:12 πŸ”— wp494 he says he hasn't been screwed over personally by Steve (the owner) and his wife unlike some of the other horror stories going out
04:16 πŸ”— wp494 he wound up signing a non-compete for 2 years (which got extended by 1)
04:17 πŸ”— wp494 when he left he took the LTT assets, and did it on paper (and was glad he did), because even though he wouldn't think Steve would do anything untoward to him, creditors are sharks looking for their next kill
04:19 πŸ”— wp494 and that's about it
04:27 πŸ”— ez why do people bother with fairly standard eshop drama, was ncix the canadian amazon or something?
04:27 πŸ”— MrRadar More like Canadian Newegg... before Newegg moved into Canada
04:28 πŸ”— MrRadar It was *the* place to go for computer parts online, from what I understand
04:28 πŸ”— ez ah
04:28 πŸ”— ez razor thin margins, yea we have that locally too
04:28 πŸ”— ez all with fake "in stock" stickers where you wait 2 weeks and everything
04:30 πŸ”— qw3rty115 has joined #archiveteam-bs
04:34 πŸ”— qw3rty114 has quit IRC (Read error: Operation timed out)
04:38 πŸ”— Frogging they had a location here in Ottawa. I used to shop there until they closed it
05:07 πŸ”— josho493 has quit IRC (Quit: Page closed)
05:09 πŸ”— CoolCanuk defunct as of today :o
05:09 πŸ”— CoolCanuk *yesterday
05:11 πŸ”— Mateon1 has quit IRC (Ping timeout: 245 seconds)
05:12 πŸ”— Mateon1 has joined #archiveteam-bs
05:15 πŸ”— CoolCanuk am I the only one who doesnt really see the big deal of google home/mini or amazon alexa?
05:29 πŸ”— ranavalon has quit IRC (Read error: Connection reset by peer)
05:36 πŸ”— Frogging there's a big deal?
05:38 πŸ”— shindakun has joined #archiveteam-bs
05:38 πŸ”— Jcc10 has joined #archiveteam-bs
05:42 πŸ”— ez CoolCanuk: we're all waiting for amazon to give access to alexa transcripts to app devs
05:42 πŸ”— ez so we can start archiving every little embarassing thing anyone has ever said
05:42 πŸ”— wp494 which of vidme's logos should I use for the article, the wordmark or their "astro" mascot
05:42 πŸ”— wp494 https://vid.me/media
05:48 πŸ”— CoolCanuk the one on the main page (red)
05:48 πŸ”— wp494 wordmark it is
05:48 πŸ”— CoolCanuk sadly cant be eps or svg :(
05:49 πŸ”— wp494 gonna resize it a little otherwise it'll appear about as big in a warrior project
05:49 πŸ”— CoolCanuk or we could fix the template
05:49 πŸ”— CoolCanuk wait what do you mean
05:50 πŸ”— wp494 lemme go dig through the spuf logs to show you
05:50 πŸ”— wp494 (come to think of it I'm not even sure if I took an image, I might have just pull requested and moved on)
05:51 πŸ”— CoolCanuk our {{Template project}} should be fixed to a larger logo size
05:51 πŸ”— CoolCanuk using it online is not an issue, because we can dynamicly resize
05:51 πŸ”— CoolCanuk http://tracker.archiveteam.org/
05:52 πŸ”— wp494 yeah there it could benefit from being a touch bigger at least for logos that are rectangles instead of squares
05:52 πŸ”— CoolCanuk apparently we can't... :|
05:52 πŸ”— wp494 (it seems to like squares the best)
05:52 πŸ”— CoolCanuk "benefit"?
05:52 πŸ”— CoolCanuk distortion?
05:52 πŸ”— wp494 and yeah, I was about to say, our copy of mediawiki isn't quite as flexible as wikimedia's where you can stuff in any number and it'll spit it out for you
05:52 πŸ”— wp494 even ridiculously large ones like 10000px
05:52 πŸ”— CoolCanuk I just noticed that. that's too bad
05:53 πŸ”— CoolCanuk another reason to use SVG.
05:54 πŸ”— wp494 even SVGs too
05:54 πŸ”— CoolCanuk no. SVGs are not raster
05:55 πŸ”— CoolCanuk you can blow them up to 1000000000px and it will never distort unless you have embedded rasters
05:57 πŸ”— CoolCanuk https://upload.wikimedia.org/wikipedia/commons/3/35/Tux.svg
06:02 πŸ”— wp494 ok I was gonna recreate an example with SPUF but there's a live one that I can get you right now
06:03 πŸ”— wp494 see how the miiverse logo goes a bit out of its bounds and pushes content downwards: https://i.imgur.com/P3Wcfbp.png
06:03 πŸ”— CoolCanuk ew
06:03 πŸ”— CoolCanuk logo should be within that white div, not yellow
06:04 πŸ”— CoolCanuk (within, not overlaid)
06:04 πŸ”— wp494 now take the version of the steam icon we had stored on the wiki and stuffed into the project code (http://www.archiveteam.org/images/4/48/Steam_Icon_2014.png) and it wound up being a bit worse than that example
06:04 πŸ”— wp494 luckily a 100px version that mediawiki gracefully generated more or less solved things: https://github.com/ArchiveTeam/spuf-grab/pull/2/commits/1c319d3d144cc13599f1fe571e699ca8b3d79e60
06:04 πŸ”— CoolCanuk not the image's fault, it's the tracker ;)
06:05 πŸ”— wp494 afaik tracker main page was ok
06:05 πŸ”— CoolCanuk how could it be ok
06:05 πŸ”— wp494 note how it looks like it's fine on http://tracker.archiveteam.org/
06:05 πŸ”— CoolCanuk simply use max-width for img in css
06:05 πŸ”— CoolCanuk *height
06:06 πŸ”— wp494 but with that said scroll bars do appear
06:06 πŸ”— CoolCanuk then you need to
06:06 πŸ”— CoolCanuk overflow: hidden
06:06 πŸ”— wp494 but it's nothing near as annoying as the in-warrior example, though still a nuisance albeit very minor
06:07 πŸ”— CoolCanuk I will fix it
06:08 πŸ”— wp494 k so a 600 x 148 version will go up on the wiki
06:08 πŸ”— wp494 and then if it causes problems we can grab a 100px url
06:08 πŸ”— wp494 for project code
06:09 πŸ”— CoolCanuk we have or
06:09 πŸ”— CoolCanuk **or
06:09 πŸ”— CoolCanuk just use max-height: 100px
06:09 πŸ”— CoolCanuk ;)
06:09 πŸ”— wp494 ok project page is going up
06:09 πŸ”— CoolCanuk lol how did it let you upload file name with a space :P
06:09 πŸ”— CoolCanuk it makes me use _
06:10 πŸ”— wp494 it does insert a _
06:10 πŸ”— wp494 the recent changes bot treats it as a space though
06:11 πŸ”— wp494 but for actually using the filename you're going to need to use underscores
06:11 πŸ”— CoolCanuk o
06:11 πŸ”— wp494 aw crap I'm getting spam filtered and I don't even get a prompt to put in the secret phrase
06:12 πŸ”— wp494 oh well let's see if this workaround of inserting a space in the url works
06:12 πŸ”— CoolCanuk heh
06:12 πŸ”— CoolCanuk SHHHH that's supposed to be a secret :x
06:13 πŸ”— wp494 ok wow that apparently worked
06:15 πŸ”— CoolCanuk i'll fix it for ya
06:15 πŸ”— wp494 gl with the filter
06:15 πŸ”— CoolCanuk oh you fixed it
06:15 πŸ”— wp494 I was surprised I was even able to toss such a tiny little stone at that goliath
06:18 πŸ”— wp494 ok that's a solid foundation I think
06:19 πŸ”— CoolCanuk huh
06:20 πŸ”— CoolCanuk I have a workaround :P
06:21 πŸ”— slyphic has quit IRC (Read error: Operation timed out)
06:21 πŸ”— Odd0002 I got a 508 clicking that purplebot link
06:21 πŸ”— SketchCow godane: What does "WOC" mean with the MPGs?
06:21 πŸ”— Odd0002 resource limit reached
06:22 πŸ”— CoolCanuk this 208 error will be the death of me
06:22 πŸ”— CoolCanuk *508
06:23 πŸ”— Odd0002 connection timed out now...
06:23 πŸ”— CoolCanuk same here ughhhh
06:24 πŸ”— CoolCanuk ffff
06:24 πŸ”— CoolCanuk impossible to eidt
06:25 πŸ”— Odd0002 oh finally
06:25 πŸ”— CoolCanuk there must be more than just "shared hosting" being the problem
06:28 πŸ”— CoolCanuk can the topic in #archiveteam changed from Compuserve to vidme? lmfao
06:28 πŸ”— CoolCanuk *be changed
06:32 πŸ”— wp494 if it gets pointed out a few times like with compuserve then someone will probably do it
06:32 πŸ”— wp494 if it's just once or twice more then it's no big deal just say "yeah we're on it"
06:33 πŸ”— CoolCanuk fair
07:06 πŸ”— CoolCanuk making up a tag for vidme is going to be tricky. it's so short.. hard to come with a spinoff
07:42 πŸ”— Pixi has quit IRC (Ping timeout: 255 seconds)
07:42 πŸ”— Pixi has joined #archiveteam-bs
07:44 πŸ”— BlueMaxim has quit IRC (Ping timeout: 633 seconds)
07:45 πŸ”— BlueMaxim has joined #archiveteam-bs
09:05 πŸ”— Dimtree has quit IRC (Peace)
09:11 πŸ”— Dimtree has joined #archiveteam-bs
09:55 πŸ”— fie has quit IRC (Ping timeout: 245 seconds)
10:11 πŸ”— fie has joined #archiveteam-bs
10:16 πŸ”— CoolCanuk has quit IRC (Quit: Connection closed for inactivity)
10:27 πŸ”— BlueMaxim has quit IRC (Read error: Connection reset by peer)
10:35 πŸ”— schbirid has joined #archiveteam-bs
11:08 πŸ”— fie has quit IRC (Ping timeout: 246 seconds)
11:21 πŸ”— fie has joined #archiveteam-bs
11:33 πŸ”— bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…)
11:44 πŸ”— jschwart has joined #archiveteam-bs
12:11 πŸ”— JAA ez: Yep, that's what I came up with yesterday as well. You can either iterate min/maxVideoId in blocks of 100 with limit=100 or implement pagination. I'd probably go for the former, i.e. retrieve video IDs 1 to 100, 101 to 200, etc. (need to figure out whether these parameters are exclusive or not though).
12:17 πŸ”— MangoTec has joined #archiveteam-bs
12:41 πŸ”— jrwr my god
12:41 πŸ”— jrwr the best thing I've ever heard just got tweeted
12:41 πŸ”— jrwr @ElonMusk: Payload will be my midnight cherry Tesla Roadster playing Space Oddity. Destination is Mars orbit. Will be in deep space for a billion years or so if it doesn’t blow up on ascent.
12:57 πŸ”— zino Elon knows how to put on a show.
12:57 πŸ”— jrwr Yep
12:58 πŸ”— jrwr I mean, he thinks its going to blow, they didn't want to make a real payload... so fuck it send a Car
13:01 πŸ”— zino At this time I'll recommand the old Top Gear episode where they convert a car to a space shuttle and blast it off with rockets.
13:01 πŸ”— zino recommend*
13:24 πŸ”— MangoTec has quit IRC (Quit: Page closed)
13:43 πŸ”— schbirid hetzner's auctions seem to have dropped in price a lot, 1/3 aka -10€ for what i have
13:43 πŸ”— schbirid https://www.hetzner.com/sb
13:44 πŸ”— schbirid nvm, had fucking US version without VAT =(
13:49 πŸ”— odemg https://medium.com/vidme/goodbye-for-now-120b40becafa
13:49 πŸ”— odemg https://medium.com/vidme/goodbye-for-now-120b40becafa
13:49 πŸ”— odemg https://medium.com/vidme/goodbye-for-now-120b40becafa
13:49 πŸ”— odemg What the fuck!!
13:49 πŸ”— odemg Okay you know about it
13:49 πŸ”— odemg but what the actual fuck!
14:10 πŸ”— jrwr people are finding out its REALLY hard to make a video website
14:16 πŸ”— odemg It's easy to make a video site, it's just hard to monetise it, mediacru.sh was the best in terms of technology in my opinion but they didn't manage to monetise either.
14:16 πŸ”— ranavalon has joined #archiveteam-bs
14:17 πŸ”— ranavalon has quit IRC (Remote host closed the connection)
14:17 πŸ”— odemg I'm collecting video ids from reddit anyways, heads up the bulk of the older urls (and possibly new ones) are going to be reddit porn related.
14:17 πŸ”— ranavalon has joined #archiveteam-bs
14:18 πŸ”— schbirid wait, youtube is still operating at loss
14:18 πŸ”— schbirid why the FUCK are people making so much money on their ad share then?
14:18 πŸ”— schbirid *?
14:21 πŸ”— voidsta has joined #archiveteam-bs
14:22 πŸ”— odemg Google isn't operating at a loss, so they can keep YouTube afloat and keep trying new things to pump up their bottom line, which is why we see a new yt related shit storm every other week, yt may as well be called YouTube[beta] or YouTube[this is an experiment]
14:23 πŸ”— schbirid YouTube{incredible journey]
14:24 πŸ”— odemg Though because it's Google ad because there is no real competition for them making any real headway we can talk like yt is 'never' going to close doors, or turn their service off, but it'll come, maybe not today, maybe not in 5 years, but it'll come when we're 'what the fucking' at a Google blog post announcing there coming plans to phase out YouTube or just turn it off.
14:26 πŸ”— odemg Hopefully that comes at a time 500PB* is nothing and something we can grab in a few months
14:26 πŸ”— JAA ... except YouTube will be 10 EB by then.
14:27 πŸ”— Kaz wait what
14:27 πŸ”— Kaz vimeo is dead?
14:28 πŸ”— jrwr vid.me I though
14:28 πŸ”— Kaz vid.me
14:28 πŸ”— Kaz ffs it looked very close to vimeo
14:28 πŸ”— odemg It's an odd time we're living in when we first started 10TB was insane to think we could get, now we're doing sites nearing 300TB without a great deal of thought, we're scaling pretty well with the times I suppose, but how long before ia close doors and we have to find somewhere to put that? (I know we're talking about it...)
14:29 πŸ”— jrwr Ugh
14:29 πŸ”— jrwr if IA ever goes bust
14:29 πŸ”— Kaz so
14:29 πŸ”— Kaz we have 2 weeks for vidme
14:29 πŸ”— odemg yup
14:29 πŸ”— JAA I'm setting up an API scrape right now.
14:30 πŸ”— Kaz probably needs a channel, not sure how big it is
14:30 πŸ”— odemg #vidmeh
14:30 πŸ”— JAA 1.3x million videos
14:30 πŸ”— schbirid vidwithoutme, vidnee, vidmeh
14:31 πŸ”— Kaz vidmeh will do
14:31 πŸ”— zino This will almost need to be a warrior project. We can probably fix storage, but there is no way we can download this in time using a script-solution unless someone buys up Amazon nodes to do it.
14:32 πŸ”— zino JAA: Any idea what the average size of a video is?
14:32 πŸ”— JAA zino: I haven't looked at the videos themselves at all yet, only the metadata.
14:32 πŸ”— JAA The API returns a link to download the videos as an MP4, by the way.
14:32 πŸ”— JAA The website uses Dash/HLS.
14:35 πŸ”— JAA Those MP4s are hosted on CloudFront, by the way, i.e. Amazon. That could be annoying.
14:51 πŸ”— jrwr wiki is slow as balls
15:07 πŸ”— voidsta has left
15:08 πŸ”— MrRadar zino: I've been scraping a few channels and here's what I've seen so far. Their highest quality is 2 mbps video (at 1080p or 720p depending on the original resolution) with audio between 128kbps and 320 kbps(!)
15:09 πŸ”— MrRadar SD-quality video is around 1200 kbps
15:09 πŸ”— jrwr Ugh
15:10 πŸ”— jrwr thats not too bad overall
15:10 πŸ”— MrRadar And I'm grabbing with youtube-dl's "bestvideo+bestaudio" option, if storage/bandwidth becomes an issue they have lower-quality versions we could grab instead
15:11 πŸ”— jrwr Na
15:11 πŸ”— jrwr We have da powerrrr
15:11 πŸ”— jrwr right now I'm working on the grabber, mostly just going to mod eroshare-grab
15:12 πŸ”— MrRadar Some files are randomly capped at 150 KB/s download while others will saturate my 50 mbit connection
15:12 πŸ”— jrwr the channel pages are going to be interesting since they scroll load type
15:12 πŸ”— MrRadar As long as the URLs for those follow a pattern that shouldn't be too hard
15:13 πŸ”— MrRadar Oh, I just noticed there's a channel, #vidmeh
15:13 πŸ”— jrwr ya
15:23 πŸ”— HCross2 Nothing is bloody working
15:24 πŸ”— jrwr for what
15:24 πŸ”— HCross2 I've spent all day trying to get my proxmox cluster sorted
15:30 πŸ”— jrwr dat CDN
15:35 πŸ”— Jcc10 has quit IRC (Ping timeout: 260 seconds)
15:39 πŸ”— jrwr hay JAA your pulling all the APIs, are you saving all the reposes so we can get the raw URL for the videos?
15:39 πŸ”— jrwr reponses*
16:01 πŸ”— JAA Yeah, of course I save them. To WARC, specifically.
16:04 πŸ”— kristian_ has joined #archiveteam-bs
16:29 πŸ”— CoolCanuk has joined #archiveteam-bs
16:33 πŸ”— fie has quit IRC (Ping timeout: 360 seconds)
16:43 πŸ”— shin has joined #archiveteam-bs
16:44 πŸ”— fie has joined #archiveteam-bs
17:06 πŸ”— shindakun don't know if it will help but i was made a brute force video/metadata downloader for vidme https://github.com/shindakun/vidme i don't really have the bandwidth or storage to let it run though
17:07 πŸ”— shindakun you guys already have a lot of tooling though
17:07 πŸ”— JAA No need to bruteforce, we can get a list of all videos through their API.
17:07 πŸ”— PurpleSym shindakun: /join #vidmeh
17:07 πŸ”— JAA (I'm doing that currently.)
17:08 πŸ”— shindakun that's basically what it does sort of... i found some seemed to be unlisted so i request details for every videoid
17:08 πŸ”— shindakun off to vidmeh lol
17:08 πŸ”— JAA Right. There's an API endpoint for getting lists of videos though, so you don't have to run through all ~19M IDs.
17:09 πŸ”— JAA You can do it with 190k requests. With further optimisation, it might be possible to decrease that even further, but that's a bit more complex.
17:09 πŸ”— ola_norsk has joined #archiveteam-bs
17:11 πŸ”— ola_norsk made a test C64/dosbox emulator item (https://archive.org/details/iaCSS64_test) , but it seems very slow. At least on my potato pc.
17:13 πŸ”— ola_norsk unfortunatly i'm no ms-dos guru. But might there be a way to optimize speed trough some dos utilites/settings that could reside in the zip file?
17:14 πŸ”— zino You are emulating in two layers. It's not going to be fast, or accurate.
17:16 πŸ”— ola_norsk yeah it's kind of emu-inception :d But, could fastER be done perhaps?
17:17 πŸ”— ola_norsk i did try it in Brave browser as well as Chromium, and Brave seemed to run it a bit better.
17:17 πŸ”— ola_norsk and my pc is kind of shit
17:21 πŸ”— Igloo /join #vidmeh
17:21 πŸ”— Igloo ahem
17:27 πŸ”— Stilett0 has quit IRC (Ping timeout: 246 seconds)
17:29 πŸ”— CoolCanuk ahhhhh. CLEVER
17:32 πŸ”— Pixi has quit IRC (Quit: Pixi)
17:38 πŸ”— kristian_ has quit IRC (Quit: Leaving)
17:45 πŸ”— mundus201 is now known as mundus
17:46 πŸ”— Pixi has joined #archiveteam-bs
18:08 πŸ”— hook54321 How can I automatically save links from an RSS feed onto the wayback machine?
18:08 πŸ”— pizzaiolo has joined #archiveteam-bs
18:14 πŸ”— CoolCanuk i'd use something like this http://xmlgrid.net/xml2text.html . then get rid of the non urls in excel/google sheets.
18:15 πŸ”— JAA Ew
18:15 πŸ”— CoolCanuk then upload your list of urls to pastebin, get the raw link. in #archivebot , use !ao < PASTEBINrawLINK
18:15 πŸ”— CoolCanuk you got a better idea, JAA ? :P
18:15 πŸ”— ola_norsk if you have the links in a list; curl --silent --max-time 120 --connect-timeout 30 'https://web.archive.org/save/THE_LINK_TO_SAVE' > /dev/null , is a way to save them i think
18:15 πŸ”— JAA Grab the feed, extract the links (by parsing the XML), throw them into wpull, upload WARC to IA. Throw everything into a cronjob, done.
18:16 πŸ”— CoolCanuk o ok
18:16 πŸ”— JAA I suspect he's looking for something that doesn't require writing code though.
18:16 πŸ”— CoolCanuk most users are :P
18:16 πŸ”— CoolCanuk also why curl? cant we just use HTTP GET?
18:17 πŸ”— astrid that's what curl does
18:17 πŸ”— JAA That's what curl does. You could also use wget, wpull, or whatever else.
18:17 πŸ”— JAA Hell, you could do it with openssl s_client if you really wanted to.
18:18 πŸ”— JAA And yeah, you can obviously replace the "throw them into wpull, upload WARC to IA" with that.
18:18 πŸ”— CoolCanuk oh.. I thought curl downloads the web.archive.org page as well
18:18 πŸ”— JAA It wouldn't grab the requisites though, I think.
18:18 πŸ”— JAA CoolCanuk: That's exactly what it does, and it triggers a server-side archiving.
18:19 πŸ”— CoolCanuk unhelpful if you have a bad internet connection and don't want to download the archive.org page every request :P
18:19 πŸ”— ola_norsk idk :d i just use that as cronjobs to save tweets https://pastebin.com/raw/ZE4udKTi
18:20 πŸ”— arkiver no page requisites are saved when you use /save/ like that
18:20 πŸ”— arkiver only the one URL you have after /save/
18:20 πŸ”— arkiver no images, or other stuff from the page is saved
18:20 πŸ”— ola_norsk doh
18:21 πŸ”— CoolCanuk (which is probably fine for net neutrality.. it should mostly be text/links to othe rsites)
18:21 πŸ”— CoolCanuk if there are any images, it's likely already been posted before
18:22 πŸ”— arkiver you can't see what picture is on a page if it's not saved
18:22 πŸ”— arkiver no matter how many times the picture might have been saved in other places acros the web
18:23 πŸ”— CoolCanuk you can't see pictures that are still online?
18:23 πŸ”— ola_norsk twitter also uses their damn tc.co url shortening
18:23 πŸ”— arkiver I think we save things in case they go offline
18:24 πŸ”— astrid <3
18:25 πŸ”— CoolCanuk (I hope that wasn't passive aggressive) :(
18:26 πŸ”— * arkiver isn't an aggressive person :)
18:27 πŸ”— CoolCanuk aggressive at archiving :P
18:27 πŸ”— CoolCanuk hehe
18:27 πŸ”— arkiver :)
18:28 πŸ”— ola_norsk i've been running those cronjobs since the 26th (i think). Should i perhaps just halt that idea then, or might it be useful data for someone else to dig trough? At least the text and links are there i guess..
18:29 πŸ”— ola_norsk was planning to run them until the netneutrality voting stuff is over on the 14th(?)
18:29 πŸ”— arkiver text is always useful
18:30 πŸ”— JAA Definitely better than nothing.
18:30 πŸ”— arkiver I believe the data from Alexa on IA also does not include pictures
18:30 πŸ”— arkiver but I'm not totally sure about that
18:34 πŸ”— ola_norsk i'm just going to let it run then
18:34 πŸ”— JAA What does the /save/ URL return exactly? Are the URLs for page requisites also replaced with /save/ URLs?
18:35 πŸ”— JAA If so, it might be possible to use wget --page-requisites to grab them.
18:35 πŸ”— ola_norsk one sec
18:38 πŸ”— ola_norsk https://pastebin.com/raw/dJrVbnpr
18:39 πŸ”— ola_norsk that's what i get when running: curl -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.89 Chrome/62.0.3202.89 Safari/537.36" --silent --max-time 120 --connect-timeout 30 'https://web.archive.org/save/https://twitter.com/hashtag/netneutrality?f=tweets'
18:40 πŸ”— JAA Yep, everything there is also replaced with /save/ URLs.
18:40 πŸ”— JAA So give wget --page-requisites a shot if you want.
18:40 πŸ”— JAA (Plus a bunch of other options, obviously.)
18:40 πŸ”— arkiver JAA: yes
18:41 πŸ”— ola_norsk ok
18:41 πŸ”— arkiver I believe embed are replace with a /save/_embed/ URL and links with a /save/ URL
18:41 πŸ”— JAA Yep
18:42 πŸ”— ola_norsk by 'other options' do you mean just to make it run quiet?
18:44 πŸ”— JAA Yeah, and making it not write the files to disk.
18:44 πŸ”— ola_norsk ok
18:45 πŸ”— JAA Not sure what else you'd need for this.
18:45 πŸ”— ola_norsk me neither unfortunatly, i had to browse a bit just to learn that much curl :d
18:45 πŸ”— ola_norsk but i'll check it out
18:52 πŸ”— ola_norsk i did ask info@archive.org if it's ok to do the curl commands so frequent (every 3-5 minute), but no response back yet.
18:53 πŸ”— ola_norsk i just hope they won't suddenly go 'wtf is this!?' and block me :d
19:04 πŸ”— ZexaronS has joined #archiveteam-bs
19:08 πŸ”— arkiver no
19:08 πŸ”— arkiver it's just one URL that's saved per curl command
19:08 πŸ”— arkiver https://archive.org/details/liveweb?sort=-publicdate
19:09 πŸ”— arkiver the number of URLs per item in there is a lot higher than how many you are saving in a day
19:12 πŸ”— ola_norsk as long it's fine with IA i'm good
19:13 πŸ”— ola_norsk arkiver: could there be a way to 'retro-crawl' the tweets i've already saved?
19:14 πŸ”— ola_norsk to get the images to load into the saves, i mean
19:14 πŸ”— Stilett0 has joined #archiveteam-bs
19:15 πŸ”— ola_norsk this is the mail i wrote on the 27th btw: https://pastebin.com/AV1vbKUr
19:16 πŸ”— arkiver I'm sure they're fine with it
19:16 πŸ”— ola_norsk good stuff
19:16 πŸ”— arkiver let me know if anything goes wrong
19:16 πŸ”— ola_norsk ok
19:17 πŸ”— arkiver with the 'retro-crawl', I guess you could get the older captures, get the URLs for the pictures from those and save those
19:17 πŸ”— arkiver but you can't really /save/ an old page again
19:17 πŸ”— arkiver or continue a /save/ or something
19:19 πŸ”— ola_norsk ok. I'm guessing at least some number of the tweets are bound to have become deleted by the users themselves (or banned user accounts).
19:19 πŸ”— JAA If you visit the pages, it should grab any images that aren't in the archives already.
19:20 πŸ”— JAA So I guess you could make your browser go through all those old crawls.
19:20 πŸ”— ola_norsk ouch
19:20 πŸ”— ola_norsk but yeah, that is what i meant :d
19:20 πŸ”— JAA Or perhaps it would work with wget --page-requisites as well, not sure.
19:21 πŸ”— ola_norsk i'll rather try that than sit scrolling in my browser :D
19:32 πŸ”— ola_norsk opening a capture in the browser does not seem to work to pull the images https://web.archive.org/web/20171130120002/https:/twitter.com/hashtag/netneutrality?f=tweets
19:32 πŸ”— ola_norsk only user avatars etc seems to be present
19:36 πŸ”— ola_norsk and those f*cking t.co links...pissing me off :/
19:51 πŸ”— dashcloud has quit IRC (Read error: Operation timed out)
19:52 πŸ”— dashcloud has joined #archiveteam-bs
19:58 πŸ”— CoolCanuk I'm not American and articles aren't helping... how fast is Cumulus Media declining ?
19:58 πŸ”— CoolCanuk This looks like quite the "portfolio" https://en.wikipedia.org/wiki/List_of_radio_stations_owned_by_Cumulus_Media
20:01 πŸ”— schbirid does gdrive use some kind of incremental throttling for uploads? i am down to 1.5MB/s now :(
20:01 πŸ”— schbirid and it seems quite linear over time
20:01 πŸ”— bitspill has joined #archiveteam-bs
20:02 πŸ”— ola_norsk CoolCanuk: https://www.marketwatch.com/investing/stock/cmlsq ..Not sure if it's really indicative though
20:02 πŸ”— CoolCanuk omg
20:02 πŸ”— CoolCanuk 0.095?!
20:03 πŸ”— CoolCanuk iHeartRadio also seems troubled
20:04 πŸ”— CoolCanuk however, iHeartRadio in Canada is likely not impacted, since I'm pretty sure Bell purchased rights to use it and it's a crappy radio streaming app for Bell Media radio stations- not true iHeartRadio
20:06 πŸ”— ola_norsk CoolCanuk: All i see is the slope going down :d https://www.marketwatch.com/investing/stock/cmlsq/charts That's basically the max of my knowledge about stocks and shit :d
20:07 πŸ”— CoolCanuk same here
20:14 πŸ”— SimpBrain has quit IRC (Remote host closed the connection)
20:18 πŸ”— ola_norsk CoolCanuk: a friend of mine who unfortunitaly passed away in 2015 once showed me daytrading thingy software. If i remember correctly the only thing that differed from the free API testing was that all the data was delayed
20:20 πŸ”— ola_norsk CoolCanuk: it wouldn't be useful for trading, but perhaps for alerting about online services going to hell
20:28 πŸ”— Kaz schbirid: i think there's a limit of 750GB/day uploaded?
20:29 πŸ”— Kaz if you're close to that, could explain things
20:29 πŸ”— schbirid ah, maybe
20:30 πŸ”— schbirid nope... today is just at "Transferred: 104.014 GBytes (1.540 MBytes/s)"
20:31 πŸ”— zino schbirid, any packet loss?
20:31 πŸ”— schbirid no idea, how do i check?
20:31 πŸ”— Frogging ping maybe
20:32 πŸ”— zino Well, step one: Be on linux (and run the upload from the same machine), step two: run "mtr hostname.here"
20:32 πŸ”— schbirid no idea what the hostnames for gdrive are
20:32 πŸ”— Frogging oh yeah mtr that's better
20:32 πŸ”— schbirid mtr rules
20:32 πŸ”— zino Step 0: Install iftop and check what address all your data is going too. :)
20:32 πŸ”— zino to*
20:33 πŸ”— schbirid duh, i feel dumb
20:35 πŸ”— zino Don't. There are many ways to do this, and today you learned a new one.
20:35 πŸ”— schbirid relearned
20:36 πŸ”— zino There will be a test on what all flags to tar are and what they do tomorrow!
20:38 πŸ”— schbirid i use longform
20:38 πŸ”— schbirid :P
20:38 πŸ”— schbirid tar is easy
20:38 πŸ”— schbirid looks like there is notraffic at all and rclone is doing some crap instead. makes sense to have the "speed" die down linearly then
20:54 πŸ”— dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.)
20:55 πŸ”— SmileyG has joined #archiveteam-bs
20:56 πŸ”— Smiley has quit IRC (Read error: Operation timed out)
20:57 πŸ”— ola_norsk CoolCanuk: that cumulus media thing made my brain conjure up some silly idea https://pastebin.com/raw/32k6st0E
21:04 πŸ”— SmileyG has quit IRC (Ping timeout: 260 seconds)
21:04 πŸ”— dashcloud has joined #archiveteam-bs
21:07 πŸ”— Smiley has joined #archiveteam-bs
21:19 πŸ”— Kaz schbirid: any cpu activity from rclone?
21:20 πŸ”— BlueMaxim has joined #archiveteam-bs
21:20 πŸ”— schbirid i just straced it and it has connection time outs all over
21:57 πŸ”— schbirid has quit IRC (Quit: Leaving)
21:58 πŸ”— CoolCanuk should Wikia be moved to Fandom, or is it okay to redirect Fandom to Wikia?
22:00 πŸ”— ola_norsk JAA: i tried this wget command, wget -O /dev/null --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --quiet --page-requisites "https://web.archive.org/save/https://twitter.com/hashtag/bogus?f=tweets" ..it's 100% quiet, though it doesn't seem to return more than using curl did.
22:01 πŸ”— ola_norsk JAA: i won't know until the captures show up on wayback though
22:09 πŸ”— JAA ola_norsk: You might want to write a log file to figure out what it's doing exactly. -o is the option, I think.
22:13 πŸ”— ola_norsk JAA: without -O it does make a directory structure. but it doesn't seem to contain image data
22:14 πŸ”— ola_norsk JAA: It seems to be just the same data, only then in e.g web.archive.org/save/https\:/twitter.com/hashtag/bogus\?f=tweets
22:15 πŸ”— ola_norsk in folders, i mean, instead of the (same?) data going to -O
22:17 πŸ”— JAA Hm
22:19 πŸ”— ola_norsk JAA: https://pastebin.com/FKu3mHbh this showes the structure of what it does
22:20 πŸ”— ola_norsk JAA: the 'hashtag/bogus?f=tweets' is the only file apart from robots.txt
22:20 πŸ”— JAA Right
22:20 πŸ”— noirscape has joined #archiveteam-bs
22:21 πŸ”— ola_norsk could Lynx browser be tricked into acting like a 'real' browser perhaps?
22:22 πŸ”— noirscape has quit IRC (Client Quit)
22:22 πŸ”— JAA I doubt it.
22:23 πŸ”— fie has quit IRC (Ping timeout: 633 seconds)
22:23 πŸ”— JAA Not sure why your command doesn't work.
22:23 πŸ”— JAA But yeah, a log file would help.
22:23 πŸ”— JAA Maybe with -v or -d even.
22:23 πŸ”— ola_norsk one sec
22:25 πŸ”— MrDignity has joined #archiveteam-bs
22:27 πŸ”— ola_norsk JAA: 'default output is verbose.' ..and there's quite little there i'm afraid :/
22:28 πŸ”— ola_norsk ill see if there's some options that give it better
22:28 πŸ”— fie has joined #archiveteam-bs
22:30 πŸ”— JAA ola_norsk: "Not following https://web.archive.org/save/_embed/https://pbs.twimg.com/profile_images/848200666199629824/ZwvxQIzP_bigger.jpg because robots.txt forbids it."
22:30 πŸ”— JAA Fucking robots.txt
22:30 πŸ”— JAA It breaks everything. :-P
22:30 πŸ”— ola_norsk try setting --user-agent
22:31 πŸ”— JAA I did.
22:31 πŸ”— ola_norsk maybe it's a javascript thingy, that loads all the shit? :/
22:31 πŸ”— JAA I used your exact command.
22:31 πŸ”— JAA -e robots=off
22:31 πŸ”— ola_norsk hmm
22:35 πŸ”— shin has quit IRC (Quit: Connection closed for inactivity)
22:36 πŸ”— ola_norsk JAA: here is output from me running the command (Note, it's in norwegian :/ ) https://pastebin.com/awJ9j4D8
22:37 πŸ”— CoolCanuk (please correct me if i'm wrong)
22:37 πŸ”— ola_norsk JAA: could it be i'm using older wget or something?
22:37 πŸ”— JAA ola_norsk: With -e robots=off?
22:37 πŸ”— JAA Maybe, what version are you using?
22:37 πŸ”— JAA I'm on 1.18.
22:38 πŸ”— JAA I don't think it should matter too much though.
22:39 πŸ”— ola_norsk JAA: GNU Wget 1.17.1
22:39 πŸ”— ola_norsk sry, didn't notice the robots=off
22:40 πŸ”— JAA Hmm
22:40 πŸ”— JAA It seems that it doesn't work with -O /dev/null, interesting.
22:41 πŸ”— ola_norsk robots=off did something else indeed, but i'm guessing it didn't do much better than when you ran it
22:41 πŸ”— ola_norsk a slew of 404 errors appeared
22:42 πŸ”— JAA Yeah, I got a bunch of 404s as well, but not all requests were 404s.
22:44 πŸ”— ola_norsk --2017-12-02 23:41:17-- https://web.archive.org/save/_embed/https://abs.twimg.com/a/1512085154/css/t1/images/ui-icons_2e83ff_256x240.png
22:44 πŸ”— ola_norsk Kobler til web.archive.org (web.archive.org)|207.241.225.186|:443 … tilkoblet.
22:44 πŸ”— ola_norsk HTTP-forespΓΈrsel sendt. Venter pΓ₯ svar … 404 Not Found
22:44 πŸ”— ola_norsk 2017-12-02 23:41:18 PROGRAMFEIL 404: Not Found.
22:44 πŸ”— ola_norsk is one png
22:44 πŸ”— JAA Yeah, that doesn't exist.
22:45 πŸ”— JAA But my command earlier grabbed https://pbs.twimg.com/profile_images/848200666199629824/ZwvxQIzP_bigger.jpg for example.
22:45 πŸ”— ola_norsk so, it's robots.txt on the endpoints that causes the failures?
22:45 πŸ”— JAA robots.txt at web.archive.org, yes.
22:45 πŸ”— JAA Ah, no.
22:46 πŸ”— JAA That's what causes wget not to retrieve the page requisites without -e robots=off.
22:46 πŸ”— ola_norsk no, i mean at e.g : abs.twimg.com ?
22:46 πŸ”— JAA Those 404s, not sure. Might just be broken links or misparsing.
22:47 πŸ”— ola_norsk damn internet, it's a broken big fat mess
22:48 πŸ”— ola_norsk cloudflare and shit
22:48 πŸ”— CoolCanuk which website are you trying to access that cloudflare wont let you
22:48 πŸ”— CoolCanuk I can possibly help get the true IP
22:49 πŸ”— ola_norsk it's to get waybackmachine to capture webpages, including images, with doing just request
22:50 πŸ”— CoolCanuk oh :/
22:51 πŸ”— JAA HTML is a huge clusterfuck. Well, to be precise, HTML is fine, but the parsing engines' forgiveness is awful.
22:51 πŸ”— JAA And don't get me started on JavaScript.
22:51 πŸ”— ola_norsk CoolCanuk: i've messed up, thinking it would actually do captures by doing just that with automatic requests..but turns out it wasn't that easy :/
22:52 πŸ”— ola_norsk JAA: aye. Is it possible that twitter uses javascript to put in the images, AFTER the page is loaded?
22:52 πŸ”— JAA Definitely possible.
22:52 πŸ”— ola_norsk JAA: if so, i'm giving up even trying :d
22:53 πŸ”— JAA But at least part of it is not scripted.
22:55 πŸ”— JAA My test earlier grabbed https://pbs.twimg.com/media/DQDHMryX4AEseEo.jpg for example, which is an image from a post most likely (though I'm not going to try and figure out which one).
22:55 πŸ”— ola_norsk I think i'll just let the curl stuff run until the 14th, and let someone brigther than me figure it out in the future.
22:55 πŸ”— JAA Sometimes, I hate the WM interface. "3 captures" *click* only lists one.
22:56 πŸ”— ola_norsk one thing is images, but another is that basically all links on twitter are shorterened links
22:57 πŸ”— JAA Yeah, but if you want to follow those, you'll definitely need more than that.
22:57 πŸ”— JAA I mean, it might work with --recursive and --level 1 or something like that.
22:57 πŸ”— JAA But it would really be better to just write WARCs locally and upload those to IA.
22:57 πŸ”— ola_norsk the t.co links do come with the actual link the ALT= tag i think , not sure though
22:58 πŸ”— ola_norsk <a alt=> property i mean
22:58 πŸ”— JAA Never looked into them.
22:59 πŸ”— JAA What you're describing is more or less what I'm doing from time to time with webcams.
22:59 πŸ”— JAA I did that during the eclipse in the US in August, and I'm currently retrieving images from cams across Catalonia every 5 minutes.
23:00 πŸ”— JAA It's just a script which runs wpull in the background + sleep 300 in a loop.
23:01 πŸ”— JAA A cronjob might be cleaner, but whatever.
23:02 πŸ”— ola_norsk with --recursive it does seem to take a hell of a lot longer..
23:02 πŸ”— ola_norsk and that's maybe a good sign
23:02 πŸ”— JAA Yeah, it's now retrieving all of Twitter.
23:02 πŸ”— JAA Well, maybe not all of it, but a ton.
23:03 πŸ”— * ola_norsk suddenly archive all of internets
23:03 πŸ”— JAA Solving IA's problems. Genius!
23:03 πŸ”— ola_norsk aye
23:03 πŸ”— ola_norsk maybe that level thing is not a bad idea :d
23:05 πŸ”— JAA :-P
23:05 πŸ”— ola_norsk any way i could limit it to let's say 1-2 "hops" away from twitter? :D
23:05 πŸ”— ola_norsk ...seriously, it's still going
23:06 πŸ”— ola_norsk it went from #bogus hashtag to shotting #MAGA..
23:07 πŸ”— JAA Yep, and it'll retrieve every other hashtag it can find.
23:07 πŸ”— ola_norsk aye
23:07 πŸ”— JAA It's the best recursion. Believe me!
23:07 πŸ”— ola_norsk 'recurse all the things!' lol
23:09 πŸ”— ola_norsk at the very least i think it needs some pause between these requests :d
23:09 πŸ”— zino stack exausted, core dumped.
23:10 πŸ”— ola_norsk it's doing bloddy mobile.twitter.com now ..
23:10 πŸ”— ola_norsk nobody needs that
23:12 πŸ”— ola_norsk it's brilliant though :D , i just hope it did the images :D
23:12 πŸ”— JAA It did exactly what you told it to. :-P
23:13 πŸ”— ola_norsk that just proves computes are stupid :d
23:13 πŸ”— JAA Yeah, that or... :-P
23:14 πŸ”— ola_norsk the Illuminati did it
23:16 πŸ”— ola_norsk but, i'm thinking if was limited to just 1-2 hops, even 1, that would be enough to get most images. Or?
23:17 πŸ”— JAA --page-requisites gets the images already.
23:17 πŸ”— JAA (But apparently only if you actually write the files to disk. My tests with -O /dev/null did not work.)
23:17 πŸ”— JAA You only need recursion with a level limit if you also want to follow links on the page.
23:17 πŸ”— JAA Which might make sense, retrieving the individual tweets for example.
23:18 πŸ”— ola_norsk could you pastebin the command you did that does image capture?
23:18 πŸ”— JAA But if you want to have any control over what it grabs (for example, not 100 copies of the support and ToS sites), it'll get complex...
23:18 πŸ”— JAA Uh
23:18 πŸ”— JAA Closed the window already, hold on.
23:19 πŸ”— ola_norsk the --recursion is violent :d
23:19 πŸ”— JAA It's awesome, you just need to know how to control it. :-)
23:19 πŸ”— jschwart has quit IRC (Quit: Konversation terminated!)
23:20 πŸ”— ola_norsk aye
23:21 πŸ”— ola_norsk as for any output, if i can't put in /dev/null it'll go in a ramdisk that cleared quicly
23:21 πŸ”— ola_norsk that's
23:22 πŸ”— JAA Uhm, dafuq? https://web.archive.org/web/20171202231923/https:/twitter.com/hashtag/bogus?f=tweets
23:23 πŸ”— JAA That's my grab from a few minutes ago.
23:23 πŸ”— JAA Well, it did grab the CSS etc.
23:23 πŸ”— JAA I didn't specify the UA though. That might have something to do with it.
23:24 πŸ”— ola_norsk i'm not sure how they distrubute the requests between 'nodes'
23:24 πŸ”— JAA The command was wget --page-requisites -e robots=off 'https://web.archive.org/save/https://twitter.com/hashtag/bogus?f=tweets'
23:24 πŸ”— ola_norsk ty
23:24 πŸ”— JAA Regarding the temporary files: mktemp -d, then cd into it, run wget, cd out, rm -rf the directory.
23:24 πŸ”— JAA Five-line bash script. :-)
23:28 πŸ”— ola_norsk JAA: sometimes i notice twitter.com requires login for anything. Maybe it varies by country. I'm not sure.
23:29 πŸ”— ola_norsk ty, gold stuff
23:29 πŸ”— JAA Yeah, Twitter's quite annoying to do anything with it at all.
23:29 πŸ”— JAA We still don't have a solution for archiving an entire account or hashtag.
23:30 πŸ”— ola_norsk they make money of off doing that
23:30 πŸ”— ola_norsk so they will not make it easy
23:31 πŸ”— ola_norsk if you're from a research institution, they would easily hand over hashtag archive from day0. For a slump of money, of course
23:32 πŸ”— JAA Yeah
23:36 πŸ”— CoolCanuk is there a mirror of the wiki we can use until it's stable?
23:36 πŸ”— JAA No, I don't think so.
23:37 πŸ”— JAA There's a snapshot from a few months ago in the Wayback Machine, I believe.
23:46 πŸ”— ola_norsk that command entails 1.7Megabytes of data :D what is the internet coming to?? lol
23:48 πŸ”— ola_norsk mankind doesn't deserve it :d
23:49 πŸ”— JAA "The average website is now larger than the original DOOM." was a headline a few years ago...
23:49 πŸ”— JAA web page* I guess
23:49 πŸ”— ola_norsk aye, i think just the fucking front page of my online bank is ~10MB :/
23:52 πŸ”— ola_norsk no wonder dolphins are dying from space radiation and ozone

irclogger-viewer