[00:03] *** Stilett0 has joined #archiveteam-bs [00:04] JAA: aye..trying to archive a twitter hashtag has taught me that :/ "There was a problem loading..(retry button)") [00:05] Yeah, Twitter's also pretty good at not letting you grab everything. [00:05] Reddit as well. [00:05] (We were having a discussion about that earlier in #archivebot.) [00:05] At least you can iterate over all thread IDs in a reasonable amount of time on Reddit though. [00:07] So it appears that you can get 10k results from the vid.me API. [00:07] i feel naughty doing curl requests to https://web.archive.org/save/https://twitter.com/hashtag/netneutrality?f=tweets , currently every 3rd minute :/ [00:07] You can do that for different categories, new/hot, and probably search terms (didn't try). [00:09] There are 17 categories plus hot, new, and team picks. In the ideal case, that means 20 sections times 10k results, which is still only about 1/7th of the whole site. [00:09] This is only about how to gather lists of videos and their metadata (uploader, description, etc.), not the actual videos. [00:09] (Videos are available as Dash and HLS streams.) [00:10] There are also tags, and of course you can retrieve all (?) of an uploader's videos. [00:10] JAA: As for twitter, i think one problem is that they would easily present an archive of ANYTHING, as long as they get paid for it. [00:10] For each tag, you get hot, new, and top videos. [00:11] ola_norsk: Yeah, probably. [00:11] JAA: most definitely [00:14] There's a "random video" link. We could hammer that to get videos. I don't want to do the math how many times we need to retrieve it to discover the vast majority of all videos right now though. [00:14] JAA: for a legal warrant, or a slump of money, they could present all tweets with any hastag, since the dawn of ti..twitter [00:14] Ah, I thought you were talking about vid.me now. [00:15] Yeah, there is a company which has an entire archive of Twitter, I believe. [00:15] ah, sorry, that was just a link regarding GOG Connect [00:16] Ah, you're not in #archiveteam. vid.me is shutting down on Dec 14. [00:16] That's why I'm looking into them. [00:16] really? that soon? [00:16] https://medium.com/vidme/goodbye-for-now-120b40becafa [00:16] wow, that's going to piss off alot of germans :D [00:17] ola_norsk: I was thinking about Gnip, by the way. Looks like Twitter bought them a few years ago. [00:18] "We’re building something new." .. [00:19] a.k.a "Trust us, we're not completely destroying this shit..We're building something new!".. [00:20] free image/video host "couldn't find a path to sustainability" [00:20] man, i actually thought vid.me had something good going [00:20] what a surprise :p [00:21] https://archive.org/details/jscott_geocities [00:24] wow, there's actually people who cancelled their youtube accounts after having used vid.me's easy export solution [00:24] and as far as i know, that shit might not be such easy to export back, since i don't think YT does import by url.. [00:25] og well [00:28] why not upload to both? <.< [00:28] aye [00:30] omglolbah: according to "SidAlpha", if you know that youtuber, he would'nt because it would mean he'd have to interact on several platforms.. [00:30] If only he had moved to vidme [00:31] that was his response to the request for that, not move, but upload there as well [00:31] no, I'm saying I wished he had moved so that he would be gone :p [00:31] oh [00:34] where does shit go if Youtube goes though? I mean, Google Video went to Youtube.. [00:36] Where did Yahoo Video go? [00:36] Justin.tv became Twitch right? [00:37] Justin.tv created Twitch and then closed down. Nothing was automatically moved. I don't know if Justin had vods though. [00:37] aye [00:38] I was wrong about the vid.me API not returning all results. [00:38] The actual API does return everything, or at least nearly everything. [00:38] The "API" used by the website doesn't. [00:38] I just didn't find the real API docs previously. [00:38] https://docs.vid.me/#api-Videos-List [00:38] No auth required either. [00:38] You can get chunks of 100 videos per request. [00:38] \o/ [00:39] Do we have a death date? [00:39] It gets quite slow for large offsets, indicating that they don't know how to use offsets. [00:39] 14 Dec [00:39] :-/ [00:39] how to use indices* [00:39] indexes? [00:40] where would youtube go? nowhere. it's too big :p [00:40] I never know which plural's correct. [00:40] Frogging; aye [00:40] Frodding: we'll just show up with a tractor trailer "Load it all in back y'all" [00:40] The real API returns a bit more videos, by the way: 1360532. [00:40] (About 11k more, specifically.) [00:41] Might be the NSFW/unmoderated/private filter stuff. [00:41] bithippo: YouTube is around 1 exabyte. Have fun with that. [00:41] Well, at least that order of magnitude. [00:41] I used to manage hundreds of petabytes :-P [00:41] * ola_norsk shoves in in his usb stick and applies youtube-dl ! [00:41] lol [00:42] I'm sure someone from China will sell you a 1 EB USB stick if you ask them. [00:42] Well, "1 EB". [00:42] Which will quickly err out once a few GB have been written .... :( [00:42] Yep [00:42] i'll just save it all in /dev/null [00:42] Or not error out, just overwrite the previous data etc. [00:43] Depends. Some of them are cyclical, so you can write all you want as long as you don't try to read it. :) [00:43] Yep [00:43] I'm a fan of S4. [00:43] The Super Simple Storage Service. [00:43] http://www.supersimplestorageservice.com/ [00:44] That pricing is a bargain. [00:44] bithippo: Interesting. What did you work with that included 100s of PiBs? I deal in 10s of them. [00:44] Data taking for LHC detector [00:44] Ooh, nice! [00:45] Only a couple hundred TB of spinning disk on storage arrays, the rest were tape archive libraries. [00:45] bithippo: Ah. I sort of do that on the sly. Part of our storage is for the Nordic LHC grid. [00:45] #TeamCMS [00:46] I deal mostly with crimate data though. Have a few petabytes of that. [00:46] That's awesome. [00:46] I <3 big data sets [00:47] Indeed :) [00:47] @ola_norsk If you're interested in how to make something be emulated on IA, here's some pages that lay it out for you- http://digitize.archiveteam.org/index.php/Internet_Archive_Emulation http://digitize.archiveteam.org/index.php/Making_Software_Emulate_on_IA [00:49] dashcloud: ty, i'm thinking there must be ways. If there's dosbox, there's e.g Frodo that could run in that.. [00:50] I've done a bunch of DOSBOX games, and there's a whole collection of emulated DOS/Win31/Mac Classic stuff up [00:51] ola_norsk: What, the C64 emu? [00:51] yes [00:51] No, nonononono. Go helt the jsmess people get Vice running instead. [00:52] i was hoping that was already done [00:52] I know it's started. [00:52] good stuff [00:52] But it might be stalled forever for all I know. [00:54] i have no idea about these things, but it would be cool to see C64 on Internet Arcade [00:54] JAA: I'll have very little time to do anything before the 9th, and probably not much after either, but ping me if storage is needed for vid.me. [00:55] dashcloud, ill try to make an item per that, using dosbox [00:55] ty for info [00:56] zino: Will do. I'll set up a scrape of the API first to get all the relevant information about the platform. Then we'll see. [00:57] if your software needs installation or configuration before the first run, you'll want to do that ahead of time [00:57] scrape/archive, whatever. That's the information we can save for sure. [00:57] Unless they ban us... [00:58] Using minVideoId and maxVideoId might be faster than the offset/limit method, especially for the later pages. [00:59] Current video IDs are slightly above 19 million, so that's around 190k requests (to be sure no videos are missed). [01:03] attending my first 2600 meeting [01:05] so the thing with vidme, there's a bunch of original stuff [01:06] there's a little bit of lewd stuff (they ban outright porn, but they do permit "artistic" nsfw) [01:06] and then there's a bit of it that consists of reuploads of copyrighted stuff [01:06] OK [01:06] not that I think it'll be a big deal since IA can just dark the affected stuff if someone does come yelling, but something to keep in mind [01:07] Sounds more or less what I'd expect. [01:07] like* [01:09] I can't find any information about API rate limits, except this Reddit thread: https://redd.it/6acvg5 [01:11] *** icedice has quit IRC (Quit: Leaving) [01:18] *** Ceryn has quit IRC (Connection closed) [01:26] "The Internet is Living on Borrowed Time" .. https://vid.me/1LriY (ironically on vid.me) ..That's pretty dark title, for being Lunduke :d [01:33] To be fair, it's also available on YouTube: https://www.youtube.com/watch?v=1VD_pJOFnZ0 [01:54] thats not fair :D [01:54] i think most of his vids are also on IA :d [01:55] but yeah [01:57] seriously though. I imagine there's a shitload of german vidme'ers currently bewildered as to what to do.. [01:59] a lot of people used the url importing at vidme, thinking they would simple move their entire channels.. [02:00] from what i've heard tales of, germany youtube is not the same youtube as everywhere elsetube [02:07] GEMA blocks a fuckton of music there [02:07] aye [02:09] ranma: is that the only reason though? There were so many germans coming to vid.me it was made a video about it.. [02:12] JAA: how do you get your data OUT of S4? [02:12] and what are the costs? [02:12] s4? [02:13] ranma: "German INVASION"...100k creators..https://vid.me/JjNaH [02:13] oh, it's a joke :'( [02:14] i hate slow internet. ml [02:14] *fml [02:14] *** phuzion has joined #archiveteam-bs [02:18] ranma: does it simply block ALL music? i can't see any other reason for such a noticable influx and flight of users [02:19] ranma: It's actually hard to browse vidme because of it at times, since often 1 in 2 videos on the feed is german [02:22] kinda wish some site could ZIP/7z another site [02:22] just noticed archivebot slurped down https://ftp.modland.com/ [02:23] did it *completely* slurp modland? [02:23] dd -i http://google.com -o http://bing.com [02:23] Muad-Dib: Your job for https://ftp.modland.com/ has finished. [02:25] actually, not that i'd have the space for it, tho [02:25] *** ola_norsk has quit IRC (its the beer talking) [02:26] does anyone have a great upload script for ia? their docs are too much for me to understand and uploading 1 by 1 is painful [02:34] for anyone wanting to mirror vid.me, its possible to page everything there: https://api.vid.me/videos/list?minVideoId=100&maxVideoId=1000 [02:34] just step the min/max (its easier on the db). [02:34] ..... https://usercontent.irccloud-cdn.com/file/PZalOsZ6/image.png [02:35] JAA: ^ [02:35] our wiki is more stable than this beta-like system :P [02:39] CoolCanuk: What do you mean by "upload script for ia"? [02:40] Such as https://github.com/jjjake/internetarchive ? [02:40] an easier way [02:41] eg I can loop is for 100s of files in a folder, but upload as 100 items. [02:41] That repo is your best bet for that sort of operation. [02:42] What sort of files and metadata? [02:42] pdf [02:43] currently, newspapers [02:43] and sears crap [02:44] Hmm [02:46] The two routes would be "web interface", which gives you a nice interface and shouldn't be too painful if you're putting up each folder as an item (with all of the files contained within that folder attributed to the item). Failing that, you'd need some light python or bash scripting skills to pickup up files per item, associate metadata with each item, and upload. [02:46] I could be wrong of course! But that's my interpretation based on working with the IA interfaces. [02:47] tbh, IA interface is just plain atrocious to use [02:47] Indeed. [02:47] i suppose thats artificial barrier of entry on purpose to avoid people uploading crap [02:48] I guess [02:48] I'm uploading stuff I know will probably not be found anywhere else [02:49] yea, the commitment to jump the hoops is paired with commitment to curate content [02:50] Only think I'm worried about is repetitive strain injury [02:52] *** wp494_ has joined #archiveteam-bs [02:59] *** wp494 has quit IRC (Read error: Operation timed out) [03:03] *** ld1 has quit IRC (Quit: ~) [03:06] *** ld1 has joined #archiveteam-bs [03:19] why does IA have a difficult time using the FIRST page of a pdf as the COVER >:( [03:41] *** wp494_ is now known as wp494 [03:47] X-posting from #archiveteam: if you're using youtube-dl to grab vid.me content, be aware of this issue: https://github.com/rg3/youtube-dl/issues/14199 [03:47] tl;dr: their HLS streams return a data format youtube-dl doesn't fully handle resulting in corrupted output files [03:48] Use a workaround in the 2nd to last comment to force youtube-dl to grab from the DASH endpoints instead [03:54] posting highlights of https://www.youtube.com/watch?v=KMaWSinw4MI&t=41m33s here [03:55] first one being that linus has significant disagreements with senior management, and especially NCIX's owner [03:55] which seems to be a very common theme [03:56] he also left NCIX because the people he mentored departed [03:56] says he thinks some were forced out because of extraordinarily poor management decisions [03:57] (in his opinion) [03:57] I'm reading this https://np.reddit.com/r/bapcsalescanada/comments/77h771/for_anyone_that_purchased_a_8700k_from_ncix/domm2ca/?context=3 [04:09] *** josho493 has joined #archiveteam-bs [04:09] linus pitched what sounded like a pretty good idea, try and get bought, but how? his solution was to open "NCIX Lite"s across the country which would be really small pickup places that you could ship to since shipping direct to your home sometimes killed the deal [04:10] said that the writing was on the wall as early as 7 years ago (before Amazon was doing pickup) to anyone actively paying attention, so the idea would've been to attract someone similar to Amazon if not Amazon themselves using that infrastructure when they wanted to gobble someone Whole Foods style [04:12] linus said when management didn't do that, he said it became obvious that he had to GTFO [04:12] he says he hasn't been screwed over personally by Steve (the owner) and his wife unlike some of the other horror stories going out [04:16] he wound up signing a non-compete for 2 years (which got extended by 1) [04:17] when he left he took the LTT assets, and did it on paper (and was glad he did), because even though he wouldn't think Steve would do anything untoward to him, creditors are sharks looking for their next kill [04:19] and that's about it [04:27] why do people bother with fairly standard eshop drama, was ncix the canadian amazon or something? [04:27] More like Canadian Newegg... before Newegg moved into Canada [04:28] It was *the* place to go for computer parts online, from what I understand [04:28] ah [04:28] razor thin margins, yea we have that locally too [04:28] all with fake "in stock" stickers where you wait 2 weeks and everything [04:30] *** qw3rty115 has joined #archiveteam-bs [04:34] *** qw3rty114 has quit IRC (Read error: Operation timed out) [04:38] they had a location here in Ottawa. I used to shop there until they closed it [05:07] *** josho493 has quit IRC (Quit: Page closed) [05:09] defunct as of today :o [05:09] *yesterday [05:11] *** Mateon1 has quit IRC (Ping timeout: 245 seconds) [05:12] *** Mateon1 has joined #archiveteam-bs [05:15] am I the only one who doesnt really see the big deal of google home/mini or amazon alexa? [05:29] *** ranavalon has quit IRC (Read error: Connection reset by peer) [05:36] there's a big deal? [05:38] *** shindakun has joined #archiveteam-bs [05:38] *** Jcc10 has joined #archiveteam-bs [05:42] CoolCanuk: we're all waiting for amazon to give access to alexa transcripts to app devs [05:42] so we can start archiving every little embarassing thing anyone has ever said [05:42] which of vidme's logos should I use for the article, the wordmark or their "astro" mascot [05:42] https://vid.me/media [05:48] the one on the main page (red) [05:48] wordmark it is [05:48] sadly cant be eps or svg :( [05:49] gonna resize it a little otherwise it'll appear about as big in a warrior project [05:49] or we could fix the template [05:49] wait what do you mean [05:50] lemme go dig through the spuf logs to show you [05:50] (come to think of it I'm not even sure if I took an image, I might have just pull requested and moved on) [05:51] our {{Template project}} should be fixed to a larger logo size [05:51] using it online is not an issue, because we can dynamicly resize [05:51] http://tracker.archiveteam.org/ [05:52] yeah there it could benefit from being a touch bigger at least for logos that are rectangles instead of squares [05:52] apparently we can't... :| [05:52] (it seems to like squares the best) [05:52] "benefit"? [05:52] distortion? [05:52] and yeah, I was about to say, our copy of mediawiki isn't quite as flexible as wikimedia's where you can stuff in any number and it'll spit it out for you [05:52] even ridiculously large ones like 10000px [05:52] I just noticed that. that's too bad [05:53] another reason to use SVG. [05:54] even SVGs too [05:54] no. SVGs are not raster [05:55] you can blow them up to 1000000000px and it will never distort unless you have embedded rasters [05:57] https://upload.wikimedia.org/wikipedia/commons/3/35/Tux.svg [06:02] ok I was gonna recreate an example with SPUF but there's a live one that I can get you right now [06:03] see how the miiverse logo goes a bit out of its bounds and pushes content downwards: https://i.imgur.com/P3Wcfbp.png [06:03] ew [06:03] logo should be within that white div, not yellow [06:04] (within, not overlaid) [06:04] now take the version of the steam icon we had stored on the wiki and stuffed into the project code (http://www.archiveteam.org/images/4/48/Steam_Icon_2014.png) and it wound up being a bit worse than that example [06:04] luckily a 100px version that mediawiki gracefully generated more or less solved things: https://github.com/ArchiveTeam/spuf-grab/pull/2/commits/1c319d3d144cc13599f1fe571e699ca8b3d79e60 [06:04] not the image's fault, it's the tracker ;) [06:05] afaik tracker main page was ok [06:05] how could it be ok [06:05] note how it looks like it's fine on http://tracker.archiveteam.org/ [06:05] simply use max-width for img in css [06:05] *height [06:06] but with that said scroll bars do appear [06:06] then you need to [06:06] overflow: hidden [06:06] but it's nothing near as annoying as the in-warrior example, though still a nuisance albeit very minor [06:07] I will fix it [06:08] k so a 600 x 148 version will go up on the wiki [06:08] and then if it causes problems we can grab a 100px url [06:08] for project code [06:09] we have or [06:09] **or [06:09] just use max-height: 100px [06:09] ;) [06:09] ok project page is going up [06:09] lol how did it let you upload file name with a space :P [06:09] it makes me use _ [06:10] it does insert a _ [06:10] the recent changes bot treats it as a space though [06:11] but for actually using the filename you're going to need to use underscores [06:11] o [06:11] aw crap I'm getting spam filtered and I don't even get a prompt to put in the secret phrase [06:12] oh well let's see if this workaround of inserting a space in the url works [06:12] heh [06:12] SHHHH that's supposed to be a secret :x [06:13] ok wow that apparently worked [06:15] i'll fix it for ya [06:15] gl with the filter [06:15] oh you fixed it [06:15] I was surprised I was even able to toss such a tiny little stone at that goliath [06:18] ok that's a solid foundation I think [06:19] huh [06:20] I have a workaround :P [06:21] *** slyphic has quit IRC (Read error: Operation timed out) [06:21] I got a 508 clicking that purplebot link [06:21] godane: What does "WOC" mean with the MPGs? [06:21] resource limit reached [06:22] this 208 error will be the death of me [06:22] *508 [06:23] connection timed out now... [06:23] same here ughhhh [06:24] ffff [06:24] impossible to eidt [06:25] oh finally [06:25] there must be more than just "shared hosting" being the problem [06:28] can the topic in #archiveteam changed from Compuserve to vidme? lmfao [06:28] *be changed [06:32] if it gets pointed out a few times like with compuserve then someone will probably do it [06:32] if it's just once or twice more then it's no big deal just say "yeah we're on it" [06:33] fair [07:06] making up a tag for vidme is going to be tricky. it's so short.. hard to come with a spinoff [07:42] *** Pixi has quit IRC (Ping timeout: 255 seconds) [07:42] *** Pixi has joined #archiveteam-bs [07:44] *** BlueMaxim has quit IRC (Ping timeout: 633 seconds) [07:45] *** BlueMaxim has joined #archiveteam-bs [09:05] *** Dimtree has quit IRC (Peace) [09:11] *** Dimtree has joined #archiveteam-bs [09:55] *** fie has quit IRC (Ping timeout: 245 seconds) [10:11] *** fie has joined #archiveteam-bs [10:16] *** CoolCanuk has quit IRC (Quit: Connection closed for inactivity) [10:27] *** BlueMaxim has quit IRC (Read error: Connection reset by peer) [10:35] *** schbirid has joined #archiveteam-bs [11:08] *** fie has quit IRC (Ping timeout: 246 seconds) [11:21] *** fie has joined #archiveteam-bs [11:33] *** bithippo has quit IRC (My MacBook Air has gone to sleep. ZZZzzz…) [11:44] *** jschwart has joined #archiveteam-bs [12:11] ez: Yep, that's what I came up with yesterday as well. You can either iterate min/maxVideoId in blocks of 100 with limit=100 or implement pagination. I'd probably go for the former, i.e. retrieve video IDs 1 to 100, 101 to 200, etc. (need to figure out whether these parameters are exclusive or not though). [12:17] *** MangoTec has joined #archiveteam-bs [12:41] my god [12:41] the best thing I've ever heard just got tweeted [12:41] @ElonMusk: Payload will be my midnight cherry Tesla Roadster playing Space Oddity. Destination is Mars orbit. Will be in deep space for a billion years or so if it doesn’t blow up on ascent. [12:57] Elon knows how to put on a show. [12:57] Yep [12:58] I mean, he thinks its going to blow, they didn't want to make a real payload... so fuck it send a Car [13:01] At this time I'll recommand the old Top Gear episode where they convert a car to a space shuttle and blast it off with rockets. [13:01] recommend* [13:24] *** MangoTec has quit IRC (Quit: Page closed) [13:43] hetzner's auctions seem to have dropped in price a lot, 1/3 aka -10€ for what i have [13:43] https://www.hetzner.com/sb [13:44] nvm, had fucking US version without VAT =( [13:49] https://medium.com/vidme/goodbye-for-now-120b40becafa [13:49] https://medium.com/vidme/goodbye-for-now-120b40becafa [13:49] https://medium.com/vidme/goodbye-for-now-120b40becafa [13:49] What the fuck!! [13:49] Okay you know about it [13:49] but what the actual fuck! [14:10] people are finding out its REALLY hard to make a video website [14:16] It's easy to make a video site, it's just hard to monetise it, mediacru.sh was the best in terms of technology in my opinion but they didn't manage to monetise either. [14:16] *** ranavalon has joined #archiveteam-bs [14:17] *** ranavalon has quit IRC (Remote host closed the connection) [14:17] I'm collecting video ids from reddit anyways, heads up the bulk of the older urls (and possibly new ones) are going to be reddit porn related. [14:17] *** ranavalon has joined #archiveteam-bs [14:18] wait, youtube is still operating at loss [14:18] why the FUCK are people making so much money on their ad share then? [14:18] *? [14:21] *** voidsta has joined #archiveteam-bs [14:22] Google isn't operating at a loss, so they can keep YouTube afloat and keep trying new things to pump up their bottom line, which is why we see a new yt related shit storm every other week, yt may as well be called YouTube[beta] or YouTube[this is an experiment] [14:23] YouTube{incredible journey] [14:24] Though because it's Google ad because there is no real competition for them making any real headway we can talk like yt is 'never' going to close doors, or turn their service off, but it'll come, maybe not today, maybe not in 5 years, but it'll come when we're 'what the fucking' at a Google blog post announcing there coming plans to phase out YouTube or just turn it off. [14:26] Hopefully that comes at a time 500PB* is nothing and something we can grab in a few months [14:26] ... except YouTube will be 10 EB by then. [14:27] wait what [14:27] vimeo is dead? [14:28] vid.me I though [14:28] vid.me [14:28] ffs it looked very close to vimeo [14:28] It's an odd time we're living in when we first started 10TB was insane to think we could get, now we're doing sites nearing 300TB without a great deal of thought, we're scaling pretty well with the times I suppose, but how long before ia close doors and we have to find somewhere to put that? (I know we're talking about it...) [14:29] Ugh [14:29] if IA ever goes bust [14:29] so [14:29] we have 2 weeks for vidme [14:29] yup [14:29] I'm setting up an API scrape right now. [14:30] probably needs a channel, not sure how big it is [14:30] #vidmeh [14:30] 1.3x million videos [14:30] vidwithoutme, vidnee, vidmeh [14:31] vidmeh will do [14:31] This will almost need to be a warrior project. We can probably fix storage, but there is no way we can download this in time using a script-solution unless someone buys up Amazon nodes to do it. [14:32] JAA: Any idea what the average size of a video is? [14:32] zino: I haven't looked at the videos themselves at all yet, only the metadata. [14:32] The API returns a link to download the videos as an MP4, by the way. [14:32] The website uses Dash/HLS. [14:35] Those MP4s are hosted on CloudFront, by the way, i.e. Amazon. That could be annoying. [14:51] wiki is slow as balls [15:07] *** voidsta has left [15:08] zino: I've been scraping a few channels and here's what I've seen so far. Their highest quality is 2 mbps video (at 1080p or 720p depending on the original resolution) with audio between 128kbps and 320 kbps(!) [15:09] SD-quality video is around 1200 kbps [15:09] Ugh [15:10] thats not too bad overall [15:10] And I'm grabbing with youtube-dl's "bestvideo+bestaudio" option, if storage/bandwidth becomes an issue they have lower-quality versions we could grab instead [15:11] Na [15:11] We have da powerrrr [15:11] right now I'm working on the grabber, mostly just going to mod eroshare-grab [15:12] Some files are randomly capped at 150 KB/s download while others will saturate my 50 mbit connection [15:12] the channel pages are going to be interesting since they scroll load type [15:12] As long as the URLs for those follow a pattern that shouldn't be too hard [15:13] Oh, I just noticed there's a channel, #vidmeh [15:13] ya [15:23] Nothing is bloody working [15:24] for what [15:24] I've spent all day trying to get my proxmox cluster sorted [15:30] dat CDN [15:35] *** Jcc10 has quit IRC (Ping timeout: 260 seconds) [15:39] hay JAA your pulling all the APIs, are you saving all the reposes so we can get the raw URL for the videos? [15:39] reponses* [16:01] Yeah, of course I save them. To WARC, specifically. [16:04] *** kristian_ has joined #archiveteam-bs [16:29] *** CoolCanuk has joined #archiveteam-bs [16:33] *** fie has quit IRC (Ping timeout: 360 seconds) [16:43] *** shin has joined #archiveteam-bs [16:44] *** fie has joined #archiveteam-bs [17:06] don't know if it will help but i was made a brute force video/metadata downloader for vidme https://github.com/shindakun/vidme i don't really have the bandwidth or storage to let it run though [17:07] you guys already have a lot of tooling though [17:07] No need to bruteforce, we can get a list of all videos through their API. [17:07] shindakun: /join #vidmeh [17:07] (I'm doing that currently.) [17:08] that's basically what it does sort of... i found some seemed to be unlisted so i request details for every videoid [17:08] off to vidmeh lol [17:08] Right. There's an API endpoint for getting lists of videos though, so you don't have to run through all ~19M IDs. [17:09] You can do it with 190k requests. With further optimisation, it might be possible to decrease that even further, but that's a bit more complex. [17:09] *** ola_norsk has joined #archiveteam-bs [17:11] made a test C64/dosbox emulator item (https://archive.org/details/iaCSS64_test) , but it seems very slow. At least on my potato pc. [17:13] unfortunatly i'm no ms-dos guru. But might there be a way to optimize speed trough some dos utilites/settings that could reside in the zip file? [17:14] You are emulating in two layers. It's not going to be fast, or accurate. [17:16] yeah it's kind of emu-inception :d But, could fastER be done perhaps? [17:17] i did try it in Brave browser as well as Chromium, and Brave seemed to run it a bit better. [17:17] and my pc is kind of shit [17:21] /join #vidmeh [17:21] ahem [17:27] *** Stilett0 has quit IRC (Ping timeout: 246 seconds) [17:29] ahhhhh. CLEVER [17:32] *** Pixi has quit IRC (Quit: Pixi) [17:38] *** kristian_ has quit IRC (Quit: Leaving) [17:45] *** mundus201 is now known as mundus [17:46] *** Pixi has joined #archiveteam-bs [18:08] How can I automatically save links from an RSS feed onto the wayback machine? [18:08] *** pizzaiolo has joined #archiveteam-bs [18:14] i'd use something like this http://xmlgrid.net/xml2text.html . then get rid of the non urls in excel/google sheets. [18:15] Ew [18:15] then upload your list of urls to pastebin, get the raw link. in #archivebot , use !ao < PASTEBINrawLINK [18:15] you got a better idea, JAA ? :P [18:15] if you have the links in a list; curl --silent --max-time 120 --connect-timeout 30 'https://web.archive.org/save/THE_LINK_TO_SAVE' > /dev/null , is a way to save them i think [18:15] Grab the feed, extract the links (by parsing the XML), throw them into wpull, upload WARC to IA. Throw everything into a cronjob, done. [18:16] o ok [18:16] I suspect he's looking for something that doesn't require writing code though. [18:16] most users are :P [18:16] also why curl? cant we just use HTTP GET? [18:17] that's what curl does [18:17] That's what curl does. You could also use wget, wpull, or whatever else. [18:17] Hell, you could do it with openssl s_client if you really wanted to. [18:18] And yeah, you can obviously replace the "throw them into wpull, upload WARC to IA" with that. [18:18] oh.. I thought curl downloads the web.archive.org page as well [18:18] It wouldn't grab the requisites though, I think. [18:18] CoolCanuk: That's exactly what it does, and it triggers a server-side archiving. [18:19] unhelpful if you have a bad internet connection and don't want to download the archive.org page every request :P [18:19] idk :d i just use that as cronjobs to save tweets https://pastebin.com/raw/ZE4udKTi [18:20] no page requisites are saved when you use /save/ like that [18:20] only the one URL you have after /save/ [18:20] no images, or other stuff from the page is saved [18:20] doh [18:21] (which is probably fine for net neutrality.. it should mostly be text/links to othe rsites) [18:21] if there are any images, it's likely already been posted before [18:22] you can't see what picture is on a page if it's not saved [18:22] no matter how many times the picture might have been saved in other places acros the web [18:23] you can't see pictures that are still online? [18:23] twitter also uses their damn tc.co url shortening [18:23] I think we save things in case they go offline [18:24] <3 [18:25] (I hope that wasn't passive aggressive) :( [18:26] * arkiver isn't an aggressive person :) [18:27] aggressive at archiving :P [18:27] hehe [18:27] :) [18:28] i've been running those cronjobs since the 26th (i think). Should i perhaps just halt that idea then, or might it be useful data for someone else to dig trough? At least the text and links are there i guess.. [18:29] was planning to run them until the netneutrality voting stuff is over on the 14th(?) [18:29] text is always useful [18:30] Definitely better than nothing. [18:30] I believe the data from Alexa on IA also does not include pictures [18:30] but I'm not totally sure about that [18:34] i'm just going to let it run then [18:34] What does the /save/ URL return exactly? Are the URLs for page requisites also replaced with /save/ URLs? [18:35] If so, it might be possible to use wget --page-requisites to grab them. [18:35] one sec [18:38] https://pastebin.com/raw/dJrVbnpr [18:39] that's what i get when running: curl -H "User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/62.0.3202.89 Chrome/62.0.3202.89 Safari/537.36" --silent --max-time 120 --connect-timeout 30 'https://web.archive.org/save/https://twitter.com/hashtag/netneutrality?f=tweets' [18:40] Yep, everything there is also replaced with /save/ URLs. [18:40] So give wget --page-requisites a shot if you want. [18:40] (Plus a bunch of other options, obviously.) [18:40] JAA: yes [18:41] ok [18:41] I believe embed are replace with a /save/_embed/ URL and links with a /save/ URL [18:41] Yep [18:42] by 'other options' do you mean just to make it run quiet? [18:44] Yeah, and making it not write the files to disk. [18:44] ok [18:45] Not sure what else you'd need for this. [18:45] me neither unfortunatly, i had to browse a bit just to learn that much curl :d [18:45] but i'll check it out [18:52] i did ask info@archive.org if it's ok to do the curl commands so frequent (every 3-5 minute), but no response back yet. [18:53] i just hope they won't suddenly go 'wtf is this!?' and block me :d [19:04] *** ZexaronS has joined #archiveteam-bs [19:08] no [19:08] it's just one URL that's saved per curl command [19:08] https://archive.org/details/liveweb?sort=-publicdate [19:09] the number of URLs per item in there is a lot higher than how many you are saving in a day [19:12] as long it's fine with IA i'm good [19:13] arkiver: could there be a way to 'retro-crawl' the tweets i've already saved? [19:14] to get the images to load into the saves, i mean [19:14] *** Stilett0 has joined #archiveteam-bs [19:15] this is the mail i wrote on the 27th btw: https://pastebin.com/AV1vbKUr [19:16] I'm sure they're fine with it [19:16] good stuff [19:16] let me know if anything goes wrong [19:16] ok [19:17] with the 'retro-crawl', I guess you could get the older captures, get the URLs for the pictures from those and save those [19:17] but you can't really /save/ an old page again [19:17] or continue a /save/ or something [19:19] ok. I'm guessing at least some number of the tweets are bound to have become deleted by the users themselves (or banned user accounts). [19:19] If you visit the pages, it should grab any images that aren't in the archives already. [19:20] So I guess you could make your browser go through all those old crawls. [19:20] ouch [19:20] but yeah, that is what i meant :d [19:20] Or perhaps it would work with wget --page-requisites as well, not sure. [19:21] i'll rather try that than sit scrolling in my browser :D [19:32] opening a capture in the browser does not seem to work to pull the images https://web.archive.org/web/20171130120002/https:/twitter.com/hashtag/netneutrality?f=tweets [19:32] only user avatars etc seems to be present [19:36] and those f*cking t.co links...pissing me off :/ [19:51] *** dashcloud has quit IRC (Read error: Operation timed out) [19:52] *** dashcloud has joined #archiveteam-bs [19:58] I'm not American and articles aren't helping... how fast is Cumulus Media declining ? [19:58] This looks like quite the "portfolio" https://en.wikipedia.org/wiki/List_of_radio_stations_owned_by_Cumulus_Media [20:01] does gdrive use some kind of incremental throttling for uploads? i am down to 1.5MB/s now :( [20:01] and it seems quite linear over time [20:01] *** bitspill has joined #archiveteam-bs [20:02] CoolCanuk: https://www.marketwatch.com/investing/stock/cmlsq ..Not sure if it's really indicative though [20:02] omg [20:02] 0.095?! [20:03] iHeartRadio also seems troubled [20:04] however, iHeartRadio in Canada is likely not impacted, since I'm pretty sure Bell purchased rights to use it and it's a crappy radio streaming app for Bell Media radio stations- not true iHeartRadio [20:06] CoolCanuk: All i see is the slope going down :d https://www.marketwatch.com/investing/stock/cmlsq/charts That's basically the max of my knowledge about stocks and shit :d [20:07] same here [20:14] *** SimpBrain has quit IRC (Remote host closed the connection) [20:18] CoolCanuk: a friend of mine who unfortunitaly passed away in 2015 once showed me daytrading thingy software. If i remember correctly the only thing that differed from the free API testing was that all the data was delayed [20:20] CoolCanuk: it wouldn't be useful for trading, but perhaps for alerting about online services going to hell [20:28] schbirid: i think there's a limit of 750GB/day uploaded? [20:29] if you're close to that, could explain things [20:29] ah, maybe [20:30] nope... today is just at "Transferred: 104.014 GBytes (1.540 MBytes/s)" [20:31] schbirid, any packet loss? [20:31] no idea, how do i check? [20:31] ping maybe [20:32] Well, step one: Be on linux (and run the upload from the same machine), step two: run "mtr hostname.here" [20:32] no idea what the hostnames for gdrive are [20:32] oh yeah mtr that's better [20:32] mtr rules [20:32] Step 0: Install iftop and check what address all your data is going too. :) [20:32] to* [20:33] duh, i feel dumb [20:35] Don't. There are many ways to do this, and today you learned a new one. [20:35] relearned [20:36] There will be a test on what all flags to tar are and what they do tomorrow! [20:38] i use longform [20:38] :P [20:38] tar is easy [20:38] looks like there is notraffic at all and rclone is doing some crap instead. makes sense to have the "speed" die down linearly then [20:54] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [20:55] *** SmileyG has joined #archiveteam-bs [20:56] *** Smiley has quit IRC (Read error: Operation timed out) [20:57] CoolCanuk: that cumulus media thing made my brain conjure up some silly idea https://pastebin.com/raw/32k6st0E [21:04] *** SmileyG has quit IRC (Ping timeout: 260 seconds) [21:04] *** dashcloud has joined #archiveteam-bs [21:07] *** Smiley has joined #archiveteam-bs [21:19] schbirid: any cpu activity from rclone? [21:20] *** BlueMaxim has joined #archiveteam-bs [21:20] i just straced it and it has connection time outs all over [21:57] *** schbirid has quit IRC (Quit: Leaving) [21:58] should Wikia be moved to Fandom, or is it okay to redirect Fandom to Wikia? [22:00] JAA: i tried this wget command, wget -O /dev/null --header="Accept: text/html" --user-agent="Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:21.0) Gecko/20100101 Firefox/21.0" --quiet --page-requisites "https://web.archive.org/save/https://twitter.com/hashtag/bogus?f=tweets" ..it's 100% quiet, though it doesn't seem to return more than using curl did. [22:01] JAA: i won't know until the captures show up on wayback though [22:09] ola_norsk: You might want to write a log file to figure out what it's doing exactly. -o is the option, I think. [22:13] JAA: without -O it does make a directory structure. but it doesn't seem to contain image data [22:14] JAA: It seems to be just the same data, only then in e.g web.archive.org/save/https\:/twitter.com/hashtag/bogus\?f=tweets [22:15] in folders, i mean, instead of the (same?) data going to -O [22:17] Hm [22:19] JAA: https://pastebin.com/FKu3mHbh this showes the structure of what it does [22:20] JAA: the 'hashtag/bogus?f=tweets' is the only file apart from robots.txt [22:20] Right [22:20] *** noirscape has joined #archiveteam-bs [22:21] could Lynx browser be tricked into acting like a 'real' browser perhaps? [22:22] *** noirscape has quit IRC (Client Quit) [22:22] I doubt it. [22:23] *** fie has quit IRC (Ping timeout: 633 seconds) [22:23] Not sure why your command doesn't work. [22:23] But yeah, a log file would help. [22:23] Maybe with -v or -d even. [22:23] one sec [22:25] *** MrDignity has joined #archiveteam-bs [22:27] JAA: 'default output is verbose.' ..and there's quite little there i'm afraid :/ [22:28] ill see if there's some options that give it better [22:28] *** fie has joined #archiveteam-bs [22:30] ola_norsk: "Not following https://web.archive.org/save/_embed/https://pbs.twimg.com/profile_images/848200666199629824/ZwvxQIzP_bigger.jpg because robots.txt forbids it." [22:30] Fucking robots.txt [22:30] It breaks everything. :-P [22:30] try setting --user-agent [22:31] I did. [22:31] maybe it's a javascript thingy, that loads all the shit? :/ [22:31] I used your exact command. [22:31] -e robots=off [22:31] hmm [22:35] *** shin has quit IRC (Quit: Connection closed for inactivity) [22:36] JAA: here is output from me running the command (Note, it's in norwegian :/ ) https://pastebin.com/awJ9j4D8 [22:37] (please correct me if i'm wrong) [22:37] JAA: could it be i'm using older wget or something? [22:37] ola_norsk: With -e robots=off? [22:37] Maybe, what version are you using? [22:37] I'm on 1.18. [22:38] I don't think it should matter too much though. [22:39] JAA: GNU Wget 1.17.1 [22:39] sry, didn't notice the robots=off [22:40] Hmm [22:40] It seems that it doesn't work with -O /dev/null, interesting. [22:41] robots=off did something else indeed, but i'm guessing it didn't do much better than when you ran it [22:41] a slew of 404 errors appeared [22:42] Yeah, I got a bunch of 404s as well, but not all requests were 404s. [22:44] --2017-12-02 23:41:17-- https://web.archive.org/save/_embed/https://abs.twimg.com/a/1512085154/css/t1/images/ui-icons_2e83ff_256x240.png [22:44] Kobler til web.archive.org (web.archive.org)|207.241.225.186|:443 … tilkoblet. [22:44] HTTP-forespørsel sendt. Venter på svar … 404 Not Found [22:44] 2017-12-02 23:41:18 PROGRAMFEIL 404: Not Found. [22:44] is one png [22:44] Yeah, that doesn't exist. [22:45] But my command earlier grabbed https://pbs.twimg.com/profile_images/848200666199629824/ZwvxQIzP_bigger.jpg for example. [22:45] so, it's robots.txt on the endpoints that causes the failures? [22:45] robots.txt at web.archive.org, yes. [22:45] Ah, no. [22:46] That's what causes wget not to retrieve the page requisites without -e robots=off. [22:46] no, i mean at e.g : abs.twimg.com ? [22:46] Those 404s, not sure. Might just be broken links or misparsing. [22:47] damn internet, it's a broken big fat mess [22:48] cloudflare and shit [22:48] which website are you trying to access that cloudflare wont let you [22:48] I can possibly help get the true IP [22:49] it's to get waybackmachine to capture webpages, including images, with doing just request [22:50] oh :/ [22:51] HTML is a huge clusterfuck. Well, to be precise, HTML is fine, but the parsing engines' forgiveness is awful. [22:51] And don't get me started on JavaScript. [22:51] CoolCanuk: i've messed up, thinking it would actually do captures by doing just that with automatic requests..but turns out it wasn't that easy :/ [22:52] JAA: aye. Is it possible that twitter uses javascript to put in the images, AFTER the page is loaded? [22:52] Definitely possible. [22:52] JAA: if so, i'm giving up even trying :d [22:53] But at least part of it is not scripted. [22:55] My test earlier grabbed https://pbs.twimg.com/media/DQDHMryX4AEseEo.jpg for example, which is an image from a post most likely (though I'm not going to try and figure out which one). [22:55] I think i'll just let the curl stuff run until the 14th, and let someone brigther than me figure it out in the future. [22:55] Sometimes, I hate the WM interface. "3 captures" *click* only lists one. [22:56] one thing is images, but another is that basically all links on twitter are shorterened links [22:57] Yeah, but if you want to follow those, you'll definitely need more than that. [22:57] I mean, it might work with --recursive and --level 1 or something like that. [22:57] But it would really be better to just write WARCs locally and upload those to IA. [22:57] the t.co links do come with the actual link the ALT= tag i think , not sure though [22:58] property i mean [22:58] Never looked into them. [22:59] What you're describing is more or less what I'm doing from time to time with webcams. [22:59] I did that during the eclipse in the US in August, and I'm currently retrieving images from cams across Catalonia every 5 minutes. [23:00] It's just a script which runs wpull in the background + sleep 300 in a loop. [23:01] A cronjob might be cleaner, but whatever. [23:02] with --recursive it does seem to take a hell of a lot longer.. [23:02] and that's maybe a good sign [23:02] Yeah, it's now retrieving all of Twitter. [23:02] Well, maybe not all of it, but a ton. [23:03] * ola_norsk suddenly archive all of internets [23:03] Solving IA's problems. Genius! [23:03] aye [23:03] maybe that level thing is not a bad idea :d [23:05] :-P [23:05] any way i could limit it to let's say 1-2 "hops" away from twitter? :D [23:05] ...seriously, it's still going [23:06] it went from #bogus hashtag to shotting #MAGA.. [23:07] Yep, and it'll retrieve every other hashtag it can find. [23:07] aye [23:07] It's the best recursion. Believe me! [23:07] 'recurse all the things!' lol [23:09] at the very least i think it needs some pause between these requests :d [23:09] stack exausted, core dumped. [23:10] it's doing bloddy mobile.twitter.com now .. [23:10] nobody needs that [23:12] it's brilliant though :D , i just hope it did the images :D [23:12] It did exactly what you told it to. :-P [23:13] that just proves computes are stupid :d [23:13] Yeah, that or... :-P [23:14] the Illuminati did it [23:16] but, i'm thinking if was limited to just 1-2 hops, even 1, that would be enough to get most images. Or? [23:17] --page-requisites gets the images already. [23:17] (But apparently only if you actually write the files to disk. My tests with -O /dev/null did not work.) [23:17] You only need recursion with a level limit if you also want to follow links on the page. [23:17] Which might make sense, retrieving the individual tweets for example. [23:18] could you pastebin the command you did that does image capture? [23:18] But if you want to have any control over what it grabs (for example, not 100 copies of the support and ToS sites), it'll get complex... [23:18] Uh [23:18] Closed the window already, hold on. [23:19] the --recursion is violent :d [23:19] It's awesome, you just need to know how to control it. :-) [23:19] *** jschwart has quit IRC (Quit: Konversation terminated!) [23:20] aye [23:21] as for any output, if i can't put in /dev/null it'll go in a ramdisk that cleared quicly [23:21] that's [23:22] Uhm, dafuq? https://web.archive.org/web/20171202231923/https:/twitter.com/hashtag/bogus?f=tweets [23:23] That's my grab from a few minutes ago. [23:23] Well, it did grab the CSS etc. [23:23] I didn't specify the UA though. That might have something to do with it. [23:24] i'm not sure how they distrubute the requests between 'nodes' [23:24] The command was wget --page-requisites -e robots=off 'https://web.archive.org/save/https://twitter.com/hashtag/bogus?f=tweets' [23:24] ty [23:24] Regarding the temporary files: mktemp -d, then cd into it, run wget, cd out, rm -rf the directory. [23:24] Five-line bash script. :-) [23:28] JAA: sometimes i notice twitter.com requires login for anything. Maybe it varies by country. I'm not sure. [23:29] ty, gold stuff [23:29] Yeah, Twitter's quite annoying to do anything with it at all. [23:29] We still don't have a solution for archiving an entire account or hashtag. [23:30] they make money of off doing that [23:30] so they will not make it easy [23:31] if you're from a research institution, they would easily hand over hashtag archive from day0. For a slump of money, of course [23:32] Yeah [23:36] is there a mirror of the wiki we can use until it's stable? [23:36] No, I don't think so. [23:37] There's a snapshot from a few months ago in the Wayback Machine, I believe. [23:46] that command entails 1.7Megabytes of data :D what is the internet coming to?? lol [23:48] mankind doesn't deserve it :d [23:49] "The average website is now larger than the original DOOM." was a headline a few years ago... [23:49] web page* I guess [23:49] aye, i think just the fucking front page of my online bank is ~10MB :/ [23:52] no wonder dolphins are dying from space radiation and ozone