[00:01] Did it ever allow for multiple pages on a site? [00:02] * joepie91 walks in [00:02] hey Joe, we're trying to figure out the most graceful way of saving YTMD [00:03] yeah, saw in -bs [00:03] max: do you think you can provide a list of sites? [00:04] when you say list, you mean the list of YTMD's? [00:04] YTMND*'s? [00:04] arkiver: I think you could go through http://blah.ytmnd.com/info/{number}/json incrementally? [00:04] * joepie91 is currently reading [00:04] should be pretty easy actually he could just send us the zone file [00:04] since literally every YTMND is on it's own host [00:04] ErkDog: I assume he has a wildcard *.ytmnd.com at the DNS level [00:04] max: do you by any chance still have the source of the HTML5 version, even if it's incomplete? [00:04] ohhhhhh good point :( [00:05] ohhhh well then in his database table [00:05] Thanks for pointing that out nicolas17 [00:05] max: if released as open-source it might drive people to continue developing it, even if just as a future-proof way of viewing the YTMBD stuff [00:05] er [00:05] max: would the http://blah.ytmnd.com/info/{number}/json method get us all sites? [00:05] YTMND * [00:05] he should have .ytmnd.com so it could match [00:05] Or are there some special cases [00:08] http://archiveteam.org/index.php?title=YTMND [00:08] Awesome Frogging! [00:09] wlel I've been trying random numbers in the json [00:09] up to 25000 with a response so far [00:09] some have said error no site [00:09] ok [00:09] hey max do you have a list of names that you could share? [00:10] ^that would be very helpful [00:10] https://puu.sh/qF8Bx/9d14b6c213.png [00:11] so basically we could crawl the json's up [00:11] and it gives you the "domain" in the json [00:11] *** Froggypwn has quit IRC (Read error: Operation timed out) [00:11] then we crawl the domain.ytmnd.com for WARCing [00:13] https://puu.sh/qF8MY/96a6c609bd.png [00:13] https://puu.sh/qF8NY/de3ee558f9.png [00:13] I joined YTMND 12 yrs ago, jesus [00:19] http://ateam-test-1.ytmnd.com/ [00:19] just made that [00:20] OK I just made one it's ID is: 1008765 [00:22] *** howdoicom has quit IRC (Quit: Page closed) [00:24] wow you have got to be kidding me [00:24] I can't edit the YTMND wiki page [00:24] https://puu.sh/qF9sQ/fa44f78678.png [00:25] because YTMND.com is blacklisted external site [00:25] lol [00:26] Frogging must have super powers [00:26] *** Petri152 has joined #archiveteam [00:28] I had to put spaces in the URLs I guess someone will have to fix it besides me [00:28] *** RedType_ has quit IRC (Read error: Operation timed out) [00:29] ytmnd is on the tracker page [00:30] max: any limits or special status codes? [00:30] I summarised the info we had so far that would allow a sucessful crawl arkiver [00:30] thanks [00:30] barring additional resources/info provided by Max [00:32] Also if it's only 1.7 TB, I could crawl and push that out in a few days, if you didn't want to do all the extra crazy stuff to add it into the warrior [00:42] ErkDog: I think the 1.7TB includes stuff that isn't publicly or easily accessible [00:43] so crawling you would get less :P [00:43] Hmm.. for a handful of sites, forcing HTML5 results an error message saying the audio could not be decoded [00:51] max: Jason Scott here, we can also do a full version of the 1.7tb collection and put it into the Internet Archive's dark archives for safekeeping. [00:51] We should do both [00:51] We should. That's what I'm saying. [00:51] awesome [00:52] I think we're going to start the crawls from for example http://ytmnd.com/sites/991586/profile [00:53] I'm off [00:54] max: if you can, please provide a list of sites, users and keywords (if that isn't easily possible, we can extract some ourselves too) [00:54] * arkiver is afk for the night [00:59] *** JesseW has joined #archiveteam [01:10] *** DoomTay has quit IRC (Quit: Page closed) [01:21] *** tomwsmf has quit IRC (Read error: Operation timed out) [01:26] *** JesseW has quit IRC (Ping timeout: 370 seconds) [01:34] *** RedType has joined #archiveteam [01:37] *** DoomTay has joined #archiveteam [01:50] *** JesseW has joined #archiveteam [02:00] *** Froggypwn has joined #archiveteam [02:01] *** Honno has joined #archiveteam [02:02] *** DoomTay has quit IRC (Quit: Page closed) [02:06] *** DoomTay has joined #archiveteam [02:45] *** DoomTay has quit IRC (Quit: Page closed) [02:51] *** tuankiet6 has joined #archiveteam [02:52] *** tuankiet6 is now known as tuankiet [03:04] *** RichardG has quit IRC (Read error: Connection reset by peer) [03:04] *** RichardG has joined #archiveteam [03:12] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [03:15] *** JesseW has quit IRC (Ping timeout: 370 seconds) [03:39] *** RichardG has joined #archiveteam [03:50] *** mutoso has quit IRC (Ping timeout: 260 seconds) [03:58] *** mutoso has joined #archiveteam [04:16] *** JesseW has joined #archiveteam [04:18] *** nicolas17 has quit IRC (Read error: Operation timed out) [04:22] *** Sk1d has quit IRC (Ping timeout: 194 seconds) [04:30] *** Sk1d has joined #archiveteam [04:42] *** JesseW has quit IRC (Ping timeout: 370 seconds) [04:52] *** RedType has quit IRC (Read error: Operation timed out) [05:02] *** JesseW has joined #archiveteam [05:31] *** RichardG has quit IRC (Read error: Operation timed out) [05:31] *** RichardG has joined #archiveteam [05:54] *** dan- has quit IRC (Ping timeout: 633 seconds) [05:54] *** RedType has joined #archiveteam [05:57] *** dan- has joined #archiveteam [06:25] *** JesseW has quit IRC (Ping timeout: 370 seconds) [06:29] *** uosdwis has joined #archiveteam [06:31] hi. I'm running the warrior but it's not getting any items. is the tracker down? [06:32] *** aschmitz has quit IRC (Read error: Operation timed out) [06:39] *** aschmitz has joined #archiveteam [07:02] *** uosdwis has quit IRC (Quit: Page closed) [07:11] *** RichardG has quit IRC (Ping timeout: 255 seconds) [07:21] tracker is up, but we complete projects faster than we can start them [07:21] but uosdwis is gone anyway [07:22] we are too good [07:25] *** BlueMaxim has quit IRC (Read error: Operation timed out) [07:26] Anyone got this error while running googlecode-grab? Lua runtime error: googlecode.lua:375: invalid use of '%' in replacement string. (It's just on Arch Linux I think, my VPS running Ubuntu 16.04 doesn't have this error) [07:26] *** BlueMaxim has joined #archiveteam [07:32] *** les has joined #archiveteam [07:32] WHAT FORSOOTH, PRITHEE TELL ME THE SECRET WORD [07:33] yahoosucks [07:33] got it, thanks [07:33] *** les has quit IRC (Client Quit) [07:52] *** db48x has quit IRC (Read error: Operation timed out) [08:02] *** kristian_ has joined #archiveteam [08:33] *** fie has joined #archiveteam [08:38] *** RichardG has joined #archiveteam [08:47] *** BlueMaxim has quit IRC (Read error: Operation timed out) [08:48] *** BlueMaxim has joined #archiveteam [09:26] Go forth [10:23] *** kristian_ has quit IRC (Leaving) [10:36] *** Peetz0r_ is now known as Peetz0r [10:41] *** db48x has joined #archiveteam [10:49] *** db48x has quit IRC (Remote host closed the connection) [10:59] *** wp494 has quit IRC (Read error: Connection reset by peer) [11:00] *** WinterFox has joined #archiveteam [11:14] *** db48x has joined #archiveteam [11:41] *** fie_ has joined #archiveteam [11:43] *** fie has quit IRC (Read error: Operation timed out) [11:58] *** swonsy has joined #archiveteam [12:01] Hello everybody. tell me, addons.mozilla.org already archived? if so, where you can download files? Thank you [12:01] where can i download* [12:04] Try the way backmachine? [12:04] It might be archived. [12:05] Could you give me a link to the archives? [12:06] https://web.archive.org/web/*/addons.mozilla.org [12:07] but the actual extensions aren't archived [12:09] but i need the archives files the extensions, not just pages of extensions to the AMO [12:09] yes, i know [12:10] This is bad [12:12] then tell me, your team will be archived AMO with extensions files? in future [12:14] archive* [12:17] it's a good idea [12:19] of course)) [12:21] because Mozilla dies [12:23] many extensions already disappeared from the AMO site, as well with the developers sites [12:25] you need to preserve at least those that have [12:39] *** BlueMaxim has quit IRC (Quit: Leaving) [12:43] *** vitzli has joined #archiveteam [12:43] *** WinterFox has quit IRC (Read error: Operation timed out) [12:58] so i was thinking if you want to warc the entire site, i can write a quick script that reads from the db and just generates a massive list of every page on ytmnd.com as well as all the subdomains [12:59] SketchCow: also hi! [13:00] also there's an API that has been down for a few years because no one ever used it and it was a pain to maintain, but it could give a ton of access to otherwise hidden data if i turned it back on and made it work. [13:00] or i could make the json on the subdomains include more information [13:01] i colocate and bandwidth is cheap, so frankly i dont care if the archiving is ddosing the site. [13:01] that said, i can copy the 1.7tb assets archive to you guys faster than using single http gets [13:01] We should do both [13:02] the whole asset dir is /#/#/. [13:02] and they're immutable [13:02] If we don't crawl this through HTTP GETs it will not be in the wayback machine [13:02] ok [13:03] A copy of the data would be nice to have too, besides the crawl [13:03] the only data im hesitant to provide is email/password hashes/private messages [13:04] I'm not sure about a dump, but it won't be in the crawled data if it is not publicly available information [13:04] As for the dump and private information, [13:04] people dated people they met on ytmnd, so there are likely some very personal messages [13:05] *** BartoCH has quit IRC (Ping timeout: 260 seconds) [13:05] Internet Archive can keep items dark ('inaccessible'), so that might be what you want for a copy for the data [13:05] Or you can encrypt the private data only and send it that way [13:06] Or leave it out, of course [13:06] im just not sure i see the value in archiving private messages [13:06] SketchCow ^ [13:06] i checked last night and the view data on sites is the majority of the data, at 460 million rows [13:07] ok [13:07] I think it'd be good to talk with SketchCow about what we should do with the copy of the data. [13:08] For the crawl, it would be great if you can create that list of every page and subdomain [13:08] and i guess it might be worth making the html5 player a bit nicer just so people in the future can see them [13:11] *** BartoCH has joined #archiveteam [13:16] That might be a nice idea [13:31] this is exciting :) [14:02] *** nicolas17 has joined #archiveteam [14:18] *** wp494 has joined #archiveteam [14:30] *** tuankiet has quit IRC (Ping timeout: 246 seconds) [14:38] *** tuankiet6 has joined #archiveteam [15:37] *** DoomTay has joined #archiveteam [15:37] max: Out of curiosity, how does YTMND handle busy servers? [15:38] Because we once dealt with a website where in such a situation, the site would instead serve a "servers are busy" message while still having a status code of 200 [15:49] well [15:50] it doesn't do anything special [15:50] but spidering would technically pollute the view data [15:50] not that it's really that important [15:50] but it was tuned to get a lot more traffic than it does now so it should probably be fine in that regard [15:51] *** vitzli has quit IRC (Quit: Leaving) [15:55] Anyway, it looks like they're going to look at JSONs with the domain always being on picard.ytmnd.com, which I'm not sure sure about. Unless they will eventually crawl the JSON at a given site's actual domain, it will probably wind up broken once the site is on Wayback Machine [16:02] max: Can you provide Internet Archive a copy of the data with all private messages removed? [16:06] if the private messages are removed then can the dump be made public? [16:07] but spidering would technically pollute the view data [16:08] It's probably best to first get a copy to IA and after that do the crawl, so the original statistics are saved [16:17] *** tomwsmf has joined #archiveteam [16:26] max: what are your thoughts on my question last night, regarding providing the HTML5 player as an open-source thing so that people can continue to develop it? [16:27] (if they so desire) [16:52] *** Morbus has quit IRC (Quit: http://www.disobey.com/) [17:16] GAWKER.COM closes down last week. [17:16] Anything left to grab? We were pretty comprehensive. [17:17] *** riordan has joined #archiveteam [17:26] *** W has joined #archiveteam [17:27] @SketchCow Forgive me for being a perma-n00b/admirer but when you grabbed gawker, did you also grab the whole gawker media/kinja network? [17:27] There’s a bunch of real weird shit in there like dog.gawker.com that’s well… fascinating [17:27] We're likely to double-check [17:28] also tons of their posts are crazy reliant on embedded content (embedded tweets) [17:28] awesome [17:28] *** kristian_ has joined #archiveteam [17:30] On behalf of the staff of old-school cultural heritage orgs: thank you all for doing this when we wont [17:30] because computers scare us and we’ve been told they’re very expensive [17:30] <3 [17:31] embedded tweets archive pretty well [17:31] because they're a
with some javascript that makes it look fancy [17:32] *** bithippo has joined #archiveteam [17:32] joepie91: it's sort of already open source. i originally wanted to make all the code open source but never ended up doing it because i was ashamed of some of the older code [17:33] "available and non-obfuscated if you click 'view source'" != "open source and under a free license" :) [17:33] right [17:34] *** W has quit IRC (Ping timeout: 268 seconds) [17:34] it's open source minus the license, i'd have no problem making it gpl or whatever you guys suggest [17:34] Is ArchiveTeam picking up gawker.com? gawker.com/gawker-com-to-end-operations-next-week-1785455712 [17:37] bithippo: Can we give you a task? [17:37] I accept all sorts of tasks. [17:37] 1. Sit in this channel [17:38] 2. For the next 12 hours, when someone with a new name comes in and goes "WHAT ABOUT THE GAWKERZ" [17:38] 3. You say "We're on it!" [17:38] Point taken :) My apologies. [17:38] No point [17:38] I'm assigning you this task [17:38] Pretty simple one [17:38] we were talking about it literally right before you joined :P [17:39] Engage maximum regret. [17:39] [13:16:24] <@SketchCow> GAWKER.COM closes down last week. [17:39] is it last week or next week :p [17:39] Next week. [17:39] I'm ..... distracted today. [17:39] I thought it was a metaphor or something, heh [17:41] The conbination of tense and time frame was pretty confusing [17:41] *** verifiedJ has joined #archiveteam [17:42] *** SketchCow sets mode: +b *!*webchat@*.res.bhn.net [17:42] *** DoomTay was kicked by SketchCow (DoomTay) [17:42] (I'm interested if he sticks around if he's just in #archivebot) [17:43] o.o [17:44] why was that? [17:44] nicolas17. [17:44] If you come into Act 2 of the play [17:44] Please avoid trying to ask why everyone's doing everything on stage [17:45] https://archive.org/download/Uptime_Magazine_Volume_11_Number_5_1985_Side_1/screenshot_00.jpg [17:48] *** Morbus has joined #archiveteam [17:51] *** schbirid has joined #archiveteam [17:54] *** ikreymer has joined #archiveteam [17:57] *** alembic has joined #archiveteam [18:00] Who is the main point of contact for archiving Gawker at this point? [18:01] max: one sec [18:01] max: have a look here: http://cryto.net/~joepie91/blog/2013/03/21/licensing-for-beginners/ [18:01] max: and don't be afraid about code quality, I can assure you that people would much rather have crappy code be open-source, than not available/reusable at all :) [18:01] (and that's assuming that it's crappy to begin with) [18:01] at least when it's open-source, they can safely improve it [18:02] (also, technically speaking, something cannot be "open-source" unless it's licensed under an OSI-compliant license :P) [18:02] (er, sorry, OSD) [18:04] *** gfscott has joined #archiveteam [18:05] joepie91: CC0 is not a license [18:06] it is [18:06] No.+ [18:06] it is an attempt at public domain dedication that falls back to a license [18:06] lets not discuss that here [18:06] * Nemo_bis shuts up the nitpicker before it gets too late. [18:06] ^ [18:09] *** m4rk3r has joined #archiveteam [18:10] *** ikreymer has quit IRC () [18:11] *** ikreymer has joined #archiveteam [18:16] *** AlexLehm has joined #archiveteam [18:32] *** kristian_ has quit IRC (Leaving) [18:48] *** riordan_ has joined #archiveteam [18:50] *** riordan has quit IRC (Read error: Operation timed out) [18:50] *** riordan_ is now known as riordan [18:55] CC0 is a license. [18:55] There, we're done. [18:56] It's allowed to be a license you think is a fucking joke, just like POSIX is a joke [18:56] (Get up get up get and get down / POSIX is a joke in your town) [18:56] So, I'm on a show tonight. [18:56] http://amyontheradio.com/ [18:58] SketchCow: I heard RMS regrets renaming the POSIX_ME_HARDER environment variable to POSIXLY_CORRECT [19:05] *** alembic has quit IRC (Ping timeout: 268 seconds) [19:09] *** riordan has quit IRC (riordan) [19:10] *** riordan_ has joined #archiveteam [19:17] *** riordan_ has quit IRC (Read error: Operation timed out) [19:28] *** riordan has joined #archiveteam [19:49] *** swonsy has quit IRC (Quit: Page closed) [19:56] *** riordan has quit IRC (riordan) [19:57] *** riordan has joined #archiveteam [19:58] *** riordan_ has joined #archiveteam [19:58] *** riordan has quit IRC (Read error: Operation timed out) [20:11] *** riordan_ has quit IRC (Ping timeout: 633 seconds) [20:21] SketchCow: will the radio show be archived by you? [20:22] Well, by archive team [20:23] What time are you on? [20:25] "Who Will Archive ArchiveTeam?" [20:25] *** schbirid has quit IRC (Quit: Leaving) [20:25] http://www.deeptalkradio.com/network-schedule/ [20:27] i wonder if i can just start wget and keep it open, the show time is too late for europe [20:28] 9pm ET [20:29] or 2am London [20:30] bithippo: we do [20:31] Kaz: Should've added the /s, sorry about that [20:31] To be fair though, (I say this because I haven't seen you around here before), there were/are plans to back up the IA [20:32] *** schbirid has joined #archiveteam [20:32] so, you joke but there is some actual project there :) [20:32] I joke, but I know you're entirely serious. One of my projects on the backburner is to figure out how to dynamically assign IA torrents to torrent client endpoints that exist solely to backup a shard of the IA [20:32] _in my spare time_ [20:33] so like ia.bak but a different way [20:33] Similar to ArchiveTeam warriors, but for distributed storage [20:33] yeah [20:33] So you'd spin up the VM on a machine with a lot of store, and IA would hand you torrents to consume and backup locally that were currently least distributed to backup clients. [20:37] *** arrith has joined #archiveteam [21:19] *** ikreymer has quit IRC (Read error: Connection reset by peer) [21:20] *** ikreymer has joined #archiveteam [21:27] *** bithippo has quit IRC (Quit: Page closed) [21:48] *** verifiedJ has left [22:09] *** Jogie has joined #archiveteam [22:15] *** m4rk3r has quit IRC (m4rk3r) [22:17] *** gfscott has quit IRC (gfscott) [22:26] *** Stiletto has quit IRC (Ping timeout: 246 seconds) [22:38] *** BlueMaxim has joined #archiveteam [22:39] *** ikreymer has quit IRC (Remote host closed the connection) [22:41] *** ikreymer has joined #archiveteam [22:46] *** ikreymer has quit IRC (Remote host closed the connection) [22:47] *** ikreymer has joined #archiveteam [22:49] *** William has joined #archiveteam [22:49] Does Archiveteam plan on jamming Gawker.com into the warrior? - http://gawker.com/gawker-com-to-end-operations-next-week-1785455712 [22:50] William: its done: https://archive.org/search.php?query=subject%3A%22gawker.com%22 [22:51] Says sitemap, is the content downloaded? [22:52] http://gawker.com/sitemap_bydate.xml?startTime=2016-08-01T00:00:00&endTime=2016-12-31T23:59:59 [22:52] all gawker.com sites have sitemaps [22:53] *** William has quit IRC (Client Quit) [23:08] actually, I'll post it here as well [23:08] https://searx.me/ [23:08] this search engine lets you get results as JSON [23:08] can be useful for discovery [23:09] *** Honno has quit IRC (Read error: Operation timed out) [23:43] *** Stiletto has joined #archiveteam [23:56] *** W has joined #archiveteam