[00:01] hook54321, files are safe on the mirrors [00:01] just my torrent box [00:01] all I need to do is redownload to my torrent server [00:03] Which mirrors? The cm ones? [00:03] Right. I have a plan. Its 00:02 here. I am going to set the files to redownload from ItsYoda - get some sleep and then work on this tomorrow [00:03] Does ItsYoda have the forums too? [00:03] yes [00:03] ok, we're safe then [00:08] hook54321, download is running. Will update in the morning. Goodnight all [00:11] goodnight [00:12] goodnight HCross [01:08] *** siemensak has quit IRC (Quit: Page closed) [01:58] I remember someone mentioning a huge URL database here awhile ago but I can't seem to find it now, anyone know what it was? [02:01] *** Specular has joined #archiveteam-bs [02:04] Nevermind, found it: https://commoncrawl.org/ [02:07] "The organization's crawlers respect nofollow and robots.txt policies." pffftt :/ [02:08] dashcloud: srsly.de's up, if you want to go see what I was talking about yesterday [02:11] no FAQ on how they differ from other archiving efforts [02:11] yea kind of strange [02:12] they should atleasy have an faq [02:12] I don't get it, what is their purpose? [02:15] It looks like a bunch of screenshots, mostly of terminals [02:16] wp494 [02:17] what kind of terminals though [02:17] using their primitive index searches found on their blog (seemingly the only links to the crawls) doesn't seem to contain much [02:18] They have a blog? Where? [02:18] http://commoncrawl.org/connect/blog/ [02:18] but can you set it up lol [02:19] Specular: Oh, I thought you were talking about srsly.de [02:20] oh. Srsly.de looks like some site that displays exposed VNC terminals. [02:20] ^^^ [02:20] Specular: You can search it here: http://urlsearch.commoncrawl.org/ [02:20] Specular, wp494 : "Failed to connect to server (code: 1006)" [02:21] hook54321, where did you even find that link btw? [02:21] Specular: https://duckduckgo.com/?q=commoncrawl+search+urls&ia=web [02:21] Some search like that [02:22] hook54321: yeah I've been getting that when attempting to view [02:22] it's early days at C3 so maybe it'll get fixed [02:22] C3? [02:23] chaos communication congress [02:23] run by germany's chaos computer club [02:23] more or less euro DEFCON if you don't count Black Hat [02:23] Specular: They don't have anything for archiveteam.org, so it's not very big :/ [02:25] they state they crawled the top million domains from Alexa but even so there aren't actually many results within it, yeah [02:26] Are they just trying to create a database of tons of urls or are they trying to crawl them too? [02:27] *archive them too? [02:28] clicking on some of the results in my search it appears they archive the raw HTML but since it doesn't display the actual page I'm not sure if any images or other dependant files are or if its just a copy of the source HTML as-is [02:29] been in operation since 2010 according to the earliest blog result [03:39] *** pizzaiolo has left [03:40] *** Specular has quit IRC (Ping timeout: 633 seconds) [03:50] *** VADemon has quit IRC (Quit: left4dead) [04:22] I need help solving this: "prove you are not a bot to ten decimal places" [04:41] BTW, back on Dec 21st, someone made an account on the archiveteam wiki and shared it on bugmenot. I've now randomized the password for it; if someone can think of a good use for such a public/anonymous account, speak up. [04:46] Ah, I see someone did use the bugmenot account to update the status of a French ISP's hosting. Thanks for doing that, in any case. [05:18] *** Sk1d has quit IRC (Ping timeout: 250 seconds) [05:25] *** Sk1d has joined #archiveteam-bs [06:49] *** wp494 has quit IRC (Quit: LOUD UNNECESSARY QUIT MESSAGES) [08:26] *** Specular has joined #archiveteam-bs [08:29] *** Specular_ has joined #archiveteam-bs [08:33] *** Specular has quit IRC (Ping timeout: 370 seconds) [08:38] *** Specular_ has quit IRC (Ping timeout: 370 seconds) [08:39] *** Specular_ has joined #archiveteam-bs [09:17] *** GE has joined #archiveteam-bs [09:22] *** VADemon has joined #archiveteam-bs [10:19] *** Specular_ has quit IRC (Ping timeout: 370 seconds) [10:20] *** Specular_ has joined #archiveteam-bs [10:20] *** schbirid has joined #archiveteam-bs [10:28] *** Simpbrain has joined #archiveteam-bs [10:47] *** GE has quit IRC (Quit: zzz) [11:25] *** Ravenloft has quit IRC (Ping timeout: 260 seconds) [11:40] *** vitzli has joined #archiveteam-bs [12:11] Somebody2: actually, it may be cool to keep the account! [12:12] I imagine someone wanting te fix a typo but not bother enough to make an account while not wanting to edit as an IP [12:13] oh, but the wiki is read-only for IPs [12:16] *** vitzli has quit IRC (Quit: Leaving) [12:17] *** GE has joined #archiveteam-bs [12:18] you could also keep playing whack-a-mole, but with the account name "bugmenot" it's at least obvious where the edits are coming from [12:18] i often create bugmenot accounts [12:18] as long as it's not used to spam.. [12:18] because signing up is a pain in the ass if your contribution is something tiny [12:18] schbirid: yeah, I've made a bunch as well [12:43] *** Specular_ has quit IRC (Ping timeout: 370 seconds) [12:47] *** Specular_ has joined #archiveteam-bs [12:56] *** BlueMaxim has quit IRC (Quit: Leaving) [13:15] *** Simpbrain has quit IRC (Remote host closed the connection) [13:31] *** pizzaiolo has joined #archiveteam-bs [13:42] so turned out i figure go to install atomicparsley so that get_iplayer would add metadata to the m4a files [13:43] this will effect most of the Newshour uploads and 2016-11 files of The World Tonight [13:43] good news is the metadata xml file was upload with them so i don't need to be redoing them cause of this problem [13:55] i got another review: https://archive.org/details/DTIC_ADA041895 [14:24] *** Specular_ has quit IRC (Ping timeout: 370 seconds) [14:26] *** Specular_ has joined #archiveteam-bs [14:52] *** Specular_ has quit IRC (Quit: whoosh) [14:59] Torrent is up http://yda.pw/CyanogenMod.torrent [15:00] you need seeders? how big is it? [15:01] 413GB, needs seeders [15:35] I am working on this, if anyone knows/is inclined to help https://www.wikidata.org/wiki/Q4787261 [15:38] arkiver: torrent has been shoved into the IA [15:43] so some good news on the WallBuilders Live archive i have been downloading for the last 3 years [15:44] looks like they have (at least) 2011 episodes up now [15:45] this also means i will be getting the 128kbs mp3s also [15:45] instead of just the 32kbs ones [15:46] only from 2012-06 to 2013-09 need the upgraded versions [15:56] ok so it goes back to feb 2009 at least [16:01] HCross: nice [16:01] What item is it in? [16:06] *** RichardG has quit IRC (Quit: Keyboard not found, press F1 to continue) [16:08] *** RichardG has joined #archiveteam-bs [16:18] https://archive.org/details/CyanogenMod281216 [16:21] arkiver: I don't see a peer from the IA on my torrent [16:23] it's doing something https://catalogd.archive.org/log/613242494 [16:23] give it a few hours [16:24] *** TheKiwi has quit IRC (Quit: Connection closed for inactivity) [16:24] *** VADemon has quit IRC (Quit: left4dead) [16:28] and so my upload of Wallbuilders live mp3s starts now [16:29] with description metadata [16:30] first one from 2009 that i got is up: https://archive.org/details/wallbuilders-live-2009-02-19 [17:22] arkiver: there we go. IA is downloading things [17:23] :D [17:26] arkiver: also rapidly running out of IO [17:27] is this all on newsbuddy? [17:48] *** atlogbot has quit IRC (Quit: atlogbot) [17:49] *** swebb has quit IRC (Quit: badcheese.com - where crap sometimes gets done) [18:09] arkiver: not the torrent, as there isn't a torrent client installed. I may install one as I can see it being useful [18:17] Somebody2: another way the bugmenot enables good-faith edits is in the case a user doesn't have his password handy on a public computer for example [18:34] (15:58:55) yan: SketchCow: I knew you'd respond that way re: wikimedia ;) On a positive note, I dumped over a third (and going) of links to the file formats wiki on wikidata and they were quite happy to take them! [18:34] what links? *curious* [18:41] luckcolor: brazilian :P [18:43] ah lel [18:43] cause i'm italian lol [18:43] *** kristian_ has joined #archiveteam-bs [18:54] *** atlogbot has joined #archiveteam-bs [18:54] *** atlogbot has quit IRC (Remote host closed the connection) [18:55] arkiver, any reason the IA would stop downloading randomly? [18:56] *** swebb has joined #archiveteam-bs [18:57] *** atlogbot has joined #archiveteam-bs [18:57] HCross: no idea, but if it doesn't resume it'll time out and fail and we'll just restart [18:58] ok. the seeders are still doing their thing [18:59] *** atlogbot has quit IRC (Remote host closed the connection) [18:59] *** swebb has quit IRC (Client Quit) [19:00] *** atlogbot has joined #archiveteam-bs [19:00] *** swebb has joined #archiveteam-bs [19:07] *** pizzaiolo has left [19:29] *** kristian_ has quit IRC (Quit: Leaving) [19:33] *** Asparagir has quit IRC (Asparagir) [19:34] *** Asparagir has joined #archiveteam-bs [19:34] *** Asparagir has quit IRC (Client Quit) [19:34] *** Asparagir has joined #archiveteam-bs [19:36] so, anyone in town for 33c3? i did not get a ticket but would be up for awkward meets [19:48] *** Start has quit IRC (Read error: Connection reset by peer) [19:48] *** Start has joined #archiveteam-bs [19:59] *** Simpbrain has joined #archiveteam-bs [20:03] how large are the ftp-gov-items? 8GB is fine, 80 is too much for this machine [20:05] *** arkiver2 has joined #archiveteam-bs [20:05] current average is 420MB, but can be bigger. [20:07] In the pipeline, you can edit the max size. [20:10] *** arkiver2 has quit IRC (Quit: AndroIRC - Android IRC Client ( http://www.androirc.com )) [20:10] *** arkiver2 has joined #archiveteam-bs [20:13] I can set concurrent_items=1, max_items=100; but not size of item as it's unknown, isn't it? [20:14] MAX_SIZE variable in pipeline.py [20:17] oh right, I was looking at the runner [20:27] it's currently set to 10GB; I'm worried ExtractRecordsInfo will hang or run out of memory with such large files [20:30] with my latest 1256MB item, up to 2.4GB of real memory (without shared or swap) was used [20:41] *** tfgbd_znc has quit IRC (Read error: Connection reset by peer) [20:43] t2t2: it won't, we're not loading the payloads into memory (afaik) [20:43] hmm [20:44] strange. I'll test [20:44] wpull isn't, but warc is [20:50] also, rsync seems to be compressing during transfer, but isn't warc.gz already compressed? [20:51] wasting cpu on both sender and receiver for no gain [20:52] *** pizzaiolo has joined #archiveteam-bs [21:00] *** pizzaiolo has quit IRC (Read error: Operation timed out) [21:01] *** Asparagir has quit IRC (Asparagir) [21:06] *** Simpbrain has quit IRC (Remote host closed the connection) [21:08] *** siemensak has joined #archiveteam-bs [21:12] so yeah, it did run out of memory: http://i.imgur.com/7V49acA.png [21:14] *** pizzaiolo has joined #archiveteam-bs [21:15] *** jrwr has joined #archiveteam-bs [21:57] *** BlueMaxim has joined #archiveteam-bs [22:08] *** siemensak has quit IRC (Quit: http://www.mibbit.com ajax IRC Client) [22:10] *** schbirid has quit IRC (Quit: Leaving) [22:43] *** arkiver2 has quit IRC (Ping timeout: 244 seconds) [22:47] PurpleSym: thanks for sorting remix-dot-nin :D first check in I've managed since xmas. won't be back to do anything else until next week :( [23:02] *** pizzaiolo has quit IRC (Ping timeout: 244 seconds) [23:43] *** GE has quit IRC (Remote host closed the connection)