[00:48] *** dxrt has quit IRC (Ping timeout: 252 seconds) [00:51] *** chazchaz has quit IRC (Read error: Operation timed out) [00:57] *** dxrt has joined #archiveteam-bs [00:57] *** Fusl____ sets mode: +o dxrt [00:57] *** Fusl sets mode: +o dxrt [00:57] *** Fusl_ sets mode: +o dxrt [01:01] *** chazchaz has joined #archiveteam-bs [01:12] *** ats has joined #archiveteam-bs [01:12] *** ats_ has quit IRC (Read error: Operation timed out) [01:17] *** bluefoo has quit IRC (Read error: Connection reset by peer) [01:19] *** bluefoo has joined #archiveteam-bs [01:24] *** larryv has quit IRC (Quit: larryv) [02:05] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [02:50] Just posted https://archive.org/details/ouyalibrary [02:50] See you in an unspecified CIA jail [03:30] *** qw3rty2 has joined #archiveteam-bs [03:35] *** qw3rty has quit IRC (Ping timeout: 745 seconds) [03:37] *** odemgi has joined #archiveteam-bs [03:40] *** odemgi_ has quit IRC (Ping timeout: 252 seconds) [03:45] *** Stiletto has quit IRC (Ping timeout: 360 seconds) [04:02] *** Stiletto has joined #archiveteam-bs [05:00] *** fredgido has quit IRC (Ping timeout: 612 seconds) [05:01] *** fredgido has joined #archiveteam-bs [05:15] *** fredgido_ has joined #archiveteam-bs [05:22] *** fredgido has quit IRC (Read error: Operation timed out) [06:19] *** deevious has quit IRC (Quit: deevious) [06:28] *** joshua_ has quit IRC (Remote host closed the connection) [06:33] *** DigiDigi has quit IRC (Remote host closed the connection) [07:28] *** eythian has quit IRC (Ping timeout: 258 seconds) [07:34] *** eythian has joined #archiveteam-bs [09:42] *** K4k has quit IRC (Read error: Operation timed out) [09:43] *** K4k has joined #archiveteam-bs [09:52] *** deevious has joined #archiveteam-bs [10:41] *** bluefoo has quit IRC (Read error: Connection reset by peer) [11:01] *** bluefoo has joined #archiveteam-bs [11:20] *** killsushi has quit IRC (Quit: Leaving) [13:22] *** second has joined #archiveteam-bs [13:26] *** deevious has quit IRC (Quit: deevious) [13:43] *** DigiDigi has joined #archiveteam-bs [13:44] *** fredgido has joined #archiveteam-bs [13:50] *** fredgido_ has quit IRC (Read error: Operation timed out) [15:22] *** JH881326 has joined #archiveteam-bs [15:40] *** eythian has quit IRC (Remote host closed the connection) [15:40] Heyyyyy godane - I finally got one for you. [15:40] https://community.apan.org/wg/tradoc-g2/fmso/p/fmso-file-cabinet [15:40] I got all the books in the "bookshelf" up. But these monographs. Your best skills in transferring them over, please [15:41] also https://community.apan.org/wg/tradoc-g2/fmso/p/oe-watch-issues [15:42] *** eythian has joined #archiveteam-bs [15:57] *** super3_ is now known as super3 [16:07] I completely forgot I had that thing running against the archivebot collection, and it's been doing the work! Down to page 4 [16:25] *** Sauce has joined #archiveteam-bs [16:30] *** Stilettoo has joined #archiveteam-bs [16:31] *** Stiletto has quit IRC (Ping timeout: 255 seconds) [16:45] *** ShellyRol has quit IRC (Ping timeout: 492 seconds) [16:46] *** ShellyRol has joined #archiveteam-bs [17:04] *** ShellyRol has quit IRC (Remote host closed the connection) [17:04] *** ShellyRol has joined #archiveteam-bs [17:06] *** Sauce has quit IRC (C:\exit.exe) [17:07] *** jamiew has joined #archiveteam-bs [17:13] SketchCow: i will check do it the same way i did ERIC archives [17:13] find the pdfs and filter out the html pages after [17:15] i will call it the APAN archives or something like that [17:17] oh this maybe good [17:17] looks like i maybe able to grab every id with a pdf [17:17] even stuff like OEWatch [17:18] even though i'm doing it with the fmso path [17:32] *** Stilettoo is now known as Stiletto [17:44] Is archlinux just storing packages here https://archive.org/search.php?query=creator%3A%22Arch+Linux%22 [17:45] along with history [18:04] *** zhongfu has quit IRC (Remote host closed the connection) [18:05] *** zhongfu has joined #archiveteam-bs [18:33] *** jamiew has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…) [18:42] *** PurpleSym has quit IRC (Remote host closed the connection) [18:44] *** bluefoo has quit IRC (Read error: Operation timed out) [18:51] *** DogsRNice has joined #archiveteam-bs [18:51] *** ShellyRol has quit IRC (Read error: Connection reset by peer) [18:51] *** ShellyRol has joined #archiveteam-bs [18:54] *** systwi has joined #archiveteam-bs [19:03] *** bluefoo has joined #archiveteam-bs [20:34] *** systwi_ has joined #archiveteam-bs [20:36] *** ats has quit IRC (Quit: new motherboard (caution: sharks)) [20:42] *** systwi has quit IRC (Read error: Operation timed out) [20:43] *** MaximeleG has joined #archiveteam-bs [20:44] *** MaximeleG has quit IRC (Client Quit) [20:50] *** MaximeleG has joined #archiveteam-bs [20:50] *** MaximeleG has quit IRC (Client Quit) [20:58] I'm still working on the archival of the legacy battle.net forums. So far I have all of EU, TW and KR discovered (~5.5M URLs discovered), and I'm almost done crawling the US forums (will probably be another 5-10M URLs). Could someone more knowledgeable guide me through the process of actually archiving it all once I have all the URLs? [20:59] I have a data extractor written in python that I can run on a page and extract the relevant in JSON format. I wanted to build a proper archive of this, not just HTML dumps [21:16] jleclanch: you mean us.battle.net? thats been done already [21:18] as for archiving, we've done that on mips since they started blocking archivebot pipelines [21:22] *** odemgi has quit IRC (Read error: Connection reset by peer) [21:22] *** odemgi has joined #archiveteam-bs [21:24] done where, do you have info about it? [21:24] and yeah, like these forums: https://eu.battle.net/forums/en/wow/12309726/?page=5 [21:40] oh wow i didnt know these were a thing [21:41] i thought us.battle.net was the only one [21:42] Fusl: there's us.bnet, eu.bnet, kr.bnet and tw.bnet. I believe I have an exhaustive list of every single one of them (publicly available, that is) [21:42] Fusl: metadata on every single forum: https://dpaste.de/KNmh [21:45] thanks, ill start a mips job for those [21:45] Fusl: what's a mips job? [21:46] like #archivebot but on a special server with lots of ips and horse power [21:47] Fusl: I see. Is it possible to get a dump of the HTML of all the pages collected? I'm interested in extracting the data and making it all searchable [21:47] I have URL lists if you want [21:47] sure [21:47] let me upload them to drive, 1s [21:49] Fusl: kr.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/lgtj5EAe/urls.kr.txt.xz [21:49] Fusl: tw.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/XUfmFZvd/urls.tw.txt.xz [21:49] (eu is still compressing) [21:49] Fusl: eu.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/L8bVuss2/urls.eu.txt.xz [21:50] I should have US ready in a couple of days at most. [21:51] ill throw all of them + the category starting point urls into mips to grab all of it [21:51] Actually I say all URLs, none of these contain category URLs (topic listings). I'll be able to generate those at some point, but I already actually have all of them in a sqlite db [21:51] but it's all posts URLs, including pages inside the posts [21:53] thanks [21:53] I'll ping back when US is ready. Fusl: where can I get a hold of all the pages afterwards, in a way that won't require me to scrape archive.org? [21:53] ill dump them onto a storage server, you can download it with rsync from there for a few weeks until they are purged [21:54] sweet, thank you :) [21:54] I can also generate category page URLs if you want, btw (I have total topic counts, so I can figure out how many pages there are in a category) [21:55] no worries, i generated those using the json [21:55] well by exhaustive I mean including ?page=x, ?page=x+1, etc [21:56] eg. this is the last page of the wow general discussion forum: https://us.battle.net/forums/en/wow/984270/?page=22090 (and loading it now probably will crash) [22:08] feel free to follow the scraping progress here http://103.230.141.2:29000/ and here https://atdash.meo.ws/d/grabsitemips/grab-site-mips-pipeline?orgId=1&var-ident=e8cf4322&var-ident=ef4e18eb&var-ident=e16f9714&var-ident=d0b72ae6 [22:09] neat [22:29] *** BlueMax has quit IRC (Quit: Leaving) [22:38] *** ats has joined #archiveteam-bs [22:43] *** killsushi has joined #archiveteam-bs