#archiveteam-bs 2019-09-25,Wed

↑back Search

Time	Nickname	Message
00:48 ^🔗		dxrt has quit IRC (Ping timeout: 252 seconds)
00:51 ^🔗		chazchaz has quit IRC (Read error: Operation timed out)
00:57 ^🔗		dxrt has joined #archiveteam-bs
00:57 ^🔗		Fusl____ sets mode: +o dxrt
00:57 ^🔗		Fusl sets mode: +o dxrt
00:57 ^🔗		Fusl_ sets mode: +o dxrt
01:01 ^🔗		chazchaz has joined #archiveteam-bs
01:12 ^🔗		ats has joined #archiveteam-bs
01:12 ^🔗		ats_ has quit IRC (Read error: Operation timed out)
01:17 ^🔗		bluefoo has quit IRC (Read error: Connection reset by peer)
01:19 ^🔗		bluefoo has joined #archiveteam-bs
01:24 ^🔗		larryv has quit IRC (Quit: larryv)
02:05 ^🔗		DogsRNice has quit IRC (Read error: Connection reset by peer)
02:50 ^🔗	SketchCow	Just posted https://archive.org/details/ouyalibrary
02:50 ^🔗	SketchCow	See you in an unspecified CIA jail
03:30 ^🔗		qw3rty2 has joined #archiveteam-bs
03:35 ^🔗		qw3rty has quit IRC (Ping timeout: 745 seconds)
03:37 ^🔗		odemgi has joined #archiveteam-bs
03:40 ^🔗		odemgi_ has quit IRC (Ping timeout: 252 seconds)
03:45 ^🔗		Stiletto has quit IRC (Ping timeout: 360 seconds)
04:02 ^🔗		Stiletto has joined #archiveteam-bs
05:00 ^🔗		fredgido has quit IRC (Ping timeout: 612 seconds)
05:01 ^🔗		fredgido has joined #archiveteam-bs
05:15 ^🔗		fredgido_ has joined #archiveteam-bs
05:22 ^🔗		fredgido has quit IRC (Read error: Operation timed out)
06:19 ^🔗		deevious has quit IRC (Quit: deevious)
06:28 ^🔗		joshua_ has quit IRC (Remote host closed the connection)
06:33 ^🔗		DigiDigi has quit IRC (Remote host closed the connection)
07:28 ^🔗		eythian has quit IRC (Ping timeout: 258 seconds)
07:34 ^🔗		eythian has joined #archiveteam-bs
09:42 ^🔗		K4k has quit IRC (Read error: Operation timed out)
09:43 ^🔗		K4k has joined #archiveteam-bs
09:52 ^🔗		deevious has joined #archiveteam-bs
10:41 ^🔗		bluefoo has quit IRC (Read error: Connection reset by peer)
11:01 ^🔗		bluefoo has joined #archiveteam-bs
11:20 ^🔗		killsushi has quit IRC (Quit: Leaving)
13:22 ^🔗		second has joined #archiveteam-bs
13:26 ^🔗		deevious has quit IRC (Quit: deevious)
13:43 ^🔗		DigiDigi has joined #archiveteam-bs
13:44 ^🔗		fredgido has joined #archiveteam-bs
13:50 ^🔗		fredgido_ has quit IRC (Read error: Operation timed out)
15:22 ^🔗		JH881326 has joined #archiveteam-bs
15:40 ^🔗		eythian has quit IRC (Remote host closed the connection)
15:40 ^🔗	SketchCow	Heyyyyy godane - I finally got one for you.
15:40 ^🔗	SketchCow	https://community.apan.org/wg/tradoc-g2/fmso/p/fmso-file-cabinet
15:40 ^🔗	SketchCow	I got all the books in the "bookshelf" up. But these monographs. Your best skills in transferring them over, please
15:41 ^🔗	SketchCow	also https://community.apan.org/wg/tradoc-g2/fmso/p/oe-watch-issues
15:42 ^🔗		eythian has joined #archiveteam-bs
15:57 ^🔗		super3_ is now known as super3
16:07 ^🔗	SketchCow	I completely forgot I had that thing running against the archivebot collection, and it's been doing the work! Down to page 4
16:25 ^🔗		Sauce has joined #archiveteam-bs
16:30 ^🔗		Stilettoo has joined #archiveteam-bs
16:31 ^🔗		Stiletto has quit IRC (Ping timeout: 255 seconds)
16:45 ^🔗		ShellyRol has quit IRC (Ping timeout: 492 seconds)
16:46 ^🔗		ShellyRol has joined #archiveteam-bs
17:04 ^🔗		ShellyRol has quit IRC (Remote host closed the connection)
17:04 ^🔗		ShellyRol has joined #archiveteam-bs
17:06 ^🔗		Sauce has quit IRC (C:\exit.exe)
17:07 ^🔗		jamiew has joined #archiveteam-bs
17:13 ^🔗	godane	SketchCow: i will check do it the same way i did ERIC archives
17:13 ^🔗	godane	find the pdfs and filter out the html pages after
17:15 ^🔗	godane	i will call it the APAN archives or something like that
17:17 ^🔗	godane	oh this maybe good
17:17 ^🔗	godane	looks like i maybe able to grab every id with a pdf
17:17 ^🔗	godane	even stuff like OEWatch
17:18 ^🔗	godane	even though i'm doing it with the fmso path
17:32 ^🔗		Stilettoo is now known as Stiletto
17:44 ^🔗	second	Is archlinux just storing packages here https://archive.org/search.php?query=creator%3A%22Arch+Linux%22
17:45 ^🔗	second	along with history
18:04 ^🔗		zhongfu has quit IRC (Remote host closed the connection)
18:05 ^🔗		zhongfu has joined #archiveteam-bs
18:33 ^🔗		jamiew has quit IRC (Quit: My iMac has gone to sleep. ZZZzzz…)
18:42 ^🔗		PurpleSym has quit IRC (Remote host closed the connection)
18:44 ^🔗		bluefoo has quit IRC (Read error: Operation timed out)
18:51 ^🔗		DogsRNice has joined #archiveteam-bs
18:51 ^🔗		ShellyRol has quit IRC (Read error: Connection reset by peer)
18:51 ^🔗		ShellyRol has joined #archiveteam-bs
18:54 ^🔗		systwi has joined #archiveteam-bs
19:03 ^🔗		bluefoo has joined #archiveteam-bs
20:34 ^🔗		systwi_ has joined #archiveteam-bs
20:36 ^🔗		ats has quit IRC (Quit: new motherboard (caution: sharks))
20:42 ^🔗		systwi has quit IRC (Read error: Operation timed out)
20:43 ^🔗		MaximeleG has joined #archiveteam-bs
20:44 ^🔗		MaximeleG has quit IRC (Client Quit)
20:50 ^🔗		MaximeleG has joined #archiveteam-bs
20:50 ^🔗		MaximeleG has quit IRC (Client Quit)
20:58 ^🔗	jleclanch	I'm still working on the archival of the legacy battle.net forums. So far I have all of EU, TW and KR discovered (~5.5M URLs discovered), and I'm almost done crawling the US forums (will probably be another 5-10M URLs). Could someone more knowledgeable guide me through the process of actually archiving it all once I have all the URLs?
20:59 ^🔗	jleclanch	I have a data extractor written in python that I can run on a page and extract the relevant in JSON format. I wanted to build a proper archive of this, not just HTML dumps
21:16 ^🔗	Fusl	jleclanch: you mean us.battle.net? thats been done already
21:18 ^🔗	Fusl	as for archiving, we've done that on mips since they started blocking archivebot pipelines
21:22 ^🔗		odemgi has quit IRC (Read error: Connection reset by peer)
21:22 ^🔗		odemgi has joined #archiveteam-bs
21:24 ^🔗	jleclanch	done where, do you have info about it?
21:24 ^🔗	jleclanch	and yeah, like these forums: https://eu.battle.net/forums/en/wow/12309726/?page=5
21:40 ^🔗	Fusl	oh wow i didnt know these were a thing
21:41 ^🔗	Fusl	i thought us.battle.net was the only one
21:42 ^🔗	jleclanch	Fusl: there's us.bnet, eu.bnet, kr.bnet and tw.bnet. I believe I have an exhaustive list of every single one of them (publicly available, that is)
21:42 ^🔗	jleclanch	Fusl: metadata on every single forum: https://dpaste.de/KNmh
21:45 ^🔗	Fusl	thanks, ill start a mips job for those
21:45 ^🔗	jleclanch	Fusl: what's a mips job?
21:46 ^🔗	Fusl	like #archivebot but on a special server with lots of ips and horse power
21:47 ^🔗	jleclanch	Fusl: I see. Is it possible to get a dump of the HTML of all the pages collected? I'm interested in extracting the data and making it all searchable
21:47 ^🔗	jleclanch	I have URL lists if you want
21:47 ^🔗	Fusl	sure
21:47 ^🔗	jleclanch	let me upload them to drive, 1s
21:49 ^🔗	jleclanch	Fusl: kr.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/lgtj5EAe/urls.kr.txt.xz
21:49 ^🔗	jleclanch	Fusl: tw.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/XUfmFZvd/urls.tw.txt.xz
21:49 ^🔗	jleclanch	(eu is still compressing)
21:49 ^🔗	jleclanch	Fusl: eu.battle.net, all URLs https://usercontent.irccloud-cdn.com/file/L8bVuss2/urls.eu.txt.xz
21:50 ^🔗	jleclanch	I should have US ready in a couple of days at most.
21:51 ^🔗	Fusl	ill throw all of them + the category starting point urls into mips to grab all of it
21:51 ^🔗	jleclanch	Actually I say all URLs, none of these contain category URLs (topic listings). I'll be able to generate those at some point, but I already actually have all of them in a sqlite db
21:51 ^🔗	jleclanch	but it's all posts URLs, including pages inside the posts
21:53 ^🔗	Fusl	thanks
21:53 ^🔗	jleclanch	I'll ping back when US is ready. Fusl: where can I get a hold of all the pages afterwards, in a way that won't require me to scrape archive.org?
21:53 ^🔗	Fusl	ill dump them onto a storage server, you can download it with rsync from there for a few weeks until they are purged
21:54 ^🔗	jleclanch	sweet, thank you :)
21:54 ^🔗	jleclanch	I can also generate category page URLs if you want, btw (I have total topic counts, so I can figure out how many pages there are in a category)
21:55 ^🔗	Fusl	no worries, i generated those using the json
21:55 ^🔗	jleclanch	well by exhaustive I mean including ?page=x, ?page=x+1, etc
21:56 ^🔗	jleclanch	eg. this is the last page of the wow general discussion forum: https://us.battle.net/forums/en/wow/984270/?page=22090 (and loading it now probably will crash)
22:08 ^🔗	Fusl	feel free to follow the scraping progress here http://103.230.141.2:29000/ and here https://atdash.meo.ws/d/grabsitemips/grab-site-mips-pipeline?orgId=1&var-ident=e8cf4322&var-ident=ef4e18eb&var-ident=e16f9714&var-ident=d0b72ae6
22:09 ^🔗	jleclanch	neat
22:29 ^🔗		BlueMax has quit IRC (Quit: Leaving)
22:38 ^🔗		ats has joined #archiveteam-bs
22:43 ^🔗		killsushi has joined #archiveteam-bs

irclogger-viewer