[00:24] *** BlueMax has joined #archiveteam-bs [00:30] *** lennier2 has joined #archiveteam-bs [00:39] *** sivoais_ has joined #archiveteam-bs [00:41] *** justcool3 has joined #archiveteam-bs [00:43] *** lennier1 has quit IRC (Ping timeout: 745 seconds) [00:43] *** lennier2 is now known as lennier1 [00:49] *** sivoais has quit IRC (se.hub irc.efnet.nl) [00:49] *** Mateon1 has quit IRC (se.hub irc.efnet.nl) [00:49] *** Ctrl has quit IRC (se.hub irc.efnet.nl) [00:49] *** Meli has quit IRC (se.hub irc.efnet.nl) [01:20] *** OrIdow6 has quit IRC (Read error: Connection reset by peer) [01:20] *** OrIdow6 has joined #archiveteam-bs [01:24] *** Mateon1 has joined #archiveteam-bs [01:29] *** Ctrl has joined #archiveteam-bs [01:30] *** fredgido_ has quit IRC (Read error: Connection reset by peer) [01:34] *** fredgido_ has joined #archiveteam-bs [02:31] *** vitzli has joined #archiveteam-bs [02:33] *** vitzli has quit IRC (Client Quit) [02:42] *** wyatt8750 has joined #archiveteam-bs [02:43] *** wyatt8740 has quit IRC (Ping timeout: 260 seconds) [03:31] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:41] *** qw3rty__ has quit IRC (Ping timeout: 265 seconds) [03:55] *** wyatt8750 has quit IRC (Read error: Operation timed out) [03:55] *** wyatt8740 has joined #archiveteam-bs [04:16] *** HP_Archiv has joined #archiveteam-bs [04:45] *** OrIdow6 has quit IRC (Quit: Leaving.) [06:07] *** britmob_ has quit IRC (Read error: Connection reset by peer) [06:08] *** britmob has joined #archiveteam-bs [06:11] *** dashcloud has quit IRC (Quit: No Ping reply in 180 seconds.) [06:18] *** dashcloud has joined #archiveteam-bs [06:24] *** Terbium has quit IRC (Read error: Connection reset by peer) [06:25] *** Terbium has joined #archiveteam-bs [06:39] *** qw3rty has joined #archiveteam-bs [06:50] *** justcool3 has quit IRC (Quit: Connection closed for inactivity) [07:20] *** OrIdow6 has joined #archiveteam-bs [09:26] Some time ago I asked about archiving someone's Twitter and someone mentioned socialbot? Would like to try it out (someone I follow is planning to close their account) [09:54] *** OrIdow6 has quit IRC (Remote host closed the connection) [09:54] *** OrIdow6 has joined #archiveteam-bs [10:01] Hello Arcorann, what Twitter account you want to archive? o: [10:22] Handle is @CINDERELLAGlRLS [10:40] Ah, found the source of saying their shut down; it's now being run through socialbot and will go through AB afterwards [10:40] Thanks [10:42] *** asdf01010 has joined #archiveteam-bs [10:45] *** asdf0101 has quit IRC (Read error: Operation timed out) [10:45] *** asdf01010 is now known as asdf0101 [11:09] *** wessel152 has joined #archiveteam-bs [11:35] *** BlueMax has quit IRC (Read error: Connection reset by peer) [12:09] *** lunik13 has quit IRC (Ping timeout: 265 seconds) [12:23] *** lunik13 has joined #archiveteam-bs [12:28] *** dashcloud has quit IRC (Read error: Operation timed out) [12:34] *** HP_Archiv has quit IRC (Quit: Leaving) [14:26] *** Arcorann has quit IRC (Read error: Connection reset by peer) [14:53] *** pew has quit IRC (Ping timeout: 265 seconds) [14:53] *** pew has joined #archiveteam-bs [15:59] *** dashcloud has joined #archiveteam-bs [16:14] *** Dj-Wawa has quit IRC (Quit: Dj-Wawa) [16:15] *** Dj-Wawa has joined #archiveteam-bs [16:19] *** DogsRNice has joined #archiveteam-bs [16:21] *** Panasonic has quit IRC (Read error: Connection reset by peer) [17:10] *** Dj-Wawa has quit IRC (Dj-Wawa) [17:21] *** Dj-Wawa has joined #archiveteam-bs [17:51] Looks like arkiver is purging [18:03] *** ripdog has quit IRC (Remote host closed the connection) [18:14] *** logchfoo3 starts logging #archiveteam-bs at Thu Jun 25 18:14:03 2020 [18:14] *** logchfoo3 has joined #archiveteam-bs [18:14] *** pew has joined #archiveteam-bs [18:17] *** VoynichCr has joined #archiveteam-bs [18:17] *** Dj-Wawa has joined #archiveteam-bs [18:18] *** i0npulse has joined #archiveteam-bs [18:18] *** roxfan has joined #archiveteam-bs [18:19] *** atomicthu has joined #archiveteam-bs [18:22] JAA: hello [18:24] i have no idea how suitable what i'm doing is for archive team's work, i've just got a wpull oneliner that scrapes an individual SA subforum, downloads the forum and threads, plus page requisites [18:25] every page downloaded has my username embedded in it [18:25] due to being logged in and such [18:25] Are you producing WARC files or plain .html etc.? [18:25] both [18:26] Good. The WARC files are what we're after. But yes, being logged in is always a bit problematic. [18:26] I saw a post earlier that the admins aren't sure what the future of the site is and looking for options for offsites, and turned off grabbing stuff like "show user's posts in this thread" and clickable links to quoted posts to save time, so those will be broken. [18:27] no idea if wpull's database allows a rescrape with different parameters to grab stuff like that. if it does i might turn off external page requisites too [18:28] I'm not getting Archives either, despite paying for access, because it looks like that functionality is broken serverside [18:29] Old threads that have continued posting and haven't fallen off the boards will still be present [18:29] the one I've been running since yesterday was only scraping The Firing Range as a test, I've just added YOSPOS and The Dorkroom since that's pretty much where I posted when I was on there [18:31] It's not possible to rescrape that directly just by changing options after the fact, but the URLs can be pulled from the DB file and then grabbed in a separate wpull run. [18:32] Right, I've been wondering about those "Archives"... Does that mean that over half of the content on the forums is just impossible to access? [18:32] Well, over half the threads, almost three quarters of the posts. [18:32] Going by the numbers on the homepage, that is. [18:34] yeah basically if a thread didn't get posted in for long enough it falls into the archives [18:35] there are very long-running threads in some forums, but in some cases threads go through iterations of being closed and a new one created, and the old one would fall into archives [18:36] JAA: a hacky workaround might be to just stuff the scraped files into a local web server, and then run a second scrape on that :v [18:37] workaround for rescrape with requisites, rather [18:38] Eww [18:39] i've had to kill the scraper a few times when it hung on a graceful stop. I just realized I don't know if that ruins the WARC or not (--warc-append is turned on) [18:40] It shouldn't, except it might. Which wpull version are you using? [18:41] 2.0.1 [18:41] Yeah, that's what I thought. [18:41] I recommend using 2.0.3 instead. It's not on PyPI but needs to be installed from GitHub. [18:42] ok [18:42] pip install git+https://github.com/ArchiveTeam/wpull.git@v2.0.3 [18:42] one more question: what's the best way to get it to download page requisites from anything on any "somethingawful.com" subdomain, but not scrape pictures from imgur etc? [18:43] What killing wpull does ruin is the log files. So keep any tmp-*.log.gz files you may have in the directory. [18:43] got it [18:44] `--page-requisites --span-hosts-allow page-requisites --domains somethingawful.com` should do it I think. [18:44] thanks. [18:49] I might restart the scrape and move things to a big linode. Possibly grab the whole forum if I'm able. [18:50] what amount of RAM/vps size should I budget here? [18:54] You only need (base system +) ~100-200 MB RAM for wpull. The resource that will likely be limiting is the CPU due to the HTML parsing, compressing for WARCs, etc. wpull is single-threaded (for practical purposes), so you won't gain anything by having more than 1 core either. [18:54] ok. [18:58] can, or should I, indicate in the useragent that the scrape is for Archive Team along with a request to PM me if I need to slow it down? I know I'm not using whatever official solution exists for this [18:58] my own was "fuck lowtax, this is atomicthumbs archiving selected subforums, if i'm slamming the server please PM and I'll slow it down" :v [18:59] it's behind cloudflare but I have no idea what % of my requests actually hit a dynamic page [19:01] thank you for all your help btw [19:12] *** Meli has joined #archiveteam-bs [19:12] It's generally a good idea to identify yourself. You're logged in though, so they'd already know who you are in theory. Adding your profile URL to the UA might still be good. [19:50] *** godane has joined #archiveteam-bs [20:12] *** mtntmnky_ has joined #archiveteam-bs [20:13] *** mtntmnky has quit IRC (Remote host closed the connection) [20:20] *** roxfan has quit IRC (Remote host closed the connection) [20:20] *** schbirid has quit IRC (Quit: Leaving) [21:23] *** mtntmnky has joined #archiveteam-bs [21:33] oh christ i never paid for no-ads and doing that now would mean giving lowtax money [21:34] *** mtntmnky_ has quit IRC (Remote host closed the connection) [21:53] gonna fire this thing up once python 3.6 finishes building on the linode. [21:53] is there any reason I shouldn't use --html-parser libxml2-lxml on wpull [21:54] got postres installed for sqlalchemy [21:54] *gres [21:54] *** lennier2 has joined #archiveteam-bs [21:55] *** lennier1 has quit IRC (Read error: Operation timed out) [21:56] *** lennier2 is now known as lennier1 [22:01] *** Jens has joined #archiveteam-bs [22:04] atomicthu: wpull only works with SQLite (even though it uses SQLAlchemy). It uses syntax (INSERT OR IGNORE) that to my knowledge is not supported by anything else. [22:04] thank you [22:05] The libxml2-lxml parser might fail in some corner cases of invalid HTML, but it should work fine for the vast majority of sites and is much faster than html5lib. [22:21] sweet [22:21] i'm running it on a dedicated cpu instance so hopefully it'll mostly be waiting for the forum [23:07] *** Arcorann has joined #archiveteam-bs [23:19] http://www.tigris.org/ "The site will be decommissioned and shut down on 1-July-2020" [23:19] Yeah... I know that, now let me access the stuff. [23:19] Someone is doing dev work on the prod system, it seems. [23:22] INFO Converting links in file ‘forums.somethingawful.com/js/vb/forums.combined.js?1476414227’ (type=None). [23:22] ERROR Fatal exception. [23:22] [stack trace] [23:22] Exception: Unknown link type. [23:22] maybe that's a reason not to use lxml [23:23] Hmm, no idea, never used -k before as I typically only save WARCs and use --delete-after for the rest. [23:24] hm [23:24] That seems like a bug though, and also type really shouldn't be None. [23:24] This is my first time using WARCs. Are they as easily browsable as a local mirror of the site, given the right tool [23:24] Yes, with pywb. [23:26] In theory, you could also produce your familiar tree structure from it. [23:26] WARCs contain the raw HTTP requests and responses, so really any processing can be done with it later if desired. [23:26] I'm not aware of a tool to do it, but it should be possible by reusing some wpull innards. [23:26] yeah it looks like i could just... crawl warcserver if i wanted to [23:26] and convert the links? [23:27] That would also work if your archive isn't too large (since it'll be fairly inefficient). [23:30] *** Maylay has quit IRC (Read error: Operation timed out) [23:39] *** BlueMax has joined #archiveteam-bs [23:42] *** Maylay has joined #archiveteam-bs [23:45] jodizzle: yeah fuck that