[00:02] *** scorche` has joined #archiveteam-bs [00:03] *** achip has quit IRC (Ping timeout: 255 seconds) [00:03] *** scorche` has quit IRC (hub.efnet.us irc.Prison.NET) [00:03] *** sivoais_ has quit IRC (hub.efnet.us irc.Prison.NET) [00:03] *** scorche has quit IRC (hub.efnet.us irc.Prison.NET) [00:03] *** Somebody2 has quit IRC (hub.efnet.us irc.Prison.NET) [00:07] *** sivoais has joined #archiveteam-bs [00:13] *** achip has joined #archiveteam-bs [00:13] *** scorche has joined #archiveteam-bs [00:13] *** Somebody2 has joined #archiveteam-bs [00:30] *** godane has quit IRC (Read error: Operation timed out) [00:31] *** godane has joined #archiveteam-bs [01:11] *** lennier2 has joined #archiveteam-bs [01:15] *** lennier1 has quit IRC (Read error: Operation timed out) [01:15] *** lennier2 is now known as lennier1 [01:47] *** larryv has joined #archiveteam-bs [02:01] *** mgrandi has joined #archiveteam-bs [02:19] *** lennier2 has joined #archiveteam-bs [02:22] *** lennier1 has quit IRC (Read error: Connection reset by peer) [02:22] *** lennier2 is now known as lennier1 [02:54] *** mgrandi has quit IRC (Leaving) [03:04] *** DogsRNice has quit IRC (Read error: Connection reset by peer) [03:36] *** qw3rty_ has joined #archiveteam-bs [03:43] *** qw3rty has quit IRC (Read error: Operation timed out) [04:55] *** benjins has quit IRC (Read error: Operation timed out) [05:24] *** justcool3 has joined #archiveteam-bs [05:50] *** HP_Archiv has joined #archiveteam-bs [06:02] any ideas why wpull won't resume from a crawl that finished, with new regexes to get it to pull more? [06:02] i rerun the command and just get [06:02] INFO FINISHED. [06:02] INFO Duration: 0:00:00. Speed: -- B/s. [06:02] INFO Downloaded: 0 files, 0.0 B. [06:03] and it creates a 4kb additional numbered warc [07:24] *** larryv_ has joined #archiveteam-bs [07:24] *** larryv_ has quit IRC (Client Quit) [07:29] *** larryv has quit IRC (Read error: Operation timed out) [09:20] *** achip has quit IRC (Read error: Operation timed out) [09:25] *** achip has joined #archiveteam-bs [09:44] *** bsmith093 has quit IRC (Ping timeout: 272 seconds) [10:00] *** bsmith093 has joined #archiveteam-bs [10:10] *** benjins has joined #archiveteam-bs [10:35] atomicthu: As mentioned yesterday, that doesn't work. The URLs are already marked as "skipped" in the DB, so adjusting the options doesn't do anything. [10:35] woops [10:35] i misinterpreted or forgot. thank you [10:37] *** HP_Archiv has quit IRC (Quit: Leaving) [11:34] *** ranma_ has joined #archiveteam-bs [11:36] *** ranma has quit IRC (Ping timeout: 272 seconds) [11:43] *** BlueMax has quit IRC (Read error: Connection reset by peer) [11:46] *** Ravenloft has joined #archiveteam-bs [12:14] *** dashcloud has quit IRC (Read error: Operation timed out) [13:10] *** dashcloud has joined #archiveteam-bs [13:11] *** HP_Archiv has joined #archiveteam-bs [13:19] *** sembiance has quit IRC (Remote host closed the connection) [13:51] *** HP_Archiv has quit IRC (Quit: Leaving) [14:41] So apparently Tigris.org already nuked everything even though the deadline is 1 July. [14:43] Their shutdown notice advised people to "copy and archive any data you wish to keep before that date [2020-07-01]"... [14:43] Great job, millions of issue comments and discussion posts destroyed prematurely just when I was getting ready to start my grab. [14:44] :( [15:05] The Center for Election Science's forums are shutting down. They're entering read-only mode on 30 July and will remain online until the end of the year or so. https://forum.electionscience.org/t/archiving-the-forum-july-30th/697 [15:05] Fairly small, just an AB job in August. [15:10] *** Arcorann has quit IRC (Read error: Connection reset by peer) [15:27] Has anyone done anything with regards to the Something Awful forums so far? [15:28] I mean apart from atomicthu. :-) [15:28] I'm considering emailing the mods to see if we can work something out with them. [15:29] To get all ~178M posts going back to the early years. [15:41] atomicthu: Also, can you clarify please what you meant when you said the archives were inaccessible even with the archive upgrade? [15:41] *** Rapix has joined #archiveteam-bs [16:01] *** dashcloud has quit IRC (Ping timeout: 272 seconds) [16:03] anyone interested in a time sensitive archiving project? Find any sites/stores etc that are "Dixie" themed. Many stores are, such as https://dixieoutfitters.com/ [16:04] If history is any indicator, WordPress will shut some down, and some will just go under from outside pressure [16:05] *** dashcloud has joined #archiveteam-bs [16:06] yikes [16:08] its already happened to many Confederate groups, now the target is on the stores: https://fox2now.com/news/missouri/confederate-store-in-branson-missouri-at-protests-center/ [16:08] I have too much going on to run this one as a project, but I think it would be great for someone to do so [16:43] *** Ctrl has quit IRC (Ping timeout: 857 seconds) [17:00] *** larryv has joined #archiveteam-bs [17:01] Interesting. Any ideas on where and how to look? [17:01] I'm guessing that there isn't a simple directory of them. [17:04] *** VoynichCr has left [17:05] jodizzle no, probably more of a hunt for them with Google searches [17:06] there are probably links to similar sites at the back of some websites, like so many sites do, but haven't looked deep enough to know if there is a larger list of them [17:07] Hm, alright. I might look into it [17:07] Nice :) [17:08] *** VoynichCr has joined #archiveteam-bs [17:23] *** Ctrl has joined #archiveteam-bs [17:45] *** DogsRNice has joined #archiveteam-bs [17:47] this community is getting shut down soon http://www.skvrs.com/ https://steamcommunity.com/groups/skiver [17:48] *** lunik13 has quit IRC (Quit: :x) [17:50] *** lunik13 has joined #archiveteam-bs [17:50] *** maxfan8_ has joined #archiveteam-bs [17:51] *** maxfan8_ has quit IRC (Client Quit) [17:51] these are the only two places they have [17:56] sorry i didnt put this in the archivebot channel btw its kinda busy there right now so this probably would have been burried quickly [17:57] *** maxfan8_ has joined #archiveteam-bs [18:01] *** maxfan8 has quit IRC (Ping timeout: 745 seconds) [18:01] *** maxfan8_ is now known as maxfan8 [18:05] *** Pixi has quit IRC (Quit: Leaving) [18:14] DogsRNice, did a run of that community website, although it might be an attempt 2 depending if it gets to the forums somehow [18:18] *** Pixi has joined #archiveteam-bs [18:28] i cant really tell if it grabbed the forums or not [18:47] *** mtntmnky has quit IRC (Remote host closed the connection) [18:48] *** mtntmnky has joined #archiveteam-bs [18:48] *** mtntmnky has quit IRC (Client Quit) [18:57] *** mtntmnky has joined #archiveteam-bs [20:25] *** t3 has joined #archiveteam-bs [20:39] *** RichardG_ has joined #archiveteam-bs [20:40] *** RichardG has quit IRC (Read error: Connection reset by peer) [20:48] *** RichardG_ is now known as RichardG [21:56] *** dashcloud has quit IRC (Quit: http://quassel-irc.org - Chat comfortably. Anywhere.) [22:25] *** wyatt8750 has joined #archiveteam-bs [22:26] *** wyatt8740 has quit IRC (Read error: Operation timed out) [22:27] *** Nikchemny has joined #archiveteam-bs [22:27] https://archive.org/details/archiveteam_videobot_twitter_com_907101280488837121 Please delete this item [22:29] Nikchemny: #archiveteam is only intended for very important messages. [22:29] Not random chat. [22:30] Ok [22:30] JAA: Btw, I took two photos of elections posters in my city. Are they needed? [22:30] Yeah, that sounds like something we should archive. [22:30] Upload them to https://transfer.notkiska.pw/ ? [22:30] Emm [22:30] Nope [22:31] I meant upload them on wiki [22:31] Ah [22:31] For the article [22:31] Yeah, that's also good. :-) [22:31] So, is videobot random? [22:32] Or [22:32] It was on request. I think there were ways to give it individual tweets, users, hashtags, search terms, etc. [22:33] It has been broken for 2 years or so now. [22:33] *** maxfan8 has quit IRC (Quit: WeeChat 2.8) [22:33] *** maxfan8 has joined #archiveteam-bs [22:34] Wow [22:34] #archivebot [22:36] Wow. American politician liked this porn video http://web.archive.org/web/20170912042432im_/https://pbs.twimg.com/media/DJfxCREUMAAg9GU.jpg [22:40] Ah yes, that's probably why it was archived then. [22:42] So, all videos (on yt) about Elections can be founded here: https://www.youtube.com/results?search_query=%D0%BF%D0%BE%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8 [22:43] #youtubearchive on hackint for YouTube stuff [22:44] /join #youtubearchive [22:44] whoops [22:45] *** ramolg has joined #archiveteam-bs [22:48] I'm sorry, I can't connect with hackint; I'll just create another account on IA and mirror videos on it [22:49] uh [22:49] Nikchemny: what videos? [22:52] arkiver These videos https://www.youtube.com/results?search_query=%D0%BF%D0%BE%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8 [22:54] Nikchemny: why should they be archived? [23:03] *** Rapix has quit IRC (Read error: Connection reset by peer) [23:09] *** lennier1 has quit IRC (Quit: Going offline, see ya! (www.adiirc.com)) [23:11] *** lennier1 has joined #archiveteam-bs [23:16] *** ranma_ has quit IRC () [23:22] arkiver: Almost all of them about this "great" event: https://en.wikipedia.org/wiki/2020_amendments_to_the_Constitution_of_Russia [23:26] Hi all. I've been working on a project to both archive and consolidate US government technical report databases like DTIC, OSTI, and NTRS. [23:27] I'm planning to launch a website later this year. [23:27] I'm a competent bot programmer but am at the limits of my web programming knowledge. I'm particularly clueless about how to write a good search engine for this database. [23:28] Right now I have about 2 GB in a sqlite file, DTIC citations only. [23:28] Anyone have any recommendations about how to write a good search engine? Preferably in Python. [23:28] arkiver: also there are these videos: https://www.youtube.com/results?search_query=%D0%BF%D0%BE%D0%BF%D1%80%D0%B0%D0%B2%D0%BA%D0%B8+%D1%80%D0%B5%D0%BA%D0%BB%D0%B0%D0%BC%D0%B0 [23:28] About elections' advertisments [23:29] DTIC's current search engine is pretty bad so I want to make a good one. [23:30] Also: I'm getting a lot of non-public information via FOIA requests. At present I have declassified citations to thousands of classified reports, though obviously none of the reports. [23:32] ramolg: You probably don't want to write your own search engine. Look into Solr, Elasticsearch, etc. [23:42] JAA: Thanks, I'm looking at Solr right now but it's clear that there's a lot I need to learn before I can approach this. [23:43] Yeah, large-scale text indexing is not easy. [23:45] How much easier would it be if I only searched the metadata/citations rather than full text of the reports? [23:47] Also: I've been very slowly spidering this little-known website from Los Alamos National Labs: https://permalink.lanl.gov/object/view?what=info:lanl-repo/lareport/LA-00001 [23:47] The only "interface" to the website that I'm aware of is changing the report number in the URL string. [23:48] I've downloaded every page listed on Google and the Internet Archive. [23:48] I've been running a bot to systematically try almost every report number, but if I try more than about one a minute they IP ban me. [23:48] Any ideas about how to get a comprehensive list of pages on that site? [23:51] You can certainly build a basic search function with any DB and some LIKE conditions or similar, but at least with SQLite, you'll quickly run into performance issues if you run that at scale. Proper DBs will do better, but for free-form text, not so sure. [23:51] Noted. [23:51] You'll want more IPs, i.e. rent cheap virtual/cloud servers etc. [23:53] *** Nikchemny has quit IRC (Ping timeout: 252 seconds) [23:53] I know that godane did some things regarding DTIC before. Not sure about any of the other databases you mentioned. [23:55] I'm really interested in any technical report database, and also things that haven't been digitized yet.